A good average hides the column that matters

A synthetic table came back scoring 0.94 on fidelity and it was useless. The number was real: averaged across every column, the generated data matched the distribution of the real data almost perfectly. It was also beside the point, because the one column anyone actually queried, the category that every downstream model split on, had collapsed to a single value. Ninety-something columns reproduced beautifully and carried the mean, and the column the data existed to preserve was dead, and the average was happy to bury that under good news from columns nobody cared about.

This is the failure mode of every quality score that reports a mean. A mean is a machine for hiding its worst input. You can always raise it by improving the parts that were already fine, and a generator under optimization pressure will find that path, dressing up the easy columns while the hard one rots, because the easy columns are where the cheap gains are. If the metric you watch is an average, you are training the thing to lie to you in exactly the place you are not looking.

So the fidelity check I trust does not average and stop. It scores every column and every pairwise relationship, sorts them, and looks at the bottom tenth. The worst-decile column has to clear the bar on its own. A perfect score on the easy ninety percent cannot buy back one collapsed field, because the gate is not asking “is this good on the whole.” It is asking “is the worst thing here good enough,” which is the only question whose answer survives contact with a real query that happens to land on the worst thing.

Every gate is a floor, not an average. One breach drops the whole table, score zero, no partial credit.

Fidelity is the easy half. The half that actually keeps me up is privacy, because a synthetic row that is secretly a copy of a real person is a leak wearing a costume, and it will pass a fidelity check with flying colors. Matching the real distribution and memorizing the real records look identical from the outside if all you measure is how real the output seems. So privacy needs its own gates, pointed the other way. Can a classifier tell synthetic rows from real ones: if it can separate them cleanly, the synthesis failed; if it cannot, that is the good case. What fraction of the synthetic rows are genuinely new rather than reissued originals. And how close does a synthetic row sit to its nearest real neighbor, measured against how close real rows already sit to each other, because a synthetic record that hugs a real person tighter than real people hug each other is not anonymized, it is a quotation.

Every one of these is a floor, and a breach of any single floor sends the score to zero. Not a deduction, not a weighted penalty that a strong showing elsewhere can absorb. Zero. The moment you let a privacy failure be traded against a fidelity win, you have built a system that will, under pressure, sell out the privacy to make the average look good, which is the original sin restated one level up. Fail closed or you have not built a gate, you have built a suggestion.

I will say what I have not earned the right to claim. This has been run end to end against a small number of real tables, not a fleet of them, and a gate only catches the failure it was written to catch. The collapsed column taught me to floor the fidelity; the near-duplicate row taught me to measure distance. I am sure there is a third failure I have not been handed yet, and when it arrives it will sail through every gate I have, because the gate for it does not exist until something goes wrong enough to write it.