Your gold set is lying to you

I once made a retrieval system more correct and watched its score fall. The change was a scoping fix, checked by hand against the source documents. One evaluation suite, grounded in those same documents, ticked up. The curated gold set dropped three points.

We almost reverted. That’s the standard move at this fork, and I’d guess it happens somewhere every week: a correct fix rolled back to protect a number.

Instead we ran an experiment on the gold set itself. We spoon-fed the system, handing it the reference answers through hints, and measured how high the suite could possibly score. It capped at 85 percent. Our accuracy target was above that. We had spent weeks engineering toward a number the data could not pay out.

Once we started reading the golds the way we read code, the bugs were everywhere. Some had been written from memory of what an earlier version of the system could do, so they encoded old limitations as correctness. Some silently assumed a subset of the corpus and some assumed all of it, which means any consistent scoping choice fails one group or the other. I contributed my own damage before understanding this: three golds were failing, I nudged the system prompt until they passed, three other answers regressed, and the whole change got reverted. When the eval is the target and the eval is wrong, optimization makes the product worse.

The fix that held was boring. No gold stands without its evidence attached. For a document corpus that meant rendering the source page and storing it next to the reference answer, so auditing a gold means looking at a picture instead of redoing the research. The first audit caught a transposed decimal, a missed tie, and a wrong city. With the golds repaired and a judge that graded the core claim instead of the phrasing, a system that had been stuck near 60 percent for weeks measured at 95. Not one point of that came from improving the system.

Two things I still don’t have good answers for. Harvesting golds from production traces gives you the real question distribution, but in one deployed assistant two-thirds of the harvested “questions” were button clicks logged as text, and I don’t know a general filter that doesn’t also throw away real short queries. And golds expire in ways nothing flags: when product policy changed, our “must refuse” golds started penalizing the system for correctly answering questions it was newly allowed to answer. We caught that one by luck.

So the discipline now is the ordinary kind. The gold set lives in version control, changes to it get reviewed like code, every answer carries its evidence, audits run on a schedule. Re-audits of public benchmarks keep finding broken reference answers in double-digit fractions, and a private gold set is in worse shape than any of those, because nobody else will ever look at it. The score drop that started all this was the lucky case: two evals disagreed, so we had to interrogate the referee. A team with one gold set never gets that signal.