Evaluating systems that answer in sentences

Software engineering spent fifty years building the discipline of knowing whether a change made things better or worse: tests, CI, regression suites, the whole apparatus of confidence. Then we started shipping systems whose core component answers in sentences, and most teams quietly abandoned the entire discipline. A prompt gets edited, the engineer tries four questions that used to fail, they pass, and the change ships. Nobody knows what the edit broke, because nothing measures it. I have watched this loop run inside startups and inside serious enterprises, and it is the single most common reason natural-language systems plateau: iterating blind.

The blindness has a specific structure. In deterministic software, a change either breaks the build or it does not, and the test suite localizes the damage. In an LLM system, every change is global. A reworded prompt, a model upgrade, a new chunking strategy, a reranker swap: each one shifts behavior across the entire input distribution at once, improving some answers and degrading others, and the degradations land silently in the cases you did not try. Teams in this state develop a characteristic learned helplessness: they stop touching the system, because the last three times someone touched it, something unrelated got worse and a customer found it first.

The way out is unglamorous: evaluation infrastructure, built with the same seriousness as the test suite, occupying the same place in the development loop: a harness that runs on every meaningful change and answers, in minutes, the question every engineer is actually asking: did I just make this better or worse, and where?

Building one has taught me a few things that the papers tend not to emphasize.

The eval set matters more than the metric, and real failures are where eval sets come from. Synthetic question sets measure the system you imagined; production failures measure the system you have. Every wrong answer a user catches goes into the set with the correct answer attached, and the system can never regress on it unnoticed again. An eval set built this way is the accumulated memory of every way the system has failed, and after a year it is one of the most valuable assets the team owns, worth more than any individual component of the pipeline it tests, all of which will be replaced before the eval set is.

Score the stages. An end-to-end score tells you something is wrong and nothing about what. The question that actually directs engineering effort is which stage failed: did retrieval surface the right material and the model misread it, or did the model never see the evidence at all? Those are different teams’ problems with different fixes. Instrument retrieval and generation separately, and most “the AI is wrong” mysteries decompose into ordinary, fixable engineering.

Be suspicious of your judges. Using a model to grade a model is unavoidable at scale and useful within limits, but it imports the failure modes of the judge into your measurement system. The corrective is human anchoring: sample the judge’s verdicts, have people check them, measure the agreement rate, and treat disagreement as a bug in the harness with the same priority as a bug in the product. The goal is known reliability, so that when the number moves you know whether to believe it.

Wire it into the gate. An eval harness that engineers run by hand when they remember decays into a dashboard nobody reads. The harness earns its keep when it gates releases the way tests gate merges: a change that drops retrieval quality or answer accuracy past threshold does not ship, mechanically, without a human deciding to override. This is also what makes fast iteration safe. The teams with the strongest evaluation move fastest: they can take aggressive changes, swap models the week they release, and know by lunch whether the swap holds.

There is a quieter reason all this matters now. The model layer is improving on someone else’s schedule, and every improvement arrives as an invitation to upgrade. Teams without evaluation experience each upgrade as a leap of faith, and most decline it; they are still running last year’s model out of fear. Teams with evaluation run the harness, read the diff, and decide on evidence.

A product is a system whose owners know how well it answers questions, can prove it, and can tell within the hour when that stops being true.