Letting an agent merge to main

An agent opened a pull request while I was asleep, a second agent reviewed it, the two of them agreed it was correct, and it merged to main. I found out from the deploy notification the next morning. The change was small and it was right, and the fact that it was right is the least interesting thing about the night, because the question that matters is not whether the agent can write the code. It is what had to be true for me to let that happen without being reckless.

The reflex is to ask whether the model is good enough to trust. That is the wrong question, the way “is the driver good enough to never crash” is the wrong question about a road. You do not get safety from the driver. You get it from the lanes, the guardrails, the speed limits, and the record of who was where when something went wrong. Autonomy is not a property the agent earns by being smart. It is a property of the box you run it in.

So the agent runs in a box. Each task gets its own worktree, a throwaway checkout it cannot escape. A hook sits in front of every tool call and scopes what it can touch: edits land inside the worktree and nowhere else, and the genuinely dangerous verbs, push, deploy, anything that talks to the outside, are simply refused at the boundary. The agent does not need to be talked out of pushing to main. It cannot. The capability is not present in the box it inhabits.

Every tool call it does make is appended to a ledger, and the ledger is hash-chained, each entry carrying the hash of the one before it. That makes the record tamper-evident: you cannot quietly rewrite what the agent did at 3am without breaking the chain, and you can replay the whole run, in order, to see exactly which files it read, which commands it ran, and in what sequence it talked itself into the change. When something does go wrong, and it will, the first thing I want is not an apology from the model. It is the tape.

The agent's competence sits in the middle. Everything around it is the part that lets it run unattended.

A green build gets the change to the door, and the door does not open on a green build alone. Compilation, the test suite, the architecture lint, the check that generated code still matches its spec: passing all of that proves the change is well-formed, not that it is a good idea. So before anything merges, a separate agent reads the diff cold and judges it, and it judges adversarially, looking for the reason to say no. The panel fails closed and it has to be unanimous. One reviewer unconvinced and the change waits for a human. A loud failure that parks the work is cheap. A quiet wrong merge is the expensive one, and the whole arrangement is tuned to trade the second for the first.

The part I did not expect to matter is that the box learns. A finished run leaves its ledger behind, and the ledger is a record of where the agent thrashed: the test it kept failing, the convention it kept rediscovering, the file it wandered into by mistake. That friction gets distilled into the written instructions the next agent reads before it starts, so the same wall is a little lower the second time a build hits it. The system that runs the agents is also slowly writing down what the agents keep getting wrong.

This box has shipped builds 48 through 397 without my hand on each one, and I want to be precise about what that number is evidence of. It is not evidence that the model is trustworthy. It is evidence that the gates held. The day the panel and the agent share a blind spot, they will agree confidently and be wrong together, and no amount of unanimity protects against a failure both halves cannot see. I do not have a clean answer to correlated blindness. For now the honest mitigation is that the tape is always there, and I read it more often than the green checkmarks would suggest I need to.