The model designs, the code enforces

The worst schema my ingestion framework ever produced came from the most carefully engineered prompt I have written for it. The framework turns thousands of heterogeneous documents into queryable databases, and the design prompt had accumulated everything I believed about data modeling: dimensional conventions, key strategies, normalization rules, years of taste compressed into instructions. The model returned a blob. Composite keys built from verbatim strings, no lookup tables, no junctions. Then we tried a plain prompt, essentially “design a normalized schema for this corpus, as a careful data engineer would,” and got surrogate keys, proper relationships, clean third normal form. A blind-judged bake-off ruled out the output format as the culprit; the framing alone did the damage. The model has read more schemas than any engineer alive, and the discipline we packed into the prompt was a constraint suppressing a capability.

That result settled how I divide labor in these pipelines. The model gets exactly two jobs, and both require reading the document: design, deciding what schema fits a corpus and what type a given document is, and transcription, extracting what a page says, verbatim, with the page number and a source snippet attached. Everything mechanical is deterministic code: rendering the schema artifacts, validating them, loading, deduplicating, canonicalizing, serving. Not because the model could not do those things, but because code does them the same way twice, and the entire value of a data pipeline is doing things the same way twice.

The schema lesson generalizes: loadability is the framework’s concern, and it never appears in a prompt. The design prompt says nothing about foreign-key mechanics, uniqueness enforcement, or load order, because every such mention trades design quality for compliance with machinery the model has no reason to know exists. Codegen renders the artifacts, validators and compile checks gate them, and the loader absorbs every get-or-create, every upsert, every constraint. We learned the cost of ignoring this by running the inverse experiment: an agent writing pipeline code directly burned its entire budget debugging framework internals instead of designing anything, at several times the cost.

The boundary needs policing from the other side too, because mechanical work migrates into prompts quietly. Real corpora say the same thing five ways, and at some point “emit the canonical value” crept into our extraction prompt. It was a hallucination surface: mapping decisions that must be exact, stable, and auditable were being remade by a model on every document. We pulled canonicalization back into a deterministic trigram function at load time, matching against a maintained vocabulary, and both halves improved. The extractor’s job collapsed to transcribing what the page says, and the mapping became a testable function with version history. We considered embeddings for that resolution step and rejected them: they drift across model versions and over-match by design, and a single over-eager merge corrupts an exact-match key forever.

A deterministic check verifies each snippet is literally a substring of its claimed page; a transcription that fails does not load. Document identity is the hash of the file’s content, a rule we adopted after filename-based identity silently collided two different documents. And the spec and raw extraction are persisted, so the pipeline re-renders and re-loads endlessly without invoking a model again.

The honest limit is that the model designs from evidence. An axis your queries depend on that never surfaces in the documents is unrecoverable by any prompt, however plain. Closing that gap takes feedback from real workloads, which means evaluation.