December 2025
When documents become databases
Schema discovery is a corpus-statistics problem before it is a modeling problem.
The questions that justify pointing AI at an engineering corpus are aggregate questions. What was the average across every test report in the program? How many configurations exceeded the limit? What is the maximum value measured anywhere last year? Retrieval over chunks cannot answer these; the answer has to be computed over rows, which means that at some point the documents have to become a database.
I know the architecture every team draws at this point because I drew it too: classify the corpus into document types, author a schema for each type, extract each document into its table, query with SQL. It is clean, it matches how extraction gets sold, and it works beautifully on the ten documents in the demo.
There is a cheaper test than building it. Ask a model to propose fields for every document in the corpus, then count how often each proposed field recurs. We ran that pass over a few thousand engineering reports and it proposed roughly ten thousand distinct fields. About 74 percent of them appeared in exactly one document. A thin backbone recurred almost everywhere: dates, authors, approvals, part identifiers, pass-fail dispositions. The measurements themselves, the only fields worth aggregating, sat almost entirely in the singleton tail.
That distribution forecloses the plan, because a schema per type assumes types share fields, and these mostly do not. We tried to cluster our way out anyway. Asking a model to consolidate the discovered types produced either hundreds of micro-types, each amortizing its authoring cost over a handful of documents, or one mega-type with thousands of columns that are null nearly everywhere.
What the shape wants instead is an observation model. One fact table whose grain is a single measured quantity, carrying the verbatim value, the unit, the measurement conditions, the page and snippet it came from, and a pointer into a vocabulary of canonical properties. The vocabulary is grown out of the corpus itself, by clustering and labeling what was actually extracted, and merges into it are proposed for human review, never applied automatically. Database people will recognize entity-attribute-value here, and their objections are well earned wherever EAV is imposed on data that has a stable schema. The field-frequency histogram is exactly the instrument that says whether yours does. Clinical informatics, facing corpora with the same statistics, converged on this same design fifteen years ago.
Verbatim capture comes first because extraction and ontology mature at different rates. The verbatim layer is stable the day it lands; the canonical vocabulary keeps improving for months. With canonicalization kept as a separate, re-runnable step, every improvement to the vocabulary made weeks-old extractions more queryable without re-reading a single page. Two other findings from the same build: anchoring each value to its subject and conditions from the surrounding context, rather than extracting bare numbers, roughly doubled how often a value could be attributed to the thing it measured. And canonicalization, not extraction, turned out to be the correctness linchpin. The same physical quantity arrives under five phrasings across vendors and decades, and an aggregation over the verbatim names silently undercounts.
I now measure the field-frequency tail before designing anything. Strong mid-frequency clusters mean conventional wide tables will serve. A thin core with a fat tail means observations. Near-total singletons mean the corpus does not want to be a database at all, and the honest architecture is provenance-rich capture with retrieval on top. The histogram costs an afternoon. The architecture it argues against costs a quarter, and it fails slowly enough that the demo keeps working while it does.