Inducing the schema instead of supplying it

Every document-AI architecture I have drawn or seen drawn starts by assuming the schema. You classify the corpus into document types, you author a table for each type, you extract each document into its table, you query with SQL. The schema is the given; placing data into it is the work. That assumption is the part nobody checks, and it is the part that does not hold. The hard, mostly unsolved problem is the inverse: inducing the relational schema from an arbitrary pile of documents, discovering the columns, their types, the keys, and the joins from the shape and content of the data itself, with no human authoring the DDL up front.

I built a framework to do exactly that and ran it across several real corpora. The organizing idea I borrowed from a categorical account of self-revising discovery systems, which names this precise task, learning the base schema rather than engineering it, as an open problem and gives acceptance criteria rather than an algorithm. The schema is not scaffolding you erect before ingestion. The schema is the artifact you are discovering, and it grows as a sequence of versioned transitions, b0 to b1 to b2, each one driven by what the documents turned out to contain.

The trigger for growth is the residual: an extracted, well-formed value that maps onto no existing column. That empty slot is a precise, non-subjective signal that the schema is too small right there. The question is what to do with it, and the wrong answer is to promote every residual into a column, which is how you end up with thousands of fields that are null almost everywhere. The right answer is an information-theoretic gate. A candidate column is promoted only when it pays for itself in bits across the whole corpus: the cost of carrying the column in the schema, plus the cost of encoding its values under a typed model, has to come in under the cost of leaving those values in an untyped overflow store. Minimum description length is the referee. Rare fields stay in overflow, queryable but unpromoted. Compression alone never decides: one field sat in overflow at a 2,476-bit gain because it had appeared in a single document, and one document is not corroboration.

The induction loop: a residual the active schema cannot type drives a versioned migration, gated on bits.

A schema that grows has to grow safely, because old answers were already given against old versions. So every transition is a verified migration, not a bare ALTER TABLE. Old rows transport forward through a recorded map; a new column is added metadata-only and old rows read null with a reason, never a fabricated value; and before the version commits, the gold queries that were correct against the previous version are re-asked against it and must still return what they returned. If a transition would break a prior commitment, it is blocked and quarantined automatically. The check is honest about what it is, non-regression: it proves a migration did not damage what was already accepted, not that the new schema is good. Those are different claims, and conflating them is how teams convince themselves a growing schema is improving when it is only churning.

The other half is that every cell has to replay. A materialized value carries a derivation back through its transformations to the source span: document, page, bounding box, character offsets. This is a property of the data model, not a citation bolted on at the end, and it is multi-parent, because a value often comes from a table cell plus its caption plus a coreference, grouped under one derivation. Across one build, provenance replayed on 85 of 85 audited columns. The payoff shows up against retrieval: on a double-blind, pre-registered battery of compute-over-rows aggregates, where a numerically correct answer whose cells do not replay is scored wrong, the framework returned 11 of 15 correct with 0 fabricated, while the strongest graph-augmented retrieval baseline invented confident, uncitable numbers on the aggregates it could not compute. A separate finding, which I had hoped would go the other way, is that adding a graph layer over these tabular corpora bought no material retrieval win over plain dense retrieval with reranking. The hard questions were joins and aggregations, not multi-hop walks, and a graph traverses the wrong shape.

What is proven, then, is the form: verified transitions under a load-bearing gate, with provenance that replays and natural-language questions answered as grounded SQL on multiple real corpora, including nine never-seen documents cold-onboarded in 23.9 seconds. What is not proven is the economics. The theory’s signature result is parsimony, the schema staying compact as evidence piles up, and on heterogeneous corpora mine does the opposite: it accretes genuine new columns, because the documents genuinely keep introducing new concepts. The lever that would buy compression is a merge operator that recognizes when two columns are the same concept under different names, and that one I could not make sound. Embeddings rank name overlap, not concept identity. They happily merge two distinct measures that share a header word and reject a true synonym pair, and no threshold I tried separated the two cases. So the merge stays off, and a real chunk of the theory stays a frontier.

Concept identity across documents is the rock everything in this area breaks on. Two filings call the same line a dozen ways, and 93 percent of the labels in one financial corpus appeared in exactly one document, so there was nothing to align them against. You can split a same-shaped pair into two columns soundly. Unifying two truly-same concepts that never co-occur is the open part, and I do not have it.