Tables are images

The first honest evaluation I ran on table reading produced a result I did not want. Given the same real document tables, a model reading extracted text landed at roughly a third of the quality the same model achieved looking at the rendered page. Qualitative, not a benchmark, but too wide a gap to argue with.

The gap exists because a table is a coordinate system. The string “66.2” carries no meaning of its own; it means something because of the row label three cells left, the column header two rows up, the unit hiding in a merged cell, the heading naming which configuration the page describes. Flatten the grid into a token stream and those coordinates are gone. The model still answers fluently. The values are just married to the wrong headers, and nobody notices until an aggregate is wrong in a meeting.

The evaluation also caught a lie, and the lie taught me more than the gap did. One text-based path scored suspiciously close to vision, and I nearly believed it before tracing what the provider did with documents submitted in that format: it rendered the page. The model had been looking at the table all along while we congratulated the text pipeline. Since then I treat “what does the model actually see” as the opening question of any evaluation.

The wrong lesson would have been vision everywhere. Before committing to that, I ran a census across tens of thousands of tables in a working engineering corpus. About a fifth were clean enough that a deterministic structure-recovery parser reproduced them perfectly, for free. Another fifth were not tables at all: tables of contents, title blocks, header furniture that detectors flag eagerly, better discarded than read. The remainder, well over half, genuinely needed pixels: jammed cells where columns had collapsed into each other, headers spanning column groups, layouts where text extraction was provably lossy. And the value concentrated in the ugly common cases. The exotic multi-level headers that dominate the literature carried a few percent of the corpus’s numbers.

The census held one inversion that shaped the design. On clean born-digital tables, the deterministic parser was more faithful than the frontier vision model, which once dropped an entire column, emitting empty strings where the parser had captured every cell. Born-digital pages also carry their own ground truth in the embedded text layer, so I could grade both readers without paying a model to judge a model. What survives is a per-table router: deterministic recovery where it provably succeeds, vision where it provably fails, with failure modes detectable from structure alone. The router needed its own humility. My first jammed-cell detector decided dates, version strings, and fiscal quarters were collapsed columns and tried to send most of the corpus to vision. Heuristics over tables are tables too.

Cost yields to one asymmetry: small open-weight vision models read a targeted value from a table image about as well as a frontier model, at close to no cost, and fail badly at exhaustive transcription of dense tables. Point-reads go to the cheap model; faithful full transcription belongs to the strongest model available, applied only to the slice that needs it.

Reading is half the problem. The questions enterprises ask of tables are aggregations: how many, across every document, what is the largest, grouped by what. Those are SQL questions, and a table shredded into retrieval chunks cannot answer them by construction. Cross-document aggregation in our system went from mostly failing to almost always passing on the day table facts landed in a queryable store, with the reading pipeline untouched. We had spent months making the reading faithful, and the deciding move was giving the results somewhere to live.