Retrieval you can audit

Enterprise document corpora are hostile terrain: scanned contracts next to spreadsheets next to engineering reports, each with its own structure, none of it consistent. Standard RAG pipelines flatten that mess into chunks and hope. The answers sound right, sometimes are right, and give the reader no way to tell the difference. For a business decision, an unverifiable answer is worth nothing.

I set the technical direction from the opposite constraint: provenance first. Every fact the system extracts keeps a pointer to the exact source page it came from, and every answer the system gives can be traced back through that chain. If the system claims a deductible is $500, you can open the document and see where.

Getting there required treating ingestion as the hard part, not retrieval. Documents are parsed visually rather than as text streams, because layout carries meaning that text extraction destroys. Extraction is schema-aware, so a benefits document and a patent yield structured fields, not undifferentiated prose. And extraction is verified: an independent pass checks what was extracted against the source and repairs what fails, because a pipeline that cannot doubt itself cannot be trusted at scale.

Retrieval then combines semantic and lexical search with reranking, which is well-trodden ground. The differentiating property was never the search. It is that the system can show its work, page by page, and that property is what moved customers from demos to production.