Authorization for answers

Picture an assistant over company data answering “how much parental leave do I get?” fluently, accurately, and for the wrong person. The industry has gotten serious about securing what AI systems do and has barely begun on what they may know. The proof is public: enterprise copilots that honor file permissions faithfully keep surfacing exactly what those permissions over-shared, and rollouts stall on it. Deciding who may receive an answer is its own problem, and it changes shape with the data underneath.

Everything depends on identity reaching the query. The cardinal anti-pattern authenticates the user at the front door and then reads everything as a service account, which demotes every downstream control to decoration. The identity the user actually proved has to resolve to who they are in the data and travel with the request into retrieval, and the path has to fail closed in the substrate: when identity goes missing, the row filters match zero rows, not all of them. I learned the embarrassing corollary firsthand. Row policies are inert if the connection runs as a role privileged enough to bypass them, so the load-bearing step is dropping privilege on every request.

Where data has shape, authorization has something to grip, and the problem is close to solved. A person sees their own records, their reporting subtree, what their role entitles them to; sensitive columns are gated separately from rows; entitlements join against the data’s own fields, identifier to identifier, never by fuzzy match. The engine enforces all of it under every query the agent generates. Enterprise search was trimming results against access lists decades ago; the database does the modern version better than anything bolted on above it.

Prose gives authorization nothing to grip. A paragraph has no tenant column and does not declare its audience, so the honest tool is coarser: document-level provenance tags, derived deterministically from a document’s origin, gating retrieval through the same fail-closed path. Coarse means a document is visible or it is not, with no redacting a paragraph, and in a real corpus most documents are legitimately shared and stay untagged. Tags should run advisory before they run enforced, because an auto-derived tag that gates access is itself a claim that needs auditing.

A trap hides here. Ingestion deduplicates by content hash, correctly; the same bytes should not become two documents. But the same bytes can arrive once for each of two audiences, leaving one row, and if its tags reflect only the first arrival, the second audience loses a document they were explicitly given. Fail-closed design makes the loss invisible: no error, no leak, just an answer that stops arriving for people entitled to it. So tags on deduplicated content must be the union of every path that delivered it. Failures of authorization are silent in exactly the way failures of retrieval are loud, which is why they survive testing.

Some questions the database cannot decide, because they are ambiguous about their own audience: an anonymous session, an answer that differs by company within one corpus, a boundary the asker does not know exists. Refusing over-blocks, and a refusal aimed at a boundary confirms the boundary is there. Answering leaks. What remains is for the agent to ask for the details that make the question answerable, which company, which plan, until it can respond from what this asker may know. Asking is disambiguation, not authentication: the reply narrows the question and never widens access, because the verified identity still gates the data underneath. And the agent must never echo the internal identifiers the authorization runs on.

The machinery should land only where the data demands it. A corpus with nothing personal and nothing segmented should pay none of this tax; who-may-know-what is a per-corpus policy.