Why your knowledge graph isn't helping your RAG

We came within a day of publishing the fashionable conclusion. We had built graph-augmented retrieval over a real industrial corpus, in the lineage of the published systems marrying entity graphs to personalized PageRank, and run five honest evaluations against plain vector retrieval. The graph tied every one. Half the industry has quietly decided GraphRAG is hype, and our data appeared to agree. We audited the system and found the published architecture carries assumptions no enterprise corpus satisfies.

The assumptions come from the benchmarks. The published systems are validated on multi-hop question sets distilled from Wikipedia, which quietly pre-solves the hardest sub-problem in the pipeline: deciding which mentions refer to the same thing. Its entities arrive canonicalized, one page per thing, described in clean prose, almost never in tables. An enterprise corpus is the inverse: scanned documents, OCR noise, decades of vendor phrasings for the same part or policy, and the densest facts in tables. Port the architecture and you port the assumption that disambiguation is nearly free. That assumption fails first.

The default resolver, the one extraction pipelines ship out of the box and we inherited like everyone does, merges any two mentions that share an alias, transitively. On a real corpus that rule chains hundreds of distinct entities into one node, because generic aliases bridge everything; the clustering literature has called this single-linkage chaining for fifty years, and a model emitting aliases generously industrializes it. The harder discovery was the case no rule can fix: two legally distinct but affiliated companies shared most of their surface forms, and that they must stay separate is tribal knowledge: obvious to any employee, present nowhere in the text. Cosine similarity cannot rescue this, because real-corpus embeddings sit in a band so compressed that no threshold separates synonym from coincidence; at ours, two different people’s first names cleared the merge bar. What discriminated, embedding a model-written definition of each term with a model gating every merge, is expensive and still partial. The literature treats entity resolution as a preprocessing detail.

The field’s evidence is contaminated from both directions. Re-evaluations keep finding that published GraphRAG gains shrink sharply once judge biases are removed, and the negative verdicts, including the one we nearly shipped, usually measure implementation artifacts. Ours is a clean specimen. Our fusion layer intersected the graph’s candidates with the vector pool: a passage received a graph vote only if vector search had already retrieved it. Surfacing the passage that shares no vocabulary with the query but connects to it through entities is the one capability a graph adds over embeddings, and the data flow made it impossible by construction. The architecture diagram said graph-augmented retrieval; the system was a re-ranker. The diagnostic is one count: how many passages in the final context arrived from the graph alone. Ours had always been zero. I have found no public writeup of this trap; I doubt we were alone in it. We had measured the bug and called it the concept.

After every repair, the graph got better at the thing graphs are for: recall on bridge questions, the ones that hop through an entity, roughly doubled. The overall scoreboard still tied, and this time the tie carried information. The corpus’s densest facts live in tables, invisible to prose-based extraction by construction, and the hardest questions were aggregations, counts and comparisons across everything: JOINs, not walks. The benchmark questions are multi-hop trivia, exactly the shape a graph traverses well. Enterprise questions mostly are not that shape, and the question distribution was the deciding variable.

We came out convinced that enterprise GraphRAG lives or dies on disambiguation and question shape, two problems the published evidence barely touches because its benchmarks define them away, and that any verdict on the approach, ours included, means nothing until someone traces where its passages came from.