In August 2025, Google DeepMind published a proof that should worry anyone running a production RAG pipeline: a single-vector similarity retriever cannot represent all task-relevant subsets of a growing memory store, no matter how large the embedding model or how good the training data.
That's not a tuning problem. It's a formal lower bound on the architecture behind Pinecone, Weaviate, Milvus, and Qdrant. And it explains why teams building LLM agents keep hitting the same wall: hybrid context storage, pairing vector databases with graph databases, is the pattern that's actually closing the gap.
TL;DR
- A 2025 DeepMind paper proves single-vector retrieval is provably lossy for growing memory stores. This is structural, not fixable with a better embedding model.
- Hybrid systems win on multi-hop reasoning: Microsoft GraphRAG wins LLM-judge comprehensiveness in 72, 83% of comparisons against vector-only RAG; HippoRAG reports +20% multi-hop accuracy while running 10, 30× cheaper per query than iterative RAG.
- The costs are real: GraphRAG indexing runs ~$500 per 1M-token corpus, and dual-store consistency is your problem to engineer.
- The 2024, 2025 wave of native vector indexes in graph engines (Neo4j 5.x, Memgraph, Neptune Analytics) changes the question from "vector or graph" to "one engine or two."
What is hybrid context storage?
Hybrid context storage stores dense embeddings alongside an explicit property graph and fuses vector similarity with graph traversal at query time. The agent gets the top-k chunks by semantic similarity plus everything reachable from the relevant entities in N hops, instead of whichever chunks happen to sit nearest the query in embedding space.
That second half matters more than it sounds. A vector store can't answer "give me everything two hops from this entity." It can only answer "give me things that look like this sentence."
Why do vector databases fail LLM agents?
Vector databases flatten document structure into a single point in embedding space, which destroys the relationship information that multi-hop questions depend on. When an answer requires chaining facts across two or more documents, similarity search retrieves the chunks nearest the query, not the chunks whose relationships are relevant.
Towards Data Science's "Vector Search Is Not All You Need" names the failure plainly: encoding flattens structure, so there's no chain of reasoning to follow. The LightRAG paper makes the same case, blaming flat representations for "fragmented answers" that can't capture inter-document relationships.
The problems compound at the systems level. Pinecone implements multi-tenancy as namespaces layered over the index with metadata filters. Weaviate's engineering blog calls out competitors directly for treating multi-tenancy as an afterthought, warning of data leakage and messy maintenance. And PlanetScale notes that HNSW needs the whole dataset in RAM and can't update incrementally, which is hostile to the tenant-partitioned, hierarchical corpora agents actually need.
None of this means vector databases are slow. Pinecone's dedicated read nodes serve 135M vectors at 600 QPS with p99 of 96 ms. They're operationally excellent at the thing they do. The thing they do is just narrower than agent workloads require.
The DeepMind result reframes the entire debate: multi-hop retrieval failure is not a tuning problem you can embed your way out of. It's a representational limit of single-vector similarity, and the fix has to come from the architecture.
How much better is hybrid retrieval? The benchmark evidence
Peer-reviewed multi-hop QA benchmarks consistently favor graph-augmented retrieval, with effect sizes that depend heavily on how strong your vector-only baseline is. The cleanest numbers come from three independent lines of work.
Microsoft GraphRAG extracts an entity-relation graph with an LLM, builds hierarchical community summaries, and answers corpus-wide questions by traversing them. On the paper's LLM-as-judge evaluation over a ~1M-token corpus, it wins comprehensiveness 72, 83% of the time and diversity 62, 82% against naive vector RAG, while compressing tokens 9, 43× at the root level.
Microsoft shipped it to GitHub in July 2024.
HippoRAG (NeurIPS 2024) takes a different route: open information extraction builds the graph, then Personalized PageRank retrieves at query time in a single algorithmic pass. The result is +20% accuracy on multi-hop QA versus iterative RAG, at 6, 13× faster query time and 10, 30× lower cost.
HippoRAG 2 (ICML 2025) averages 59.8 F1 across seven RAG benchmarks versus 57.0 for NV-Embed-v2, the strongest dense retriever tested.
Production deployments back this up. Lettria's Qdrant + Neo4j stack for regulated documents in pharma and aerospace indexes over 100M vectors at under 200 ms p95, with a reported accuracy gain of more than 20% over vector-only RAG. That's vendor-reported, but it's the cleanest published production reference for a dual-store hybrid at scale.
The honest caveat: GraphRAG's headline wins came against a naive baseline with no query rewriting, no reranking. Anthropic's contextual retrieval closes much of that gap without any graph at all.
A 67% reduction in failed retrievals from BM25 and reranking, with zero graph infrastructure, is the number every hybrid pitch has to beat.
The real costs: indexing dollars, dual-store consistency, scarce expertise
Hybrid architectures add five distinct costs: indexing-time LLM tokens, dual-store synchronization, cross-engine query planning, two observability surfaces, and expertise in both Cypher and ANN tuning. None is a showstopper alone. Together they raise the bar for small teams.
The headline cost is indexing. Full GraphRAG runs roughly $500 in LLM calls per 1M-token corpus with a GPT-4-class model, driven by entity extraction and community summarization. A PVLDB 2025 analysis independently confirms GraphRAG consumes about 10× more tokens than RAPTOR or HippoRAG at comparable accuracy.
The mitigations arrived fast:
- LazyGraphRAG (Microsoft Research, November 2024) defers graph construction to query time, cutting indexing cost to 0.1% of full GraphRAG at comparable quality on Microsoft's own benchmarks. Independent replication is still thin, so treat the parity claim as vendor-stated.
- LightRAG adds new facts as an incremental union of nodes and edges, avoiding the full reindex that pure vector pipelines need on fresh documents.
- Native single-engine hybrids eliminate the consistency problem outright. Neo4j 5.x, Memgraph, Amazon Neptune Analytics, and NebulaGraph v5.2 all fuse vector similarity and graph traversal in one query plan. Memgraph claims up to 132× higher mixed-workload throughput than Neo4j in its own Benchgraph suite, though that's a vendor stress test, not an independent benchmark.
The latency objection has a direct answer too. Three sequential round-trips (vector search, traversal, rerank) won't hit sub-100 ms p99. Fused single-plan queries or co-located ingest transactions will, and Lettria's sub-200 ms p95 over 100M+ vectors is the existence proof.
When you don't need a graph database at all
The most credible counterargument is the Postgres camp: for most production workloads, vector plus BM25 plus a reranker closes enough of the gap that a second database isn't justified yet. TigerData's 2026 position piece argues the real enterprise problem is seven-database sprawl, not missing graph features.
They have a point. Pinecone's Obviant case study reports a 30% relevance lift from sparse+dense retrieval alone, over 120M vectors at under 50 ms p50. Anthropic's contextual retrieval numbers above tell the same story. Weaviate's hybrid search fuses BM25 and HNSW in one engine and covers a large share of what people reach for graphs to fix.
But notice what these arguments don't refute: the DeepMind lower bound still holds. Hybrid keyword+vector fixes the jargon problem. It does not give you "everything two hops from this entity." If your agents answer questions that genuinely require traversing relationships, no amount of reranking substitutes for edges.
What this means for you
Match the architecture to the query shape, not the hype cycle:
| Workload | Recommended architecture | Why |
|---|---|---|
| Single-corpus semantic search (support bots, FAQ) | Vector + BM25 + reranker | 67% fewer failed retrievals, no graph overhead |
| Multi-hop QA and sensemaking on private corpora | GraphRAG on Neo4j, or LightRAG / HippoRAG | 72, 83% comprehensiveness wins, +2.8 F1 over best dense retriever |
| Frequently updated knowledge | LazyGraphRAG or LightRAG incremental, native single-engine hybrid | Avoids $500 reindex runs and dual-store drift |
| Already on Postgres | pgvector + pgvectorscale first | Credible baseline before adding any new store |
| Graph + vector at 100M+ scale | Qdrant + Neo4j with co-located query layer, or AWS Bedrock GraphRAG managed | Lettria's <200 ms p95 is the published reference |
Three practical moves for 2026. First, prototype on a native hybrid engine (Neo4j 5.x or Memgraph) before committing to a dual-store deployment; you may never need the second system.
Second, exhaust contextual retrieval and reranking before adding a graph, because that baseline is what your hybrid gains will be measured against. Third, if you do go hybrid, wire the graph in through the Neo4j MCP server so your agents invoke traversals as native tool calls instead of hand-rolled glue.
The field's direction is unambiguous. Graph engines grew vector indexes, vector engines grew sparse hybrid search, and managed stacks like Bedrock Knowledge Bases now ship GraphRAG as a checkbox. The question is no longer whether to combine similarity with structure. It's how many engines you want to operate while doing it.
Sources
- From Local to Global: A Graph RAG Approach to Query-Focused Summarization (arXiv), GraphRAG win rates and token compression figures
- HippoRAG: Neurobiologically Inspired Long-Term Memory for LLMs (arXiv), multi-hop accuracy and cost numbers
- LightRAG: Simple and Fast Retrieval-Augmented Generation (arXiv), incremental graph updates and token efficiency
- Contextual Retrieval (Anthropic Engineering), 49%/67% failed-retrieval reductions
- New DeepMind research reveals a fundamental limit in vector embeddings (TechTalks), coverage of the single-vector lower bound proof
- Vector Search Is Not All You Need (Towards Data Science), multi-hop failure modes of dense retrieval
- GraphRAG: New tool for complex data discovery (Microsoft Research), GraphRAG open-source release
- Build GraphRAG applications using Amazon Bedrock Knowledge Bases (AWS), managed GraphRAG blueprint with Neptune Analytics
- Obviant case study (Pinecone), 30% relevance lift from sparse+dense, vendor-stated
- Why Postgres Wins for AI and Vector Workloads (TigerData), the just-use-Postgres counterargument
- Weaviate's Native Multi-Tenancy Architecture (Weaviate), multi-tenancy critique of namespace patterns
- Hybrid search (Weaviate docs), single-engine sparse+dense fusion
- PlanetScale vectors public beta (PlanetScale), HNSW memory and incremental-update limits
- Pinecone integrates AI inferencing with vector database (Blocks & Files), 135M-vector latency benchmark
- Unified Index for Range Filtered ANN (PVLDB 2025), independent token-cost comparison of GraphRAG variants
- Graph Database Performance Benchmark (Memgraph Benchgraph), vendor throughput stress test
- Model Context Protocol integrations for Neo4j (Neo4j), graph operations as MCP agent tools
