Context Engineering For Ai Agents

Context Graphs: The Missing Layer Between Your Tools and Your Agents

Why flat RAG breaks agentic workflows, what a bi-temporal context graph actually is, and how to build one that holds up in production.

June 18, 202612 min read
context graphagentic harnesscontext engineering
Context Graphs: The Missing Layer Between Your Tools and Your Agents

The average enterprise now runs 305 SaaS applications, up from 254 in 2023, according to Zylo's 2026 SaaS Management Index. Large enterprises average 473. Roughly 75% of employees acquire or build technology without IT oversight, and untracked apps add an estimated 30 to 40% on top of the managed footprint.

Now point an autonomous agent at that sprawl and ask it to plan a multi-step task. It fails, and it fails in ways flat retrieval can't fix.

That gap is why context engineering has converged on a new infrastructure layer. Perplexity CEO Aravind Srinivas put the claim bluntly: context graphs "will be the best way for businesses to enable and deploy agentic harnesses." This piece tests that claim against what actually ships in 2026.

TL;DR

A context graph is the temporal, decision-aware memory layer that sits between your tools and your agents. Flat RAG breaks on multi-hop questions, stale facts, and exception lineage.

A bi-temporal context graph fixes those by preserving relationships and tracking when each fact was true. The technique is real and productized; the loftier claims about "self-organizing god-mode" and "capturing tacit knowledge" are mostly aspirational.

Key takeaways

  • Flat vector RAG fails agentic workflows in three predictable ways: multi-hop blindness, chronological collisions, and missing exception lineage.
  • A context graph adds bi-temporal validity (valid time plus system time) and decision lineage on top of a knowledge graph.
  • It plugs into an agentic harness as the shared semantic substrate every execution pathway reads from.
  • On LongMemEval, a temporal graph (Zep/Graphiti) scores 63.8% on temporal reasoning versus 49.0% for a vector baseline, a 14.8-point edge.
  • Security must live inside the query pipeline. A "self-organizing" un-permissioned graph leaks regulated data.
  • Build for custom multi-hop and audit needs; buy for ACL sync and fast time-to-value.

What is a context graph, exactly?

A context graph is a dynamic, temporally-aware knowledge representation that models semantic entities along with the operational logic, decision flows, and historical execution traces of a business domain. It is a continuously evolving memory substrate that maps processes over time, not just static states.

In February 2026, Gartner formalized the category, contrasting static "what/who" knowledge graphs with continuously-evolving "how/why" context graphs that capture decision lineage, process auditability, and guardrailing. Forrester's February 2026 best-practices report framed it as a data fabric that ensures data is "not just accessible but understood and trusted."

The taxonomy is contested, and honestly so. Gartner analyst Afraz Jaffri argues that calling a graph a "context graph" is redundant, since graphs inherently hold context. Graphwise's Andreas Blumauer counters that the temporal validity and decision lineage make it a distinct evolutionary step.

Neo4j co-founder Emil Eifrem splits the difference: a context graph is the logical layer you get when operational decision traces are connected on top of a baseline knowledge graph.

The practical distinction that matters is in the table below.

Layer State model Update mode Best at
Vector RAG Flat embeddings, stateless Full chunk re-index Semantic text match
Knowledge Graph Semantic triples (S-P-O) Periodic batch Static domain entities
GraphRAG Community summaries Costly re-clustering Global summarization of static corpora
Context Graph Episodes + entities + communities, bi-temporal Real-time incremental, non-lossy Dynamic decision traces, point-in-time queries

Why does flat RAG break agentic workflows?

Flat retrieval fails multi-step planning in three repeatable modes. Each one is a structural limitation, not a tuning problem.

Failure Mode A, the multi-hop blind spot. Ask "which employees worked on Project Alpha and contributed to the security audit that flagged the code issues Bob raised in last week's standup," and a vector DB returns disconnected chunks for "Project Alpha," "security audit," and "Bob's standup." It cannot preserve the links between them, so the model reconstructs relationships by guessing and deduces wrong.

Failure Mode B, chronological collision. A customer says on January 10 they prefer dark-colored products, then on March 15 says they only want white. A vector query retrieves both, because both are semantically similar. With no temporal boundaries in the structure, the LLM resolves the conflict by heuristic and often serves the stale preference.

Failure Mode C, the exception lineage deficit. Operational reality lives in Slack threads and emergency Jira updates. When an agent verifies an invoice against a contract, a flat DB retrieves the standard pricing doc but cannot reach the thread where a VP approved a temporary 15% discount. The agent applies the literal rule and triggers a billing dispute.

These are the cases that turn a demo into an incident.

How a context graph plugs into an agentic harness

An agentic harness is the orchestration framework around a foundation model: tool interfaces, state management, safety guardrails, and deterministic routing. It turns a predictive LLM into a long-horizon operational processor.

The context graph plugs in as the shared, persistent semantic substrate that every pathway reads from. The state router, the policy guardrails (often OPA), and the tool-execution layer all query the same graph: its episode tracker, entity resolvers, and community summaries.

The adoption curve is steep. Gartner projects that by 2029, 80% of AI agent platforms using advanced cognitive models will feature aligned context layers, versus under 10% in early 2026. It also predicts that by 2028, more than half of enterprise agentic systems will use graph-based context.

Vendors are racing to own the layer. At the AWS Summit NYC on June 17, 2026, AWS announced Bedrock AgentCore and AWS Context, a managed knowledge-graph service that maps relationships across databases and business rules at runtime.

Its "active learning loop" observes which sources and join paths produce accurate results and refines mappings without manual re-curation, publishing metadata to open Apache Iceberg tables in S3. Rubrik wraps it with a deterministic policy engine that enforces security at the gateway, outside the agent's planning loop.

What bi-temporal modeling actually buys you

The technical core of a context graph is the bi-temporal schema. Every fact, node, and edge tracks two timelines.

Valid time is when a fact is true in the real world. System time is when the database ingested or modified the record. The leading open-source engine, Graphiti, which powers Zep's Context Lake, stamps every edge with valid_from, valid_to, and invalid_at.

When a fact is superseded, you set valid_to instead of deleting it. History is preserved. That property is what makes the temporal knowledge graph auditable.

The compliance payoff is concrete. For GDPR Article 17 deletions, you set valid_to = NOW() and invalid_at = NOW(), which excludes data from active inference while keeping the audit trail. For the EU AI Act, the bi-temporal model lets an auditor reconstruct the graph's exact semantic state on a historical date and inspect the reasoning trace as it was.

Graphiti models memory across three tiers: an episode subgraph of raw interaction logs, a semantic entity subgraph of resolved concepts with validity ranges, and a community subgraph that clusters entities into functional groups. The arXiv paper documents the full architecture.

Does the temporal layer actually measure better?

Yes, and the gap shows up on the benchmarks built for this. LongMemEval (Wu et al., ICLR 2025) stresses five dimensions, including temporal reasoning and knowledge updates, with problems up to 1.5M tokens.

The headline result: tracking when facts were true delivers a measurable edge on temporal questions, and the curated subgraph slashes token cost versus dumping the full transcript into context.

Temporal reasoning accuracy on LongMemEvalVector RAG baseline (Mem0)49%Temporal graph (Zep/Graphiti)63.8%
Temporal reasoning accuracy on LongMemEval

The token math is just as important for production cost. A full-context baseline burns roughly 26,000 tokens per query at around 29 seconds of latency. A selective pipeline like Mem0's 2026 build cuts that by about 90% to roughly 1,800 tokens at sub-second latency, for a small accuracy trade.

Each bi-temporal graph hop adds 50 to 150ms, but serving a curated subgraph instead of thousands of raw docs cuts end-to-end perceived latency by up to 90%.

One honest caveat on the numbers: most of these figures are self-reported by the engine vendors on overlapping but not identical setups. Treat them as directional, and re-run the evaluation on your own data before committing.

Which big claims are real, and which are aspirational?

Two of the loudest claims need a skeptical read.

"A self-organizing god-mode view." A unified, un-permissioned graph is a security liability, not a feature. Enterprise data is governed by overlapping ACLs, roles, and regional residency rules. Cyberhaven's 2026 report found 39.7% of enterprise AI interactions involve sensitive or regulated data, so a graph that ingests HR salaries, finance, and engineering and exposes one unified model will leak.

The fix is the Glean pattern: separate the logical graph from physical retrieval privileges. Connectors continuously ingest source-system ACLs, OAuth resolves the user principal, and the pipeline filters out nodes the user can't see in the source system before context ever reaches the model.

Security is an architectural constraint inside the query path, never something a "self-organizing" framework handles outside your IAM.

"Captures tacit knowledge." Polanyi's point was that people "know more than they can tell." Tacit knowledge is non-codifiable situated judgment. What a context graph actually captures is structured traces of explicit behavior: which tables get joined, which exception overrides were recorded in Slack, which workflows ran. That is a real advance over static retrieval. It is still a record of the explicit artifacts intuition leaves behind, not a digitization of the intuition.

Where context graphs still break in production

Four engineering bottlenecks bite hard past 100,000 nodes.

  1. Entity resolution collisions. A merge threshold set too low fuses "Project Alpha" with "Product Alpha" or two different John Smiths, corrupting every path through them. Too high fragments related events.
  2. Temporal drift. Incremental label-propagation clustering runs in near-linear time but warps over weeks, forcing periodic high-compute re-propagation that locks resources.
  3. Edge hallucinations. A weak extraction prompt parses "I wish our platform worked with Salesforce" as an INTEGRATES_WITH edge. Strict ontologies and validation are the only defense.
  4. Cascade invalidation failure. Terminating a contract must cascade to invalidate the derived plan, nested tasks, and permissions. Without ontology-driven cascade rules, stale derived facts stay active and contradict the truth.

Build or buy in 2026?

Match the decision to your constraints, not the hype.

Pick this if... Choose
Custom multi-hop over proprietary objects, on-prem independence Build: Graphiti or Neo4j + MCP
Point-in-time auditing in healthcare or finance Build: Graphiti (bi-temporal native)
Strict ACL sync across fragmented apps, fast time-to-value Buy: Glean
Zero-integration on AWS, learns from agent traffic Buy: AWS Context
Query distributed relational DBs in place Buy: Promethium

Graphiti is the strongest open-source starting point for temporal reasoning. Microsoft GraphRAG is excellent for global summarization over static corpora but has no real-time or point-in-time support, so don't reach for it for live agent memory.

A concrete build playbook

If you're building, here is the sequence that holds up.

  1. Ingest high-velocity channels first. Wire in Slack, Jira, and CRM events before static directories. The exception lineage lives in the conversational logs.
  2. Ground every extraction in a schema. Validate LLM output against a formal ontology, add Neo4j unique constraints on entity names, and index valid_to and invalid_at for fast active-edge filtering.
  3. Write back decision traces. Have agents record Agent-[:EXECUTED]->Decision edges with outcomes, and let an SME flag certified paths so they rank higher in retrieval.
  4. Filter at query time. Resolve user entitlements before context reaches the model, and traverse only active edges where valid_to IS NULL and invalid_at IS NULL.
  5. Write Cypher deterministically. Use pre-written parameterized Cypher for all writes. Never let an LLM generate Cypher at runtime, or you invite injection and schema corruption.
  6. Evaluate in production, not on static similarity. Track point-in-time query accuracy, revocation latency (target under 3 minutes from access removal to the agent no longer retrieving), multi-hop Recall@5, and a hallucination rate flagged by an LLM-as-judge critic.

What would change my mind

The strongest counterargument is cost. If frontier context windows keep getting cheaper and long-context accuracy closes its ~30-point gap to oracle retrieval, the case for a separate graph layer weakens for smaller deployments. Watch the LongMemEval long-context scores over the next two releases.

For now, the durable conclusion holds. Agents that act over fragmented enterprise data need a layer that preserves relationships and tracks time. Build the bi-temporal substrate, enforce ACLs inside the query pipeline, and treat the grander marketing claims as a roadmap rather than a spec.

Sources

Frequently asked questions

What is a context graph?

A context graph is a dynamic, time-aware knowledge representation that models not just entities and their relationships, but the decision flows, exception precedents, and historical execution traces of a business. Unlike a static knowledge graph, every fact carries validity windows that record when it was true and when it was superseded, so an agent can reason about state at any point in time.

How is a context graph different from RAG?

Flat vector RAG retrieves disconnected text chunks by semantic similarity and is stateless. A context graph preserves relational links between facts, tracks when each fact was valid, and serves a curated subgraph instead of raw documents. That makes it far better at multi-hop questions, conflicting updates, and audits.

What is a bi-temporal model and why does it matter?

Bi-temporal modeling tracks two timelines per fact: valid time (when it was true in the real world) and system time (when the database recorded it). This lets an agent answer point-in-time questions, reconstruct the graph's exact state on a past date for an EU AI Act audit, and handle GDPR deletion by setting validity windows instead of destroying history.

Can context graphs really capture tacit knowledge?

Not in the strict sense. They capture structured traces of explicit, recorded behavior: which tables get joined, which exceptions got approved in Slack, which workflows ran. That is a major advance over static retrieval, but it records the artifacts that intuition leaves behind rather than digitizing the judgment itself.

Should I build or buy a context graph in 2026?

Build with Graphiti or Neo4j if you need custom multi-hop reasoning over proprietary objects, strict point-in-time auditing, or on-prem independence. Buy from Glean, AWS Context, or Promethium if you need real-time ACL sync across fragmented apps, fast time-to-value with pre-built connectors, or federated queries over distributed databases.