The advertised context window is a marketing number. The engineering number, the largest window in which your model still solves the task, is often smaller by a factor of 4 to 16.
That gap is the single most important fact in building agents in 2026, and almost every architecture decision that follows is a response to it.
Context engineering is the discipline that grew up around that gap. It is the practice of constructing, curating, and routing the information that reaches a model's context window at inference time.
Anthropic's engineering team defined it in late 2025 as filling the context window with just the right information at each step. Prompt engineering was a piece of it.
Retrieval, memory, tool plumbing, and state management are the rest.
This is the pillar guide. It covers the six branches that will cause you the most pain, in order: the context window itself, memory architectures, storage backends, retrieval patterns, externalized state, and the Model Context Protocol.
Then it covers the live debate over whether "agent memory" is even the right abstraction, and ends with a decision framework you can use on Monday.
TL;DR
Effective context is much smaller than advertised, so the core skill is curation, not stuffing. Build the boring version that works: a working scratchpad, a tested retrieval pipeline (hybrid search plus a cross-encoder rerank), one managed memory product, MCP-served tools, and aggressive prompt caching.
Long context and RAG are complements, not rivals. Treat neural memory and 1M-token "effective" context as things to track monthly and adopt when they earn their weight.
Key takeaways
- Plan for an effective operating zone of 32K to 200K tokens, not your model's headline window. Measure it; don't read it off a spec sheet.
- Reranking is the highest-leverage retrieval fix. A cross-encoder lifts nDCG@10 by 10 to 40% for under 100ms. If you aren't running one, you're leaving accuracy on the table.
- Most production agents need Mem0 or LangMem plus a vector DB plus a scratchpad, with GraphRAG reserved for genuine cross-document multi-hop work.
- MCP is the tool and context plane; A2A is the agent-to-agent plane. They compose. By early 2026 there were roughly 37,000 public MCP servers.
- Prompt caching saves 50 to 90% on input cost after the first turn. Every long-system-prompt agent should have it on.
- Treat MCP servers and retrieved documents as untrusted input. Tool poisoning and prompt injection are real attack classes.
Why the context window is the real bottleneck
The original Needle-in-a-Haystack test asked one question: can the model find a single planted fact at a given depth? Pass that, and you could put "1M context" on the model card. Then the field started demanding multi-fact reasoning, multi-hop questions, and structured outputs across the whole range. The picture fell apart.
The canonical result is Liu et al.'s Lost in the Middle (2023). Performance on multi-document QA forms a U-shaped curve by document position. Facts at the start and end are recalled well; facts in the middle are reliably lost, with drops of 10 to 30 percentage points.
It has been reproduced across OpenAI, Anthropic, Google, and open-source models, and the position of your context is now treated as a fact of life.
NVIDIA's RULER (2024) pushed past single needles into multi-hop, multi-key, and variable-tracking tasks, and showed most long-context models degrade sharply well before their advertised limit. The llm-stats RULER leaderboard tracks effective length for frontier models.
STRING (ICLR 2025) tightened it further: even with perfect retrieval, just having the tokens in context can hurt downstream reasoning. An EMNLP 2025 finding replicated that context length alone hurts performance despite perfect retrieval.
What is context rot?
Context rot is the non-linear degradation of LLM performance as the context window fills. Chroma Research, the team behind the Chroma vector database, coined the term in 2025 after testing twelve state-of-the-art models from OpenAI, Anthropic, Google, and Meta.
All of them degraded. The size of the safe zone before degradation depended on the model, the task, and the distractors present, and it was not predictable from token count alone.
The informal name for the bad region is the "dumb zone": the part of the window where the model is technically attending but not reasoning reliably. Note that Chroma's headline numbers are reported by a vendor with a stake in retrieval, so treat the exact figures as directional rather than gospel.
The qualitative finding is corroborated by the independent academic work above.
Advertised versus effective context, mid-2026
| Model | Advertised | Effective (RULER-class) |
|---|---|---|
| Gemini 2.5 Pro | 1M (2M preview) | Degrades past ~128K, 256K on multi-hop |
| Claude Sonnet 4.5 | 1M beta (200K standard) | ~200K on RULER |
| GPT-4.1 | 1M | Strong on retrieval, weaker on long-horizon reasoning |
| Llama 4 Scout | 10M marketed (~1.5M practical) | Community testing suggests much lower |
| Mistral Large 2 / Codestral | 128K (Codestral 256K) | ~64K on hard multi-hop |
These effective figures are best-available synthesis from leaderboards and community testing, not vendor-certified numbers. The honest takeaway is the pattern: the headline is marketing, the engineering number is smaller, and you have to measure it on your task.
The design discipline follows directly. Don't try to use the whole window. The MemGPT-style paged context idea was invented specifically to dodge this bottleneck.
Agent memory architectures, and where each one wins
Memory is the most active product area in agent infrastructure right now. The taxonomy is short-term versus long-term, then by substrate (vector, graph, document, neural), then by write policy. The named systems below make genuinely different bets, so this is not a pick-the-fastest decision.
MemGPT (Packer et al., 2023) is the ancestor of nearly everything that followed. It treats the LLM context as a virtual memory hierarchy: a "main context" the model sees, and an external paginated store it calls into and out of via functions like core_memory_append, archival_memory_search, and conversation_search. The model pages information the way an OS pages between RAM and disk. The successor is Letta, which keeps the virtual-context design and adds cloud-managed state and a memory-blocks abstraction.
Titans (Google, 2025) is the most consequential neural memory paper. Instead of a database, it parameterizes memory: a learned long-term module wired into the attention path, test-time trained on the input stream using a surprise-based objective. It reports gains over Transformer baselines at multi-million-token contexts. Titans is a research result, not a product, but the pattern (test-time training, surprise gating, memory as a layer) is the active frontier.
Mem0 (2025) is the most-deployed production memory layer. At each turn an LLM extracts salient facts, updates or stales existing entries, and resolves contradictions, storing them in a vector DB with structured metadata and hybrid retrieval. It ships the LoCoMo benchmark and integrates with AutoGen, LangGraph, and CrewAI. The trade-off: it's a write-policy on top of retrieval, so you're trusting the LLM to extract and update reliably.
A-Mem applies a Zettelkasten model: every memory is a note with tags and LLM-generated links to existing notes, and retrieval blends similarity, recency, and graph distance. It suits research and knowledge agents where the value is in connections, not isolated facts.
LangMem is LangChain's integrated memory layer, separating thread-scoped short-term state, cross-thread long-term state, and a semantic knowledge view. It's the default for LangGraph users and the least opinionated option, which means you write more extraction logic yourself.
For relational and temporal memory, GraphRAG (Microsoft Research, 2024) extracts entities and relations, runs Leiden community detection, and summarizes each community. On roughly 1M-token corpora it reports a 72 to 83% comprehensiveness win rate over naive RAG with root-level summaries using 97% fewer tokens.
Microsoft followed with LazyGraphRAG, claiming a 700x cost reduction at parity quality. Zep and its successor Graphiti build a temporal knowledge graph, modeling that a user asked X on Monday and changed their mind on Wednesday.
| Architecture | Best for | Skip if |
|---|---|---|
| MemGPT / Letta | Long sessions, doc QA, paging context | ≤32K effective context, simple chat |
| Titans (neural) | Research, very long streams | You need to ship today |
| Mem0 | Multi-session user memory, personalization | You need explicit reasoning about storage |
| A-Mem | Research and writing agents | Simple retrieval is enough |
| LangMem | LangGraph stacks | You're not in the LangChain ecosystem |
| GraphRAG | Global sensemaking, cross-doc multi-hop | Single-fact lookups |
| Zep / Graphiti | Long-lived, evolving user state | Stateless workloads |
The real choice is rarely between these systems. It's whether to add any of them on top of retrieval plus a scratchpad. Most production agents in 2026 run Mem0 or LangMem in front of a vector DB plus a working scratchpad, and reserve GraphRAG for cases where relational multi-hop is the actual job.
Storage backends: vector, graph, and hybrid
The substrate has consolidated. Five vector databases cover almost every case, and they're more alike than vendor marketing suggests. All support HNSW indexes, metadata filtering, and some form of hybrid retrieval.
- Chroma 1.x is the developer default: in-process, SQLite-backed, the base for most LangChain tutorials.
- Weaviate 1.30+ is the production-features default: built-in BM25 plus vector hybrid search, multi-tenancy, replication.
- Qdrant 1.13+ is Rust-based with a strong latency profile and sparse vector support.
- Milvus 2.5+ is the scale-out option for billion-vector workloads with GPU acceleration.
- Pinecone Serverless v2 is the boring managed choice with auto-scaling sparse-dense hybrid search.
On the graph side, Neo4j 5.x remains the production standard, with Cypher and native vector indexes since 5.11. Memgraph is the in-memory alternative for low-latency fraud and recommendation work.
Hybrid retrieval means combining sparse (BM25, SPLADE) and dense vectors, usually via reciprocal rank fusion. Late-interaction models like ColBERTv2 and Jina-ColBERT-v2 are a third lane that captures token-level interactions and shines on zero-shot and multilingual retrieval.
The empirical rule for 2026: BM25 plus a dense embedding plus a cross-encoder rerank is still the best general-purpose stack, and adding late interaction pays off mainly when multilingual or zero-shot dominates.
On embeddings, the MTEB leaderboard is canonical. Voyage 3.5 leads on English retrieval; OpenAI's text-embedding-3-large and Cohere's embed-v4 are strong proprietary options. The notable shift is that open-weight models, BGE-M3, Qwen3-Embedding, and Jina v3, are now within 1 to 2 points of the proprietary leaders, which is forcing real price pressure on the APIs.
Retrieval patterns: RAG, multi-hop, and reranking
Retrieval stopped being "embed the question, fetch top-5, stuff the prompt" a while ago. The 2024 RAG survey by Gao et al. Gave us the taxonomy that stuck.
- Naive RAG: chunk, embed, single-vector top-k, concatenate. Cheap and often enough.
- Advanced RAG: adds pre-retrieval (query rewriting, HyDE) and post-retrieval (reranking, compression). The default production shape.
- Modular RAG: decomposes the pipeline into replaceable modules. The forerunner of agentic RAG.
For multi-hop reasoning, three techniques dominate. IRCoT interleaves chain-of-thought steps with retrieval calls and reported lifts of +21.83 R@10 and +28.2 EM averaged across HotpotQA, 2WikiMultihopQA, MuSiQue, and IIRC. Self-Ask prompts the model to emit answerable follow-up questions. FLARE does forward-looking retrieval: predict the next sentence, and re-retrieve if any token's probability drops too low.
The general-purpose loop most agentic RAG systems instantiate is ReAct.
Self-correcting retrieval is the next layer. Self-RAG trains a model to emit reflection tokens that gate retrieval and critique its own output. CRAG adds a retrieval evaluator that classifies documents as correct, incorrect, or ambiguous and triggers web search on failures. Adaptive-RAG routes by query complexity, sending simple queries straight to the model and hard ones through multi-step retrieval.
Reranking is your highest-leverage fix
If you change one thing this quarter, add a cross-encoder reranker. The typical lift is 10 to 40% nDCG@10 over embedding-only retrieval, with the biggest gains on hard multi-hop and out-of-distribution queries, for typically under 100ms at p99.
The mid-2026 lineup: BGE Reranker v2-m3 (568M params, 100+ languages, Apache-2.0, self-host), Cohere Rerank 3.5 (hosted, $0.001 per search) and the newer Rerank 4 Pro with 32K context, and Jina Reranker m0 (2B params, multilingual and multimodal). The decision rule is simple: not running a reranker is leaving accuracy on the table for a tiny cost.
For evaluation, RAGAS is the open-source standard (faithfulness, answer relevance, context precision and recall), ARES adds statistical rigor, TruLens covers production observability, and DeepEval is Pytest-native for CI/CD gates. One caveat worth taking seriously: no LLM-judged framework reliably tells a factually wrong context from a correct one.
A system can score 0.95 RAGAS faithfulness and still give the wrong business answer. Use these for focused checks, not as a substitute for ground truth.
Externalizing state: files, databases, and scratchpads
Not all state belongs in the model. The cleanest production designs push as much as possible outside the window.
File-based memory is the right default for static context: coding style, project conventions, persona. The conventions that emerged are AGENTS.md, a plain-text repo-root file now adopted by OpenAI Codex, GitHub Copilot, Cursor, and Cline; Cursor's .cursorrules; and Claude Code's CLAUDE.md. These are cheap, debuggable, and human-editable.
Scratchpads are in-context working memory. The pattern traces to ReAct's Thought/Action/Observation triples, extends through Reflexion's stored self-critiques, and generalizes to Tree of Thoughts. Keep the scratchpad focused and let it overflow into a structured note file when the trace gets long.
For durable state, production agents typically run three databases: Postgres or SQLite for transactional state and audit logs, Redis for fast scratch and recent-context cache, and a vector DB for retrieval. The LangGraph checkpointer ecosystem is the canonical example, with InMemorySaver for prototyping and PostgresSaver for production checkpointing with time-travel and branching.
At scale, the event-log view (every LLM call, tool call, and human step appended to an immutable log) is increasingly a hard requirement, because it's the only realistic way to debug long-running agents.
| State | File | DB | Log |
|---|---|---|---|
| Static persona / rules | ✓ AGENTS.md | ||
| Working scratchpad | ✓ in-context | ||
| Cross-session user memory | ✓ Mem0 / Letta | ✓ replay | |
| Audit / debug / replay | ✓ immutable | ||
| Time-travel / branching | ✓ LangGraph |
The Model Context Protocol: the tool and context plane
MCP is the most consequential infrastructure development of mid-2026, and the rest of the stack increasingly assumes it.
The Model Context Protocol is an open JSON-RPC 2.0 protocol introduced by Anthropic on November 25, 2024, with launch partners Block, Apollo, Zed, Replit, Codeium, and Sourcegraph. Servers expose three primitives: resources (read-only context like files, rows, API responses), prompts (parameterized templates), and tools (model-controlled actions).
The host adds sampling (the server can ask the host to run a completion) and roots (declared filesystem access), with elicitation added in 2025. Transports moved from stdio and SSE to the newer Streamable HTTP.
The common framing of MCP as "the tool protocol" sells it short. Resources make it a context plane: typed, discoverable context served on demand with the server handling filtering and pagination.
Sampling makes it a compute plane, because a server that can call the host's LLM is a small agent in its own right. The accurate description is the protocol for connecting LLMs to typed external state and capability.
Adoption has been steep. By early 2026 the Glama index listed roughly 37,000 servers, and the official MCP Registry launched in September 2025. Cross-vendor support followed fast: the OpenAI Agents SDK added MCP in March 2025, Microsoft shipped an official C# SDK in April, and GitHub's MCP server went to preview the same month.
Anthropic donated the spec to the Linux Foundation in December 2025.
Google's Agent2Agent (A2A) protocol, announced April 2025, is a sibling for agent-to-agent communication. The 2026 consensus is clean: MCP is the agent-to-tool plane, A2A is the agent-to-agent plane, and the two compose. An A2A agent can be wrapped behind an MCP server.
MCP security is a real attack surface
MCP is a JSON-RPC pipe with a "trust the servers you connect to" security model, which in an open registry is a serious problem. Documented attack classes by mid-2026 include tool poisoning (a malicious tool's description is itself a prompt that can manipulate the host), prompt injection through resource bodies, lookalike tools with clashing semantics, and supply-chain CVEs in server dependencies.
The hidden attack surface of the MCP ecosystem is now an active research area.
Defensive practice is concrete: pin server versions, scope roots narrowly, run servers in sandboxes, and treat every tool description and resource body as untrusted input. Anthropic's work on code execution with MCP points at more efficient and more controllable patterns than naively exposing dozens of tools.
Is "agent memory" even the right abstraction?
This is the live debate, and three serious people land in different places.
François Chollet, creator of ARC-AGI, argues the field over-indexes on memory and under-indexes on skill acquisition. Humans don't have great episodic memory; intelligence is the ability to compress experience into generalizable programs. ARC-AGI is built so no amount of retrieval solves it; you have to induce the rule.
For Chollet, the long-term bet is induction, not external stores.
Andrej Karpathy frames the LLM as Software 3.0 with the context window as the new operating system: the whole context is addressable memory, the LLM is the CPU. His advice is to use the largest window you can afford and be ruthless about what goes in it.
The implication is that much of the memory zoo is doing a job the context window could do, and the better bet is longer, more reliable context and neural memory layers like Titans.
Simon Willison's definition is the most-cited: context engineering is providing all the context a task needs, in the exact format and quantity it needs. His critique of memory systems is practical.
Each one adds a write policy and a read policy, both of which can silently fail, and he prefers debuggable patterns like AGENTS.md and explicit function calls over heavy abstractions you can't inspect.
Anthropic's own engineering position is notably agnostic: keep the window small and high-signal using prompt caching, sub-agent architectures with clean contexts, tool-result truncation, and structured notes. Memory is one tool among many.
Long context versus RAG, settled for now
The loudest debate of 2025 to 2026 was whether million-token windows would obviate retrieval. The answer, supported by the RULER and STRING results above and the Databricks long-context RAG study, is no.
Effective context is much smaller than advertised, cost scales linearly with tokens, latency scales worse, and most of a long window gets wasted on distractors. Long context changes the layering rather than removing retrieval: use long context for global reasoning over a small high-signal set, and retrieval to pick that set from a large corpus.
They're complements.
For shipping in 2026, the pragmatist position is the right default. Short-term state is essential, external long-term memory is useful but not "intelligence," and the agent's primary cognitive act is curating context.
Use a scratchpad, a retrieval pipeline, and one memory product. The Chollet and Karpathy critiques are probably right about the long run, but the memory layer is the only game in town for products today.
A decision framework you can use Monday
Walk the tree top to bottom and stop at the first match.
Single-turn, single-doc, < 32K tokens?
-> Long context, no retrieval, no memory. Done.
Need to remember a user across sessions?
-> Add user memory (Mem0 or LangMem in front of a vector DB).
Corpus > 200K tokens with global / cross-doc questions?
-> GraphRAG (or LazyGraphRAG if cost-sensitive).
Multi-hop (>= 2 docs, chain of reasoning)?
-> IRCoT / ReAct / Self-Ask loop; add ColBERTv2 if zero-shot.
Otherwise: Advanced RAG.
query rewrite -> dense top-k -> BM25 hybrid -> cross-encoder rerank -> compression.
Add a scratchpad. Add prompt caching. Add Letta-style paging only if runs
regularly exceed effective context.
Patterns by use case:
- Chat assistant with user memory: long context, Mem0 or LangMem, AGENTS.md persona, prompt caching on the system prompt.
- Coding agent: long context for the current file, AGENTS.md or .cursorrules for project rules, semantic repo search, sub-agents for context hygiene, MCP servers for build/test/lint, state on the file system.
- Deep research agent: modular RAG over web/ArXiv/Wikipedia MCP, GraphRAG over the accumulated corpus, IRCoT multi-hop, cross-encoder rerank, RAGAS evaluation.
- Multi-agent system: A2A between agents, MCP for tools, per-agent dedicated context, shared memory for shared facts.
Cost, caching, and the numbers that move the bill
Input token pricing per 1M tokens, mid-2026:
DeepSeek V3 undercuts all of them at roughly $0.27 per 1M input tokens on a cache miss, which is part of the price pressure reshaping the market.
Prompt caching is the cheapest large win available. Anthropic caching costs 1.25x to write and 0.1x to read with a 5-minute TTL extendable to an hour. OpenAI applies an automatic 50% discount on cached input.
Gemini offers implicit caching at 0.25x cached input price. For any agent with a long system prompt or large tool schema, caching cuts input cost 50 to 90% on the second and later turns.
Latency budgets to design against: vector search over 1M vectors runs 5 to 15ms at p50, a cross-encoder rerank over the top 100 adds 50 to 150ms, and a full Advanced RAG pipeline lands around 150 to 300ms at p50 and up to 800ms at p99. Reranking is cheap relative to its accuracy lift.
Anti-patterns to avoid
- Stuffing the full window. It degrades reasoning and burns budget.
- Premature memory architecture. Get naive RAG working before you reach for Mem0 or Letta.
- Retrieving and also stuffing the whole corpus. Worse than either alone.
- No prompt caching on long system prompts. That's 50 to 90% left on the table.
- Memory as a substitute for prompt quality. A good prompt plus good retrieval beats a fancy memory layer with sloppy prompts.
- Trusting untrusted MCP servers. Pin, scope, sandbox.
- Secrets in the context. The context is logged. Use a secret broker.
What this means for you
Build the boring version first. A working scratchpad, a tested Advanced RAG pipeline with a cross-encoder rerank, one managed memory product, MCP-served tools, and prompt caching on the system prompt will outperform most elaborate architectures, and you can debug it.
Measure your model's effective context before you design around it. Treat the headline window as a ceiling you will never safely reach.
Track the frontier monthly and adopt on evidence. Neural memory in a production model, genuinely usable 1M-token context, and stable MCP governance would each change the calculus. Until they ship and reproduce, the layered stack above is what wins.
Sources
- Effective context engineering for AI agents, Anthropic
- Lost in the Middle: How Language Models Use Long Contexts
- RULER: What's the Real Context Size of Your LLM?
- Chroma context rot study (analysis)
- Context Length Alone Hurts LLM Performance Despite Perfect Retrieval (EMNLP 2025)
- MemGPT: Towards LLMs as Operating Systems
- Titans: Learning to Memorize at Test Time
- Mem0: Building Production-Ready AI Agents
- From Local to Global: A GraphRAG Approach
- LazyGraphRAG, Microsoft Research
- Retrieval-Augmented Generation survey (Gao et al.)
- IRCoT: Interleaving Retrieval with Chain-of-Thought
- Self-RAG · CRAG · Adaptive-RAG
- RAGAS: Automated Evaluation of RAG
- What is the Model Context Protocol?
- MCP joins the Linux Foundation
- Announcing the Agent2Agent Protocol (A2A)
- MCP Ecosystem H1 2026 Retrospective
- Long Context RAG Performance of LLMs, Databricks
- The New Skill in AI is Context Engineering (Willison, via Schmid)
