The advertised context length of most long-context models is roughly 2 to 4 times larger than the length at which they actually work. That's the finding from NVIDIA's RULER benchmark, which tested 13 task types and found models falling below their short-context performance long before their windows fill up.
This gap is why the most interesting shift in AI architecture right now isn't bigger windows. It's modular context windows: systems that treat the prompt as a small, fast working surface backed by larger, slower memory tiers the agent pages in deliberately.
And the production evidence, from Shopify Sidekick to Cursor to Claude Code, says this is how AI agent reasoning actually scales.
TL;DR
- Monolithic long context degrades predictably. Lost in the Middle showed models under-attend to mid-prompt information; RULER showed real context is far smaller than advertised.
- Modular context composes five families: paged memory (MemGPT/Letta), attention-sink streaming (StreamingLLM), hybrid SSM-Transformer backbones (Jamba), retrieval-as-context (RAG), and tool-and-protocol context (MCP).
- Production agents already converged. Shopify, Cursor, Claude Code, and Devin all use tiered designs, not giant prompts.
- Monolithic still wins a narrow band: single-pass, well-bounded tasks with no distractors.
What Are Modular Context Windows?
A modular context window is a context layer where the tokens the model attends to are not one contiguous buffer, but a composition of smaller, semantically typed, independently managed modules. Instead of stuffing everything into a million-token prompt, the agent maintains a compact working set and pulls from memory tiers, retrieval indexes, and tools as needed.
Think virtual memory for language models. The MemGPT paper (Packer et al., October 2023) made the analogy explicit: a small in-context "core memory" plays the role of CPU registers, while a large out-of-context "archival memory" plays the role of disk, accessed through function calls the model issues itself.
The payoff is threefold. Effective context becomes unbounded without quadratic attention cost. Models reason better over a deliberately selected working set than over the same facts buried in a 200K prompt. And inference gets cheaper, since you stop paying full attention over tokens that don't matter this step.
Why Context Length Limitations Broke the Monolithic Approach
Three empirical results ended the "just make the window bigger" era. Attention quality degrades non-uniformly, realistic distractors cause collapse, and cost grows roughly linearly with tokens processed per step.
The degradation result came first. Liu et al.'s Lost in the Middle showed that even GPT-4-class models perform 20 to 40 percentage points worse on information placed in the middle of long prompts than at the edges (a figure the research record supports directionally, though the exact magnitude varies by setup).
The distractor result is more damning. Summary of a Haystack (EMNLP 2024) found that Gemini 1.5 Pro's million-token context collapses to near-zero accuracy on multi-document summarization once more than roughly 30% of the context is noise. Real workloads are mostly noise.
And the pattern held into 2025. Retrieval Quality at Context Limit (November 2025) reports that even top-tier models lose 15 to 30 percentage points of recall on the middle third of a 200K-token prompt versus a 16K prompt of equivalent density. The paper's recommendation is blunt: for retrieval-grounded tasks, modular context outperforms monolithic long context at lower cost.
BABILong (NeurIPS 2024) closed the loop on reasoning. Testing facts distributed across up to 10M tokens, it found RAG-augmented systems match or beat pure long-context systems at the same effective budget, with the gap widest when distractors are present.
Monolithic long context does not win in production. A deliberately tiered, modular context does. That's the repeated lesson across every major agent system shipped in 2025 and 2026.
The Five Families of Modular Context
Every modular context implementation in the 2025 literature falls into five families, and production stacks compose several at once. None of them is exclusive; the best-evidenced agents layer a paged memory tier over a retrieval tier over an efficient backbone.
| Family | Canonical system | Mechanism | Best evidence |
|---|---|---|---|
| Paged / hierarchical memory | MemGPT / Letta | Core memory in-context, archival memory paged via function calls | Beats 200K monolithic baseline on document-grounded conversation at a fraction of token cost |
| Attention-sink streaming | StreamingLLM | Pin first few "sink" tokens, slide a window over the rest | 4M+ token streams at 22.2x the throughput of re-prefilling (per the paper) |
| Hybrid SSM-Transformer | AI21 Jamba 1.5 | Mamba layers for cheap long-range recurrence, sparse attention for recall, 1:8 attention-to-Mamba ratio | 256K context in a single 80GB GPU, per NVIDIA's writeup (vendor-stated) |
| Retrieval-as-context | Agentic RAG | Agent decides when and what to retrieve, per NVIDIA's framing | Stays roughly flat under distractors where long context collapses |
| Tool-and-protocol context | Model Context Protocol (MCP) | Model ingests typed tool schemas, not raw text | 14,000+ servers and 97M SDK downloads reported by late 2025; now under Linux Foundation governance |
Letta's typed memory blocks deserve a specific callout. The framework (the rebranded MemGPT project) gives agents labeled blocks likepersona,human, andfacts, each with per-block read/write permissions. The Letta team's memory blocks post is the most-cited articulation of the abstraction, and it's where most context engineering practice is heading: the model and framework jointly decide which block gets edited when.
One caution flag belongs here. Magic.dev claimed a 100M-token context window in 2024 via long-term memory modules. As of mid-2026, no independent benchmark, peer review, or productized access at that scale has materialized. Treat vendor-sourced context-length claims accordingly.
What Production Agents Actually Do
The most documented production systems all converged on the same tiered pattern, independently. A small in-context working set, a retrieval tier over the relevant corpus, a persistent memory tier for cross-session facts, and a tool layer.
Shopify's Sidekick is the clearest case study. The team's engineering writeup and ICML 2025 Expo talk describe typed tool calls, per-merchant paged memory, agentic RAG over catalogs and policies, and layered guardrails.
Their key operational finding: simply expanding context degraded both latency and reasoning quality. The right design was a deliberately composed working set per turn.
Coding agents tell the same story. Cursor pairs a small active-edit budget with retrieval over a per-project index; Dropbox has reportedly used it against 550,000-file codebases. Claude Code maintains persistent per-project memory and supports multi-day refactors, with 30+ hour sessions reported in production use. Devin (Cognition) keeps long-lived project memory and has shipped 659 pull requests end-to-end against a SWE-bench-style harness, per Cognition's published evaluations.
None of these systems bet on a giant monolithic prompt. All of them bet on tiers.
When Should You Still Use a Monolithic Context Window?
Use monolithic long context when the task is single-pass, well-bounded, and free of distractors. Use modular context for everything multi-session, retrieval-grounded, or long-running. The benchmark evidence supports this split cleanly, and pretending either side wins everywhere is marketing.
| Workload | Recommended architecture |
|---|---|
| Single-pass summary of one bounded document | Monolithic long context (Claude, Gemini 2.5 Pro, GPT-4.1) |
| Multi-session assistant with persistent user facts | Letta/MemGPT-style paged memory |
| Q&A over a large, changing corpus | RAG + reranker + 32K, 128K backbone |
| Long-running autonomous agent (coding, ops) | Paged memory + repo RAG + tool layer |
| Multi-document research with adversarial noise | Full modular stack; monolithic collapses here |
| Cost-sensitive simple queries | Smallest viable context, no memory layer |
The last row matters more than it looks. Shopify's implicit message ("we didn't need long context") generalizes: adding a memory layer you don't need is pure cost. For a bounded customer-support query, a plain 32K model with good retrieval beats a five-tier stack on simplicity, latency, and on-call burden.
The Trade-offs Nobody Markets
Modularity is not free, and the failure modes are well documented. Six recur across the 2025 literature and production postmortems.
Coherence is the big one. A paged tier and a RAG tier don't automatically give the model a unified world view, and teams consistently report that prompt and memory-schema design dominate outcomes. A badly designed memory schema performs worse than a plain 200K prompt.
Orchestration latency is real: every module hop is a round-trip, and naive paging can be slower end-to-end than just putting the data in the prompt. Retrieval errors propagate, which is why rerankers became standard in 2025 stacks.
Cross-module reasoning failures (facts split between core memory, archival memory, and a tool result) are exactly the failure mode BABILong exposes.
And evaluation gets harder. The Memory in the LLM Era survey and the incremental multi-turn memory evaluation paper both argue the field still lacks standardized benchmarks for modular agent memory. You're shipping an architecture the industry doesn't yet know how to grade.
What This Means for You
If you're building agents in 2026, four moves follow from the evidence.
First, budget context like memory, not like a landfill. Measure your model's real context size with RULER-style probes rather than trusting the spec sheet, and keep the per-turn working set deliberately small.
Second, adopt typed memory early. Letta-style memory blocks (or your own equivalent schema) cost little to add and are where OpenAI's context engineering cookbook and Anthropic's MCP work are both converging. The protocol layer is standardizing; the memory-block layer isn't yet, so keep yours portable.
Third, make retrieval quality your first metric. Independent 2025 enterprise work found retrieval recall and reranking quality determine agent task success more than the choice of LLM does.
Fourth, don't over-build. Gartner forecasts 40% of enterprise applications will integrate task-specific agents by 2026, up from under 5% in 2025, while IDC predicts over 40% of agent deployments will be cancelled or scaled back by 2027 as engineering costs bite. The teams that survive that shakeout will be the ones that matched architecture to workload instead of stacking tiers for their own sake.
The context window stopped being a number on a spec sheet. It's now an architecture decision, and the modular side of that decision is where AI agent reasoning gets built for the next several years.
Demis Hassabis has named long-term memory and continual learning as the two main bottlenecks between current systems and AGI. Both live exactly here.
Sources
- RULER: What's the Real Context Size of Your Long-Context LLMs?, NVIDIA's benchmark showing advertised context is 2, 4x the real usable length
- Lost in the Middle: How Language Models Use Long Contexts, foundational positional-degradation result
- Summary of a Haystack, EMNLP 2024 paper on long-context collapse under distractors
- BABILong: Testing the Limits of LLMs with Long Contexts, reasoning over facts spread across up to 10M tokens
- Retrieval Quality at Context Limit, 2025 re-evaluation of position bias at 200K tokens
- MemGPT: Towards LLMs as Operating Systems, the paged virtual context paper
- Memory Blocks: The Key to Agentic Context Management, Letta's typed memory-block abstraction
- Efficient Streaming Language Models with Attention Sinks, StreamingLLM, ICLR 2024
- The Jamba 1.5 Open Model Family, AI21's hybrid SSM-Transformer release (vendor-stated benchmarks)
- Jamba 1.5 on NVIDIA Developer Blog, 256K context on a single 80GB GPU
- Building Production-Ready Agentic Systems, Shopify Sidekick's architecture
- Traditional RAG vs. Agentic RAG, NVIDIA on retrieval as a callable tool
- 100M Token Context Windows, Magic.dev's unverified claim
- Gartner: 40% of Enterprise Apps Will Feature Task-Specific AI Agents by 2026, adoption forecast
- IDC FutureScape 2026, agentic AI deployment and pullback predictions
- Context Engineering for Personalization, OpenAI's cookbook on long-term state management
