What is a modular context window?

A modular context window is an architecture where the tokens a model attends to are not one contiguous buffer but a composition of smaller, typed, independently managed modules: a small always-visible working set, paged memory tiers, retrieval results, and tool schemas. The agent reads from and writes to these tiers deliberately instead of carrying everything in one prompt.

Do modular context windows beat long-context models?

On retrieval-grounded and multi-session agent tasks, yes. BABILong shows RAG-augmented systems match or beat pure long-context models at the same budget, and Summary of a Haystack shows monolithic context collapses under distractors. Monolithic context still wins on single-pass, well-bounded tasks like summarizing one long document with no noise.

Is modular context just RAG with extra steps?

Partly, and skeptics make exactly that argument. The difference is composition: production stacks like Shopify Sidekick combine paged memory (MemGPT/Letta-style), agentic RAG, and typed tool calls via protocols like MCP. Each piece exists in plain RAG systems; the deliberate tiering and write-back memory are what's new.

What are the main downsides of modular context architectures?

Engineering complexity, orchestration latency, and a harder evaluation surface. Each module hop is a round-trip, retrieval errors propagate, and a badly designed memory schema performs worse than a plain 200K monolithic prompt. Shopify's team explicitly calls out the staffing and on-call cost.

Which tools implement modular context windows today?

Letta (formerly MemGPT) for paged memory blocks, AI21's Jamba 1.5 for a hybrid SSM-Transformer backbone, StreamingLLM for attention-sink streaming in serving stacks like vLLM, standard RAG infrastructure for retrieval-as-context, and Anthropic's Model Context Protocol for typed tool-and-protocol context.

Modular Context Windows: The Future of AI Agent Reasoning

The advertised context length of most long-context models is roughly 2 to 4 times larger than the length at which they actually work. That's the finding from NVIDIA's RULER benchmark, which tested 13 task types and found models falling below their short-context performance long before their windows fill up.

This gap is why the most interesting shift in AI architecture right now isn't bigger windows. It's modular context windows: systems that treat the prompt as a small, fast working surface backed by larger, slower memory tiers the agent pages in deliberately.

And the production evidence, from Shopify Sidekick to Cursor to Claude Code, says this is how AI agent reasoning actually scales.

TL;DR

Monolithic long context degrades predictably. Lost in the Middle showed models under-attend to mid-prompt information; RULER showed real context is far smaller than advertised.
Modular context composes five families: paged memory (MemGPT/Letta), attention-sink streaming (StreamingLLM), hybrid SSM-Transformer backbones (Jamba), retrieval-as-context (RAG), and tool-and-protocol context (MCP).
Production agents already converged. Shopify, Cursor, Claude Code, and Devin all use tiered designs, not giant prompts.
Monolithic still wins a narrow band: single-pass, well-bounded tasks with no distractors.

What Are Modular Context Windows?

A modular context window is a context layer where the tokens the model attends to are not one contiguous buffer, but a composition of smaller, semantically typed, independently managed modules. Instead of stuffing everything into a million-token prompt, the agent maintains a compact working set and pulls from memory tiers, retrieval indexes, and tools as needed.

Think virtual memory for language models. The MemGPT paper (Packer et al., October 2023) made the analogy explicit: a small in-context "core memory" plays the role of CPU registers, while a large out-of-context "archival memory" plays the role of disk, accessed through function calls the model issues itself.

The payoff is threefold. Effective context becomes unbounded without quadratic attention cost. Models reason better over a deliberately selected working set than over the same facts buried in a 200K prompt. And inference gets cheaper, since you stop paying full attention over tokens that don't matter this step.

Why Context Length Limitations Broke the Monolithic Approach

Three empirical results ended the "just make the window bigger" era. Attention quality degrades non-uniformly, realistic distractors cause collapse, and cost grows roughly linearly with tokens processed per step.

The degradation result came first. Liu et al.'s Lost in the Middle showed that even GPT-4-class models perform 20 to 40 percentage points worse on information placed in the middle of long prompts than at the edges (a figure the research record supports directionally, though the exact magnitude varies by setup).

The distractor result is more damning. Summary of a Haystack (EMNLP 2024) found that Gemini 1.5 Pro's million-token context collapses to near-zero accuracy on multi-document summarization once more than roughly 30% of the context is noise. Real workloads are mostly noise.

And the pattern held into 2025. Retrieval Quality at Context Limit (November 2025) reports that even top-tier models lose 15 to 30 percentage points of recall on the middle third of a 200K-token prompt versus a 16K prompt of equivalent density. The paper's recommendation is blunt: for retrieval-grounded tasks, modular context outperforms monolithic long context at lower cost.

BABILong (NeurIPS 2024) closed the loop on reasoning. Testing facts distributed across up to 10M tokens, it found RAG-augmented systems match or beat pure long-context systems at the same effective budget, with the gap widest when distractors are present.

Monolithic long context does not win in production. A deliberately tiered, modular context does. That's the repeated lesson across every major agent system shipped in 2025 and 2026.

The Five Families of Modular Context

Every modular context implementation in the 2025 literature falls into five families, and production stacks compose several at once. None of them is exclusive; the best-evidenced agents layer a paged memory tier over a retrieval tier over an efficient backbone.

Family	Canonical system	Mechanism	Best evidence
Paged / hierarchical memory	MemGPT / Letta	Core memory in-context, archival memory paged via function calls	Beats 200K monolithic baseline on document-grounded conversation at a fraction of token cost
Attention-sink streaming	StreamingLLM	Pin first few "sink" tokens, slide a window over the rest	4M+ token streams at 22.2x the throughput of re-prefilling (per the paper)
Hybrid SSM-Transformer	AI21 Jamba 1.5	Mamba layers for cheap long-range recurrence, sparse attention for recall, 1:8 attention-to-Mamba ratio	256K context in a single 80GB GPU, per NVIDIA's writeup (vendor-stated)
Retrieval-as-context	Agentic RAG	Agent decides when and what to retrieve, per NVIDIA's framing	Stays roughly flat under distractors where long context collapses
Tool-and-protocol context	Model Context Protocol (MCP)	Model ingests typed tool schemas, not raw text	14,000+ servers and 97M SDK downloads reported by late 2025; now under Linux Foundation governance

Letta's typed memory blocks deserve a specific callout. The framework (the rebranded MemGPT project) gives agents labeled blocks likepersona,human, andfacts, each with per-block read/write permissions. The Letta team's memory blocks post is the most-cited articulation of the abstraction, and it's where most context engineering practice is heading: the model and framework jointly decide which block gets edited when.

One caution flag belongs here. Magic.dev claimed a 100M-token context window in 2024 via long-term memory modules. As of mid-2026, no independent benchmark, peer review, or productized access at that scale has materialized. Treat vendor-sourced context-length claims accordingly.

What Production Agents Actually Do

The most documented production systems all converged on the same tiered pattern, independently. A small in-context working set, a retrieval tier over the relevant corpus, a persistent memory tier for cross-session facts, and a tool layer.

Shopify's Sidekick is the clearest case study. The team's engineering writeup and ICML 2025 Expo talk describe typed tool calls, per-merchant paged memory, agentic RAG over catalogs and policies, and layered guardrails.

Their key operational finding: simply expanding context degraded both latency and reasoning quality. The right design was a deliberately composed working set per turn.

Coding agents tell the same story. Cursor pairs a small active-edit budget with retrieval over a per-project index; Dropbox has reportedly used it against 550,000-file codebases. Claude Code maintains persistent per-project memory and supports multi-day refactors, with 30+ hour sessions reported in production use. Devin (Cognition) keeps long-lived project memory and has shipped 659 pull requests end-to-end against a SWE-bench-style harness, per Cognition's published evaluations.

None of these systems bet on a giant monolithic prompt. All of them bet on tiers.

When Should You Still Use a Monolithic Context Window?

Use monolithic long context when the task is single-pass, well-bounded, and free of distractors. Use modular context for everything multi-session, retrieval-grounded, or long-running. The benchmark evidence supports this split cleanly, and pretending either side wins everywhere is marketing.

Workload	Recommended architecture
Single-pass summary of one bounded document	Monolithic long context (Claude, Gemini 2.5 Pro, GPT-4.1)
Multi-session assistant with persistent user facts	Letta/MemGPT-style paged memory
Q&A over a large, changing corpus	RAG + reranker + 32K, 128K backbone
Long-running autonomous agent (coding, ops)	Paged memory + repo RAG + tool layer
Multi-document research with adversarial noise	Full modular stack; monolithic collapses here
Cost-sensitive simple queries	Smallest viable context, no memory layer

The last row matters more than it looks. Shopify's implicit message ("we didn't need long context") generalizes: adding a memory layer you don't need is pure cost. For a bounded customer-support query, a plain 32K model with good retrieval beats a five-tier stack on simplicity, latency, and on-call burden.

The Trade-offs Nobody Markets

Modularity is not free, and the failure modes are well documented. Six recur across the 2025 literature and production postmortems.

Coherence is the big one. A paged tier and a RAG tier don't automatically give the model a unified world view, and teams consistently report that prompt and memory-schema design dominate outcomes. A badly designed memory schema performs worse than a plain 200K prompt.

Orchestration latency is real: every module hop is a round-trip, and naive paging can be slower end-to-end than just putting the data in the prompt. Retrieval errors propagate, which is why rerankers became standard in 2025 stacks.

Cross-module reasoning failures (facts split between core memory, archival memory, and a tool result) are exactly the failure mode BABILong exposes.

And evaluation gets harder. The Memory in the LLM Era survey and the incremental multi-turn memory evaluation paper both argue the field still lacks standardized benchmarks for modular agent memory. You're shipping an architecture the industry doesn't yet know how to grade.

What This Means for You

If you're building agents in 2026, four moves follow from the evidence.

First, budget context like memory, not like a landfill. Measure your model's real context size with RULER-style probes rather than trusting the spec sheet, and keep the per-turn working set deliberately small.

Second, adopt typed memory early. Letta-style memory blocks (or your own equivalent schema) cost little to add and are where OpenAI's context engineering cookbook and Anthropic's MCP work are both converging. The protocol layer is standardizing; the memory-block layer isn't yet, so keep yours portable.

Third, make retrieval quality your first metric. Independent 2025 enterprise work found retrieval recall and reranking quality determine agent task success more than the choice of LLM does.

Fourth, don't over-build. Gartner forecasts 40% of enterprise applications will integrate task-specific agents by 2026, up from under 5% in 2025, while IDC predicts over 40% of agent deployments will be cancelled or scaled back by 2027 as engineering costs bite. The teams that survive that shakeout will be the ones that matched architecture to workload instead of stacking tiers for their own sake.

The context window stopped being a number on a spec sheet. It's now an architecture decision, and the modular side of that decision is where AI agent reasoning gets built for the next several years.

Demis Hassabis has named long-term memory and continual learning as the two main bottlenecks between current systems and AGI. Both live exactly here.

Sources

RULER: What's the Real Context Size of Your Long-Context LLMs?, NVIDIA's benchmark showing advertised context is 2, 4x the real usable length
Lost in the Middle: How Language Models Use Long Contexts, foundational positional-degradation result
Summary of a Haystack, EMNLP 2024 paper on long-context collapse under distractors
BABILong: Testing the Limits of LLMs with Long Contexts, reasoning over facts spread across up to 10M tokens
Retrieval Quality at Context Limit, 2025 re-evaluation of position bias at 200K tokens
MemGPT: Towards LLMs as Operating Systems, the paged virtual context paper
Memory Blocks: The Key to Agentic Context Management, Letta's typed memory-block abstraction
Efficient Streaming Language Models with Attention Sinks, StreamingLLM, ICLR 2024
The Jamba 1.5 Open Model Family, AI21's hybrid SSM-Transformer release (vendor-stated benchmarks)
Jamba 1.5 on NVIDIA Developer Blog, 256K context on a single 80GB GPU
Building Production-Ready Agentic Systems, Shopify Sidekick's architecture
Traditional RAG vs. Agentic RAG, NVIDIA on retrieval as a callable tool
100M Token Context Windows, Magic.dev's unverified claim
Gartner: 40% of Enterprise Apps Will Feature Task-Specific AI Agents by 2026, adoption forecast
IDC FutureScape 2026, agentic AI deployment and pullback predictions
Context Engineering for Personalization, OpenAI's cookbook on long-term state management

Beyond Context Length: Modular Context Windows and the Future of AI Agent Reasoning