Frontier models score just 26 to 58 percent on long-horizon, state-dependent tasks when they have no external memory, according to Microsoft's STATE-Bench (May 2026). The same models clear far higher marks on short-context work.
That gap is not a model problem. It's an architecture problem, and it's the clearest signal yet in the stateful vs. Stateless agent architecture debate.
Here's the paradox every production system in 2026 is built on: the LLM is stateless by construction, yet almost every useful agent is stateful. What ships as a "stateful agent" is a compound system, a frozen model wrapped in a runtime that owns memory, identity, and orchestration.
TL;DR
- Stateless agents are competitive on short, atomic tasks. Under five turns and 8K tokens, the architectures are statistically indistinguishable on quality and cost.
- Stateful agents dominate long horizons. On memory-specific benchmarks, external-memory systems beat full-context-window baselines by 10 to 30 percentage points.
- Per-token pricing has converged across AWS, Google, and Microsoft managed runtimes (within roughly 3.5%). The real economic levers are prefix caching and state-store cost, both stateful by definition.
- Every major 2026 framework converged on the same hybrid: stateless model, stateful runtime, explicit memory store.
A stateless agent processes each request independently, with no context surviving between calls. A stateful agent retains session, memory, and tool state across calls and re-derives the active context per request. In 2026, the engineering question is no longer whether to keep state, but how much, where, and at what cost.
What's the actual difference between stateful and stateless agents?
The difference is a deployment pattern, not a model property. The 2026 survey AI Agent Systems: Architectures, Applications, and Evaluation formalizes the agent layer as a runtime that compensates for the model's statelessness by maintaining an external store and projecting it into each call.
Three memory tiers are now standard in the literature, per the Memory for Autonomous LLM Agents survey:
| Tier | What it holds | Typical backing |
|---|---|---|
| Working memory | Active context: message buffer, tool traces, system prompt | Re-derived per request |
| Episodic memory | Time-ordered log of past interactions | Vector store (pgvector, Pinecone, Qdrant) or KV store (Redis) |
| Semantic memory | Consolidated facts, preferences, profiles | Structured store or knowledge graph |
Two more axes matter. Server-side state (Letta, Bedrock AgentCore Memory, Vertex Memory Bank) simplifies the developer experience but couples you to a vendor. Client-side state keeps the model swappable but pushes consistency and lifecycle problems into your application.
Memory in the Age of AI Agents notes the dominant 2026 pattern is agentic memory that learns to consolidate and write to its own store, not a fixed RAG pipeline.
Where do stateful agents win on LLM benchmarks?
The gap between architectures widens with task horizon. On short tool-use tasks, stateless call patterns hold their own. On long-horizon software engineering and multi-turn dialogue, the best stateful systems beat the best stateless ones by 20 to 40 percentage points in 2025-2026 evaluations, per the 2025 AI Agent Index and the agent evaluation survey.
| Benchmark | Horizon | Rewards persistent state? | Top scores |
|---|---|---|---|
| SWE-bench Verified | Long (full repo) | Yes | ~65-80% |
| SWE-bench Pro | Very long (professional repos) | Yes | ~23-45% |
| τ-bench / τ²-bench | Multi-turn dialogue + tools | Yes | ~50-65% airline, lower retail |
| STATE-Bench (Microsoft, 2026) | Long-term state recall | Entirely | 26-58% without external memory |
| MemoryAgentBench (ICLR 2026) | Incremental multi-turn | Yes | 30-60% full-context; 70%+ with external memory |
The most architecture-sensitive result is τ-bench, which simulates customer-service conversations where users lie and change their minds. A stateless system that re-derives everything from the transcript typically loses 15 to 25 points against a system maintaining an explicit belief state.
MemoryAgentBench was built specifically to defeat context-window recall: information dispersed across 20-40 turns with distractors. The 10-30 point gap between full-context and external-memory systems is, essentially, the measured contribution of state.
One honest caveat. The Saving SWE-Bench paper showed that trivial mutations (renaming a function, reordering a file) swing top systems' scores by 20 to 50 percent. So treat sub-5-point deltas as noise. The 15-point-plus gaps on memory-specific benchmarks are the credible evidence.
When stateless agents are still the right call
Stateless wins when the entire relevant context fits in one request and nothing has lasting identity. That covers more production traffic than the hype suggests: one 2025 measurement found the median enterprise support ticket resolves in under four turns, comfortably inside a single context window.
The strong stateless cases:
- Atomic single-call work. Summarization, extraction, classification. Lower latency floor, smaller failure surface.
- High-throughput batch pipelines. Moderation, embedding generation. Horizontally scalable workers with zero session overhead.
- Zero-retention compliance. Stateless inference is the only pattern that guarantees the runtime writes no conversation data to disk. For EU data-residency or zero-retention contracts, it's legally cleaner, not just cheaper.
- Edge and on-device. Running a vector store and orchestrator on a phone or in a vehicle is usually infeasible.
- Evaluation harnesses. Stateless calls are reproducible and cacheable, which is exactly why SWE-bench and AgentBench invoke agents statelessly.
And state isn't free of risk. A 2026 Frontiers in Computer Science survey catalogs incidents where persistent memory was exfiltrated via prompt injection, tool poisoning, or session hijacking. An agent that forgets everything has a smaller attack surface.
AI performance economics: cost per resolved task, not per token
Per-token price no longer decides the architecture. Managed runtime compute pricing has converged: Bedrock AgentCore at roughly $0.0895 per vCPU-hour, Vertex AI Agent Engine at roughly $0.0864, with Microsoft Foundry and Anthropic's managed agents (vendor-stated, public beta at ~$0.08 per session-hour) in the same band.
Two stateful mechanisms now dominate the economics.
Prefix caching is the biggest lever. Reusing the KV cache for a stable prompt prefix cuts time-to-first-token by 30-70% and per-call cost by 40-80%, per vendor and academic measurements. A truly stateless request that rebuilds its prefix from scratch forfeits this entirely, because the runtime needs to recognize that call n+1 shares a prefix with call n. That recognition is state.
State reduces round-trips. A stateful system with explicit memory typically resolves a multi-step support task in 3-6 LLM calls. A stateless system re-deriving everything from the transcript needs 8-15 calls for the same task. At current token prices, fewer calls beats leaner calls.
Mem0's published evaluation makes the point concretely: 26% relative improvement on long-term memory tasks and a vendor-stated 91% cost reduction versus a full-context baseline.
The counterweight is prompt bloat. State that isn't summarized or evicted grows the working context past the cost-optimal point, which is why LangGraph and the Microsoft Agent Framework both ship explicit summarization and eviction hooks.
The hidden tax: context management and operational complexity
Statefulness costs roughly 2-3x the initial engineering effort of a stateless baseline when you own the store yourself, and about 1.5-2x on a managed runtime. The recurring costs are worse than the upfront ones.
Production incident analyses found that "stale state" and "lost state on retry" were the two most common root causes of customer-visible agent failures, ahead of model errors. Microsoft's Memory Contracts pattern (2026) exists precisely to make those failure modes typed and debuggable.
Three taxes to price in:
- Observability. Correlating an LLM call, a tool call, a vector retrieval, and a reflection step across spans is genuinely hard; OpenTelemetry's GenAI conventions are still maturing.
- State migration. Swap a model or restructure your memory schema, and every live session is potentially inconsistent. Stateless deployments simply don't have this problem.
- Concurrency. Multiple writers to shared state means optimistic concurrency control, idempotent retries, and partial-failure recovery.
And state doesn't repeal long-context limits. Stuffing 200K tokens of retrieved memory into the window pays the same "lost in the middle" attention tax as stuffing in 200K tokens of raw transcript. State moves the problem; it doesn't solve it.
How should you choose? A five-axis decision framework
Score your workload on task horizon, persistence requirement, latency budget, scale, and team capability. Then apply the gates below. This is the conservative AI decision-making path, and it defaults to hybrid.
| Use case | Persistence | Recommended architecture |
|---|---|---|
| Document classification, extraction at scale | None | Stateless |
| RAG over a static corpus | Build-time only | Stateless (the index is a build artifact, not runtime state) |
| Multi-turn customer support | Cross-session | Hybrid on a managed runtime |
| Personalized assistant (days/weeks) | Cross-user | Hybrid + Mem0 or Letta |
| Coding agent (SWE-bench-class) | Within-session | Hybrid (LangGraph, OpenAI Agents SDK) |
| Regulated workloads (audit, replay) | Regulated | Stateful with explicit trace store |
| Edge / on-device | Local only | Stateless with bounded context |
The gates, in order. If persistence is none and the horizon is short, build stateless and stop. If your latency budget is under 200ms, prefer stateless with a precomputed prefix cache unless you accept vendor-managed server-side state (the Realtime API path).
For everything cross-session and medium-to-long horizon, the default is the hybrid every major framework now implements: stateless LLM inside a stateful runtime.
Above roughly 100K concurrent sessions, managed runtime costs can exceed the engineering savings, and a client-side store on Kubernetes with Redis or pgvector becomes the better trade. The multi-agent orchestration survey flags shared-state protocols as the dominant open problem at that scale, so budget accordingly.
What this means for you
Stop framing this as stateful versus stateless. Per the Agentic AI comprehensive survey, every shipped framework already converged on the hybrid; your decision is how much state, where it lives, and who pays for it.
Practical moves this quarter: model cost per resolved task, not per call (they differ by 2-5x on stateful workloads). Confirm prompt caching is actually enabled. Make your session boundary, eviction policy, and schema versioning explicit before launch, because those are where the postmortems come from.
And watch the durable-execution wave. Typed memory contracts and Temporal-style agent runtimes are on track to collapse most of the complexity gap between these architectures by 2027. The conservative bet for mid-2026: stateless core, managed stateful runtime, typed memory layer, OpenTelemetry instrumentation. Revisit when the next generation lands.
Sources
- AI Agent Systems: Architectures, Applications, and Evaluation (arXiv, 2026), formalizes the agent layer as a stateful runtime over a stateless model
- The 2025 AI Agent Index (arXiv), benchmark landscape and stateful/stateless performance gaps
- Evaluation and Benchmarking of LLM Agents: A Survey (arXiv), methodology and benchmark coverage
- Memory in the Age of AI Agents (arXiv, 2026), survey of agentic memory patterns
- Memory for Autonomous LLM Agents (arXiv), the three-tier memory taxonomy
- MemoryAgentBench (ICLR 2026, OpenReview), incremental multi-turn memory evaluation
- Saving SWE-Bench: A Benchmark Mutation Approach (arXiv), benchmark brittleness and 20-50% score swings
- Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory (arXiv), 26% memory improvement, 91% cost reduction figures
- Agentic AI: A Comprehensive Survey (arXiv), framework consolidation and contamination concerns
- The Orchestration of Multi-Agent Systems (arXiv, 2026), shared-state protocols as the open problem
- Letta (GitHub), the stateless-API-over-stateful-server pattern
- LangGraph documentation, checkpointers, summarization, and eviction hooks
- Mem0 documentation, add/update/delete memory primitives
