A 15-agent run, 5 Haiku searchers, 5 Haiku verifiers, and one Opus synthesizer, completed a full codebase analysis in about 572,000 subagent tokens and 3.5 minutes of wall-clock time. You physically can't stuff that same work into one context window.
That run is the argument in one shot. In 2026, the live question is where to draw the line between one mind and many.
Direct answer: keep write paths single-threaded, and run intelligence and verification in parallel. Decompose for context isolation, wall-clock speed, and token economics. A swarm for its own sake just burns budget.
TL;DR: The 2025 fight between Cognition ("don't build multi-agents") and Anthropic ("multi-agent research wins by 90.2%") resolved into a synthesis. Single agent writes, parallel agents verify and explore. Six orchestration patterns cover almost every real workload, so pick the framework that fits the pattern you actually run.
Key takeaways
- Decompose even with a 1M window: isolation, parallelism, and quadratic token growth justify it.
- Let one agent own the edits. Fan out search and review.
- Six patterns cover the field: orchestrator-worker, hierarchical, blackboard, swarm, planner-executor, debate.
- A2A (inter-agent) plus MCP (tools) are the coordination standards to build on.
- Hard caps and tiered memory are what keep coordination from cascading into cost blowups.
Why decompose when one model holds a million tokens?
Because a big window gives you room to work and nothing more. Treat the context window like RAM you have to manage, not durable storage you can pile into.
Open-ended tools dump raw, unstructured output straight into your primary context. Search results, scraped HTML, DB rows. That noise clobbers your system prompt, guardrails, and formatting rules.
Delegating exploration to isolated subagents keeps the orchestrator's context clean. The subagent absorbs the mess and returns distilled findings.
Then there's wall-clock. Serial execution is the latency bottleneck. Multi-agent buys double-layer parallelism: an orchestrator spawns concurrent subagents, each running parallel tool calls, cutting execution time by up to 90%.
And token economics. A monolithic multi-turn agent reprocesses its entire trace every turn, so cost grows quadratically. Partition the work into short, disposable subagent turns whose intermediate tokens get evicted on completion, and only the compressed result appends to the main thread.
Token usage explains roughly 80% of the variance in success at locating hard-to-find information.
The tradeoff is real: multi-agent systems use about 15x more tokens than chat and 4x more than a single agent. So decompose on purpose, with a reason you can name.
The debate, resolved
In June 2025, Cognition published Don't Build Multi-Agents. The core claim held up. Parallel agents executing writes make conflicting implicit choices about code style, edge cases, and patterns, and without shared continuous context they clash on integration. Their Flappy Bird example: one subagent builds a Super-Mario-style background while another builds a mismatched bird.
Anthropic ran the other way. For open-ended research, single agents bottleneck on sequential reasoning and context limits. Their orchestrator-worker design, a LeadResearcher managing plans with subagents acting as intelligent filters, hit a 90.2% performance improvement over the single-agent baseline.
Both were right about different workloads. Cognition's 2026 follow-up, Multi-Agents: What's Actually Working, draws the line cleanly: writes stay single-threaded, intelligence and verification go multi-threaded. Three production patterns survived:
- Generator-Verifier Loop. A writer (Devin) plus a review agent (Devin Review) iterate. The reviewer shares no context with the generator beforehand, operates on a clean diff, reasons backward from the implementation, and catches roughly 2 bugs per PR, 58% of them severe.
- Smart Friend. A smaller sub-frontier model drives and escalates hard tasks to a "smart friend" tool, handing over a complete fork of its context and asking broad strategic questions.
- Map-Reduce-and-Manage. A manager decomposes massive multi-PR or multi-service migrations, spawns child agents, and coordinates via MCP. Unstructured self-negotiating swarms got discarded as fragile.
The six orchestration patterns
Almost every workload maps to one of these.
| Pattern | Coordination | Best for | Main failure mode | Cost/latency |
|---|---|---|---|---|
| Orchestrator-Worker | Primary decomposes, workers return summaries | Deep research, comparative doc analysis | Orchestrator context bloat on wide fan-out | Moderate, linear tokens |
| Hierarchical/Recursive | Managers spawn child subagents in tiers | Large migrations, system-wide audits | Communication gaps between distant nodes | High |
| Blackboard/Shared-Memory | Agents read/write a central state board | Async long-running enterprise ops | Race conditions without atomic isolation | Low-moderate |
| Swarm/Handoff | Peers transfer control along a graph | Multi-domain routing, service desks | Infinite routing loops, context loss | Low, efficient |
| Planner-Executor | Split planning from sandboxed execution | Code gen, DB queries, FS work | Infinite repair loops on ambiguous errors | Moderate |
| Debate/Critic | Generator vs isolated critic, iterate | High-stakes SWE, security, finance | High cost per decision | High |
The orchestrator-worker pattern is the default for fan-out research because the primary stays in synthesis mode while workers do the dirty reads. The debate pattern earns its high cost differently.
An isolated critic auditing a candidate against a rubric is the most reliable quality lever available, which is why the generator-verifier loop is the one pattern everyone converged on. If you only adopt a single structure from this list, make it that one.
The 2026 framework matrix
The runtime should match the pattern. Here's how the field shakes out.
| Framework | Model | Strength | Watch for |
|---|---|---|---|
| Claude Code | Subagents, Skills, Agent Teams, Workflows | Hundreds of resumable agents via deterministic Workflows | Workflow runtime blocks Date.now()/Math.random() by design |
| OpenAI Agents SDK | Manager + Handoffs, provider-agnostic | Lightweight, Python and TS | SandboxAgent still Beta (v0.14+) |
| LangGraph | Cycle-oriented graphs | Deterministic routing, checkpointing, low latency | More wiring up front |
| Google ADK | Agent-as-a-Tool, Agent Protocol | Versioned Artifact memory, A/V streaming | Vertex-leaning |
| MAF 1.0 | Semantic Kernel + AutoGen | VM-isolated sandboxes, GA April 2026 | .NET/Python only |
| Strands | Model-driven, OSS by Amazon | Native MCP, Zod self-repair | Newer ecosystem |
| Mastra | Durable graph workflows | OpenTelemetry, automated Scorers | TypeScript-first |
Claude Code's dynamic Workflows release is the headline shift: agent() for a single git-worktree-isolated worker, parallel() for a concurrent barrier, and pipeline() for streaming items through stages. Failed runs pause and resume via local state journaling.
The most interesting performance story is Microsoft's CodeAct. Instead of a chatty multi-turn tool chain, the agent writes one Python script that runs all tool calls inside a fresh Hyperlight micro-VM and returns a single consolidated result.
That's 27.81s down to 13.23s, and 6,890 tokens down to 2,489. For shallow tool-calling work, one synthesized script now beats a multi-turn swarm on both latency and cost.
What breaks multi-agent systems
Four failure modes recur, and each has a fix.
Context fragmentation. Parallel workers with clean contexts make conflicting implicit decisions, invisible to peers, that collide on fan-in. Fix: keep write paths single-threaded and reserve parallel workers for read and verify. In practice that means one agent owns the edits to a file while others propose diffs the owner applies.
Lost-in-the-middle. Transformers still show a U-shaped recall curve in 2026, even on 1M-token models. Past 50% capacity, recall degrades with distance from the end, and mid-context safety constraints get overlooked.
Omission-constraint decay. The Yeran Gamage study across thousands of trials found "must not" rules decay faster than "must" rules with depth: ~98% compliance at turn 3, 73% at turn 5, 33% at turn 16. Fix: constraint pinning, dynamically re-injecting critical omission rules at the absolute top of the system prompt every turn so depth never buries them.
Cascading non-determinism. A minor variation cascades, and an infinite self-correction loop can drain a token budget in minutes. Fix: hard caps on subagent spawning, execution journaling, and schema-guaranteed outputs that force a clean exit.
How do you measure a multi-agent system?
On four axes: task success rate against verified ground truth, cost in tokens per task, wall-clock latency, and trajectory accuracy (the correct tool-call sequence and parameters).
Mastra's trajectory scorers split into code-based (deterministic match against a gold trajectory, strict or relaxed) and LLM-based semantic judges. Real-time trust scoring, scoring each intermediate step before it executes, cut failure rates up to 50% on complex multi-step customer-service tasks.
For leaderboards: Opus 4.7 leads SWE-bench Verified at 87.6%, Claude Sonnet 4.5 tops GAIA at 74.55% under Princeton HAL, and tau-bench measures dynamic dialogue under strict policy.
The build playbook
Three steps, in order.
Step 1, decide the shape. Highly integrated, style-sensitive writes? Single-threaded consolidated agent with a generator-verifier loop. Decomposable into non-conflicting parallel steps? Hierarchical or orchestrator-worker, mapping subagents to separate worktrees. Multi-domain routing? A swarm graph with explicit policy boundaries.
Step 2, tier your memory using the MemGPT/Letta OS model documented in the 2026 memory systems guide. Core Memory is a small hot buffer pinned at the top so it can't get buried. A Working Buffer holds the last 20-40 messages and summarizes on capacity. An Archival Store sits in an external vector DB that the agent must explicitly query.
Step 3, standardize comms via the A2A protocol, now under the Linux Foundation after merging with IBM's ACP in August 2025.
Agents publish Agent Cards at /.well-known/agent.json, expose Tasks with a stateful lifecycle (SUBMITTED through COMPLETED/FAILED), and bind over JSON-RPC 2.0. Agents stay opaque black boxes; the coordinator talks to them in natural language plus structured task states.
Pair it with MCP for tools and AG-UI for human-in-the-loop.
What this means for you
Start single-threaded. Add a parallel critic before you add anything else, because the generator-verifier loop is the highest-return move on this list. Reach for orchestrator-worker only when the work genuinely fans out into independent reads.
Put deterministic control flow (a graph state machine or programmatic workflow) in charge of routing, and use models for isolated thinking inside the nodes. Cap your spawns, journal your runs, and treat the context window as RAM you must actively manage.
What to watch next: as always-on adaptive reasoning deepens, expect more shallow specialist swarms to collapse back into single capable agents with high effort budgets, and program synthesis to keep absorbing the multi-turn tool chains we still hand-wire today.
Related guides
- Your Model Isn't the Agent. Your Agentic Harness Is.
- Long-Horizon Agents Run for Hours. Wield Them Safely
Sources
- Don't Build Multi-Agents (Cognition)
- Multi-Agents: What's Actually Working (Cognition)
- Building a multi-agent research system (Anthropic via ZenML)
- Building Effective AI Agents (Anthropic)
- Orchestrate subagents at scale with dynamic workflows (Claude Code Docs)
- Claude Code adds Dynamic Workflows (InfoQ)
- A Harness for Every Task (Towards Data Science)
- OpenAI Agents SDK (Python)
- OpenAI Agents SDK (JS/TS)
- How and when to build multi-agent systems (LangChain)
- Agent Development Kit (Google)
- Microsoft Agent Framework v1.0 (DevBlogs)
- Microsoft Agent Framework at BUILD 2026 / CodeAct
- Strands Agents
- Mastra
- Mastra trajectory-accuracy scorers
- Lost-in-the-Middle is still real in 2026 (DEV)
- Context window is RAM not storage (Mem0)
- AI agent memory systems guide 2026
- Agent2Agent (A2A) Project
- A2A specification
- AG-UI Protocol
- AI Agent Leaderboard 2026 (Rapid Claw)
- HAL GAIA Leaderboard (Princeton)
- tau-bench (Sierra)
