Should I build a multi-agent system if I have a 1M token context window?

Often yes, but for narrow reasons. Decomposition buys context isolation, wall-clock parallelism, and cheaper token economics, since a monolithic agent reprocesses its whole trace each turn. A big window does not remove the lost-in-the-middle recall problem or rule dilution from huge system prompts.

What is the orchestrator-worker pattern?

A capable primary agent decomposes an objective into parallel subtasks, spawns specialized workers with isolated contexts and narrow toolsets, then consolidates their structured summaries. Anthropic reported a 90.2% performance gain over a single-agent baseline using this for open-ended research.

Is the Cognition vs Anthropic multi-agent debate settled?

Yes, into a synthesis. Keep write paths single-threaded so parallel agents don't make conflicting implicit choices, while running intelligence and verification in parallel. Cognition's 2026 follow-up endorses generator-verifier loops, capability routing, and structured map-reduce.

What is the A2A protocol?

Agent2Agent is a Linux Foundation standard, merged with IBM's Agent Communication Protocol in August 2025, for inter-agent coordination. It defines Agent Cards, a stateful Task lifecycle, and JSON-RPC bindings so agents stay opaque while a coordinator talks to them via natural language and structured task states.

Which framework should I pick for agent orchestration in 2026?

Depends on the pattern you're running. LangGraph for deterministic graph routing and Mastra for durable graph workflows, Claude Code Workflows for hundreds of resumable agents, OpenAI Agents SDK for lightweight handoffs, and Microsoft Agent Framework's CodeAct when tool-call latency and token cost dominate.

One Mind or Many? The 2026 Subagent Systems Playbook

A 15-agent run, 5 Haiku searchers, 5 Haiku verifiers, and one Opus synthesizer, completed a full codebase analysis in about 572,000 subagent tokens and 3.5 minutes of wall-clock time. You physically can't stuff that same work into one context window.

That run is the argument in one shot. In 2026, the live question is where to draw the line between one mind and many.

Direct answer: keep write paths single-threaded, and run intelligence and verification in parallel. Decompose for context isolation, wall-clock speed, and token economics. A swarm for its own sake just burns budget.

TL;DR: The 2025 fight between Cognition ("don't build multi-agents") and Anthropic ("multi-agent research wins by 90.2%") resolved into a synthesis. Single agent writes, parallel agents verify and explore. Six orchestration patterns cover almost every real workload, so pick the framework that fits the pattern you actually run.

Key takeaways

Decompose even with a 1M window: isolation, parallelism, and quadratic token growth justify it.
Let one agent own the edits. Fan out search and review.
Six patterns cover the field: orchestrator-worker, hierarchical, blackboard, swarm, planner-executor, debate.
A2A (inter-agent) plus MCP (tools) are the coordination standards to build on.
Hard caps and tiered memory are what keep coordination from cascading into cost blowups.

Why decompose when one model holds a million tokens?

Because a big window gives you room to work and nothing more. Treat the context window like RAM you have to manage, not durable storage you can pile into.

Open-ended tools dump raw, unstructured output straight into your primary context. Search results, scraped HTML, DB rows. That noise clobbers your system prompt, guardrails, and formatting rules.

Delegating exploration to isolated subagents keeps the orchestrator's context clean. The subagent absorbs the mess and returns distilled findings.

Then there's wall-clock. Serial execution is the latency bottleneck. Multi-agent buys double-layer parallelism: an orchestrator spawns concurrent subagents, each running parallel tool calls, cutting execution time by up to 90%.

And token economics. A monolithic multi-turn agent reprocesses its entire trace every turn, so cost grows quadratically. Partition the work into short, disposable subagent turns whose intermediate tokens get evicted on completion, and only the compressed result appends to the main thread.

Token usage explains roughly 80% of the variance in success at locating hard-to-find information.

The tradeoff is real: multi-agent systems use about 15x more tokens than chat and 4x more than a single agent. So decompose on purpose, with a reason you can name.

The debate, resolved

In June 2025, Cognition published Don't Build Multi-Agents. The core claim held up. Parallel agents executing writes make conflicting implicit choices about code style, edge cases, and patterns, and without shared continuous context they clash on integration. Their Flappy Bird example: one subagent builds a Super-Mario-style background while another builds a mismatched bird.

Anthropic ran the other way. For open-ended research, single agents bottleneck on sequential reasoning and context limits. Their orchestrator-worker design, a LeadResearcher managing plans with subagents acting as intelligent filters, hit a 90.2% performance improvement over the single-agent baseline.

Both were right about different workloads. Cognition's 2026 follow-up, Multi-Agents: What's Actually Working, draws the line cleanly: writes stay single-threaded, intelligence and verification go multi-threaded. Three production patterns survived:

Generator-Verifier Loop. A writer (Devin) plus a review agent (Devin Review) iterate. The reviewer shares no context with the generator beforehand, operates on a clean diff, reasons backward from the implementation, and catches roughly 2 bugs per PR, 58% of them severe.
Smart Friend. A smaller sub-frontier model drives and escalates hard tasks to a "smart friend" tool, handing over a complete fork of its context and asking broad strategic questions.
Map-Reduce-and-Manage. A manager decomposes massive multi-PR or multi-service migrations, spawns child agents, and coordinates via MCP. Unstructured self-negotiating swarms got discarded as fragile.

The six orchestration patterns

Almost every workload maps to one of these.

Pattern	Coordination	Best for	Main failure mode	Cost/latency
Orchestrator-Worker	Primary decomposes, workers return summaries	Deep research, comparative doc analysis	Orchestrator context bloat on wide fan-out	Moderate, linear tokens
Hierarchical/Recursive	Managers spawn child subagents in tiers	Large migrations, system-wide audits	Communication gaps between distant nodes	High
Blackboard/Shared-Memory	Agents read/write a central state board	Async long-running enterprise ops	Race conditions without atomic isolation	Low-moderate
Swarm/Handoff	Peers transfer control along a graph	Multi-domain routing, service desks	Infinite routing loops, context loss	Low, efficient
Planner-Executor	Split planning from sandboxed execution	Code gen, DB queries, FS work	Infinite repair loops on ambiguous errors	Moderate
Debate/Critic	Generator vs isolated critic, iterate	High-stakes SWE, security, finance	High cost per decision	High

The orchestrator-worker pattern is the default for fan-out research because the primary stays in synthesis mode while workers do the dirty reads. The debate pattern earns its high cost differently.

An isolated critic auditing a candidate against a rubric is the most reliable quality lever available, which is why the generator-verifier loop is the one pattern everyone converged on. If you only adopt a single structure from this list, make it that one.

The 2026 framework matrix

The runtime should match the pattern. Here's how the field shakes out.

Framework	Model	Strength	Watch for
Claude Code	Subagents, Skills, Agent Teams, Workflows	Hundreds of resumable agents via deterministic Workflows	Workflow runtime blocks Date.now()/Math.random() by design
OpenAI Agents SDK	Manager + Handoffs, provider-agnostic	Lightweight, Python and TS	SandboxAgent still Beta (v0.14+)
LangGraph	Cycle-oriented graphs	Deterministic routing, checkpointing, low latency	More wiring up front
Google ADK	Agent-as-a-Tool, Agent Protocol	Versioned Artifact memory, A/V streaming	Vertex-leaning
MAF 1.0	Semantic Kernel + AutoGen	VM-isolated sandboxes, GA April 2026	.NET/Python only
Strands	Model-driven, OSS by Amazon	Native MCP, Zod self-repair	Newer ecosystem
Mastra	Durable graph workflows	OpenTelemetry, automated Scorers	TypeScript-first

Claude Code's dynamic Workflows release is the headline shift: agent() for a single git-worktree-isolated worker, parallel() for a concurrent barrier, and pipeline() for streaming items through stages. Failed runs pause and resume via local state journaling.

The most interesting performance story is Microsoft's CodeAct. Instead of a chatty multi-turn tool chain, the agent writes one Python script that runs all tool calls inside a fresh Hyperlight micro-VM and returns a single consolidated result.

CodeAct: collapsing the multi-turn tool chain

That's 27.81s down to 13.23s, and 6,890 tokens down to 2,489. For shallow tool-calling work, one synthesized script now beats a multi-turn swarm on both latency and cost.

What breaks multi-agent systems

Four failure modes recur, and each has a fix.

Context fragmentation. Parallel workers with clean contexts make conflicting implicit decisions, invisible to peers, that collide on fan-in. Fix: keep write paths single-threaded and reserve parallel workers for read and verify. In practice that means one agent owns the edits to a file while others propose diffs the owner applies.

Lost-in-the-middle. Transformers still show a U-shaped recall curve in 2026, even on 1M-token models. Past 50% capacity, recall degrades with distance from the end, and mid-context safety constraints get overlooked.

Omission-constraint decay. The Yeran Gamage study across thousands of trials found "must not" rules decay faster than "must" rules with depth: ~98% compliance at turn 3, 73% at turn 5, 33% at turn 16. Fix: constraint pinning, dynamically re-injecting critical omission rules at the absolute top of the system prompt every turn so depth never buries them.

Cascading non-determinism. A minor variation cascades, and an infinite self-correction loop can drain a token budget in minutes. Fix: hard caps on subagent spawning, execution journaling, and schema-guaranteed outputs that force a clean exit.

How do you measure a multi-agent system?

On four axes: task success rate against verified ground truth, cost in tokens per task, wall-clock latency, and trajectory accuracy (the correct tool-call sequence and parameters).

Mastra's trajectory scorers split into code-based (deterministic match against a gold trajectory, strict or relaxed) and LLM-based semantic judges. Real-time trust scoring, scoring each intermediate step before it executes, cut failure rates up to 50% on complex multi-step customer-service tasks.

For leaderboards: Opus 4.7 leads SWE-bench Verified at 87.6%, Claude Sonnet 4.5 tops GAIA at 74.55% under Princeton HAL, and tau-bench measures dynamic dialogue under strict policy.

The build playbook

Three steps, in order.

Step 1, decide the shape. Highly integrated, style-sensitive writes? Single-threaded consolidated agent with a generator-verifier loop. Decomposable into non-conflicting parallel steps? Hierarchical or orchestrator-worker, mapping subagents to separate worktrees. Multi-domain routing? A swarm graph with explicit policy boundaries.

Step 2, tier your memory using the MemGPT/Letta OS model documented in the 2026 memory systems guide. Core Memory is a small hot buffer pinned at the top so it can't get buried. A Working Buffer holds the last 20-40 messages and summarizes on capacity. An Archival Store sits in an external vector DB that the agent must explicitly query.

Step 3, standardize comms via the A2A protocol, now under the Linux Foundation after merging with IBM's ACP in August 2025.

Agents publish Agent Cards at /.well-known/agent.json, expose Tasks with a stateful lifecycle (SUBMITTED through COMPLETED/FAILED), and bind over JSON-RPC 2.0. Agents stay opaque black boxes; the coordinator talks to them in natural language plus structured task states.

Pair it with MCP for tools and AG-UI for human-in-the-loop.

What this means for you

Start single-threaded. Add a parallel critic before you add anything else, because the generator-verifier loop is the highest-return move on this list. Reach for orchestrator-worker only when the work genuinely fans out into independent reads.

Put deterministic control flow (a graph state machine or programmatic workflow) in charge of routing, and use models for isolated thinking inside the nodes. Cap your spawns, journal your runs, and treat the context window as RAM you must actively manage.

What to watch next: as always-on adaptive reasoning deepens, expect more shallow specialist swarms to collapse back into single capable agents with high effort budgets, and program synthesis to keep absorbing the multi-turn tool chains we still hand-wire today.

One Mind or Many? The 2026 Subagent Architecture Playbook