On SWE-bench Verified, the most-cited coding benchmark, Claude Code and OpenAI Codex are separated by 1.1 points (88.7% for GPT-5.5 + Codex, 87.6% for Opus 4.7 + Claude Code, per the SWE-bench Verified leaderboard). That gap is noise.
Look at the next benchmark over and the picture splits wide open. On SWE-bench Pro, the harder multi-file version, Claude Code leads by 10.6 points. On Terminal-Bench 2.1, Codex leads by 5.3. These two tools are not slightly-different versions of the same thing. They're optimized for different jobs.
This is the practitioner's guide to the Claude Code vs Codex decision as of June 2026: which agentic coding harness actually ships more code, on which kind of task, at what cost.
TL;DR
Codex wins short-horizon and terminal-style work; Claude Code wins long-horizon, multi-file, large-codebase work. As of June 2026, Codex (GPT-5.5) leads Terminal-Bench 2.1 at 83.4% and ships a true OS-level sandbox, making it the stronger pick for shell-driven and security-sensitive automation. Claude Code (Opus 4.8) leads SWE-bench Pro at 69.2% and MRCR-at-1M-tokens at 78.3%, backed by nested sub-agents and a four-layer config stack, making it the stronger pick for sprawling refactors. They've converged on MCP and AGENTS.md, so the real choice is harness ergonomics plus model fit, not protocol lock-in.
Key takeaways
- No single winner. The right tool depends on whether your work is sequential-and-long (Codex) or decomposable-and-wide (Claude Code).
- The benchmarks disagree by design. Codex: Terminal-Bench, SWE-bench Verified. Claude Code: SWE-bench Pro, OSWorld, long-context MRCR.
- The protocol war is over. Both are MCP clients; both read
AGENTS.md. Pick on server availability and model, not "who has MCP." - Pricing mirrors. $20 entry tiers, $25/seat teams. The differentiator is per-token API cost and harness depth, not subscription price.
- Versions rot in ~90 days. Both vendors ship every 6-8 weeks. Standardize on the durable abstraction (
AGENTS.md+ MCP), treat the harness as swappable.
What is the difference between Claude Code and Codex?
Claude Code is a single, deeply configurable harness: one CLI plus thin IDE wrappers. Codex is a distributed family of five surfaces sharing one SDK. That architectural split explains almost every downstream difference.
Claude Code ships as the CLI (v2.1.172 as of the week of June 8, 2026) with first-party VS Code, JetBrains, Cursor, Windsurf, and Zed integrations. The default model is Claude Sonnet 4.6 on Pro/Max 5x, and Claude Opus 4.6 on Max 20x and up, with Opus 4.7 and Opus 4.8 available as model choices.
OpenAI Codex is five surfaces: the Codex CLI (open source), the VS Code extension, the JetBrains/Xcode IDE extension, Codex Cloud, and the Codex desktop app for Windows, macOS, and iOS. The default model in June 2026 is GPT-5.5 (released April 23, 2026), with GPT-5.4 and the cheaper GPT-5.3-Codex still in rotation.
What changed in the last 60 days?
The field moves monthly, so the recency anchor matters. Here's what a June 2026 buyer should weigh.
Claude Code's update rhythm has been harness-heavy. The CLI went from 2.1.169 to 2.1.172 in a single week (June 8-15), adding a /cd command, nested sub-agents (depth 5 in background mode), a --safe-mode flag, and a fallbackModel config field. That sub-agent change is the most material harness shift of the month.
OpenAI's rhythm has been model-heavy. GPT-5.4 landed in March, GPT-5.5 in April, and on May 17 GPT-5.3-Codex became the base model for Copilot Business and Enterprise. The Codex line is now OpenAI's default coding model, not a niche variant.
One cautionary tale: Anthropic launched Claude Fable 5 on June 9 and pulled it from claude.ai on June 12 under a U.S. Commerce Department order, per Simon Willison's tracking. Claude Code itself was untouched, but anyone who hard-coded Fable 5 into CI lost a model overnight. "Shipped" can revert in days. Build for it.
The takeaway for planning: if your team chases the latest model benchmarks, Codex's surface refreshes faster. If you value harness ergonomics, Claude Code keeps stacking primitives.
CLAUDE.md vs AGENTS.md: who controls the agent better?
This is where the convergence story gets interesting. AGENTS.md is a community-maintained standard for repo-level agent instructions, now read natively by Claude Code, Codex CLI, Cursor, Aider, Devin, Cline, and twenty-plus other agent CLIs.
The 2026 development: Claude Code now reads AGENTS.md directly, alongside CLAUDE.md. So a team that standardizes on AGENTS.md gets it honored by both tools, and a team wanting Claude-specific behavior layers CLAUDE.md on top. Genuine convergence, not marketing.
Below the instruction file, the configuration stories diverge sharply.
Claude Code runs a four-layer stack documented at code.claude.com: CLAUDE.md memory, sub-agents (.claude/agents/*.md, each with its own model and tools), skills, hooks on tool calls, slash commands like /review and /security-review, three-layered memory, and git worktree isolation. It's the deepest config surface in the category.
Codex is more uniform but less composable: one config.toml covering model, approval policy (untrusted/on-failure/on-request/never), and sandbox mode, plus per-task Codex Cloud config. There is no native sub-agent primitive. Codex parallelizes externally by spawning tasks; Claude Code parallelizes internally with sub-agents.
The honest trade-off is contested. Some reviews argue Claude Code's depth is a liability for small teams who never customize past CLAUDE.md and still pay the cognitive cost. Others call the depth a clear win. The disagreement is real. Pick by team size and whether one engineer will own the harness.
Which coding agent is better for long, multi-step tasks?
Autonomy here means finishing a long task ("fix the bug breaking test #4,217, then open a PR") without a human babysitting it. Both tools approach it differently, and that difference is the cleanest way to choose.
Codex Cloud is built to "let it run." OpenAI's long-horizon guide documents multi-hour cloud runs against sandboxed environments. For a single sequential job ("migrate this monorepo to a new ORM"), that model has the edge.
Claude Code parallelizes instead. Nested sub-agents (v2.1.172) decompose work into tracks that each complete smaller steps and report back. The new fallbackModel field enables a tiered strategy: Opus 4.8 for the planner, Sonnet 4.6 for sub-agents, keeping cost and latency sane.
For a decomposable job ("triage failing tests across 12 packages"), the built-in parallelism wins.
Context handling reinforces the split. Both flagships ship 1M-token windows (Opus 4.6 and GPT-5.5). But using that window is harder than having it. On MRCR, the long-context retrieval benchmark, Opus 4.6 scores 78.3% at 1M tokens versus GPT-5.5's 49.1%. That's the largest gap in this comparison.
One contested figure worth flagging: some community reports claim Opus 4.7's MRCR-at-1M collapsed to 32.2%, a 46-point regression. It is not on Anthropic's published leaderboard, it contradicts Anthropic's own positioning, and it's the single most disputed number in the 2026 corpus. Treat it as unverified.
On safety primitives, Codex has a structural advantage. Its OS-level sandbox (read-only, workspace-write, danger-full-access) is a real isolation boundary. Claude Code's --safe-mode and hooks are policy, not a sandbox.
For running untrusted agent-generated shell, the Codex sandbox is the stronger guarantee. For verification on trusted code, Claude Code's /review, /security-review, plan mode, and Claude Code Security are the richer toolkit, with Codex Security as the OpenAI counterpart.
Caveat worth stating plainly: as of June 16, 2026, no widely-cited independent study (METR, Princeton, Berkeley) ranks the two head-to-head on multi-step completion. The evidence is fragmented across vendor case studies and forum threads, including an active Codex Cloud stream-disconnect bug report. Expect the first real independent study in Q3 2026.
Claude Code vs Codex benchmarks (mid-2026)
Here's the scorecard. Model and harness are named together, because harness differences can swing scores 5-10 points.
| Benchmark | Best Claude Code | Best Codex | Source |
|---|---|---|---|
| SWE-bench Verified | 87.6% (Opus 4.7) | 88.7% (GPT-5.5) | llm-stats |
| SWE-bench Pro | 69.2% (Opus 4.8) | 58.6% (GPT-5.5) | evolink |
| Terminal-Bench 2.1 | 78.1% (Opus 4.7) | 83.4% (GPT-5.5) | morphllm |
| OSWorld-Verified | 82.8% (Opus 4.7) | 71.5% (GPT-5.5) | findskill |
| MRCR @ 1M tokens | 78.3% (Opus 4.6) | 49.1% (GPT-5.5) | learn-prompting |
The chart below shows the three benchmarks where the gap is decisive.
How to read this without getting fooled:
SWE-bench Verified is the cleanest apples-to-apples number, and it's a tie within noise. Do not let a 1.1-point gap drive a six-figure decision.
SWE-bench Pro is the harder version: 1,865 issues from professional repos. Claude Code's 10.6-point lead is the most-cited 2026 reason to pick it for real multi-file refactors.
Terminal-Bench 2.1 measures shell-session task completion. Codex's lead (submitted by OpenAI, June 9, 2026) matches how the team positions GPT-5.5 as a long-horizon terminal agent. Independent replications are still thin, so verify before quoting.
OSWorld-Verified is desktop computer-use. Opus 4.7's 82.8% sits above the roughly 72% human baseline cited in the OSWorld paper, which is why Anthropic leaned into "computer use" positioning.
What none of these measure: "did the agent finish my real feature on the first try without supervision." The closest proxy is SWE-bench Pro, and it favors Claude Code.
But both tools can ace a benchmark and still hand you code that needs revision. The best signal in 2026 is your own eval harness on your own repo.
Ecosystem: MCP, IDEs, and pricing
The MCP advantage Claude Code held in 2025 is gone. Anthropic donated MCP to the Linux Foundation's Agentic AI Foundation on December 9, 2025, with Block and OpenAI as co-founders. Both tools are MCP clients, so any MCP server works with either. Choose on server availability, not protocol.
IDE coverage splits on developer-relations reach. Codex has deeper JetBrains and Xcode support (first-party IDE extension). Claude Code has the broader VS Code, Cursor, and Windsurf footprint. The Zed integration is symmetric, both as "External Agents."
On pricing, the two are deliberate mirrors. Claude Pro and ChatGPT Plus are both $20/mo; Team Standard and Codex Business are both $25/seat, per Anthropic and OpenAI. The one new tier this window is OpenAI's $8 ChatGPT Go.
API pricing per million tokens is where real money lives. Opus 4.6+ is $5/$25; GPT-5.5 is $5/$30. For heavy code generation (long output), Claude is cheaper. The budget play is GPT-5.3-Codex at $1.75/$14, but with a 400K context instead of 1M.
Anthropic's Opus 4.6 launch cut the prior Opus rate by 67%, which is why it's now price-competitive.
One TCO risk to raise with your account team: Forbes reported "huge pricing issues with glitching Claude Code limits" on March 26, 2026. Anthropic acknowledged intermittent billing edge cases for Max-plan customers. It reads as glitchy limits rather than systematic overcharging, but flag it before a Max-plan migration.
What this means for you
Pick by the dimension you actually optimize for, not by the headline benchmark.
| Your situation | Pick | Why |
|---|---|---|
| Long-horizon, multi-file refactors | Claude Code | SWE-bench Pro 69.2%, MRCR 78.3%, sub-agents |
| Terminal/shell-driven automation | Codex | Terminal-Bench 83.4%, OS-level sandbox |
| Untrusted agent-generated shell | Codex | Real sandbox boundary, not policy |
| Deep custom harness (hooks, skills) | Claude Code | Four-layer config stack |
| Cross-vendor instruction portability | Both | Standardize on AGENTS.md |
| JetBrains / Xcode shop | Codex | First-party IDE extension |
| Cursor / Windsurf shop | Claude Code | First-party extensions |
| Non-engineer entry point | Codex | Desktop + mobile app, no CLI needed |
Two anti-patterns to avoid. First, don't choose on a single benchmark; the SWE-bench Verified gap is 1.1 points of noise. Second, don't equate "newest model" with "best for my code."
GPT-5.5 and Opus 4.8 shipped within 60 days of each other, both on a roughly monthly refresh. For a 6-month commitment, the durable question is which harness and configuration story fits your team.
The strategic move for any team with a 6-month-plus horizon: treat AGENTS.md plus MCP as your abstraction layer and the two harnesses as interchangeable behind it. The MCP donation to the Linux Foundation makes that a stable bet.
Then run a 30-day pilot of both on five identical tasks. The symmetric pricing removes cost as a variable, so you measure what actually matters: which one ships more of your code, first try.
Sources
- Claude Code overview, Anthropic
- Set up Claude Code, Claude Code Docs
- Codex CLI, OpenAI Developers
- Run long horizon tasks with Codex, OpenAI
- GPT-5.5 powers Codex, NVIDIA
- SWE-bench Verified Leaderboard, LLM Stats
- SWE-bench 2026: Claude vs GPT, Evolink
- Codex vs Claude Code benchmarks, Morphllm
- Claude Opus 4.6 1M context, Learnia/Learn-Prompting
- AGENTS.md standard
- Model Context Protocol, Codex
- Donating MCP to the Linux Foundation, Anthropic
- Sandbox Agents, OpenAI API
- Claude pricing · Codex pricing
- Anthropic Claude Code pricing issues, Forbes
- GPT-5.3-Codex base model for Copilot, GitHub Changelog
