Anthropic shipped Claude Fable 5 on June 9, 2026, and the number on the box is real: 80.3% on SWE-bench Pro, paired with a 1M-token context window built specifically for long-running coding agents. For repo-wide refactors and multi-file work, it is the strongest model Anthropic has shipped.
The useful question is not whether the headline is true. It's where Fable 5 actually earns its keep against GPT-5.5, what it costs to run at that level, and how to prove the upgrade on your own codebase in about two weeks.
TL;DR: Claude Fable 5 posts 80.3% on SWE-bench Pro against GPT-5.5's 58.6%, and its strengths land exactly where that gap predicts: multi-file refactors, long-context retrieval, and agent runs with a stable prompt. The 90% cached-input discount is what makes its higher list price pay off. Decide with a two-week holdout run on your own repo, and add one engineering safeguard along the way: validate tool-call response bodies, because Fable 5 can return a 200 on a truncated payload.
Key takeaways:
- Fable 5 leads SWE-bench Pro at 80.3% to GPT-5.5's 58.6%, and independent reviewers confirm the direction: it's strongest on long-context and cross-file coding.
- The 1M-token window plus a 90% cached-input discount make it well suited to long-horizon agents with stable system prompts and tool schemas.
- List price is $10 / $50 per million tokens against GPT-5.5's $5 / $30. Your cache hit rate decides whether that premium disappears or stings.
- Add response-body validation to your tool calls. It's good hygiene across any model, and it covers a documented Fable 5 case where a truncated response still reports success.
- Choose with a holdout run on your own backlog, not the public leaderboard, and switch on a threshold you write down in advance.
What Claude Fable 5 is built for
Fable 5 is the public face of Anthropic's new Mythos-class tier. The restricted Mythos 5 sits behind Project Glasswing with roughly 50 vetted partners, and Fable 5 ships the same underlying capability with an added safety cage and tighter tool-use limits. The BBC's coverage describes the split as gated by safety profile rather than capability ceiling.
What that buys you in practice is a model tuned for long, stateful coding work. The 1M-token context (aggregator-reported) lets an agent hold a large slice of a repository, a long task history, and a full tool schema in working memory at the same time.
Simon Willison's June 9 review, the most-cited practitioner take so far, called the 1M window the headline capability and found Fable 5 clearly better than Opus 4.8 on long-context coding. That's the work it was designed for.
Claude Fable 5 vs GPT-5.5: the spec sheet
One naming trap first. The ChatGPT default since May 5 is GPT-5.5 Instant, a chat-latency SKU capped at a 256K context and 32K output. The model you actually want to benchmark coding against is frontier GPT-5.5, released April 23 per CNBC and Wikipedia. Benchmark against Instant and you'll measure the wrong model.
| Claude Fable 5 | GPT-5.5 | |
|---|---|---|
| Released | June 9, 2026 | April 23, 2026 |
| Context window | 1M tokens (aggregator-reported) | 1M tokens |
| Max output | 128K | 100K |
| Input / output price | $10 / $50 per MTok | $5 / $30 per MTok |
| Cached input | 90% discount | 50% discount |
| SWE-bench Pro | 80.3% (Anthropic harness) | 58.6% (both vendors) |
| Agent harness story | MCP, with an official C# SDK via Microsoft | Codex CLI, battle-tested |
| Access | Pro tier and up; free until June 22 | All tiers; OpenRouter mirror |
The one figure both vendors report identically is GPT-5.5 at 58.6% on SWE-bench Pro. Fable 5's 80.3% is Anthropic's first-party result on its own harness, which is normal for launch week. The reviews below are what tell you how much of that lead survives contact with real work.
Where Fable 5 pulls ahead
The independent signal lines up with the headline's direction. Vellum ran Fable 5 against its internal 200-issue set and found it beat GPT-5.5, with the gap concentrated in multi-file refactors and long-context retrieval.
Lushbinary's three-way comparison adds useful texture: Fable 5 takes Terminal-Bench 2.1, while GPT-5.5 wins OSWorld-Verified. So the picture is consistent rather than lopsided. Fable 5 is strongest where context length and cross-file reasoning dominate.
If your hardest work is repo-wide refactors and long agent runs, that's the lane where Fable 5 earns the upgrade.
How much does Claude Fable 5 cost per task?
List price puts Fable 5 at 2x GPT-5.5 on input and 1.67x on output. Cache behavior changes that math.
| Workload | Fable 5 | GPT-5.5 | Ratio |
|---|---|---|---|
| Single-file bug fix (~50K in, 5K out) | ~$0.75 | ~$0.40 | 1.88x |
| Multi-file feature (~300K in, 30K out) | ~$4.50 | ~$2.40 | 1.88x |
| 6-hour agent run, 90% cache hit (~10M in, 500K out) | ~$25.50 | ~$65* | flips |
*GPT-5.5's cache discount applies, but a cached rate comparable to Anthropic's 90% read discount wasn't posted at the API when we ran the numbers. Fable 5's cache-write and batch pricing also remain unpublished, so treat the agent-run row as directional.
A long-running agent with a stable system prompt and tool schema rides the 90% cached-input discount, and at high cache-hit rates it can land cheaper per task than GPT-5.5. A workload that re-reads a fresh repo on every call pays the full premium.
Euronews flagged the same dynamic on June 12: OpenAI has room to undercut, so treat Anthropic's 2x as a ceiling. The rule of thumb from the arithmetic: if you can't sustain cache hit rates above roughly 70%, Fable 5 is hard to justify at current list prices.
One safeguard worth adding to your agent
This is good agent hygiene regardless of which model you pick: validate what your tools return. A Towards AI reproduction from June 10 documented Fable 5 returning HTTP 200 with a truncated body on some HTTP-aware tool calls, then reporting success to the agent loop.
For a long-horizon agent, a green status on an incomplete result is the expensive kind of bug. The fix is a few lines: wrap tool responses with a length and shape check, and fail closed when a payload looks truncated. It costs nothing and protects you across every vendor.
How should you test Fable 5 against your own repo?
The leaderboard tells you the direction. Your own backlog tells you the answer. Here's the protocol we use, and it takes two to four weeks.
- Build a holdout from your own backlog. Take 30 to 50 real issues from the last 90 days, resolved and unresolved. Exclude anything whose text or canonical fix is likely in either vendor's training data; most public GitHub issues from before mid-2025 are at risk.
- Fix the harness. Same scaffolding, same tool surface, same retry policy, same MCP server for both models. The harness is your measurement instrument; keep it identical across runs or the comparison is void.
- Measure three things. Pass@1 on your shipped test suite, wall-clock time to green, and total tokens including cache hits. Cost-adjusted pass rate drives the decision, not pass rate alone.
- Stratify. Single-line fixes, multi-file refactors, and long agent runs behave differently. Morph LLM's Codex vs Claude Code comparison from May is a useful baseline for how much harness choice alone moves agentic results.
- Pre-commit a threshold. Write it down before you see results: switch if the challenger is at least 10 points better on pass@1 at no more than 1.5x cost, on the stratum that matters most to you. Anything weaker is a wash.
Then re-run quarterly. Both vendors ship frontier models every six to eight weeks, and a June 2026 decision is stale by Q4.
What this means for you
If you run long-horizon agents with stable prompts, heavy MCP use, or repo-wide refactors, Fable 5 is the model to evaluate first. The 1M context, the cache economics, and the multi-file edge Vellum measured all point the same way. Run your holdout before June 22, while the free access window makes it cheap, and ship the tool-call validation while you're in there.
If your workload is cost-dominated or latency-sensitive, like single-file fixes, bulk PR review, or chat-speed features, GPT-5.5 is still the pragmatic pick, and Anthropic's pricing may come down to meet you.
The 80.3% is Anthropic's first-party number, and the independent reviews so far back the shape of the claim. Your holdout run is the tiebreaker that counts, because it measures the only repository that pays your bills: yours.
Sources
- Claude Fable 5 and Claude Mythos 5 announcement (Anthropic)
- GPT-5.5 Instant (OpenAI)
- OpenAI announces GPT-5.5 (CNBC)
- Initial impressions of Claude Fable 5 (Simon Willison)
- GPT-5 benchmarks with Fable 5 addendum (Vellum)
- Claude Fable 5 fails with HTTP 200 (Towards AI)
- Fable 5 vs GPT-5.5 vs Gemini 3.1 Pro (Lushbinary)
- Claude Fable 5: benchmarks, the cage, and June 22 (FindSkill)
- What is Anthropic's Claude Mythos? (BBC)
- Is Fable 5 worth the price? (Euronews)
- Codex vs Claude Code (Morph LLM)
- Claude models in Microsoft Foundry (Microsoft Azure)
- GPT-5.5 pricing and benchmarks (OpenRouter)
