cluster

Claude Fable 5 vs GPT-5.5: what each is actually great at

Claude Fable 5 lands 80.3% on SWE-bench Pro with a 1M-token window built for agents. Here's where it beats GPT-5.5, what it costs, and how to pick for your codebase.

June 12, 20268 min read
claude fable 5 benchmarksclaude fable 5 vs gpt-5.5swe-bench pro
Claude Fable 5 vs GPT-5.5: the coding benchmarks that actually matter

Anthropic shipped Claude Fable 5 on June 9, 2026, and the number on the box is real: 80.3% on SWE-bench Pro, paired with a 1M-token context window built specifically for long-running coding agents. For repo-wide refactors and multi-file work, it is the strongest model Anthropic has shipped.

The useful question is not whether the headline is true. It's where Fable 5 actually earns its keep against GPT-5.5, what it costs to run at that level, and how to prove the upgrade on your own codebase in about two weeks.

TL;DR: Claude Fable 5 posts 80.3% on SWE-bench Pro against GPT-5.5's 58.6%, and its strengths land exactly where that gap predicts: multi-file refactors, long-context retrieval, and agent runs with a stable prompt. The 90% cached-input discount is what makes its higher list price pay off. Decide with a two-week holdout run on your own repo, and add one engineering safeguard along the way: validate tool-call response bodies, because Fable 5 can return a 200 on a truncated payload.

Key takeaways:

  • Fable 5 leads SWE-bench Pro at 80.3% to GPT-5.5's 58.6%, and independent reviewers confirm the direction: it's strongest on long-context and cross-file coding.
  • The 1M-token window plus a 90% cached-input discount make it well suited to long-horizon agents with stable system prompts and tool schemas.
  • List price is $10 / $50 per million tokens against GPT-5.5's $5 / $30. Your cache hit rate decides whether that premium disappears or stings.
  • Add response-body validation to your tool calls. It's good hygiene across any model, and it covers a documented Fable 5 case where a truncated response still reports success.
  • Choose with a holdout run on your own backlog, not the public leaderboard, and switch on a threshold you write down in advance.

What Claude Fable 5 is built for

Fable 5 is the public face of Anthropic's new Mythos-class tier. The restricted Mythos 5 sits behind Project Glasswing with roughly 50 vetted partners, and Fable 5 ships the same underlying capability with an added safety cage and tighter tool-use limits. The BBC's coverage describes the split as gated by safety profile rather than capability ceiling.

What that buys you in practice is a model tuned for long, stateful coding work. The 1M-token context (aggregator-reported) lets an agent hold a large slice of a repository, a long task history, and a full tool schema in working memory at the same time.

Simon Willison's June 9 review, the most-cited practitioner take so far, called the 1M window the headline capability and found Fable 5 clearly better than Opus 4.8 on long-context coding. That's the work it was designed for.

Claude Fable 5 vs GPT-5.5: the spec sheet

One naming trap first. The ChatGPT default since May 5 is GPT-5.5 Instant, a chat-latency SKU capped at a 256K context and 32K output. The model you actually want to benchmark coding against is frontier GPT-5.5, released April 23 per CNBC and Wikipedia. Benchmark against Instant and you'll measure the wrong model.

Claude Fable 5 GPT-5.5
Released June 9, 2026 April 23, 2026
Context window 1M tokens (aggregator-reported) 1M tokens
Max output 128K 100K
Input / output price $10 / $50 per MTok $5 / $30 per MTok
Cached input 90% discount 50% discount
SWE-bench Pro 80.3% (Anthropic harness) 58.6% (both vendors)
Agent harness story MCP, with an official C# SDK via Microsoft Codex CLI, battle-tested
Access Pro tier and up; free until June 22 All tiers; OpenRouter mirror

The one figure both vendors report identically is GPT-5.5 at 58.6% on SWE-bench Pro. Fable 5's 80.3% is Anthropic's first-party result on its own harness, which is normal for launch week. The reviews below are what tell you how much of that lead survives contact with real work.

Where Fable 5 pulls ahead

The independent signal lines up with the headline's direction. Vellum ran Fable 5 against its internal 200-issue set and found it beat GPT-5.5, with the gap concentrated in multi-file refactors and long-context retrieval.

Lushbinary's three-way comparison adds useful texture: Fable 5 takes Terminal-Bench 2.1, while GPT-5.5 wins OSWorld-Verified. So the picture is consistent rather than lopsided. Fable 5 is strongest where context length and cross-file reasoning dominate.

If your hardest work is repo-wide refactors and long agent runs, that's the lane where Fable 5 earns the upgrade.

SWE-bench Pro scores at launch (June 9, 2026)Claude Fable 580.3%Mythos Preview77.8%Opus 4.869.2%GPT-5.558.6%Gemini 3.1 Pro54.2%
SWE-bench Pro scores at launch (June 9, 2026)

How much does Claude Fable 5 cost per task?

List price puts Fable 5 at 2x GPT-5.5 on input and 1.67x on output. Cache behavior changes that math.

Workload Fable 5 GPT-5.5 Ratio
Single-file bug fix (~50K in, 5K out) ~$0.75 ~$0.40 1.88x
Multi-file feature (~300K in, 30K out) ~$4.50 ~$2.40 1.88x
6-hour agent run, 90% cache hit (~10M in, 500K out) ~$25.50 ~$65* flips

*GPT-5.5's cache discount applies, but a cached rate comparable to Anthropic's 90% read discount wasn't posted at the API when we ran the numbers. Fable 5's cache-write and batch pricing also remain unpublished, so treat the agent-run row as directional.

A long-running agent with a stable system prompt and tool schema rides the 90% cached-input discount, and at high cache-hit rates it can land cheaper per task than GPT-5.5. A workload that re-reads a fresh repo on every call pays the full premium.

Euronews flagged the same dynamic on June 12: OpenAI has room to undercut, so treat Anthropic's 2x as a ceiling. The rule of thumb from the arithmetic: if you can't sustain cache hit rates above roughly 70%, Fable 5 is hard to justify at current list prices.

One safeguard worth adding to your agent

This is good agent hygiene regardless of which model you pick: validate what your tools return. A Towards AI reproduction from June 10 documented Fable 5 returning HTTP 200 with a truncated body on some HTTP-aware tool calls, then reporting success to the agent loop.

For a long-horizon agent, a green status on an incomplete result is the expensive kind of bug. The fix is a few lines: wrap tool responses with a length and shape check, and fail closed when a payload looks truncated. It costs nothing and protects you across every vendor.

How should you test Fable 5 against your own repo?

The leaderboard tells you the direction. Your own backlog tells you the answer. Here's the protocol we use, and it takes two to four weeks.

  1. Build a holdout from your own backlog. Take 30 to 50 real issues from the last 90 days, resolved and unresolved. Exclude anything whose text or canonical fix is likely in either vendor's training data; most public GitHub issues from before mid-2025 are at risk.
  2. Fix the harness. Same scaffolding, same tool surface, same retry policy, same MCP server for both models. The harness is your measurement instrument; keep it identical across runs or the comparison is void.
  3. Measure three things. Pass@1 on your shipped test suite, wall-clock time to green, and total tokens including cache hits. Cost-adjusted pass rate drives the decision, not pass rate alone.
  4. Stratify. Single-line fixes, multi-file refactors, and long agent runs behave differently. Morph LLM's Codex vs Claude Code comparison from May is a useful baseline for how much harness choice alone moves agentic results.
  5. Pre-commit a threshold. Write it down before you see results: switch if the challenger is at least 10 points better on pass@1 at no more than 1.5x cost, on the stratum that matters most to you. Anything weaker is a wash.

Then re-run quarterly. Both vendors ship frontier models every six to eight weeks, and a June 2026 decision is stale by Q4.

What this means for you

If you run long-horizon agents with stable prompts, heavy MCP use, or repo-wide refactors, Fable 5 is the model to evaluate first. The 1M context, the cache economics, and the multi-file edge Vellum measured all point the same way. Run your holdout before June 22, while the free access window makes it cheap, and ship the tool-call validation while you're in there.

If your workload is cost-dominated or latency-sensitive, like single-file fixes, bulk PR review, or chat-speed features, GPT-5.5 is still the pragmatic pick, and Anthropic's pricing may come down to meet you.

The 80.3% is Anthropic's first-party number, and the independent reviews so far back the shape of the claim. Your holdout run is the tiebreaker that counts, because it measures the only repository that pays your bills: yours.

Sources

Frequently asked questions

What is Claude Fable 5's SWE-bench Pro score?

Anthropic's June 9, 2026 announcement lists Claude Fable 5 at 80.3% on SWE-bench Pro, versus 77.8% for Mythos Preview, 69.2% for Opus 4.8, and 58.6% for GPT-5.5. The figure is first-party and was produced on Anthropic's own harness, so it has not yet been independently reproduced.

Is Claude Fable 5 better than GPT-5.5 for coding?

On vendor-published numbers, yes: Fable 5 leads on SWE-bench Pro and Terminal-Bench 2.1, and third-party testing by Vellum found its edge concentrated in multi-file refactors and long-context retrieval. GPT-5.5 wins on OSWorld-Verified and costs roughly half as much per token, so the answer depends on your workload.

How much does the Claude Fable 5 API cost?

Fable 5 is priced at $10 per million input tokens and $50 per million output tokens, with a 90% discount on cached input. GPT-5.5 costs $5/$30 with a 50% cache discount. Fable 5 is free to try until June 22, 2026, then requires a paid tier.

What is a Mythos-class model?

Mythos-class is Anthropic's new tier for models it judges too dangerous for unrestricted release. Mythos 5 is gated behind Project Glasswing (roughly 50 vetted partners), while Fable 5 ships the same underlying capability on the public API with additional safety constraints and tool-use restrictions.

Is SWE-bench Pro still a reliable benchmark?

It's more discriminating than SWE-bench Verified, which OpenAI publicly retired in February 2026 as saturated. But the June 2026 Fable 5 and GPT-5.5 numbers were produced on different vendor harnesses with no jointly run, contamination-controlled comparison, so cross-vendor gaps should be treated as marketing until third parties reproduce them.