In February 2026, OpenAI published "Why SWE-bench Verified no longer measures frontier coding" and walked away from the benchmark it created. Its recommendation: use SWE-bench Pro for public comparison, and build a private suite for anything that matters.
This guide shows you how to build an eval harness for LLM agents, with real numbers for cost and sample size.
TL;DR: SWE-bench Verified is contaminated, gamed, and saturated. The replacement is a two-tier strategy: SWE-bench Pro (1,865 tasks, 46% top reported score) for public claims, plus a private suite mined from your own repo's closed issues. A defensible harness needs hash-pinned tasks, three graders, sandbox isolation, and a cost ledger. A 500-task run on GPT-5.5 costs about $2,900 uncached.
Key takeaways
- OpenAI cites three failure modes for Verified: training-data contamination, leaderboard gaming, and score compression at the top.
- The strongest reported SWE-bench Pro score is around 46%, some 35 points below saturated Verified scores, so Pro still discriminates between frontier models.
- Mine eval tasks from issues closed with merged PRs in the 60 to 90 days before each model snapshot, then deduplicate against public benchmarks.
- Combine test-based, diff-based, and LLM-judge grading. Tests gate the hard pass; the judge handles open-ended tasks.
- Harness noise can move pass rates 5 to 10 points, per Anthropic's infrastructure-noise research. Measure it before you trust any model delta.
Why did OpenAI drop SWE-bench Verified?
OpenAI's February 2026 post makes a three-part case: contamination, gaming, and ceiling effects. The public Verified task statements have diffused so widely into training corpora that the metric can no longer separate frontier models from each other.
The irony is sharp. When OpenAI introduced SWE-bench Verified in August 2024, it was fixing a broken benchmark, noting that "some SWE-bench tasks which may be hard or impossible to solve" were "systematically underestimating models' autonomous software engineering capabilities."
Eighteen months later the fix worked too well. Scores compressed above 65% for several frontier models, per BenchLM's 2026 tracking, into a band where deltas stopped meaning anything.
Gaming compounds the problem. Every lab optimizes against the same 500 issues, producing benchmark-shaped agents that ace the harness's grammar and stumble on equivalent real-world work outside it.
SWE-bench Verified vs SWE-bench Pro: what actually changed?
SWE-bench Pro, maintained by Scale AI, is bigger, harder, and partially held out. The arXiv preprint (2509.16941) describes 1,865 issues from 41 professional repositories, and the public split lives on Scale's leaderboard.
| Property | SWE-bench Verified | SWE-bench Pro |
|---|---|---|
| Maintainer | OpenAI | Scale AI |
| Tasks | 500 | 1,865 |
| Repos | 12 public Python repos | 41 enterprise-grade repos |
| Languages | Python only | Python, Go, plus TypeScript and Java additions |
| Contamination control | Public, widely crawled | Held-out private split drives the leaderboard |
| Top score (2026) | 65%+ for several models | ~46% (reported by Morph LLM) |
One honesty note: that 46% figure comes from Morph LLM's third-party leaderboard analysis, a blog rather than a first-party Scale publication. Treat it as reported rather than established. The qualitative signal holds either way: models that cluster within a few points on Verified spread out meaningfully on Pro.
How do you build an eval harness for LLM agents?
Start with task sourcing, because everything downstream inherits its quality. Pull issues from your target repo that were closed with a merged PR, in a 60 to 90 day window before each model's training snapshot.
Filter for bounded work: reject anything whose closing PR touched more than about 5 files or 1,000 lines. Then deduplicate against both public SWE-bench task lists.
One pitfall worth flagging: issues often describe a different fix than the PR delivered. Your golden answer is the merged PR's diff, with its tests, rather than the issue's prose.
Store each task as(repo, issue_number, pr_number, base_sha, head_sha, task_description, golden_diff)and content-hash it. The hash is what lets you prove an old pass and a new pass refer to the same task.
What grading strategy should you use?
Use three graders and require agreement only where it counts. The OpenAI Cookbook's evals guide covers the canonical test-based approach.
- Test-based (the gate). Replay the maintainer's test suite. Hard pass means every fail-to-pass test goes green and no pass-to-pass test breaks.
- Diff-based (the flag). Similarity against the golden diff catches near-misses worth a human look. It's brittle alone, since a functionally equivalent but syntactically different fix scores zero.
- LLM-as-judge (the fallback). A cheap model like Claude Haiku 4.5 grades open-ended tasks against a rubric, at roughly $0.25 per task.
Track soft passes separately. A patch that fixes the bug but breaks an unrelated test is a classic over-modification signal, and it's invisible in a single headline number.
How do you isolate and track runs?
Every task runs in a fresh sandbox. For SWE-bench-shaped Python tasks, Modal (about $0.05 per task at 2 vCPUs for 10 minutes) and E2B (around $0.00003 per second) are the practical defaults. Docker on your own infrastructure wins when tasks need real network access or a database.
For regression tracking, three primitives are non-negotiable: hash-pinned tasks, pinned model strings with snapshot dates (gpt-5.5-2026-04-23,claude-fable-5-2026-06-09, both confirmed against OpenAI's and Anthropic's release pages), and a nightly CI run that alerts on regression.
The whole harness fits in one small repo: a task pool of hash-pinned JSONL entries, three grader modules, sandbox runners, a cost ledger, and a stats module computing Wilson confidence intervals on the pass rate.
How much does a private eval suite cost to run?
Token spend dominates everything else. At GPT-5.5's $5/M input and $30/M output pricing, with roughly 500K tokens per SWE-bench-shaped task, here are worked totals with a Haiku 4.5 judge and no input caching:
Input caching cuts those totals to roughly $165, $330, and $1,650. Switching the main model to Claude Haiku 4.5, priced at $1/$5 per million tokens, drops them another 5x. Sandbox and judge costs are rounding errors by comparison.
A useful sanity check: if you're spending over $5 per task on a suite of fewer than 50 tasks, you're over-investing relative to the variance the eval can resolve.
How many tasks make a result statistically meaningful?
Fewer than most teams assume, for trends. Far more than most teams assume, for ship decisions. Detecting a 5-point pass-rate improvement (70% to 75%) with a two-proportion z-test at 80% power requires about 1,237 tasks per group.
The commonly quoted "~619 per group" figure comes from a simplified formula that drops a variance term and understates the requirement by half.
The practical rule of thumb, popularized in Braintrust's Phil Hetzel talk on eval maturity: about 30 tasks for trend detection, 100 for "this is a real change," and over 1,000 for a ship decision.
And sample size has a floor beneath it. Anthropic's infrastructure-noise post found that flaky tests, slow boots, and network timeouts can move pass rates by 5 to 10 percentage points on their own.
Harness noise sets the resolution limit of your eval; sample size only helps once that floor is measured. Re-run a 10% subset weekly on a frozen model and subtract the observed drift before reporting any model delta.
Braintrust vs LangSmith vs Phoenix vs Langfuse: which platform fits?
For a private coding eval, the deciding feature is contamination control: keeping your tasks out of any provider's training data.
| Platform | Cheapest paid tier | Model | Contamination story |
|---|---|---|---|
| Braintrust | $249/mo Pro | SaaS, on-prem at Enterprise | Dataset-as-code; canonical tasks stay in your repo, only results stream out |
| LangSmith | $39/seat/mo Plus | SaaS with hybrid deployment | Data plane in your VPC, metadata only to LangChain |
| Arize Phoenix | $50/mo AX Pro | ELv2 open source | Self-host everything; you own storage and retention |
| Langfuse | $29/mo Core | OSS plus managed cloud | Self-host by default, dataset and prompt versioning |
Braintrust is the strongest pure evaluation product, and its February 2026 $80M Series B signals consolidation. But weigh the May 2026 incident in which Braintrust told every customer to rotate sensitive keys after a breach.
Contamination control is a security topic as much as a product feature. If your eval tasks would leak competitive information, the self-hosted options deserve a hard look.
What this means for you
Treat the private eval as your release gate and the public benchmark as your market claim. That's the synthesis the practitioner community has converged on, and even the skeptics agree on the split: Nathan Lambert's "evals are marketing" essay argues that without fair comparison "the numbers are marketing, not science," and METR's January 2026 limitations note reports historical error bars of roughly 2x in each direction.
Both are arguments for keeping a public number alongside your private one.
The vendor announcements this month make the same point from the other side. Cursor's CursorBench (87 internal tasks, Claude Fable 5 at 72.9%) and Cognition's FrontierCode claims for SWE 1.6 are both unaudited internal numbers. Useful as signals, but you should reproduce before you rely on either.
This week's concrete plan: mine 30 to 50 tasks from your repo's merged PRs, wire up test-based grading in Modal or Docker, hash-pin everything, and run your current model as the baseline. That's enough for trend detection at a cost of a few hundred dollars.
Grow toward 100+ tasks and a weekly noise measurement before you let the suite gate a model upgrade.
Sources
- Why SWE-bench Verified no longer measures frontier coding (OpenAI)
- Introducing SWE-bench Verified (OpenAI, August 2024)
- SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Tasks? (arXiv)
- SWE-Bench Pro public dataset (Scale AI)
- SWE-Bench Pro Leaderboard 2026 (Morph LLM)
- Quantifying infrastructure noise in agentic coding evals (Anthropic)
- Getting started with OpenAI Evals (OpenAI Cookbook)
- OpenAI API pricing
- Claude Fable 5 and Claude Mythos 5 (Anthropic)
- Braintrust plans and limits
- LangSmith pricing and hybrid deployment docs
- Langfuse pricing
- Braintrust breach disclosure (TechCrunch, May 2026)
- Big Tech's LLM evals are just marketing (Nathan Lambert, Interconnects)
- Clarifying limitations of time horizon (METR)
- Introducing SWE 1.6 (Cognition)
