cluster

SWE-bench is dead: build your own LLM eval harness in 2026

OpenAI retired SWE-bench Verified in February 2026. Here is the step-by-step playbook for a private eval suite you can ship this week.

June 12, 202610 min read
build an eval harness for llm agentsswe-bench verified vs swe-bench prollm evaluation
SWE-bench is dead: build your own LLM eval harness in 2026

In February 2026, OpenAI published "Why SWE-bench Verified no longer measures frontier coding" and walked away from the benchmark it created. Its recommendation: use SWE-bench Pro for public comparison, and build a private suite for anything that matters.

This guide shows you how to build an eval harness for LLM agents, with real numbers for cost and sample size.

TL;DR: SWE-bench Verified is contaminated, gamed, and saturated. The replacement is a two-tier strategy: SWE-bench Pro (1,865 tasks, 46% top reported score) for public claims, plus a private suite mined from your own repo's closed issues. A defensible harness needs hash-pinned tasks, three graders, sandbox isolation, and a cost ledger. A 500-task run on GPT-5.5 costs about $2,900 uncached.

Key takeaways

  • OpenAI cites three failure modes for Verified: training-data contamination, leaderboard gaming, and score compression at the top.
  • The strongest reported SWE-bench Pro score is around 46%, some 35 points below saturated Verified scores, so Pro still discriminates between frontier models.
  • Mine eval tasks from issues closed with merged PRs in the 60 to 90 days before each model snapshot, then deduplicate against public benchmarks.
  • Combine test-based, diff-based, and LLM-judge grading. Tests gate the hard pass; the judge handles open-ended tasks.
  • Harness noise can move pass rates 5 to 10 points, per Anthropic's infrastructure-noise research. Measure it before you trust any model delta.

Why did OpenAI drop SWE-bench Verified?

OpenAI's February 2026 post makes a three-part case: contamination, gaming, and ceiling effects. The public Verified task statements have diffused so widely into training corpora that the metric can no longer separate frontier models from each other.

The irony is sharp. When OpenAI introduced SWE-bench Verified in August 2024, it was fixing a broken benchmark, noting that "some SWE-bench tasks which may be hard or impossible to solve" were "systematically underestimating models' autonomous software engineering capabilities."

Eighteen months later the fix worked too well. Scores compressed above 65% for several frontier models, per BenchLM's 2026 tracking, into a band where deltas stopped meaning anything.

Gaming compounds the problem. Every lab optimizes against the same 500 issues, producing benchmark-shaped agents that ace the harness's grammar and stumble on equivalent real-world work outside it.

SWE-bench Verified vs SWE-bench Pro: what actually changed?

SWE-bench Pro, maintained by Scale AI, is bigger, harder, and partially held out. The arXiv preprint (2509.16941) describes 1,865 issues from 41 professional repositories, and the public split lives on Scale's leaderboard.

Property SWE-bench Verified SWE-bench Pro
Maintainer OpenAI Scale AI
Tasks 500 1,865
Repos 12 public Python repos 41 enterprise-grade repos
Languages Python only Python, Go, plus TypeScript and Java additions
Contamination control Public, widely crawled Held-out private split drives the leaderboard
Top score (2026) 65%+ for several models ~46% (reported by Morph LLM)

One honesty note: that 46% figure comes from Morph LLM's third-party leaderboard analysis, a blog rather than a first-party Scale publication. Treat it as reported rather than established. The qualitative signal holds either way: models that cluster within a few points on Verified spread out meaningfully on Pro.

How do you build an eval harness for LLM agents?

Start with task sourcing, because everything downstream inherits its quality. Pull issues from your target repo that were closed with a merged PR, in a 60 to 90 day window before each model's training snapshot.

Filter for bounded work: reject anything whose closing PR touched more than about 5 files or 1,000 lines. Then deduplicate against both public SWE-bench task lists.

One pitfall worth flagging: issues often describe a different fix than the PR delivered. Your golden answer is the merged PR's diff, with its tests, rather than the issue's prose.

Store each task as(repo, issue_number, pr_number, base_sha, head_sha, task_description, golden_diff)and content-hash it. The hash is what lets you prove an old pass and a new pass refer to the same task.

What grading strategy should you use?

Use three graders and require agreement only where it counts. The OpenAI Cookbook's evals guide covers the canonical test-based approach.

  1. Test-based (the gate). Replay the maintainer's test suite. Hard pass means every fail-to-pass test goes green and no pass-to-pass test breaks.
  2. Diff-based (the flag). Similarity against the golden diff catches near-misses worth a human look. It's brittle alone, since a functionally equivalent but syntactically different fix scores zero.
  3. LLM-as-judge (the fallback). A cheap model like Claude Haiku 4.5 grades open-ended tasks against a rubric, at roughly $0.25 per task.

Track soft passes separately. A patch that fixes the bug but breaks an unrelated test is a classic over-modification signal, and it's invisible in a single headline number.

How do you isolate and track runs?

Every task runs in a fresh sandbox. For SWE-bench-shaped Python tasks, Modal (about $0.05 per task at 2 vCPUs for 10 minutes) and E2B (around $0.00003 per second) are the practical defaults. Docker on your own infrastructure wins when tasks need real network access or a database.

For regression tracking, three primitives are non-negotiable: hash-pinned tasks, pinned model strings with snapshot dates (gpt-5.5-2026-04-23,claude-fable-5-2026-06-09, both confirmed against OpenAI's and Anthropic's release pages), and a nightly CI run that alerts on regression.

The whole harness fits in one small repo: a task pool of hash-pinned JSONL entries, three grader modules, sandbox runners, a cost ledger, and a stats module computing Wilson confidence intervals on the pass rate.

How much does a private eval suite cost to run?

Token spend dominates everything else. At GPT-5.5's $5/M input and $30/M output pricing, with roughly 500K tokens per SWE-bench-shaped task, here are worked totals with a Haiku 4.5 judge and no input caching:

Private eval run cost by suite size (GPT-5.5, uncached)50 tasks290USD100 tasks580USD500 tasks2900USD
Private eval run cost by suite size (GPT-5.5, uncached)

Input caching cuts those totals to roughly $165, $330, and $1,650. Switching the main model to Claude Haiku 4.5, priced at $1/$5 per million tokens, drops them another 5x. Sandbox and judge costs are rounding errors by comparison.

A useful sanity check: if you're spending over $5 per task on a suite of fewer than 50 tasks, you're over-investing relative to the variance the eval can resolve.

How many tasks make a result statistically meaningful?

Fewer than most teams assume, for trends. Far more than most teams assume, for ship decisions. Detecting a 5-point pass-rate improvement (70% to 75%) with a two-proportion z-test at 80% power requires about 1,237 tasks per group.

The commonly quoted "~619 per group" figure comes from a simplified formula that drops a variance term and understates the requirement by half.

The practical rule of thumb, popularized in Braintrust's Phil Hetzel talk on eval maturity: about 30 tasks for trend detection, 100 for "this is a real change," and over 1,000 for a ship decision.

And sample size has a floor beneath it. Anthropic's infrastructure-noise post found that flaky tests, slow boots, and network timeouts can move pass rates by 5 to 10 percentage points on their own.

Harness noise sets the resolution limit of your eval; sample size only helps once that floor is measured. Re-run a 10% subset weekly on a frozen model and subtract the observed drift before reporting any model delta.

Braintrust vs LangSmith vs Phoenix vs Langfuse: which platform fits?

For a private coding eval, the deciding feature is contamination control: keeping your tasks out of any provider's training data.

Platform Cheapest paid tier Model Contamination story
Braintrust $249/mo Pro SaaS, on-prem at Enterprise Dataset-as-code; canonical tasks stay in your repo, only results stream out
LangSmith $39/seat/mo Plus SaaS with hybrid deployment Data plane in your VPC, metadata only to LangChain
Arize Phoenix $50/mo AX Pro ELv2 open source Self-host everything; you own storage and retention
Langfuse $29/mo Core OSS plus managed cloud Self-host by default, dataset and prompt versioning

Braintrust is the strongest pure evaluation product, and its February 2026 $80M Series B signals consolidation. But weigh the May 2026 incident in which Braintrust told every customer to rotate sensitive keys after a breach.

Contamination control is a security topic as much as a product feature. If your eval tasks would leak competitive information, the self-hosted options deserve a hard look.

What this means for you

Treat the private eval as your release gate and the public benchmark as your market claim. That's the synthesis the practitioner community has converged on, and even the skeptics agree on the split: Nathan Lambert's "evals are marketing" essay argues that without fair comparison "the numbers are marketing, not science," and METR's January 2026 limitations note reports historical error bars of roughly 2x in each direction.

Both are arguments for keeping a public number alongside your private one.

The vendor announcements this month make the same point from the other side. Cursor's CursorBench (87 internal tasks, Claude Fable 5 at 72.9%) and Cognition's FrontierCode claims for SWE 1.6 are both unaudited internal numbers. Useful as signals, but you should reproduce before you rely on either.

This week's concrete plan: mine 30 to 50 tasks from your repo's merged PRs, wire up test-based grading in Modal or Docker, hash-pin everything, and run your current model as the baseline. That's enough for trend detection at a cost of a few hundred dollars.

Grow toward 100+ tasks and a weekly noise measurement before you let the suite gate a model upgrade.

Sources

Frequently asked questions

Why did OpenAI drop SWE-bench Verified?

In a February 2026 post, OpenAI said public Verified tasks had diffused into training corpora, labs were optimizing against the same 500 issues, and top scores had compressed into a band too narrow to separate frontier models. It now recommends SWE-bench Pro for cross-lab comparison and private suites for high-stakes decisions.

What is the difference between SWE-bench Verified and SWE-bench Pro?

Verified is 500 human-checked Python issues from 12 public repos, introduced by OpenAI in August 2024. Pro, maintained by Scale AI, has 1,865 issues from 41 enterprise-grade repos across Python and Go, with a held-out private split that resists contamination. Top reported scores sit around 46% on Pro versus 65%+ on Verified.

How many tasks does a private eval suite need?

Roughly 30 tasks for trend detection, 100 for confirming a real change, and 1,000+ for a ship decision. A statistically rigorous 5-point pass-rate comparison at 80% power needs about 1,237 tasks per group, so most teams should report confidence intervals instead of chasing significance.

How much does it cost to run a private coding eval?

Token spend dominates. A 500-task run on GPT-5.5 costs roughly $2,900 without input caching and about $1,650 with it. Sandbox compute (Modal or E2B) adds around $25 and an LLM judge about $125. Switching to Claude Haiku 4.5 cuts totals by roughly 5x.

Should I use Braintrust, LangSmith, Phoenix, or Langfuse?

Braintrust is the strongest pure evaluation platform with dataset-as-code; LangSmith's hybrid deployment keeps eval data in your VPC; Arize Phoenix and Langfuse are open source, so you own contamination control by self-hosting. Pick based on whether managed convenience or data custody matters more.