On June 16, 2026, Artificial Analysis shipped Intelligence Index v4.1 and quietly dropped MMLU-Pro, AIME 2025, and LiveCodeBench from the scoring mix. In their place: multi-step agentic workloads like GDPval-AA v2, 𝜏³-Banking, Terminal-Bench v2.1, and SciCode.
The stated reason was saturation, the top of the field had compressed into a statistical tie, and knowledge-recall scores in the 90s had stopped predicting enterprise performance.
The same week, a 2026 MMLU contamination re-evaluation landed the second blow: stripping leaked items from the test set dropped one model's ranking by 17 points and reordered the leaderboard.
So if you want to build a custom LLM eval in 2026, the off-the-shelf leaderboard is no longer your starting point. Public benchmarks now tell you the field moved. Only a private harness tells you whether your model improved on your task.
TL;DR
A custom LLM eval harness is a versioned, contamination-aware test suite that scores your model on your actual workload, gated in CI like unit tests. Public benchmarks saturate and leak; a private golden set plus a calibrated LLM-as-judge gives you a number that tracks your production error rate. This is the 7-step build.
Key takeaways
- MMLU rankings shifted 17 points after decontamination in 2026; treat it as field signal, not an acceptance test.
- Build a versioned golden set: 20-50 smoke, ~200 regression, 500+ release gate, with a 10% freshness budget.
- Use deterministic metrics where you can; reserve LLM-as-judge for properties that require reading.
- Calibrate the judge against human labels and gate on Cohen's kappa; below 0.6 it is noise.
- Defend against contamination with canary strings, fresh held-out data, and n-gram overlap checks.
- Standing up the first suite costs roughly 0.5-1.0 FTE-weeks, then ~5% of an engineer's quarter to maintain.
Why custom LLM evals beat public benchmarks in 2026
The honest summary of where benchmarks stand: public sets are screening tools, not acceptance tests. MMLU's own history makes the pattern clear, the MMLU-CF benchmark from Microsoft Research had to rewrite test items wholesale just to remove contamination risk, formalizing at scale what your golden set should do in miniature.
Contamination also crosses language barriers, so a leaked English set can poison multilingual corpora downstream. And the judge models everyone now leans on carry their own biases, well documented in the Judging the Judges study.
None of this means benchmarks are useless. It means they answer a different question than the one you ship against.
Step 1: Define success in observable terms
A harness scores behavior, not vibes. For each user-facing flow, write down the input contract, the action the model takes, and the observable outcome you can verify.
For a code-review bot: input is a unified diff; action is a structured JSON {severity, file, line, message, suggested_fix}; outcome is whether the comment is correct, in-place, and actionable.
Resist scoring "helpfulness" in the abstract. Score a checklist of properties: correctness, severity calibration, comment locality, no false positives, p95 latency under 2s. Pin that checklist in a versioned YAML file so reviewers argue about it in code review.
Step 2: Build a golden set with provenance and a freshness budget
A golden set is a fixed, versioned snapshot of inputs plus expected outputs that never enters training. 2026 practice runs roughly 20-50 examples for a smoke test, 200 for a regression suite, and 500+ for a release gate, per Inference.net's regression-testing guide.
Each item needs provenance, a difficulty tag, and a date stamp. Store it in DVC or Git LFS and never edit in place. Release golden-v4-2026-06 and supersede it; don't mutate it.
One practical rule keeps memorization at bay: at least 10% of items should come from the trailing 30 days. A model trained on last quarter's data cannot have seen this week's failures.
Step 3: Choose metrics, code-based or judge-based
For anything you can check programmatically (compiles, parses, exact match, regex, schema validation), use a deterministic metric. Fast, cheap, unambiguous.
For properties that require reading (clarity, tone, factual grounding, severity calibration), use an LLM-as-judge. Reliable judging needs three things: a written rubric in the prompt, structured output (1-5 or JSON), and a calibration set of ~50 cases where you also hold the human label.
The failure modes are well mapped: position bias, verbosity bias, self-preference, and "rating roulette" where the same judge scores the same input differently across runs. A 2026 audit, Bias in the Loop, found that weaker judges systematically fail to evaluate stronger models, with agreement collapsing as the capability gap widens.
Don't let a small judge grade a frontier model on tasks it can't solve itself.
Step 4: Calibrate and gate the judge
Two controls are non-negotiable.
First, position-swap. For every pairwise comparison, run it twice with the candidates in opposite slots and average the result.
Second, use multiple judges. Route each example to two or three judge models and require agreement within a tolerance, say a delta of 1 or less on a 5-point scale, and flag disagreements for human review.
Spot-check at least 5% of judge decisions against a human label weekly and recompute Cohen's kappa. Inference.net recommends a 0.7 floor for production judgments. If kappa falls below 0.6, the judge has stopped being a measurement instrument.
Step 5: Defend against benchmark contamination
Contamination is the failure mode that invalidated MMLU, so build four defenses in by default.
Canary strings. Inject unique, human-unlikely GUIDs (for example MINION-42-aurelius-7f3c) into golden-set items. If a model emits that string unprompted on an unrelated input, your data leaked.
Date-bounded tasks. Require a fact or API state that post-dates known training cutoffs, like "summarize the changelog merged yesterday."
Held-out fresh data. Keep a private set authored after the most recent training cutoff and never published. This is your only true contamination test.
Overlap checks. Compute 13-gram overlap and a minhash or embedding similarity between your golden set and any public benchmark you suspect the model has seen, then reject items above a threshold. The 17-point leaderboard shift is the size of the error you make by skipping this, per the 2026 re-evaluation coverage.
Step 6: Wire eval gates into CI
Three tiers, matched to merge cost.
On every pull request, run the 20-50 item smoke set as a fast unit test, under five minutes, deterministic metrics only.
Pre-merge to main, run the 200-item regression set with both code metrics and the judge, 15-30 minutes, p95 latency budget enforced.
Pre-release, run the 500+ deep set with multi-judge panels and human spot-checks.
The gate is not "score above X." It is "no statistically significant regression versus the last green build." Particula's 2026 guidance is to compare confidence intervals, not point estimates, and to replicate each eval 5-10 times when stakes are high.
The DeepEval CI/CD guide suggests a 0.5 threshold floor for general use and 0.7 for production gates. In GitHub Actions, this is a job that calls deepeval test run or inspect eval and fails the build when the regression threshold trips.
Step 7: Version, monitor, and re-baseline
Evals drift. New model versions shift behavior; new prompt versions shift it more.
Tag every run with the prompt SHA, the model version, and the golden-set version. Store results in a time-series store (Braintrust, LangSmith, MLflow 3.14.0, all shipped Q2 2026) and dashboard the trailing-30-day mean plus a CUSUM chart for sudden regressions.
Re-baseline quarterly. Retire items the model gets right 100% of the time, add items drawn from real production failures, and refresh the freshness budget.
Worked example: a code-review harness
Evaluate two frontier candidates, Model A and Model B (mid-2026 means the Claude Opus 4.x line, the GPT-5.x line, and Gemini 3.5 are all shipping), on 200 real PR diffs, with a different model as the judge.
- Success rubric: comment is correct, in-place (line within ±2), actionable, and non-duplicative.
- Golden set: 200 diffs from the last 90 days, 30 from the last 7, human comments as ground truth, frozen as
golden-v4-2026-06-17. - Code metrics:
valid_json,severity_in_enum,line_within_range,no_duplicate_comments, all under 1s per item. - Judge: rubric plus diff plus candidate comment, output
{correct, in_place, actionable, severity_match}, position-swapped and averaged, requiring the judge's own calibration accuracy at 0.7 kappa or better. - Contamination: canary GUID in every diff header, n-gram check against public corpora, 30 fresh held-out diffs.
- CI gate: smoke on PR, full 200 pre-merge, 500 pre-release; fail if any metric drops more than 2 points versus the last green main.
- Decision: Model A scores 0.84 (95% CI 0.79-0.88), Model B scores 0.81 (0.76-0.85). The CIs overlap, so it's a tie. Pick on cost or latency.
Choosing your approach: a comparison table
| Approach | Cost to stand up | Relevance to your task | Contamination-resistant | Pick when |
|---|---|---|---|---|
| Public benchmarks (MMLU, GPQA) | $0 | Low | No (17-pt shift in 2026) | Field-level signal only |
| LM Arena | $0 | Medium | Medium (gaming risk) | Human-preference signal at the margin |
| Vendor evals (Braintrust) | Low | High for vendor tasks | Depends on data | Already on the vendor's stack |
| Custom harness (this guide) | Medium-high | High | Yes | Production model selection and gating |
| Hybrid (public + custom gate) | Medium | High | Mostly | Default for most teams shipping in 2026 |
What's current in eval tooling (June 2026)
As of mid-June 2026: Inspect AI v0.3.240 (UK AISI, MIT, June 15), Promptfoo v0.121.12 (MIT, June 13, with OpenAI having announced intent to acquire it), DeepEval 4.0.0 (Apache-2.0, May 8), and MLflow 3.14.0 (Apache-2.0, June 17) are all current.
OpenAI Evals was deprecated on June 3, 2026, with a shutdown date of November 30, 2026. If you're on it, migrate now.
What this means for you
The custom harness has a real price. Expect 0.5-1.0 FTE-weeks to stand up the first suite, then about 5% of an eval engineer's quarter to maintain it.
What that buys is a number that actually correlates with your production error rate, which no public benchmark can promise. For multi-turn agent evals, add a coherence check across turns; agentic workloads fail in ways single-shot metrics never see.
The durable part of this workflow outlives any version on this page. Define observable success, freeze a dated golden set, calibrate your judge against humans, defend against contamination, and gate in CI. Swap the model names every quarter; keep the harness.
Sources
- Artificial Analysis Intelligence Index v4.1
- Benchmark contamination broke MMLU: the 17-point drop
- The Batch, DeepLearning.AI
- MMLU-CF: A Contamination-free Multi-task Benchmark (ACL)
- Judging the Judges: position and bias in LLM judges (arXiv)
- Bias in the Loop: Auditing LLM-as-a-Judge for Software Engineering (arXiv)
- Regression Testing Non-Deterministic AI With LLM-as-Judge (Particula)
- AI Regression Testing for LLM Apps (Inference.net)
- Regression Testing LLM Systems in CI/CD (DeepEval)
