LLM-as-Judge

LLM-as-Judge is the practice of using a strong language model, prompted with a rubric, to score or compare the outputs of another model at scale, standing in for the human raters who would otherwise grade each response.

LLM-as-Judge is the practice of using a strong language model, prompted with a rubric, to score or compare the outputs of another model at scale, standing in for the human raters who would otherwise grade each response. It emerged as a practical eval method around 2023 as teams sought a cheaper, faster alternative to manual annotation for open-ended tasks like summarization, chat quality, and instruction following, where exact-match scoring fails. A judge can return an absolute score against criteria, a pairwise preference between two candidates, or a pass/fail on a checklist. The approach scales to thousands of examples per run and turns fuzzy quality dimensions—helpfulness, factuality, tone—into repeatable numbers. It is not a source of ground truth: judge verdicts carry known biases, drift with model updates, and require validation against human labels before they can be trusted to gate a release.

How it works

You write a rubric that names the criteria and a scoring scale, then prompt the judge model with the task input, the candidate output, and often a reference answer. The judge is instructed to reason before scoring—chain-of-thought reasoning measurably improves agreement with humans—and to emit a structured verdict such as a JSON score or an A/B choice. Pairwise comparison tends to be more reliable than absolute scoring because relative judgments are easier than calibrating a number. Runs are then aggregated across a dataset to produce a metric you can track across model versions.

Why it matters for AI engineers

A judge turns evaluation from a bottleneck into a CI step: thousands of graded comparisons for the price of inference, fast enough to run on every prompt change. But the biases are real and load-bearing. Position bias makes judges favor whichever answer appears first, so you must swap order and average. Self-preference bias makes a model rate its own family's outputs higher, so avoid using a model to judge itself when picking between vendors. Always calibrate the judge against a human-labeled sample and report agreement before letting it gate a ship decision.

LLM-as-Judge vs. alternatives

Method Cost per item Best for Main weakness
LLM-as-Judge Low Open-ended quality at scale Position and self-preference bias
Human eval High Ground truth, nuance, safety Slow, hard to scale, rater variance
Programmatic checks Near zero Exact match, schema, regex, unit tests Blind to semantics and style

Related terms

Go deeper

Definitions are the start. Ask the Research Desk for a cited, multi-source brief on LLM-as-Judge — real sources, verified claims, delivered in minutes.

Ask the Research Desk →