Evaluating Ai Models And Agents

Multimodal Evals Are Now the Hardest Part of the Stack

Text benchmarks have saturated, so differentiation moved to vision, audio, video, and real-time duplex tasks where evaluation is still immature and gameable.

By June 26, 202610 min read
multimodal LLM evaluationhow to evaluate multimodal modelsmultimodal benchmarks 2026
Multimodal Evals Are Now the Hardest Part of the Stack

Every frontier model now scores between 89% and 92% on MMLU. That cluster, confirmed across multiple independent aggregators as of April 2026, has made the canonical text benchmark useless for telling models apart. Differentiation has moved to vision, audio, video, and real-time duplex tasks, where the evaluation methodology is immature, fragmented, and often gameable.

Multimodal LLM evaluation is the genuine 2026 bottleneck: building a capable vision-language or omni-modal model is commoditized, but measuring whether it works in production remains unsolved. The practical answer is a layered approach, using public benchmarks as a floor, product-specific rubric-graded evals as the signal, and LLM-as-judge only on narrow, human-calibrated tasks.

TL;DR

Text benchmarks saturated, so the field pivoted to multimodal and real-time evaluation, where methods are younger and noisier. Cross-modal grounding, temporal reasoning, and modality-specific hallucination require fundamentally different evals than text-only accuracy. No leaderboard substitutes for a 100-to-200-example product-representative suite with human ground truth and rubric-based grading.

Key takeaways

  • MMLU is saturated at 89 to 92% for all frontier models; MMLU-Pro is closing fast within 18 months of release.
  • Vision-language benchmarks are themselves compromised: GeminiPro hits 42.9% on MMMU without seeing the image, per the MMStar audit.
  • Audio-visual (AV-Odyssey), video (Video-MME, TempCompass), and real-time duplex (Omni-DuplexEval) benchmarks are the active frontier, but each measures a narrow slice.
  • LLM-as-judge costs roughly $78 versus $5,000 to $7,500 for human annotation, but carries self-preference, perceptual, length, and position biases that are worse for multimodal outputs.
  • The only reliable signal is a private, product-specific eval with rubric grading and Cohen kappa at least 0.61 against human raters.

Why MMLU saturation forced the pivot to multimodal

MMLU, introduced by Hendrycks et al. In 2021, was the canonical measure of AI progress for four years. As of April 2026, that era is over. GPT-5.4 reportedly leads at about 92%, Claude Opus 4.6 sits near 91 to 92.3%, Gemini 3.1 Pro around 90 to 91.8%, and DeepSeek V4 near 89%, according to aggregators tracked by TokenMix and Artificial Analysis.

MMLU-Pro, released mid-2024, was the designated successor. It expanded to roughly 12,000 questions across 14 disciplines with 10 answer choices instead of 4, cutting guessing accuracy from 25% to 10% and restoring discriminating power, as the original paper reports.

By June 2026 it is also saturating: top models cluster within one to two points, with Gemini 3.1 Pro Preview near 83.8% and Claude Opus 4.5 near 89.5% depending on the aggregator, per Presenc AI and bracai.

The saturation has forced three parallel shifts: domain-specific benchmarks like GPQA Diamond and SWE-Bench, modality expansion into vision, audio, and video, and dynamic or agentic benchmarks like LiveCodeBench that update continuously to dodge contamination. As DataVLab puts it, a single cherry-picked MMLU or SWE-bench figure can be technically true and still completely misleading.

Frontier model MMLU scores, April 2026GPT-5.492%Claude Opus 4.692%Gemini 3.1 Pro91%DeepSeek V489%
Frontier model MMLU scores, April 2026

How do you evaluate multimodal models in 2026?

You start with public benchmarks as a coarse filter, then build product-specific evals that capture your actual failure modes. Public multimodal benchmarks are necessary but insufficient because they measure aggregate capability on distributions that rarely match production data, and because several are compromised by visual dispensability and contamination.

The established vision-language core is MMMU, with 11,500 questions across 30 college-level subjects. Gemini 3.1 Pro Preview leads at about 83.8% as of June 2026, up roughly 27 points from the GPT-4V baseline over two years.

MMMU-Pro hardens it with 3,460 questions, a 10-option format, and vision-only input requirements. MMBench adds bilingual coverage across 20 ability dimensions with a CircularEval strategy that resists data leakage.

MathVista targets mathematical reasoning in visual contexts.

Audio-visual evaluation is newer and thinner. AV-Odyssey is the most comprehensive dedicated audio-visual benchmark, with 4,555 multiple-choice questions on sound counting, loudness comparison, and source identification, built by researchers at CUHK MMLab with collaborators from Stanford, Berkeley, and Yale. MultiFinBen is the first multilingual, multimodal financial benchmark, covering five languages and three modalities.

Video gets Video-MME (CVPR 2025), with 300 expert videos and 900 questions across six disciplines and three cognitive stages, plus TempCompass for temporal reasoning. The newest category is real-time duplex: Omni-DuplexEval, released May 17, 2026, with 660 annotated videos across nine tasks split into Real-Time Description and Proactive Reminder scenarios, is the only benchmark specifically targeting full-duplex omni-modal interaction.

The benchmark map at a glance

Benchmark Modality Task focus Status (June 2026)
MMMU Image-text College-level reasoning, 30 subjects Saturated near 84%
MMMU-Pro Image-text 10-option, vision-required Active, narrowing
MMBench Image-text 20 ability dimensions, bilingual Active
MathVista Image-text Math reasoning in visuals Active
AV-Odyssey Audio-visual Sound counting, loudness, source ID Active
MultiFinBen Text-vision-audio Multilingual financial QA Active
Video-MME Video-text 6 disciplines, 3 cognitive stages Active
TempCompass Video-text Temporal reasoning Active
Omni-DuplexEval Full-duplex Real-time description, proactive reminder Newest (May 2026)

What failure modes must multimodal evals catch?

Text benchmarks are blind to the failure classes that make multimodal models dangerous in production. Three matter most.

Cross-modal grounding. The model must align visual evidence with textual claims. The MMStar audit (NeurIPS 2024) found two systemic problems: visual content is unnecessary for many samples, and unintentional data leakage inflates scores. GeminiPro reaches 42.9% on MMMU without any image input, beating the random baseline by 24% on average across six benchmarks, and Sphinx-X-MoE hits 43.6% on MMMU without images, surpassing its own LLM backbone by 17.9%. Video-VER (NeurIPS 2025) introduces a Visual Evidence Reward that explicitly measures whether claims are grounded in pixels rather than world knowledge.

Temporal reasoning. Understanding event sequence, duration, and causality is inherently multimodal and invisible to text evals. TempCompass targets it directly. Video-MME tests it across Perception, Comprehension, and Adaptation stages. Watch-Remember-Reason (June 2026) surveys temporal grounding across video MLLMs, and Deep Temporal Reasoning in Video Language Models (ACL 2025) evaluates action duration and completion cross-linguistically.

Cross-modal hallucination. Multimodal hallucination is qualitatively different from text hallucination. POPE polls object existence with yes-or-no questions. MMHal-Bench spans 96 image-question pairs across 12 object topics and 8 error categories for granular modality-misalignment analysis. HallusionBench disentangles language hallucination from visual illusion, and current SOTA sits at Qwen3.5-27B scoring 0.700 on it.

Why public multimodal benchmarks are gameable

Three structural weaknesses make public multimodal scores unreliable as a sole signal.

First, contamination. The MMStar findings on image-free scoring are contamination in disguise: models have seen enough text-only fragments of multimodal benchmarks to answer without the modality. TS-Guessing protocols find commercial LLMs fill in absent benchmark data with surprising accuracy, jumping from 22.28% to 42.19% for Claude-instant-1 with added metadata.

Second, scaffold dependence. SWE-Bench scores vary 25 percentage points depending on scaffolding, per DataVLab. The same is true for multimodal harnesses: prompting strategy, image preprocessing, and answer extraction can swing a reported score by double digits.

Third, selective reporting. Vendors cherry-pick favorable benchmarks. A model that leads on MMMU may fail catastrophically on OCR in degraded documents, medical imaging with specific acquisition artifacts, or multilingual code-switching in video calls, none of which MMMU measures.

LiveCodeBench is the strongest contamination-free alternative in text, continuously pulling from live competitive programming so test content is unavailable at training time. No equivalent has been widely adopted for multimodal benchmarks, leaving vision-language evaluation particularly exposed.

How should you use LLM-as-judge for multimodal outputs?

LLM-as-judge is seductive on cost: roughly $77.81 versus $5,000 to $7,500 for equivalent human annotation. For multimodal outputs it is also the most error-prone.

Yang et al. (2026) formalized Self-Preference Bias across 20 LLMs, finding that advanced capability is uncorrelated with reduced bias. More damaging for multimodal work is Perceptual Judgment Bias: when visual evidence conflicts with textual cues, multimodal judges anchor on response text and reward plausible narratives over perceptually correct answers.

The Perceptually Perturbed Judgment Dataset constructs minimally edited counterfactual responses to isolate this, and training on it improves perceptual fidelity and human alignment.

Add the usual length, recency, and provenance biases, plus the PolyVis finding that judge performance varies sharply across 12 languages and four task objectives, and the picture is clear. MJ-Bench (NeurIPS 2025) is the most directly relevant benchmark for multimodal reward models, showing closed-source VLMs like GPT-4o give better feedback than open-source alternatives, and that CLIP-style scorers win on alignment and quality while VLMs win on safety and bias.

Practical rules: use narrow rubrics, because broad rubrics degrade judge consistency; calibrate against a held-out human-annotated set; ensemble multiple judges; randomize order and control for length; and restrict LLM-as-judge to preference ranking and style, never factual visual grounding.

Building a production-representative multimodal eval suite

The only reliable signal is an eval you control. The pattern that has converged across teams at Google DeepMind, Kili, and the open VLMEvalKit ecosystem is rubric-based grading on a small, product-mirrored set.

Start with 100 to 200 examples that reflect your real distribution and failure modes, not MMMU's college-exam format. Define explicit rubrics per dimension: content accuracy, visual grounding, coherence, and helpfulness. Google DeepMind's FACTS Multimodal is the reference design here, with each example an image plus textual material scored against an explicit rubric.

Bring in human raters and hold them to inter-rater reliability thresholds of Cohen kappa at least 0.61 or Krippendorff alpha at least 0.667. Run a 10-to-20-example pilot, measure IRR, and refine guidelines before scaling. LLM-human agreement can exceed kappa 0.6 on narrow rubrics but degrades on domain-general constructs, so keep the rubric tight.

Prefer pairwise over scalar grading where possible. Asking which of two outputs is better reduces calibration drift and is more robust to annotator inconsistency than absolute scoring. For running many models, VLMEvalKit supports 220-plus multimodal models and 80-plus benchmarks as an open harness.

Reserve 10 to 20% of your product test data as a private holdout, never used in development or public reporting. Update eval sets quarterly to prevent gaming, and track benchmark trends over time rather than absolute scores, correlating them with production metrics to calibrate what your public-benchmark movement actually predicts.

What this means for you

For model selection, treat MMMU and MMMU-Pro as entry filters requiring above 75% to be competitive, layer domain benchmarks like MultiFinBen or Video-MME for fit, add AV-Odyssey if audio-visual matters, and reach for Omni-DuplexEval only if you ship real-time interaction.

For eval suite construction, the checklist is short and non-negotiable:

  • Collect 100 to 200 product-representative examples with human ground truth.
  • Write explicit, hierarchical rubrics covering accuracy, grounding, coherence, and helpfulness.
  • Calibrate LLM judges against human ratings; require Cohen kappa at least 0.61.
  • Audit for visual indispensability before trusting any vision-language score.
  • Keep a private holdout set untouched by development or reporting.
  • Refresh the set quarterly and track trends, not snapshots.

Evaluation is now the bottleneck, not because building multimodal models is easy, but because measuring whether they work in production contexts remains genuinely hard. Until contamination auditing, perceptual judge fidelity, and duplex evaluation methodologies mature, layered evaluation combining public benchmarks, custom product evals, and human judgment is the only defensible approach.

Sources

Frequently asked questions

Why is multimodal LLM evaluation harder than text evaluation?

Multimodal evals must catch cross-modal grounding, temporal reasoning, and modality-specific hallucination that text benchmarks cannot measure. They also suffer from visual dispensability, where models answer correctly without looking at the image, and from immature, fragmented judge methodologies.

Which multimodal benchmarks should I use for model selection in 2026?

Use MMMU and MMMU-Pro as entry filters, MMBench and MathVista for capability breadth, AV-Odyssey for audio-visual tasks, Video-MME and TempCompass for video, and Omni-DuplexEval for real-time duplex interaction. Treat any single score as a floor, not a verdict.

Is LLM-as-judge reliable for multimodal outputs?

Only with narrow rubrics and human calibration. Multimodal judges show self-preference bias, perceptual judgment bias where they reward plausible text over visual evidence, plus length and position biases. Use them for preference and style, not factual visual grounding.

How do I build a production-representative multimodal eval suite?

Collect 100 to 200 examples that mirror your real failure modes, define explicit rubrics, calibrate LLM judges against human ratings with Cohen kappa at least 0.61, and keep a private holdout set never used in development or public reporting.

Has MMLU saturated for frontier models?

Yes. As of April 2026 every frontier model clusters between 89 and 92 percent on MMLU, and MMLU-Pro is approaching saturation within 18 months of release, forcing differentiation into multimodal and agentic benchmarks.