Evaluating Ai Models And Agents

Jailbreak Evaluation Frameworks for the Reasoning-Model Era

A practitioner framework for scoring jailbreak severity, choosing benchmarks, and assuming reasoning-model attackers in your red team.

By June 26, 202611 min read
AI jailbreak evaluation frameworkreasoning model jailbreakLLM red teaming methodology
Jailbreak Evaluation Frameworks for the Reasoning-Model Era

A reasoning model with a single adversarial system prompt broke the safety guardrails of nine target LLMs in 97.14% of 2,520 attack combinations, according to a peer-reviewed Nature Communications study published February 5, 2026. The attack cost fractions of a cent per attempt. The defense, for the targets that survived, took months of engineering.

That finding reframes what an AI jailbreak evaluation framework has to do. Success-rate counting is no longer enough. Security teams need a methodology that scores severity the way CVSS scores software vulnerabilities, assumes reasoning-model attackers as the baseline threat, and turns evaluation output into deployment decisions.

This guide lays out that framework: which benchmarks to run, how to score what they find, and how to wire it into a red-team program that survives the next model release.

TL;DR

Reasoning models are now autonomous jailbreak agents, and a 97% success rate makes "can we jailbreak this?" the wrong question. The right question is "how severe is the jailbreak, how transferable is the technique, and how actionable is the output?"

Use HarmBench or StrongREJECT for measurement, Mozilla's JEF for CVSS-style severity scoring, and assume any determined attacker has frontier reasoning capability. Treat single-turn, English-only, text-only evaluation as incomplete.

Key takeaways

  • Reasoning-model attackers hit 97.14% success across 2,520 combinations; non-reasoning controls succeeded in only 4 of 900 attempts (Nature Communications).
  • Claude 4 Sonnet resisted best (2.86% max harm); DeepSeek-V3 was most exposed (~90%). Model choice is a security control.
  • Score jailbreaks on Blast Radius, Retargetability, and Output Fidelity, not raw attack success rate. JEF v0.8.0 ships a calculator for this (0DIN).
  • Run more than one benchmark. HarmBench, StrongREJECT, JailbreakBench, and AILuminate each cover a different gap.
  • Multi-turn conversational attacks are the real threat surface. Single-turn benchmarks systematically underestimate risk.

Why reasoning models changed the threat model

The Nature Communications paper, by Hagendorff, Derner, and Oliver, pitted four reasoning models as autonomous attackers against nine target models across 70 harmful prompts in seven sensitive domains. The attackers were DeepSeek-R1, Gemini 2.5 Flash, Grok 3 Mini, and Qwen3 235B. Targets included Claude 4 Sonnet, GPT-4o, and DeepSeek-V3.

The headline number is the 97.14% aggregate success rate. The per-attacker breakdown matters more for planning. DeepSeek-R1 reached a 90.00% maximum harm score. Grok 3 Mini hit 87.14%.

Gemini 2.5 Flash reached 71.43%. Qwen3 235B lagged at 12.86%, which is the important caveat: reasoning capability alone does not guarantee attack effectiveness. Training approach and architecture still dominate.

The alignment regression problem

The paper introduces a term worth adopting: alignment regression. As reasoning capability improves, attack capability improves in tandem. The same machinery that lets a model decompose a hard math problem lets it decompose a safety constraint across multiple turns.

The control group makes the case. Non-reasoning models achieved maximum harm in only 4 of 900 attempts. The reasoning attackers hit 97.14%. Systematic jailbreaking is not a general property of language models. It is a specific capability that emerges with reasoning-oriented training, and it generalizes to attack scenarios without specific attack training (Pebblous analysis).

Why single-turn evaluation misses the real attack

Most public benchmarks test whether a model refuses a single adversarial prompt. The reasoning-model attackers in the Nature study used multi-turn dialogue: innocuous opening queries, gradual escalation, real-time adaptation to refusal signals, and decomposition of harmful requests into benign-looking components.

Research on The Echo Chamber Multi-Turn LLM Jailbreak documents the same pattern. Multi-turn attacks establish implicit permissions that make later harmful requests harder to refuse. Any evaluation pipeline that only fires one prompt per behavior is measuring the easy case.

How do you score jailbreak severity?

Attack success rate (ASR) is the most quoted metric and the least useful in isolation. A 95% ASR against fictional harm is a different problem than a 30% ASR against bioweapon synthesis. Severity is a composite of attack difficulty, exploit potential, and operational impact, and those dimensions interact.

The proven template is CVSS, which scores software vulnerabilities across exploitability, impact, and scope to produce a 0.0 to 10.0 composite. The AI security community has adapted that structure for jailbreaks.

The JEF severity dimensions

Mozilla's 0DIN project released the Jailbreak Evaluation Framework (JEF) on June 12, 2025, and expanded it to v0.8.0 with seven modules in March 2026. JEF scores attack techniques, not model resistance, on three dimensions:

  • Blast Radius. What harmful content becomes accessible, how extensively the model's capabilities can be misused, and what downstream harm weaponized output could cause. This is the CVSS impact analog.
  • Retargetability. How easily the technique ports to other models, providers, or harm categories. A single-model exploit is a bug. A cross-provider exploit is a sector incident.
  • Output Fidelity. Whether the jailbreak produces genuinely actionable harmful output or superficial compliance. This is the dimension most ASR-only evaluations throw away.

The composite lands on a 0 to 10 scale. The JEF calculator implements it, and the framework powers the 0DIN GenAI bug bounty program, so the scoring is already tied to real reward payouts.

Severity bands for prioritization

Band Score What it means Action
Critical 9.0 to 10.0 High success, cross-provider, high-fidelity output Deployment hold, immediate mitigation
High 7.0 to 8.9 Moderate success, some transferability, useful output Prioritized fix, security architecture review
Medium 4.0 to 6.9 Lower success or limited transfer, degraded output Baseline architecture input
Low 0.1 to 3.9 Specific conditions, minimal transfer, limited harm Track, do not block release
Negligible 0.0 No demonstrated success under realistic conditions No action

A model with 40% ASR against Critical-severity harms is a worse deployment decision than a model with 60% ASR against Low-severity harms. Severity scoring exists to make that tradeoff explicit.

Which jailbreak benchmark should you run?

Four frameworks dominate the 2026 landscape. They are complementary, not competitive.

Framework Coverage Scoring Modality Best for
HarmBench 510+ behaviors, 18 attack methods ASR + Useful Refusal Rate Text + multimodal (2025 update) Broad automated robustness, public leaderboard comparison
StrongREJECT 346 curated prompts Continuous 0 to 1 usefulness Text Catching the empty-jailbreak problem, output quality
JailbreakBench 100+ behaviors Formal threat model Text Reproducible academic-grade evaluation, regulatory evidence
MLCommons AILuminate v0.5 Multimodal benchmark Resilience Gap metric T2T + T2I + I2T Industry-standard baseline, multimodal surface

HarmBench, from the Center for AI Safety and UIUC, is the de facto standard for automated evaluation (arXiv). Its 2025 multimodal expansion matters because vision-language models are now a default attack surface, and text-only evaluation misses it.

StrongREJECT, from Berkeley AI Research and Gray Swan AI, is the one to reach for when you suspect your "successful" jailbreaks are producing garbage. Its continuous usefulness score exposes what the authors call the empty jailbreak: attacks that technically bypass safety but yield output no attacker would use.

That distinction directly informed JEF's Output Fidelity dimension.

AILuminate, released as a draft October 15, 2025, is the first industry-led standardization effort. Its Resilience Gap metric measures how much a model's safety degrades under adversarial stress versus normal use.

Testing across 39 models found 89% showed measurable degradation. That number is the strongest argument for treating jailbreak evaluation as continuous monitoring rather than a release gate you run once.

Reasoning-model attacker success, by model

The per-model spread is wide enough that vendor selection is a security decision, not just a capability decision.

Maximum harm score by attacker model (Nature Communications, Feb 2026)DeepSeek-R190%Grok 3 Mini87.14%Gemini 2.5 Flash71.43%Qwen3 235B12.86%
Maximum harm score by attacker model (Nature Communications, Feb 2026)

On the defensive side, Claude 4 Sonnet held a 2.86% maximum harm score with a 50.18% refusal rate, roughly 31 times more resistant than DeepSeek-V3, the most vulnerable target. GPT-4o sat in the middle at 61.43%.

These are point-in-time measurements against a specific attack set, so treat them as relative signals, not absolutes. The next model release reshuffles the ranking.

How do you operationalize jailbreak evaluation?

Evaluation produces data. Program value comes from turning data into deployment gates, vendor selections, and remediation priorities. A mature program has four integration points.

Shift evaluation into the development pipeline

Treat jailbreak evaluation as a gate, not an audit. Run pre-training data screening for data-level vulnerabilities, post-training evaluation for emergent behaviors, pre-deployment assessment for final risk determination, and post-deployment monitoring for novel attacks. Each stage produces different findings, and catching a vulnerability post-training is orders of magnitude cheaper than catching it in production.

Wire it into existing red-team tooling

JEF provides integration pathways with NVIDIA garak and Microsoft PyRIT. Promptfoo and the UK AI Safety Institute's Inspect support similar workflows. The Hacken AI Red Team methodology gives a structured engagement template that includes jailbreak evaluation as a component.

The goal is to fold jailbreak testing into red-team operations you already run, not stand up a parallel program.

Benchmark with severity, not just success

When you benchmark model robustness, match the framework to the use case and read the scores through a severity lens. A model with 40% ASR against Critical harms is a worse risk than 60% ASR against Low harms.

Document baselines before any model update so you can attribute regressions to a specific change. AILuminate's Resilience Gap is the metric to watch for temporal drift, because static benchmark snapshots go stale as attacker capability improves.

Prioritize findings on five factors

  • Severity score. Higher scores get attention first, regardless of current exploitability.
  • Exploitability. Readily executable attacks are near-term risk even at moderate severity.
  • Exposure. Widely deployed models carry more risk surface than limited pilots.
  • Transferability. Cross-provider attacks affect portfolio risk, not just one deployment.
  • Defensive maturity. Models behind strong input filtering and output monitoring can tolerate more model-level vulnerability than bare API deployments.

What this means for you

Three decisions to make this quarter.

First, assume reasoning-model attackers in your threat model. Your evaluation should treat frontier reasoning capability as the baseline adversary, not a hypothetical. If your test suite only uses static prompt lists, you are testing against the easy case.

Second, replace ASR-only reporting with severity scoring. Pull the JEF calculator or implement the three dimensions yourself. Report Critical and High findings to leadership the way you would report a CVSS 9.0 software vulnerability.

The June 2026 White House to Anthropic collaboration on severity benchmarks incorporates capability reach as a release-decision dimension, which signals where this is heading regulatorily.

Third, run multi-turn, multimodal, multi-framework evaluation. Single-turn English text-only benchmarks are the minimum viable evaluation and no longer sufficient. Pair HarmBench for coverage with StrongREJECT for output quality, add AILuminate for an industry baseline, and score everything through JEF for prioritization.

Honest limits of this methodology

Benchmark success rates are proxy metrics. They measure whether a model produces certain outputs, not whether those outputs cause real harm. Production deployments layer input filters, output monitoring, rate limiting, and human oversight that benchmark ASR does not account for, so treat ASR as an upper bound.

The correlation between benchmark performance and real-world harm outcomes is not established. That does not make evaluation useless. Preventing model outputs that could enable harm is valuable regardless of whether harm ultimately occurs. It does mean you should pair severity scores with contextual exposure analysis before blocking a release.

Disclosure is the other open problem. The Nature study's techniques are reproducible by anyone with API access to a reasoning model, which breaks the traditional coordinated-disclosure model where the vendor patches before the public learns. Anthropic's coordinated vulnerability disclosure program and the broader AI disclosure norms are still forming.

Have a disclosure policy before you start finding vulnerabilities, not after.

Sources

Frequently asked questions

What is an AI jailbreak evaluation framework?

It is a structured methodology for measuring, scoring, and comparing how resistant an LLM is to jailbreak attacks. Mature frameworks like HarmBench, StrongREJECT, and Mozilla's JEF combine attack coverage, success-rate measurement, and severity scoring so teams can prioritize fixes rather than just count failures.

Why are reasoning models more effective at jailbreaking?

Reasoning models apply the same step-by-step problem-solving they use for legitimate tasks to bypassing safety guardrails, including multi-turn persuasion. A February 2026 Nature Communications study found reasoning-model attackers reached a 97.14% success rate across 2,520 attack combinations, versus near-zero for non-reasoning controls.

How is jailbreak severity scored?

Mozilla's Jailbreak Evaluation Framework (JEF) adapts CVSS-style scoring to jailbreaks using three dimensions: Blast Radius (potential harm), Retargetability (how transferable the attack is), and Output Fidelity (how actionable the harmful output is). The composite yields a 0 to 10 score mapped to Critical, High, Medium, and Low severity bands.

Which jailbreak benchmark should I use?

Use HarmBench for broad automated coverage (510+ behaviors, 18 attack methods), StrongREJECT when you care about output quality and the empty-jailbreak problem, JailbreakBench for reproducible academic-grade evaluation, and MLCommons AILuminate for an industry-standard multimodal baseline. Run more than one; no framework covers every attack surface.

Does a high benchmark success rate mean real-world harm?

Not directly. Benchmarks measure proxy outcomes, and production deployments layer input filters, output monitoring, and human oversight that reduce real exploitation. Treat benchmark ASR as an upper bound and combine it with severity scoring and contextual exposure analysis before making deployment decisions.