cluster

Red-teaming AI in 2026: the practical guide to adversarial testing

A step-by-step methodology for designing AI red-team exercises, plus an honest comparison of PyRIT, Garak, HarmBench, and Promptfoo.

June 12, 202610 min read
red-teaming AIadversarial testingAI security
Red-teaming AI in 2026: the practical guide to adversarial testing

Microsoft's AI Red Team published the lessons from attacking its first 100 generative-AI products in January 2025, and the headline finding reshaped the discipline: classic single-turn jailbreaks are mostly handled by current alignment, while multi-turn, agentic, and indirect prompt-injection attacks still break production systems. That report, Lessons From Red Teaming 100 Generative AI Products, is now the de facto operational template for red-teaming AI.

This guide turns it, and the surrounding 2023-2026 literature, into a methodology you can run.

TL;DR: Red-teaming AI is goal-led adversarial testing of model behavior, executed as a hybrid of human experts, automated harnesses (PyRIT, Garak, Promptfoo), and standardized benchmarks (HarmBench). The EU AI Act and NIST's generative AI profile have made it a compliance floor. Run it continuously, score findings with an AI-adapted severity model, and disclose via the CMU SEI CVD-for-AI template.

Key takeaways:

  • Goal-led objectives ("extract user PII", "force a tool call to an attacker URL") beat vulnerability-class checklists because they survive novel attack paths.
  • Indirect prompt injection remains the most impactful production vulnerability class, including the EchoLeak flaw in Microsoft 365 Copilot (CVE-2025-32711).
  • Hybrid execution is the consensus: automation for breadth and regression, humans for novelty, AI-on-AI attacks for scale.
  • Passing a red-team evaluation only proves the model beat the probes you ran. Build guardrails and monitoring on top.

What is red-teaming AI?

Red-teaming AI is a structured adversarial evaluation that combines manual probing by experts with automated attack harnesses to surface behavioral failures: jailbreaks, privacy leaks, unsafe tool use, bias, and ungrounded outputs. The Microsoft AI Red Team defines the practice as goal-led rather than vulnerability-class-led.

Testers get concrete harm outcomes and full latitude on the path.

The regulatory pull is real. NIST's AI Risk Management Framework and its generative AI profile (AI 600-1, July 2024) treat red-teaming as a core MEASURE and MANAGE activity, and the EU AI Act's general-purpose AI obligations have applied since August 2025.

Anthropic's Responsible Scaling Policy goes further and conditions deployment above a given safety level on adversarial evidence that catastrophic capabilities are absent.

How does adversarial testing of AI differ from penetration testing?

A penetration tester traverses a known attack surface toward a binary win: a shell, a token, a database row. An AI red-teamer attacks behavior, and the differences are structural.

The attack surface is prompt-shaped. Any string the model ingests, from a user message to a calendar invite in a RAG corpus, can carry an attack, and the same prompt can produce different outputs across runs.

Failures are behavioral. A model can be 99% safe against one probe set and 30% vulnerable to a reformulation of the same request. Many harm classes (bias, persuasion, hallucination) have no CVSS analogue, which is why MITRE ATLAS catalogs AI-specific tactics like data poisoning and model extraction separately from ATT&CK.

And the work never finishes. Emergent capabilities mean a model can pass every pre-release probe and still fail on first contact with a third-party user, so AI penetration testing has to run post-deployment as well.

Which tools should you use for AI vulnerability assessment?

The 2026 landscape splits into harnesses that execute attacks, benchmarks that standardize measurement, and orchestrators that run the program. Pick by surface, then layer.

Tool Best for License Standout feature Main limitation
PyRIT Orchestrated multi-turn attacks MIT Separates targets, datasets, scorers, attacks, and memory for reproducible chains Opinionated toward Azure endpoints
Garak Breadth-first scanning Apache-2.0 37+ probe categories, 9+ providers, AVID-standard exports Model-graded scoring carries known judge bias
HarmBench Comparable ASR benchmarking MIT 18 attack methods evaluated across 33 LLMs A benchmark, so it won't attack your system for you
Promptfoo CI/CD regression testing MIT OWASP LLM Top 10 plugin mapping, YAML config Acquired by OpenAI in March 2026; weigh vendor neutrality
DeepEval/DeepTeam Python-native eval teams Apache-2.0 pytest-native, 50+ metrics Smallest attack library of the major harnesses
Counterfit Classical ML (vision, fraud, speech) MIT Evasion and extraction attacks LLM tools skip Pre-LLM focus

Microsoft is the only major lab shipping an end-to-end open stack: PyRIT as the harness, the AI Red Teaming Agent in Azure AI Foundry as the orchestrator. OpenAI, Anthropic, and Google DeepMind keep their internal tooling private and publish methodology through model cards and policies like DeepMind's Frontier Safety Framework instead.

How do you design and run a red-team exercise?

Start with a dual-signed pre-engagement agreement. Name the system under test, the version, the surface (text, multimodal, RAG, tool-calling), data-handling rules, and prohibited actions. Map everything to public taxonomies up front: NIST AI 600-1's risk categories, the OWASP LLM Top 10, and MITRE ATLAS technique IDs, so findings translate into controls.

Then write goal-led objectives, AIRT-style. "Force asend_emailtool call to an attacker-controlled URL" gives a tester room to find a path you didn't anticipate. A checklist of known jailbreak strings does the opposite.

Build a target-by-persona matrix: each model and deployment crossed with at least three personas (default assistant, roleplay, agent-mode with tools). Capture reproducibility metadata per cell: system prompt hash, temperature, seed, tool definitions, RAG corpus hash.

Score on a 0-4 harm-severity scale plus attack success rate, with an LLM judge backed by human spot-checks. AIRT's report is blunt that binary pass/fail hides the most important findings.

Execution should be hybrid. Automated harnesses give breadth and regression coverage; humans find novel paths; AI-on-AI attackers add scale, but the 2025 Jailbreak-R1 work documented mode collapse in automated attackers, so pair them with diversity pressure.

Run control targets (one known-jailbreakable model, one known-safe) in parallel to calibrate your judge, a correction the HarmBench paper shows is necessary.

Cover multi-turn explicitly. Crescendo-style gradual escalation and many-shot jailbreaking defeat single-turn guardrails, with many-shot variants reporting attack success rates between 70.6% and 95.9% depending on model and example count.

The original GCG attack from Zou et al. Hit over 88% ASR with transferable adversarial suffixes back in 2023, and the attack literature has only broadened since.

Reported attack success rates from the adversarial testing literatureGCG adversarial suffix (Zou et a88%Many-shot jailbreak, upper bound95.9%Mindgard invisible-Unicode guard100%
Reported attack success rates from the adversarial testing literature

What do the frontier-lab case studies teach?

Red-team findings convert into measurable safety gains when they feed training. OpenAI's GPT-4 engagement ran 50+ contracted experts for six months across cybersecurity, CBRN, and persuasion; the subsequent fine-tuning pass produced a model that, per OpenAI, responded with restricted content 82% less often and hallucinated 60% less than GPT-3.5.

GPT-4o's voice surface required over 100 external red-teamers and shipped a new output classifier in the audio pipeline as a direct result.

Anthropic's account of red-teaming challenges reports the same asymmetry Microsoft found: standard jailbreaks largely fail against current alignment, while multi-turn escalation, persona roleplay, and agentic attacks keep working. Their Sleeper Agents paper (Hubinger et al., January 2024) adds a harder lesson: backdoored models retained trigger-conditioned misbehavior through RLHF and adversarial training, which is why behavioral probing now gets paired with interpretability research.

For agent builders, OpenAI's Operator system card is the most practical reference. Browser agents were tested against prompt injection on visited pages and session takeover, and shipped with confirmation prompts and a user take-over mode. Copy those mitigations before you ship an agent with credentials.

What are the limits, and how do you compensate?

Red-teaming finds what it knows to look for. The UK AI Safety Institute's work on agent control measures and METR's benchmarks show agent task horizons doubling roughly every five months, so the gap between evaluation coverage and deployed capability grows by default.

A May 2025 Mindgard study reported 100% evasion of common LLM guardrails using invisible Unicode characters against hardened production deployments.

The compensation is layering, and each layer is buildable now. Guardrails for AI (Llama Guard, NeMo Guardrails, Azure AI Content Safety) catch known-bad outputs at inference time. Every red-team finding goes into a regression corpus that runs in CI on every checkpoint, with a release gate on severity.

Post-deployment monitoring catches what pre-release testing missed. External evaluators (AISI, METR, Gray Swan) provide independent assurance quarterly.

Cost is the other honest constraint. Frontier-scale programs like OpenAI's Red Teaming Network of 100+ experts are out of reach for most teams. The workaround is the maturity ladder: get to a documented playbook with Garak and Promptfoo in CI first, then add a standing team and SLAs as exposure grows.

What this means for you

If you ship LLM features, the minimum viable program in 2026 looks like this. Run Garak across your endpoints this week for a baseline. Wire Promptfoo's OWASP LLM Top 10 plugins into CI so every prompt or model change gets regression-tested.

Write three goal-led objectives for your highest-stakes flow and spend two days of human time attacking them, with multi-turn escalation included.

Score findings with severity bands tied to SLAs (critical fixed in 24 hours, high in 7 days), and adopt the CMU SEI CVD-for-AI template with its 90-day default window for anything you find in third-party models. ENISA's vulnerability disclosure guidance covers the European baseline.

The conceptual shift matters more than any single tool: red-teaming stopped being a pre-release gate and became a continuous program spanning CI regression, post-deployment monitoring, and external audits. Teams that internalize that are the ones that will clear the regulatory and procurement bar now forming around AI security.

Sources

Frequently asked questions

What is AI red-teaming?

AI red-teaming is structured adversarial testing of an AI system, combining expert human probing with automated attack harnesses to surface jailbreaks, prompt injection, data leaks, and unsafe tool use before attackers do. Microsoft's AI Red Team frames it as goal-led: testers are assigned concrete harm outcomes and choose their own path to them.

How is AI red-teaming different from penetration testing?

Penetration testing traverses a known attack surface (networks, apps, identity) looking for discrete exploitable flaws. AI red-teaming targets model behavior, where the attack surface is any string the model ingests, outputs are non-deterministic, and failures like bias or persuasion have no CVSS equivalent. Both disciplines share the adversarial mindset; the unit of failure differs.

Which open-source tools should I start with for LLM adversarial testing?

Start with NVIDIA Garak for broad probe coverage, Microsoft PyRIT for orchestrated multi-turn attacks, and Promptfoo for CI/CD regression testing. Use HarmBench as the benchmark for comparable attack-success-rate reporting across models.

Does passing a red-team evaluation mean a model is safe?

No. Passing only shows the specific probes used failed to break it. A May 2025 Mindgard study reported 100% evasion of common LLM guardrails using invisible Unicode characters, and METR-style task-horizon benchmarks double roughly every five months, so coverage gaps grow over time. Treat red-teaming as one layer in a defense-in-depth stack.

How should AI vulnerabilities be disclosed?

Use the CMU SEI coordinated vulnerability disclosure template for AI (February 2025), which adapts traditional CVD with a 90-day default window, model-version metadata, a minimal prompt-chain reproducer, and harm-class mapping to OWASP LLM Top 10 and MITRE ATLAS.