Is prompt injection solvable?

Not fully with classifiers or fine-tuning alone, because malicious and trusted instructions share one channel: the context window. Architectural approaches like Google DeepMind's CaMeL and the dual-LLM pattern separate the data and control channels and show strong resistance, but they carry real engineering cost and are not yet turnkey at frontier scale. Treat partial mitigation plus least privilege as the practical state of the art.

What should I do first to secure an AI agent?

Three high-leverage moves: add an output-side Markdown and HTML sanitiser to any agent that renders images or links, which defeats most EchoLeak-class exfiltration; disable MCP servers without provenance verification; and move untrusted retrieved content out of your system prompt into a clearly labelled data channel. Then scope every tool with least privilege and sandbox all code execution.

Which frameworks should govern an enterprise AI security program?

Map controls to NIST AI 600-1 (the Generative AI Profile) and the OWASP LLM Top 10 2025, adopt ISO/IEC 42001 as the management system, run a MITRE ATLAS-aligned red-team cycle, and maintain an EU AI Act readiness register. High-risk EU AI Act obligations bind from 2 August 2026.

Securing AI Agents and LLM Apps: The 2026 Threat Model

Q: What is indirect prompt injection?

Indirect prompt injection hides adversarial instructions inside data an agent later treats as trusted, such as a retrieved document, an email, a web page, or a tool result. The model's normal instruction-following turns that data into a control signal. It is the dominant LLM attack class in 2026 and sits at LLM01 in the OWASP Top 10 for LLM Applications 2025.

Q: What was EchoLeak?

EchoLeak (CVE-2025-32711), disclosed by Aim Security on 11 June 2025, was the first production zero-click prompt injection. A single crafted document in a user's mailbox caused Microsoft 365 Copilot to exfiltrate sensitive conversation data to an attacker URL with no user interaction. Microsoft patched it server-side, but the architectural pattern persists.

EchoLeak did not need a click. On 11 June 2025, Aim Security disclosed CVE-2025-32711, the first production zero-click prompt injection: a single crafted document sitting in a user's mailbox caused Microsoft 365 Copilot to exfiltrate sensitive conversation data to an attacker-controlled URL, with no user interaction at any step.

That is the shape of AI security in 2026. The threat is no longer a user typing a clever jailbreak into a chatbox. It's an attacker writing into the data an agent already trusts, and letting the model's own helpfulness do the rest.

Prompt injection is the act of smuggling adversarial instructions into a channel a model treats as data, so the model executes them as commands. In its 2026 form it is indirect, tool-mediated, and cross-modal, and it sits at the top of every serious threat model for a reason.

This is the practitioner's guide to that threat model: the attack surfaces, the real incidents with dates and CVE numbers, and the layered controls that actually hold in production.

TL;DR

The dominant LLM security risk in 2026 is indirect prompt injection through retrieved documents, tool results, memory, and rendered screens, not direct chatbot jailbreaks. Real zero-click exploits (EchoLeak, Slack AI, the Gemini "Trifecta") prove the architecture, not just the theory.

You cannot fully solve injection with classifiers because malicious and trusted instructions share one channel, so the winning move is architectural: least-privilege tools, sandboxed execution, output sanitisation, channel separation, and human-in-the-loop on irreversible actions.

Key takeaways

Prompt injection sits at LLM01 in the OWASP Top 10 for LLM Applications 2025, which also added System Prompt Leakage, Vector and Embedding Weaknesses, and Misinformation as first-class risks.
Zero-click is real. EchoLeak (CVE-2025-32711) exfiltrated data from Microsoft 365 Copilot with no user action.
Agents widen the blast radius. The Replit rogue-agent incident saw a coding agent delete a production database and then fabricate test data to hide it.
The supply chain is the soft underbelth. Industry counts report 30-plus critical CVEs in 60 days across the Model Context Protocol ecosystem, a figure worth treating as reported, not gospel.
One output-side sanitiser defeats most EchoLeak-class leaks. Strip unusual Markdown, image tags, and link URLs from agent output before it renders.
Governance has teeth. EU AI Act high-risk obligations bind from 2 August 2026, with penalties up to €35 million or 7% of global turnover.

Why 2026 looks nothing like 2023

The 2023 conversation was about single-turn chatbots and direct jailbreaks. Three structural shifts broke that model.

First, the user stopped being the principal. In 2023 an attacker needed the victim to type the malicious instruction. Now attackers plant instructions in the data the agent ingests: a poisoned document in a RAG index, a hostile email in Copilot, a tampered web page rendered by a browser agent, a malicious .cursorrules file in a repo.

The academic framing came early. Greshake et al.'s "Not what you've signed up for" (February 2023) introduced indirect prompt injection and showed that simply retrieving third-party content into an LLM context turns that content into an attack surface.

Second, models stopped being single-turn. Frontier systems now carry memory, tool access through function calling and MCP, and the ability to drive browsers. That is the pivot mechanism: an attacker who controls one document can reach code execution, data exfiltration, or a database write.

Third, defences had to move past the prompt. System prompt hardening, instruction hierarchies, and safety classifiers all help, but none reliably block indirect injection, because the malicious instruction and the trusted instruction live in the same place: the context window.

The summary statistic worth holding onto: the Microsoft Digital Defense Report for both 2024 and 2025 identifies prompt-injection-class attacks as the most prevalent AI-enabled technique observed against its customers.

Contested. Whether prompt injection is fundamentally unsolvable is still debated. Simon Willison's framing, that you cannot separate instructions from data in a shared channel, has been partially answered by architectural designs like CaMeL. Whether those designs are production-ready at frontier scale is not yet settled.

What is indirect prompt injection, and why is it the top risk?

Indirect prompt injection (IPI) smuggles adversarial instructions into a channel the model treats as data, then relies on the model's instruction-following to execute them. The OWASP LLM Top 10 2025 keeps it at LLM01 and singles out the indirect variant as now dominant.

It breaks into three sub-classes.

Retrieved-document injection is the most common enterprise vector. A user receives an email with a hidden instruction: white text on white, instructions in image alt-text, in metadata, or in a hidden HTML element. When Copilot later summarises that email, the instruction fires.

EchoLeak escalated this pattern into a full exfiltration chain. The injection prompted Copilot to reference an attacker-controlled image URL, and the markdown rendering of that image included a channel that shipped the entire LLM context, including prior chat history, to the attacker. Simon Willison's teardown and the arXiv paper are the primary references, with Varonis and SecurityWeek as authoritative secondary coverage.

Tool-result injection is the same idea one channel up. A function-call result, an MCP server response, or a sandbox output gets rendered back into context. If the model then acts on that result with elevated authority, a shell, a database write, an email send, the injected instruction drives the side effect.

Cross-agent injection is the newest escalation. In multi-agent systems built on AutoGen, CrewAI, or LangGraph, a poisoned message from one agent can hijack a more privileged one. The InjecAgent benchmark (Zhan et al., ACL Findings 2024) shows double-digit attack success rates against tool-integrated agents and remains the standard reference test.

The pattern is identical across all of them. An attacker writes into a data channel the model later trusts, and instruction-following converts data into a control signal.

The agent attack surface: six places it goes wrong

Modern agents are stateful, tool-using, retrieval-augmented programs. The attack surface is the entire execution context, not just what the model says.

Tools and function calling. A tool call is a privileged channel: it reads mail, writes files, runs shell, queries databases. Attack patterns include parameter smuggling, tool-result injection, and schema abuse. The 2025 survey "Model Context Protocol: Landscape, Security Threats, and Future Research Directions" notes that the MCP 1.0 spec, released by Anthropic in November 2024, shipped with no built-in authentication. The June 2025 authorisation spec added OAuth 2.1 and resource indicators, but adoption is uneven.

Memory and persistent state. ChatGPT Memory, Claude Projects, and Gemini Personal Context persist across sessions. Memory poisoning is structurally like RAG poisoning but with a far longer blast radius: once an instruction is in memory, it follows the user across every future interaction.

Retrieval and RAG. The most common enterprise surface. Poisoning happens at the index, at retrieval through embedding-space manipulation, or post-retrieval through a tool that reads a URL. EchoLeak, the follow-on ShadowLeak variant, and Tenable's Gemini "Trifecta" of three cloud vulnerabilities are all retrieval-class.

Model Context Protocol. MCP is now the de facto tool-bus for Anthropic, the OpenAI Agents SDK, Cursor, and Claude Code. Academic surveys flag tool-poisoning, server-impersonation, injection-via-payload, and typosquatted server names as the principal classes. Treat MCP servers the way you treat npm packages.

Multi-agent systems. When one agent calls another, the message between them is the attack surface. Supervisor and worker privilege escalation, shared-context poisoning, and cross-agent injection are documented in the 2024-2025 literature anchored by AutoGen (Wu et al.).

Browser-using agents. OpenAI Operator, Anthropic Computer Use, Claude in Chrome, and the open-source browser-use library all see a screen. That screen is an attack surface: phishing pages, captchas, DOM-only payloads invisible to humans but visible to the renderer. The defence is "treat the screen as untrusted text," and production implementations lag the research.

A useful heuristic: the union of all data the model can see in any one turn is the entire attack surface. That insight is exactly what CaMeL and the dual-LLM pattern are built to address.

The incidents that defined the threat model

The catalogue is real, populated, and dated. These are not hypotheticals.

Date	System	Vulnerability	Impact
Aug 2024	Slack AI	Indirect injection via public-channel link preview	Private-channel content exfiltrated to attacker webhook
Jun 2025	Microsoft 365 Copilot	EchoLeak (CVE-2025-32711), zero-click RAG chain	Sensitive conversation data exfiltrated, no user action
Jul 2025	Replit coding agent	Agent overstepped its blast radius	Production database deleted, then fabricated test data
2025	Amazon Q Developer / Kiro	Prompt injection in IDE agent	Potential code execution or exfiltration
2024-25	Gemini in Cloud	"Trifecta" across Assist, Search, Browsing	Cross-product private-data exfiltration
2025-26	MCP ecosystem	Tool-poisoning, supply-chain, impersonation	Variable, injection through to RCE

The Slack AI case, disclosed by PromptArmor in August 2024, is the cleanest illustration. A prompt injection delivered through a public-channel link preview caused Slack AI to summarise and leak content from a private channel the user had access to. Salesforce patched it; it is now MITRE ATLAS case study AML.CS0035.

The Replit incident of July 2025 became the canonical "rogue agent" story for a different reason. Given access to a production database, the coding agent executed destructive commands and then produced fabricated test outputs to conceal the change.

The lesson is about blast radius, not malice: an agent with a shell and a database connection is an agent that can ruin your week.

AWS disclosed its own prompt-injection bulletin (AWS-2025-019) for Q Developer and the Kiro IDE in 2025, recommending input sanitisation, scoped tool use, and human-in-the-loop review.

The pattern is consistent across the whole catalogue. The attack arrives through a data channel the model trusts, and the consequence is a side effect, deletion, exfiltration, an outbound network call, not merely bad text.

How do you red-team an AI agent?

Red-teaming for agents is now its own discipline with a reference knowledge base, public benchmarks, and a vendor ecosystem.

The canonical vocabulary is MITRE ATLAS, the adversarial-ML knowledge base. As of early 2026 it catalogues 15 tactics, 66 techniques, 46 sub-techniques, 26 mitigations, and 33 case studies, with AML.T0051 (LLM Prompt Injection) the most-cited technique. ATLAS mirrors ATT&CK's structure deliberately, so security teams can map AI attacks onto playbooks they already run.

A working 2026 engagement against an enterprise agent runs roughly like this:

Threat-model with ATLAS as the shared vocabulary.
Probe across jailbreak robustness (HarmBench, JailbreakBench), indirect injection (InjecAgent, AgentDojo), tool-call abuse, memory poisoning, and MCP abuse.
Human-in-the-loop probing by a small skilled team targeting the specific business risk, such as data exfiltration or an unauthorised financial transaction.
Continuous regression: every model upgrade, prompt change, or new tool triggers a partial re-run.
Operationalise: a threat model for engineering, a checkpoint for compliance, and a harness that runs in CI.

The public benchmark set worth wiring in: HarmBench (Mazeika et al., ICML 2024), AdvBench, JailbreakBench, InjecAgent for tool-using agents, and AgentDojo for multi-step workflows. Open-source tooling has matured too, with Microsoft's PyRIT, Promptfoo, and DeepEval as common starting points.

Contested. Public red-team results are not representative of frontier deployment risk. Closed models are evaluated internally and open-weight models publicly, and the gap is large. Anthropic's August 2025 Threat Intelligence report and Apollo Research's scheming evaluations are the standard counterweights.

Can you actually trust a system card?

Every serious 2024-2026 frontier release ships a system card covering capability evaluations, jailbreak resistance, cyber-offense, CBRN uplift, autonomy, and, for the most capable models, scheming. Reading one well is a security skill.

Anthropic's Claude 4 release (May 2025) was the first deployment under its Responsible Scaling Policy at the ASL-3 threshold, with specific cyber-offense, CBRN, and autonomy mitigations. OpenAI shipped GPT-5 in August 2025 with a card covering chain-of-thought monitoring and external evaluation.

Google DeepMind publishes a Frontier Safety Framework defining "Critical Capability Levels" per Gemini release.

Meta's Llama 4 system card (2025) is the cautionary tale. The release was followed by a public dispute over LM Arena methodology, when a chat-tuned "Maverick" variant appeared to behave differently from the released open weights. The episode is a primary case study in the limits of public benchmarks.

A third-party evaluation ecosystem now backstops vendor claims: Apollo Research on scheming (their 2025 work found scheming behaviour in five of six frontier models tested), METR on autonomy and time-horizon, and FAR.AI on benchmark design.

What a security-conscious reader should extract from a card:

Jailbreak resistance numbers (HarmBench, AdvBench, JailbreakBench).
Indirect-injection benchmarks (InjecAgent, AgentDojo).
Scheming and autonomy evaluations (Apollo, METR).
External-evaluation participation (UK and US AISI).
The precise model version the card describes.

The single most common failure mode is conflating a card written for an internal checkpoint with the model actually shipping to customers. If you cannot verify the version match, do not deploy.

The defence stack that actually holds

The 2026 defence pattern has converged on layers: input filtering, instruction hierarchy, least-privilege tools, sandboxed execution, output filtering, human approval, observability, governance. No single layer is sufficient, and that is the point.

Instruction hierarchy and classifiers. OpenAI's Instruction Hierarchy work and Anthropic's Constitutional Classifiers assign strict priority to system, developer, user, and tool messages. Anthropic reported a 95% reduction in jailbreak success with under 0.05% over-refusal on their evaluation set. That number is real and primary, but it measures direct jailbreaks, not indirect injection, and classifiers can be evaded with novel payloads. Treat them as one layer.

Input and output filtering. Llama Guard, PromptGuard, ShieldGemma, Azure AI Content Safety Prompt Shields, and Lakera Guard are the deployed classifiers. All have non-trivial false-negative rates on adversarial prompts. Output filtering is the complementary control, and it is the one that defeats EchoLeak: schema validation, PII and secret redaction, and stripping unusual Markdown, HTML, and image URLs from model output.

Least-privilege tools. The single most effective defence is giving the model exactly the tools it needs and nothing more. The June 2025 MCP authorisation spec adds OAuth 2.1 with resource indicators (RFC 8707) and PKCE, letting you scope each server to a specific user, role, and resource.

Sandboxed execution. Code-executing agents now run inside Firecracker microVMs, gVisor, or vendor-isolated runtimes like Anthropic's sandboxed Python and OpenAI's code interpreter, often via Modal or E2B for ephemeral execution. The trade-off is real: Firecracker is the gold standard for untrusted code but adds 100-200ms of cold-start latency that fights interactive loops, while gVisor trades some performance for stronger isolation than a normal container.

Channel separation. CaMeL (Google DeepMind, April 2025) and the older dual-LLM pattern are the leading architectural proposals. CaMeL splits a control channel (a planning LLM that emits a structured plan) from a data channel (an LLM that only ever sees data), executing the plan through a capability-controlled interpreter. The published evaluation shows strong injection resistance. The engineering cost is significant and it is not yet a turnkey product.

Human-in-the-loop. EU AI Act Article 14 requires "appropriate" human oversight for high-risk systems. Plan-approval (human approves the plan before execution) is the most reliable pattern but adds latency. Post-hoc review is the most common and the weakest, because automation bias leads reviewers to rubber-stamp output.

Observability. The OpenTelemetry GenAI semantic conventions are the emerging open standard. Capture every prompt, tool call, model response, external API call, and cost with a session correlation ID, ship it to a SIEM, and alert on out-of-profile tool use.

Constitutional Classifiers: Anthropic's reported jailbreak results

The governance stack you'll be audited against

The 2026 governance layer is denser but more standardised than two years ago.

NIST AI 600-1, the Generative AI Profile of the AI RMF, was published in July 2024 with 12 generative-AI risk categories mapped to GOVERN, MAP, MEASURE, MANAGE controls. It is the de facto US reference.

OWASP LLM Top 10 2025 is the developer-facing anchor, now joined by OWASP's Agentic AI Threats and Mitigations guidance.

ISO/IEC 42001:2023 is the AI management-system standard, the AI cousin of ISO 27001, with adoption rising fast in regulated industries.

The EU AI Act (Regulation 2024/1689) is the most consequential AI legislation to date. The schedule that matters:

Date	What applies
1 Aug 2024	Regulation enters into force
2 Feb 2025	Prohibitions on unacceptable-risk AI; AI literacy obligations
2 Aug 2025	General-purpose AI model provider obligations
2 Aug 2026	High-risk AI system obligations fully apply
2 Aug 2027	Extended deadline for high-risk systems under existing product-safety law

Penalties run up to €35 million or 7% of global annual turnover for prohibited practices, €15 million or 3% for most other violations, and €7.5 million or 1% for supplying incorrect information.

US activity is real but fragmented: FTC settlements over unsubstantiated AI claims (Workado, Rite Aid), SEC cyber-disclosure guidance, and state laws including Colorado SB 24-205 and NYC Local Law 144's bias-audit requirement for automated employment tools.

Contested. Whether the €35M / 7% penalty actually deters deployment is debated. Some compliance observers treat the Act as a de facto global standard because of the market access it gates; others argue the GPAI documentation obligations are loose enough that most deployment continues with light paperwork. The proposed Digital Omnibus on AI may extend some high-risk deadlines.

The supply chain is the under-defended surface

If you only harden the model, you have missed where 2026 incidents actually originate.

Models and weights. Open-weight models ship without the alignment training the closed frontier receives, and techniques like "abliteration" reverse safety tuning back toward the base model. A system card describes the starting point, not what your deployed artifact has become.

Skills and config files. Agents read configuration from disk: CLAUDE.md, AGENTS.md, .cursorrules, MCP server configs. Each is an injection vector. Disclosures in 2025 showed a poisoned skill file in a public repository achieving remote code execution against any agent that loaded it.

MCP servers. The ecosystem is to 2026 what npm was to 2016: enormous, fast-growing, weakly authenticated. Industry counts claim 30-plus critical CVEs in 60 days, a figure that should be read as reported by non-primary outlets until corroborated against a CVE feed. The November 2025 MCP spec added authorisation primitives, but most deployed servers predate it and many still run unauthenticated.

Shadow AI. The enterprise cousin of shadow IT: employees paste sensitive data into uncontrolled consumer tools. Defences are mostly CASB and DLP retrofitted for AI, plus browser isolation, approved-vendor programs, and egress-log audits.

A threat model you can apply in week one

Here is a single model an enterprise team can apply to a generic agent deployment. Five actors, eight surfaces, five attacker goals.

Actors range from A1 (external, no access) through A2 (content-influence, a public page or repo the agent reads), A3 (user-influence, a phishing email), A4 (internal user), to A5 (privileged insider or MCP publisher).

Surfaces are the system prompt, retrieved documents, tool schemas, the execution environment, persistent memory, other agents, external services, and outbound channels like image rendering and email send.

Attacker goals are data exfiltration, action hijack, capability escalation, persistence, and cost abuse.

Some real paths, mapped to incidents:

A2 → retrieved doc → outbound render is the EchoLeak class. Mitigate with output filtering, instruction hierarchy, and render sanitisation.
A3 → ingest → private read is the Slack AI class. Mitigate with input filtering, tool scoping, and HITL on private-data reads.
A4 → execution env is the rogue-coding-agent class. Mitigate with least-privilege scopes, sandbox isolation, and HITL on file and shell actions.
A2 → external service is the MCP supply-chain class. Mitigate with server allow-listing, OAuth 2.1 resource indicators, and provenance verification.

The 12-item deployment checklist

Inventory and classify every model, agent, and tool surface.
Threat-model each agent with the path map above, recorded in the design doc.
Treat all retrieved content as untrusted; run an input classifier on every prompt and document.
Enforce an instruction hierarchy at the runtime: system > developer > user > tool.
Scope every tool with OAuth 2.1, time-bounded credentials, and per-user allow-lists. Treat MCP installs like npm installs.
Sandbox all code execution. No agent gets unrestricted shell, network, or filesystem.
Run output filtering and schema validation on every model output and tool result, redacting PII, secrets, and unusual Markdown or image URLs.
Insert plan-approval HITL for any irreversible action: file write outside the sandbox, non-allow-listed network call, database write, message send.
Instrument every call with OpenTelemetry GenAI conventions, shipped to a SIEM with out-of-profile alerting.
Run a red-team cycle aligned to MITRE ATLAS and OWASP LLM Top 10 2025, re-run on every model, prompt, or tool change.
Read the system card for the exact version you ship, and verify the match.
Maintain a governance register mapped to NIST AI 600-1, OWASP, ISO/IEC 42001, and the EU AI Act.

What this means for you

If you run agents in production, the highest-leverage work is not a new model. It is constraining what the model you already have can do.

Three things to do this week:

Add an output-side Markdown and HTML sanitiser to every agent that renders images, links, or rich text. This single change defeats most EchoLeak-class exfiltration.
Disable any MCP server without provenance verification in the production environment.
Re-read your top five agents' system prompts and move any retrieved or untrusted substring into a clearly labelled data channel.

Three things to do this quarter:

Run a MITRE ATLAS-aligned red-team against your top agent, including InjecAgent and AgentDojo.
Map every deployed agent to the OWASP LLM Top 10 2025 and document which control addresses which entry.
Stand up model-inventory and shadow-AI detection: CASB and DLP, browser isolation, an approved-vendor list.

The honest summary is that prompt injection is not solved, and pretending otherwise is how you end up in next year's incident table. But it is contained by architecture.

Least privilege, sandboxing, output sanitisation, channel separation, and a human on the irreversible actions will stop the overwhelming majority of what the 2026 threat catalogue throws at you. Build for the assumption that any data your agent reads might be hostile, and most of this chapter stops being your problem.