cluster

How much does an AI agent cost in production? The 2026 per-run math

The same 15-step coding task costs $0.77 on Gemini 3.5 Flash and $19.01 on Claude Fable 5 once retries hit. Here is the full unit-economics breakdown.

June 12, 202610 min read
ai agent cost in productioncost per agent runllm api cost optimization
How much does an AI agent cost in production? The 2026 per-run math

Uber's engineers burned through the company's $3.4 billion 2026 R&D budget in four months on Claude Code and Cursor, forcing a $1,500 per-tool monthly cap, per Simon Willison's June 3 writeup and TechCrunch's reporting. Microsoft's Engineering & Development org went further and cancelled its Anthropic licenses outright after a pilot bill ran out in spring 2026.

So the question every team building agents should be able to answer precisely, and almost none can: what does one completed agent run actually cost?

We did the math. The answer to "ai agent cost in production" in mid-2026: for a standard 15-turn, three-file coding task, expect $4.75 to $19.01 per completed run on Claude Fable 5, $1.28 to $5.14 on GPT-5.4, and $0.77 to $3.08 on Gemini 3.5 Flash.

The spread within each model comes from two multipliers most teams never model: cache hit rate and retry overhead.

TL;DR: Headline per-token rates explain less than half your agent bill. Cache hit rate (60-92% in production), retry overhead (1.5-6x), and routing discipline determine whether the same task costs under a dollar or twenty. Model your cost as a range, cap your retries, and treat caching as a default with a known breakeven.

Key takeaways

  • The reference 15-step coding task costs $4.75 on Claude Fable 5 with a 62% cache hit rate, and $19.01 once a realistic 4x retry multiplier is applied.
  • Frontier input pricing spans 5x: Gemini 3.1 Pro at $2/MTok up to Claude Fable 5 at $10/MTok.
  • Prompt caching cuts production bills 60-90%, but only above a breakeven of 1.25 reads per write on Anthropic's 5-minute TTL.
  • Routing premium models to planning only saves 78% versus all-premium, per the worked example below.
  • Budget-tier prices keep collapsing; frontier prices have flattened, and the IPO filings from OpenAI and Anthropic suggest they'll firm rather than fall.

What does one agent run actually cost?

A production coding-agent run costs between roughly $0.77 and $19.01 in June 2026, depending on model tier, cache behavior, and retries. That figure comes from a reproducible reference task: a SWE-bench-Lite-style multi-file fix spanning 15 LLM turns, three files, about 600 lines of code, with a 25K-token system prompt cached after turn one.

Here is the per-run math across three tiers, priced from first-party rate cards on June 12, 2026 (Anthropic, OpenAI, Google):

Model (tier) Base (no cache) With cache (62% hit) With 4x retry Net per completed run
Claude Fable 5 (premium, $10/$50) $7.84 $4.75 $19.01 $19.01
GPT-5.4 (mid, $2.50/$15) $2.06 $1.28 $5.14 $5.14
Gemini 3.5 Flash (budget, $1.50/$9) $1.23 $0.77 $3.08 $3.08
Cost per completed agent run, 15-step coding task (62% cache hit, 4x retry)Claude Fable 519.01$GPT-5.45.14$Gemini 3.5 Flash3.08$Routed (Fable+GPT+Gemini)4.17$
Cost per completed agent run, 15-step coding task (62% cache hit, 4x retry)

The pull-quote version: your agent bill is a multiplication problem, and the per-token rate is the smallest factor in it.

Why do the multipliers matter more than the rate card?

Because agents re-send everything, every turn. The accumulated context (system prompt, tool definitions, prior tool results, file diffs) goes back through the meter on each of those 15 turns, which is why input tokens are 95-98% of a multi-step agent's bill. Anthropic's own engineering post on building Claude Code says the harness is architected around prompt caching and the team runs SEV alerts on cache hit rate.

Token volume per task varies wildly by harness. Morph LLM's February 2026 three-way benchmark on six months of production traces found Aider averaged 105,000 tokens per task at 71% first-pass success, Cursor 104,000 at 68%, and Claude Code 479,000 at 78%. The 4.2x token gap buys a 7-point success-rate gain, which is often a good trade.

Failed runs are the expensive tail: the Stanford/MIT/UMich "How Do AI Agents Spend Your Money?" study found failed bug-fix attempts routinely exceed a million tokens and cost $30 to $100 per attempt.

Retries are the single biggest lever in the sensitivity analysis. Moving from 4x to 2x retry overhead halves the bill. Moving cache hit rate from 0% to 90% cuts it 58%. No other axis comes close.

How much does prompt caching actually save?

Production deployments report 60-90% cost reductions from caching, with hard numbers attached. Notion cut its Claude bill 90% while reducing latency up to 85% across 30+ concurrent agent tasks. HolySheep dropped its OpenAI bill from $4,200 to $680 a month, an 84% cut. Culprit cut Haiku 4.5 input cost 90% with a two-segment trick: one cached segment for the stable system prompt, one for recent conversation history.

But caching has a breakeven, and Anthropic quietly moved it. The default cache TTL dropped from one hour to five minutes in early March 2026. Tanay Shah's production analysis puts the thresholds at 1.25 cache reads per write on the 5-minute TTL and roughly 11 reads per write on the 1-hour TTL.

Below those, caching adds 0-30% to your bill, because cache writes cost 1.25-2x base input.

All three vendors now converge on 90% off cache reads. The differences live in the fine print: Anthropic has explicit TTLs and write surcharges, OpenAI auto-manages with no stated TTL, and Google charges a $1.00 per million tokens per hour storage fee that gets material for long-lived caches.

Is model routing worth it?

Routing saves 78% versus running everything on the premium model, and it has become a corporate default. CNBC reported on June 5 that model routing is now standard practice for managing AI overspend, and the academic grounding goes back to RouteLLM, which showed 2x+ cost cuts with no quality loss.

Applied to our reference task: Fable 5 handles the planning step only, GPT-5.4 takes the five code-editing steps, and Gemini 3.5 Flash runs the other nine (reads, tests, lint, format, finalize). The routed run lands at $1.04 with cache and $4.17 with 4x retry, against $19.01 for all-Fable 5.

One honest caveat: routing does not beat simply running everything on Gemini 3.5 Flash at $3.08. The reason most teams can't do that is the failure tail. Complex planning and multi-file refactors need a frontier model to succeed on the first attempt, and the 3-30x failed-attempt multiplier swamps the savings from a cheaper planner.

Route premium to the steps where failure is expensive, budget to the steps where it isn't.

Context management is the third lever and it compounds with both. Chroma's context-rot research found all 18 frontier models tested degrade 20-50% in accuracy as context grows from 10K to 100K+ tokens. A bloated context window costs you twice: linearly in tokens and again in retries triggered by degraded accuracy.

Rolling windows, summarization at turn boundaries, and ranked retrieval with hard chunk caps are the standard playbook.

Will token prices keep falling?

Not at the frontier. The famous 10x-per-year inference cost decline has held for budget tiers (Epoch AI measured a 40x drop in Q1 2026 alone) but frontier pricing has gone essentially flat, with TokenCost measuring only 12x over three years at the top tier.

The Claude Fable 5 launch on June 9 raised the public ceiling to $10/$50, double the prior Opus 4.x rate.

The IPO calendar reinforces this. OpenAI and Anthropic both filed confidential S-1s in spring 2026, per NYT reporting, and both need a gross-margin expansion story for public investors.

Anthropic just posted its first operating profit ($559M on $10.9B revenue in Q2 2026, per the WSJ). Cutting frontier rates 50% would undercut the exact narrative both companies are selling.

Expect budget tiers to keep compressing toward the DeepSeek floor ($0.14/MTok input on V3) while the budget-to-frontier spread widens.

Also watch the surcharges nobody reads: OpenAI bills 2x input and 1.5x output on any session exceeding 272K input tokens, applied to the entire session. For an agent stuffing context with file diffs, that's the most aggressive penalty in the market.

What this means for you

Five moves, in priority order:

  1. Model cost as a range. Best case (cached, no retry) to worst case (uncached, 4x retry) spans 3-4x. Budget for the range.
  2. Cap retries and add loop detection. The canonical failure is the Pubby case: one customer session cost $47 over 8 hours from an unbounded 30-iteration planner loop.
  3. Turn on caching, then verify the breakeven. Confirm you exceed 1.25 reads per write on a 5-minute TTL before celebrating.
  4. Route by step, premium for planning only. 78% savings versus all-premium in the worked example.
  5. Keep context lean. Every token you don't re-send is paid for 15 times over.

And keep volume in perspective. A team running 50 agent tasks a day on Fable 5 spends maybe $1,000 a month at the worst case, noise next to a $300K+ fully-loaded engineer.

A team running 1,000 tasks a day has a $570K-a-month problem and should treat this math as core unit economics. Simon Willison, a single engineer, now spends $2,180 a month in tokens across two $200 subscriptions, and Cognition's $492M ARR (up from $37M a year prior) shows where aggregate demand is heading.

The bills are real. The good news is that the four multipliers that drive them are all under your control.

Sources

Frequently asked questions

How much does an AI agent cost per run in production in 2026?

For a typical 15-turn multi-file coding task, a completed run costs $4.75 to $19.01 on Claude Fable 5, $1.28 to $5.14 on GPT-5.4, and $0.77 to $3.08 on Gemini 3.5 Flash. The range depends on cache hit rate and retry overhead, which together swing the bill 4x.

What is Claude Fable 5 API pricing?

Claude Fable 5, GA since June 9, 2026, costs $10 per million input tokens and $50 per million output tokens, with cache reads at $1 and cache writes at $12.50 per million. It set the new public frontier ceiling, replacing the prior Opus 4.x rate of $5/$25.

Does prompt caching always reduce agent costs?

No. With Anthropic's default 5-minute cache TTL you need at least 1.25 cache reads per write to break even, and roughly 11 reads per write on the 1-hour TTL. Below those thresholds caching adds 0-30% to your bill. Above them, production teams report 60-90% cost reductions.

Is model routing worth the engineering effort?

Routing saves 78% versus running an entire 15-step task on a premium model, and production deployments report 40-75% savings. It is worth it when your baseline is all-frontier. It is not worth it if the whole workload can run on a budget tier, which routing cannot beat.

Will LLM token prices keep falling 10x per year?

Only on budget tiers. Epoch AI data shows a 40x cost drop in Q1 2026 for budget models but essentially flat frontier pricing, and TokenCost measured just 12x over three years at the frontier. With OpenAI and Anthropic both filing for IPOs, frontier rates are more likely to firm than fall.