Uber's engineers burned through the company's $3.4 billion 2026 R&D budget in four months on Claude Code and Cursor, forcing a $1,500 per-tool monthly cap, per Simon Willison's June 3 writeup and TechCrunch's reporting. Microsoft's Engineering & Development org went further and cancelled its Anthropic licenses outright after a pilot bill ran out in spring 2026.
So the question every team building agents should be able to answer precisely, and almost none can: what does one completed agent run actually cost?
We did the math. The answer to "ai agent cost in production" in mid-2026: for a standard 15-turn, three-file coding task, expect $4.75 to $19.01 per completed run on Claude Fable 5, $1.28 to $5.14 on GPT-5.4, and $0.77 to $3.08 on Gemini 3.5 Flash.
The spread within each model comes from two multipliers most teams never model: cache hit rate and retry overhead.
TL;DR: Headline per-token rates explain less than half your agent bill. Cache hit rate (60-92% in production), retry overhead (1.5-6x), and routing discipline determine whether the same task costs under a dollar or twenty. Model your cost as a range, cap your retries, and treat caching as a default with a known breakeven.
Key takeaways
- The reference 15-step coding task costs $4.75 on Claude Fable 5 with a 62% cache hit rate, and $19.01 once a realistic 4x retry multiplier is applied.
- Frontier input pricing spans 5x: Gemini 3.1 Pro at $2/MTok up to Claude Fable 5 at $10/MTok.
- Prompt caching cuts production bills 60-90%, but only above a breakeven of 1.25 reads per write on Anthropic's 5-minute TTL.
- Routing premium models to planning only saves 78% versus all-premium, per the worked example below.
- Budget-tier prices keep collapsing; frontier prices have flattened, and the IPO filings from OpenAI and Anthropic suggest they'll firm rather than fall.
What does one agent run actually cost?
A production coding-agent run costs between roughly $0.77 and $19.01 in June 2026, depending on model tier, cache behavior, and retries. That figure comes from a reproducible reference task: a SWE-bench-Lite-style multi-file fix spanning 15 LLM turns, three files, about 600 lines of code, with a 25K-token system prompt cached after turn one.
Here is the per-run math across three tiers, priced from first-party rate cards on June 12, 2026 (Anthropic, OpenAI, Google):
| Model (tier) | Base (no cache) | With cache (62% hit) | With 4x retry | Net per completed run |
|---|---|---|---|---|
| Claude Fable 5 (premium, $10/$50) | $7.84 | $4.75 | $19.01 | $19.01 |
| GPT-5.4 (mid, $2.50/$15) | $2.06 | $1.28 | $5.14 | $5.14 |
| Gemini 3.5 Flash (budget, $1.50/$9) | $1.23 | $0.77 | $3.08 | $3.08 |
The pull-quote version: your agent bill is a multiplication problem, and the per-token rate is the smallest factor in it.
Why do the multipliers matter more than the rate card?
Because agents re-send everything, every turn. The accumulated context (system prompt, tool definitions, prior tool results, file diffs) goes back through the meter on each of those 15 turns, which is why input tokens are 95-98% of a multi-step agent's bill. Anthropic's own engineering post on building Claude Code says the harness is architected around prompt caching and the team runs SEV alerts on cache hit rate.
Token volume per task varies wildly by harness. Morph LLM's February 2026 three-way benchmark on six months of production traces found Aider averaged 105,000 tokens per task at 71% first-pass success, Cursor 104,000 at 68%, and Claude Code 479,000 at 78%. The 4.2x token gap buys a 7-point success-rate gain, which is often a good trade.
Failed runs are the expensive tail: the Stanford/MIT/UMich "How Do AI Agents Spend Your Money?" study found failed bug-fix attempts routinely exceed a million tokens and cost $30 to $100 per attempt.
Retries are the single biggest lever in the sensitivity analysis. Moving from 4x to 2x retry overhead halves the bill. Moving cache hit rate from 0% to 90% cuts it 58%. No other axis comes close.
How much does prompt caching actually save?
Production deployments report 60-90% cost reductions from caching, with hard numbers attached. Notion cut its Claude bill 90% while reducing latency up to 85% across 30+ concurrent agent tasks. HolySheep dropped its OpenAI bill from $4,200 to $680 a month, an 84% cut. Culprit cut Haiku 4.5 input cost 90% with a two-segment trick: one cached segment for the stable system prompt, one for recent conversation history.
But caching has a breakeven, and Anthropic quietly moved it. The default cache TTL dropped from one hour to five minutes in early March 2026. Tanay Shah's production analysis puts the thresholds at 1.25 cache reads per write on the 5-minute TTL and roughly 11 reads per write on the 1-hour TTL.
Below those, caching adds 0-30% to your bill, because cache writes cost 1.25-2x base input.
All three vendors now converge on 90% off cache reads. The differences live in the fine print: Anthropic has explicit TTLs and write surcharges, OpenAI auto-manages with no stated TTL, and Google charges a $1.00 per million tokens per hour storage fee that gets material for long-lived caches.
Is model routing worth it?
Routing saves 78% versus running everything on the premium model, and it has become a corporate default. CNBC reported on June 5 that model routing is now standard practice for managing AI overspend, and the academic grounding goes back to RouteLLM, which showed 2x+ cost cuts with no quality loss.
Applied to our reference task: Fable 5 handles the planning step only, GPT-5.4 takes the five code-editing steps, and Gemini 3.5 Flash runs the other nine (reads, tests, lint, format, finalize). The routed run lands at $1.04 with cache and $4.17 with 4x retry, against $19.01 for all-Fable 5.
One honest caveat: routing does not beat simply running everything on Gemini 3.5 Flash at $3.08. The reason most teams can't do that is the failure tail. Complex planning and multi-file refactors need a frontier model to succeed on the first attempt, and the 3-30x failed-attempt multiplier swamps the savings from a cheaper planner.
Route premium to the steps where failure is expensive, budget to the steps where it isn't.
Context management is the third lever and it compounds with both. Chroma's context-rot research found all 18 frontier models tested degrade 20-50% in accuracy as context grows from 10K to 100K+ tokens. A bloated context window costs you twice: linearly in tokens and again in retries triggered by degraded accuracy.
Rolling windows, summarization at turn boundaries, and ranked retrieval with hard chunk caps are the standard playbook.
Will token prices keep falling?
Not at the frontier. The famous 10x-per-year inference cost decline has held for budget tiers (Epoch AI measured a 40x drop in Q1 2026 alone) but frontier pricing has gone essentially flat, with TokenCost measuring only 12x over three years at the top tier.
The Claude Fable 5 launch on June 9 raised the public ceiling to $10/$50, double the prior Opus 4.x rate.
The IPO calendar reinforces this. OpenAI and Anthropic both filed confidential S-1s in spring 2026, per NYT reporting, and both need a gross-margin expansion story for public investors.
Anthropic just posted its first operating profit ($559M on $10.9B revenue in Q2 2026, per the WSJ). Cutting frontier rates 50% would undercut the exact narrative both companies are selling.
Expect budget tiers to keep compressing toward the DeepSeek floor ($0.14/MTok input on V3) while the budget-to-frontier spread widens.
Also watch the surcharges nobody reads: OpenAI bills 2x input and 1.5x output on any session exceeding 272K input tokens, applied to the entire session. For an agent stuffing context with file diffs, that's the most aggressive penalty in the market.
What this means for you
Five moves, in priority order:
- Model cost as a range. Best case (cached, no retry) to worst case (uncached, 4x retry) spans 3-4x. Budget for the range.
- Cap retries and add loop detection. The canonical failure is the Pubby case: one customer session cost $47 over 8 hours from an unbounded 30-iteration planner loop.
- Turn on caching, then verify the breakeven. Confirm you exceed 1.25 reads per write on a 5-minute TTL before celebrating.
- Route by step, premium for planning only. 78% savings versus all-premium in the worked example.
- Keep context lean. Every token you don't re-send is paid for 15 times over.
And keep volume in perspective. A team running 50 agent tasks a day on Fable 5 spends maybe $1,000 a month at the worst case, noise next to a $300K+ fully-loaded engineer.
A team running 1,000 tasks a day has a $570K-a-month problem and should treat this math as core unit economics. Simon Willison, a single engineer, now spends $2,180 a month in tokens across two $200 subscriptions, and Cognition's $492M ARR (up from $37M a year prior) shows where aggregate demand is heading.
The bills are real. The good news is that the four multipliers that drive them are all under your control.
Sources
- Claude Fable 5 and Claude Mythos 5 announcement (Anthropic)
- Claude API pricing (Anthropic)
- OpenAI API pricing
- Gemini Developer API pricing (Google)
- Uber caps usage of AI tools like Claude Code (Simon Willison)
- Uber caps employee AI spending (TechCrunch)
- Lessons from building Claude Code: prompt caching is everything (Anthropic)
- Notion customer story (Anthropic)
- Anthropic prompt caching cut our RCA cost by 90% (Culprit)
- Anthropic prompt cache production patterns (Tanay Shah)
- Aider vs Cursor vs Claude Code token benchmark (Morph LLM)
- Context Rot: how increasing input tokens impacts LLM performance (Chroma)
- RouteLLM: learning to route LLMs with preference data (arXiv)
- Model routing is a problem for OpenAI and Anthropic (CNBC)
- I think Anthropic and OpenAI have found product-market fit (Simon Willison)
- More Devins in More Places (Cognition)
