Boris Cherny and Cat Wu, the creators of Claude Code, say they personally merge 10 to 30 agent-generated pull requests per day, roughly 5, 10× the median throughput of a senior engineer. Anthropic engineers place the share of the company's production code that is now AI-authored at 80, 90%+. Meanwhile, the most rigorous randomized trial ever run on this question, METR's study of 16 experienced developers working a mature 1M+ line open-source repository, found that developers using AI tools were 19% slower than those without them, even as they believed they were faster.
Both of those things are true at the same time, and the distance between them is where every AI coding business case either survives or dies. This piece is the money math: what agents actually cost per PR, which ROI figures circulating in pitch decks are fabricated or garbled, what the productivity evidence really shows, and why the local-first stack is the most under-reported cost lever available to engineering leaders in mid-2026.
TL;DR: AI coding agents cost $1–$30 in tokens per PR at list prices (not the mythical $37.50), deliver ~10% median productivity gains (not 2, 3×), and generate measurable quality debt, churn nearly doubled and refactoring fell 60% per GitClear. The 79%-adoption headline masks an 11% production reality, and a self-hosted open-weights stack beats cloud APIs by 8, 24× on cost past 5, 10M tokens/month. The winners are the organizations treating this as a governance and measurement problem, not a tool-selection problem.
Key takeaways:
- Plan against the 11% production baseline, not the 79% adoption headline, the gap is governance, security, and integration work.
- Discount vendor productivity claims by 50, 70%; the independent data converges on 10, 30% individual gains, 5, 15% at the team level.
- Budget compute separately from seats: $1–$30 in tokens per PR, with a long tail where 10% of tasks drive 50% of cost.
- The "$37.50 per PR" and "4:1 ROI" figures do not trace to primary sources. Do not put them in a business case.
- Local-first models break even against cloud APIs at 5, 10M tokens/month and solve the compliance problem outright for regulated workloads.
- Code quality metrics, churn, refactor rate, change failure rate, are now first-class cost lines, not engineering hygiene.
The market is real; the forecast is a Rorschach test
The most-quoted 2026 number for the agentic AI market, $10.21B today, $388.30B by 2036 at a 43.80% CAGR, comes from a Vantage Market Research report dated June 1, 2026. It sits inside a remarkably wide analyst band. Precedence Research puts 2026 at $10.86B and 2034 at $199.05B; Fortune Business Insights puts 2026 at $9.14B and 2034 at just $139.19B. That $59.9B spread at the 2034 endpoint, a 1.43× ratio between two reputable houses, is not noise. Fortune counts only fully autonomous systems; Precedence includes assistants and tool-using LLMs. They are measuring different things and calling them the same name.
A 43.8% CAGR implies a 38× expansion in a decade. Only cloud computing (2008, 2018) and the smartphone app economy (2009, 2014) have achieved anything comparable at this scale, and the forecast quietly assumes hundreds of millions of seats paying $19–$200/month, a step-function change in how the world buys software. The sane planning posture: treat the 2036 figure as a market-creation upper bound, and plan against the 2030, 2032 horizon, where the analyst houses actually cluster and the base case is 5, 10× the 2026 market, not 38×.
The coding-agent sub-segment is consolidating fast. Cursor (Anysphere) sits at roughly $2B ARR with 7 million monthly active users as of February 2026. Anthropic's run-rate revenue, commonly cited in the $10, 14B range, implies coding-agent revenue in the low single billions. GitHub Copilot crossed 1.8 million paid seats per Microsoft's Q2 FY2026 disclosure, and in February 2026 GitHub began reselling Claude and Codex to Copilot Business and Pro users, the clearest sign yet that distribution, not models, is becoming the moat. The long tail (Devin, Replit, Sourcegraph, Augment, Poolside, Tabnine) is being absorbed through partnerships, which means the number of standalone pricing decisions you actually face is shrinking.
79% adoption, 11% production: the gap is the whole story

The Databricks 2026 State of AI Agents report, drawn from telemetry across 20,000+ organizations including more than 60% of the Fortune 500, found 79% of enterprises actively using AI agents but only ~11% running them in production. Crucially, that 11% is an operational measurement of where workloads actually execute, not a survey response. Other sources bracket it: IDC, with AWS, puts "full production" near 7%; McKinsey's State of AI 2025 puts "regularly using agentic AI in at least one function" at 45%; Menlo Ventures gets to 87% if a customer-service chatbot counts. The 7%–87% range isn't contradiction, it's definitional drift. For coding agents handling real production work, anchor on the low end.
The blockers are well documented and they are not about model quality. Per Databricks: 57% of organizations cite governance as the primary friction, and organizations with mature AI governance are 12× more likely to get projects into production. 45% of enterprises reported an AI-related data leak in the past 12 months, and 67% of those leaks came from unapproved tools, not sanctioned ones. 46% cite integration with existing CI/CD, identity, and observability stacks.
Gartner captures the institutional schizophrenia in two of its own publications: an August 2025 forecast that 40% of enterprise applications would integrate task-specific agents by 2026 (up from under 5%), and a September 2025 survey finding just 15% of IT application leaders even considering fully autonomous agents. High deployment intent, low production tolerance. That paradox is the gap.
The productivity evidence: 10% is the honest number

The bullish data is real and worth taking seriously. DX's research, the most-cited independent source, documented a jump from 1.4 to 2.3 PRs per developer per week and ~3.6 hours of weekly time saved. Vinted reported a 58% PR-throughput increase. These are not vendor-supplied numbers.
But the same firm published the corrective, and it got far less attention. DX's "AI productivity gains: more modest than expected" analysis concluded that real-world gains sit closer to 10% than the 2, 3× promised. The reconciliation is simple and damning: the 60% throughput figure describes the upper quartile of agent users; the 10% figure describes the median engineer. A small set of high-leverage users pulls the average up while most of the organization sees incremental improvement. That is the AI productivity paradox compressed into a single statistical artifact.
The quality dimension makes the headline numbers worse. LinearB's 2026 benchmarks, drawn from thousands of organizations, found AI-assisted PRs merge at roughly half the rate of human-written PRs. So "2.3 vs 1.4 PRs per week" measures PRs opened, while merge rates pull in the opposite direction. The 2025 DORA report completes the picture: 90% of organizations now use AI in the SDLC, yet the core DORA metrics, deployment frequency, lead time, change failure rate, show only modest improvement for AI-heavy organizations, and AI adoption correlates with a small but measurable increase in change failure rate in some segments.
Then there is METR. Sixteen experienced developers, 246 tasks, one mature open-source repository (22,000+ stars, 1M+ lines), randomized assignment to AI tools (Cursor with Claude) or none. Result: 19% slower with AI, with developers simultaneously believing they were faster. The authors attribute the slowdown to time spent prompting, reviewing, and debugging output that doesn't match codebase conventions. The study's generalizability is bounded, one repo, one tool, expert developers on familiar code, but it is the strongest controlled evidence to date that 2, 3× claims do not survive rigorous measurement. The contrast with GitHub/Microsoft's large-sample study (n≈4,800, +55% task completion on a controlled benchmark, a contested ~13.5% time reduction) tells you what's going on: AI looks transformative on greenfield benchmark tasks and marginal-to-negative on mature production codebases. Most of your codebase is the second thing.
Planning math: expect 10, 30% throughput gains at the individual level, substantially less at the team level, and treat the gap between felt productivity and measured productivity as a problem your metrics stack must solve.
Cost-per-PR: killing the $37.50 myth and building the real number
Two figures circulate in business cases that should not survive contact with a CFO, and it's worth dismantling them precisely because they're so widely quoted.
The "4:1 ROI on Claude Code Max" figure does not trace to a primary publication. The closest source, Faros AI's "Measuring Claude Code ROI" post, describes positive ROI but asserts no 4:1 ratio. Faros's own AI Productivity Paradox research actively undermines a clean ratio: it measured +98% PRs merged alongside +91% review time, +154% PR size, +91% code churn, and +9% bug volume. Any ROI claim that doesn't net out review and rework costs is measuring the numerator and ignoring the denominator.
The "$37.50 per incremental PR" figure is a units error. $37.50 is Anthropic's per-million-token output rate for Opus 4.6+ when context exceeds 200K tokens, a long-context tier introduced in mid-2026. Apply that rate to a realistic 2,000-token output per PR and you get about $0.075, three orders of magnitude off. The most plausible origin of the garbled number is someone dropping the "/M tokens" denominator. The real all-in figure, per Faros's engineering benchmarks: $20–$80 per PR for a typical Claude Code Max deployment, with wide variance by task complexity and agent turns.
Here is the actual cost stack, which has four layers:
| Cost layer | Mid-2026 reality | Notes |
|---|---|---|
| Per-seat subscription | Claude Code: $20 Pro / $100 Max 5× / $200 Max 20×. Copilot: $19–$39/seat. Cursor: $20–$40. Devin: $20 Team / $500 Enterprise | Median seat has held at $19–$40 IC, $30–$60 enterprise |
| Token/API usage | $1–$30 per production PR at Opus 4.5 list price (200K, 2M tokens per agentic run) | Long-context output tiers ($30–$75/M) are where the race is, not input tokens |
| Orchestration overhead | ~4× token amplification for multi-agent vs. Single-shot | Per Anthropic's own engineering write-up |
| Review, rework, revert | +91% review time, +91% churn, +9% bugs (Faros) | The layer naive ROI models omit entirely |
The practical monthly cost per engineer: a Claude Code Max 5× seat with average usage runs $200–$400 (plan plus overage); a Copilot Business or Cursor Pro seat runs $40–$80; a Devin Enterprise seat runs $500–$1,500 depending on workload. Power users on Max 20× plans are documented in Faros telemetry as consistently hitting monthly caps with $200–$500 in overages.
The multi-agent tax and the long tail
Anthropic's multi-agent research write-up quantified what every orchestration architect suspects: routing a task through an orchestrator with sub-agents costs ~4× the tokens of a single-shot call (~$0.04 → ~$0.16 for a typical Sonnet task), because the orchestrator issues its own planning, synthesis, and verification calls, and each sub-agent starts with a fresh, uncompressed context. Anthropic's SWE-bench performance work put the full 500-problem suite at $1,900–$2,400 in API fees, $3.80–$4.80 per task.
METR's independent cost analysis adds the distribution shape, and the shape matters more than the median: the median SWE-bench-style task costs $0.46 in tokens, the 90th percentile costs $3.20 and 45 minutes of wall time, and the 99th percentile costs $22+ and 4+ hours. For a team running 50, 100 agent PRs per week, that implies $500–$2,000/week in the median scenario, scaling to $5,000–$10,000/week for high-complexity work. Roughly 10% of tasks drive 50% of the cost. Budget the long tail explicitly or it will budget itself.
The frontier labs are the existence proof, and the selection-bias warning
The labs themselves are the most aggressive adopters on Earth, and their disclosures set the ceiling for what's possible. Anthropic, via the Anthropic Institute and executive interviews from Dario Amodei and Mike Krieger, has placed AI-authored production code at 80, 90%+ as of mid-2026, with the Claude Code founders publicly describing 10, 30 agent PRs merged per engineer per day. Sundar Pichai told Alphabet's Q1 2026 earnings call that "well over 30%" of new code at Google is AI-generated. Satya Nadella has put Microsoft's figure at 20, 30%. Meta and OpenAI have stayed directional rather than numerical.
Read these claims for what they are: primary-source executive testimony, not independent measurement, from organizations whose engineers are by selection the heaviest power users of their own tools, exactly the upper-quartile population DX found driving the 60% number. The labs prove the ceiling exists. They tell you nothing about where your median engineer, on your fifteen-year-old codebase, will land. The honest synthesis: the frontier has moved from "AI is a tool" to "AI writes most of our code," and the gap between them and the median enterprise is not model access, everyone has the same models, but governance, test infrastructure, and review capacity.
The quality bill is coming due
The skeptical case is now as data-driven as the bullish one, and it is the binding constraint on production deployment.
GitClear's longitudinal analysis is the structural evidence. Across 211 million changed lines (2025 edition), code churn, lines changed within two weeks of commit, nearly doubled from 3.1% to 5.7%; the share of code classified as refactored collapsed from 24.1% to 9.5%, a ~60% drop; copy/paste clones rose 48%. A CMU difference-in-differences study presented at ICSE 2025 found AI adoption correlates with 30% more static-analysis warnings and 40% higher cyclomatic complexity, controlling for repository, language, and developer experience. These figures are not contested in the academic literature; the open question is whether they describe a transition phase or an equilibrium.
Security compounds the picture. Perry et al.'s IEEE S&P study, still the most-cited academic reference in the space, found developers using AI assistants produced more security vulnerabilities while being more confident their code was secure. Veracode's generative-AI coding reports found roughly 40% of generated snippets contained an OWASP Top 10 vulnerability. Snyk's 2025 report found a 67% year-over-year increase in AI-introduced vulnerabilities in customer repositories, with a median time-to-fix 31 days longer than human-introduced bugs.
Connect the threads and the mechanism is obvious: agents produce code faster than humans can review it, the refactoring that historically paid down debt has fallen 60%, and security debt is accumulating faster than review capacity. The GitClear refactor decline isn't a separate finding from the Snyk fix-time finding, it's the same finding measured twice.
The local-first edge: where the cost math flips
Here is the most under-reported development of 2026: the open-weights coding stack got good enough, and the cost math is no longer close.
The models are production-grade. Qwen3-Coder-30B runs on a single 24GB consumer GPU at 46, 55% on SWE-bench Verified. Devstral 24B (Mistral) hits 46% on a 16GB GPU. OpenAI's gpt-oss-20B runs on a 16GB laptop. For context, these scores would have led the global SWE-bench leaderboard not long ago, and they now run air-gapped on hardware that costs less than three months of a Devin Enterprise seat.
The unit economics are lopsided. A self-hosted H100 running an open-weights coding model achieves roughly $0.62 per million tokens of effective output, against $5/M for GPT-4o and $15/M for Claude Sonnet 4.5, an 8× to 24× cost advantage that breaks even against cloud APIs at around 5, 10M tokens per month. A single engineer running serious agentic workloads clears that threshold easily; a team clears it in days.
The tooling is mature. Tabby (self-hosted, SQLite-embedded state, IDE plugins), Aider (terminal agent whose repo-map feature meaningfully cuts token costs on large codebases), Continue.dev, Cline and Roo Code, OpenHands with sandboxed execution, and Zed AI all support full agentic loops against any local or remote model, deployable air-gapped against on-prem Git and CI/CD.
The killer pattern is test-in-a-loop. When the agent writes code, runs the test suite locally, at zero token cost, reads the failures, and iterates, you remove the most expensive turns from the paid loop entirely. A SQLite-embedded test database resets across hundreds of agent iterations; a Docker test environment cycles in seconds. The pattern cuts per-PR token cost by 30, 60% versus a cloud-only deployment where every test run burns a paid model turn. This works with Claude Code and cloud models too, but combined with a local model, the marginal cost of an agent iteration approaches the electricity bill.
And then there's the part that isn't about money. In financial services, healthcare, defense, and government, "the model sees your proprietary code" versus "the model does not" is the difference between a permissible deployment and a regulatory violation. With the EU AI Act's general-purpose AI enforcement phase live since 2025 and the U.S. AI safety executive order extended through 2026, the local-first option isn't a cost optimization for the privacy-sensitive 30, 40% of the engineering workload, it's the only option. The cloud frontier models still win on raw capability for the hardest long-horizon tasks; the correct architecture in 2026 is a portfolio, not a religion. Route high-complexity, low-sensitivity work to frontier APIs; route high-volume, privacy-sensitive, and test-heavy work to the local stack.
What this means for you
If you own an engineering budget, the 2026 playbook compresses to this:
- Build the business case on 10%, not 2×. The DX, LinearB, DORA, and METR evidence converges. If the case only works at 60% throughput gains, you don't have a case, you have a pitch deck.
- Strike the $37.50/PR and 4:1 figures from any document you sign. Neither traces to a primary source. Use first-party token math: 200K, 2M tokens per agentic PR, $1–$30 at list price, $20–$80 all-in.
- Forecast compute separately from seats, and model the long tail. The p99 task costs ~50× the median. A hard usage cap with a defined unit and overage rate is the single most valuable procurement clause this year.
- Stand up the local stack for the regulated 30, 40% of your workload. Break-even arrives at 5, 10M tokens/month; the compliance benefit arrives on day one. Start with test-in-a-loop, which pays off even before you swap in a local model.
- Instrument churn and change-failure-rate as cost metrics, not hygiene metrics. The GitClear and Faros data say the review/rework layer is where ROI actually leaks. Mandatory review for AI-generated PRs and a DORA dashboard with churn as a first-class metric is the cheapest insurance available.
- Close the governance gap before scaling seats. Organizations with mature AI governance are 12× more likely to reach production. The 89-point gap between adoption and production is not waiting for a better model.
The agents are not going away, at the frontier, they already write most of the code. But the economics reward the organizations that measure honestly, route workloads to the cheapest adequate model, and treat quality debt as a line item rather than a surprise. In a market where everyone has access to the same models at the same prices, the discipline is the edge.
