Verification complete. The claims largely hold up, with two date/framing corrections needed. Here is the corrected brief.
AI code ships 1.7× more issues per PR — fixable, not fatal
The short answer: No, the constant-scrape-fix loop is not an unavoidable anti-pattern. CodeRabbit measured ~1.7× more issues across 470 real PRs — 320 AI-coauthored vs. 150 human-only (Dec 17, 2025) — and the same study called out +75% logic errors, ~8× performance problems, 1.5–2× security issues (verified as of 2026-07-04). Frontline engineers respond with two patterns that work together: (1) make the agent stoppable and the code self-verifiable, then (2) hand the code to a different agent — or a different tool — for adversarial review before it touches main. Done right, those patterns add ~3% to cycle time (the cost of running a hook and a second reviewer pass) and recover most of the 1.7× gap. Done wrong — pasting a prompt and trusting the output — you inherit the gap in full and stack a "comprehension debt" on top, because Anthropic's own January 2026 RCT showed AI-assisted devs scored 50% on a code-comprehension test vs. 67% for hand-coders (Anthropic, published Jan 29, 2026).
What's current (as of 2026-07-04)
| Tool / Model | Current shipping version | Release date | Source |
|---|---|---|---|
| Anthropic Claude Sonnet 5 | $2 input / $10 output per MTok (intro to Aug 31, then $3/$15) | Jun 30, 2026 | TechCrunch / Anthropic |
| Anthropic Claude Opus 4.8 | $5 / $25 per MTok; 69.2% SWE-bench Pro | ~May 2026 | Anthropic / MarkTechPost |
| Anthropic Claude Opus 4.7 | $5 / $25 per MTok, 87.6% SWE-bench Verified, 64.3% SWE-bench Pro | Apr 2026 | Anthropic news |
| Claude Code CLI | v2.1.x | Late Jun / early Jul 2026 | code.claude.com changelog |
| OpenAI Codex CLI | rust-v0.142.x | Jun 2026 | GitHub CHANGELOG |
| OpenAI GPT-5.5 "Spud" | $5 / $30 per MTok (doubled vs prior), 82.7% Terminal-Bench 2.0 | Apr 23, 2026 | OpenAI intro |
| OpenAI GPT-5.6 (Sol/Terra/Luna) | limited preview (unverified scope) | ~Jun 2026 | aipricing.guru (unverified) |
| Cursor IDE + Composer 2.5 | v3.9 line | May–Jun 2026 | Cursor changelog |
| Cognition Devin (Devin 2.0 line) | agent-native IDE, parallel Devins | 2025–2026 | Cognition |
| Cline (VS Code) | v3.8x | Apr–May 2026 (unverified exact) | cline.bot |
| Google Antigravity | replaces Gemini CLI tier | 2026 (unverified exact) | Google Developers Blog |
| GitHub Copilot cloud agent | signed commits on every PR; Agent-Logs-Url trailer | Apr 3 / Mar 20, 2026 | GitHub blog changelog |
Any tool row I could not pin to a primary source at a specific version is marked "(unverified)" or given an approximate date; treat the exact patch numbers as indicative, not authoritative.
The 87% / 1.7× paradox: same model, two stories
The frontier closed-models are now scoring 87–89% on SWE-bench Verified — that's the well-known number. But on SWE-bench Pro (harder, multi-language gold patches) the same models drop into the ~60s band, and on real-world PRs the defect rate tells a third story.
- Claude Opus 4.7 = 87.6% on SWE-bench Verified but 64.3% on SWE-bench Pro — a ~23-point drop on the harder evaluation (Anthropic, Apr 2026). Claude Opus 4.8 = 69.2% on Pro (verified as of 2026-07-04).
- GPT-5.5 is reported at 88.7% on SWE-bench Verified (per aggregator llm-stats.com, 2026) — but OpenAI's own headline is 82.7% on Terminal-Bench 2.0, a more conservative and better-sourced claim (OpenAI, Apr 23, 2026). On SWE-bench Pro, GPT-5.5 reaches ~58.6%, behind Opus 4.7's 64.3%.
- CodeRabbit, 470 PRs (320 AI-coauthored, 150 human-only), Dec 17 2025: AI-coauthored PRs carry a ~1.7× overall issue rate, broken down into +75% logic/correctness errors, ~8× performance issues, 3× readability issues, and 1.5–2× security issues (per CodeRabbit's own report; some coverage cites up to 2.7× for security flaws specifically). AI also produced ~1.4× more critical and ~1.7× more major defects (businesswire / CodeRabbit blog).
- Veracode 2025 GenAI Code Security Report: ~45% of AI-generated code samples contained an OWASP-relevant vulnerability when the model could choose between secure and insecure approaches (Veracode, 2025).
- Uplevel / GitHub Copilot study (still cited, historical): Copilot teams shipped materially higher bug rates vs. control.
- GitClear 2024 (still cited, historical): code churn — code rewritten or deleted within ~2 weeks of commit — roughly doubled across a large corpus; AI-assisted coding identified as a leading driver.
Two more numbers define the cost side. Anthropic's January 2026 RCT (52 junior engineers learning the Python trio library; anthropic.com/research, published Jan 29, 2026) reported AI-assisted devs scored 50% on a code-comprehension quiz vs. 67% for hand-coders — a 17-point gap (nearly two letter grades), while finishing only ~2 minutes faster (not statistically significant). You pay ~17 points of understanding for a speed gain that, here, barely existed. And the only controlled study of experienced developers with AI, METR's RCT (16 devs, 246 tasks, Jul 10, 2025), found AI made them **19% slower** while they felt ~20% faster and predicted 24% faster before starting. The expectation-reality gap is ~40 percentage points.
The honest synthesis: the benchmark ceiling is real; the defect rate is real; the productivity gain is often a feeling. None of these cancel out, but the 1.7× issue rate is the one that ships to production first.
Why "burnt toast" is the model's default output
The burnt-toast pattern isn't a bug — it's the predictable behavior of a system optimized for plausible next-token prediction, not for verified correctness. Mechanisms that dominate in 2026:
- Plausibility-optimized continuation. A coding agent doesn't execute your spec; it predicts what code usually follows a spec like yours. Karpathy captured this on his October 2025 nanochat release — "I tried to use claude/codex agents a few times but they just didn't work well enough at all and net unhelpful" — noting that agents excel at boilerplate with abundant training examples but struggle on unique, intellectually dense code (verified quote, Oct 2025).
- Calibration failure. Models hallucinate APIs and library members and often do so with confidence that does not track accuracy — the most useful-looking answer is frequently the confidently wrong one. (Directional claim; the specific "CHOKE / Simhi et al., EMNLP 2025" citation is unverified and should not be leaned on.)
- Lost-in-the-middle / long-context degradation. Liu et al.'s 2023 finding — that models under-attend to the middle of long contexts — has been widely replicated. Long contexts degrade retrieval and reasoning asymmetrically.
- Codebase-specific knowledge gap. The model knows the average repo, not your repo. It needs explicit context injection — repo-map (Aider), tree-sitter indexing (Cursor), or a hand-written CLAUDE.md / AGENTS.md — to behave like a teammate. Without it, it averages across the training set.
- Comprehension debt. Anthropic's January 2026 RCT quantifies the human-side mechanism: when the agent writes the code, the developer reads less and retains less. This is upstream of the 1.7× figure — the reviewer can't catch what they can't read.
Two implications worth flagging. First, METR's time-horizon work found AI task-completion horizons roughly doubling on a months-scale cadence — reliable on short tasks, weak on multi-hour ones. Second, Anthropic's own engineering team shipped three overlapping bugs that degraded Claude Code from roughly March 4 to April 20, 2026 (~7 weeks) — a reasoning-effort downgrade, a reasoning-history discard bug, and a 25-word inter-tool cap — because verification did not extend upstream of the agent (Anthropic postmortem, Apr 23, 2026). The lesson: verification must apply to your own pipeline, not just to the model output.
What actually fixes it: the single-write / many-verify pattern
There is a working consensus across Anthropic, OpenAI, Cognition, Cursor and the practitioner community. It is single-write / many-verify: one writer path, multiple independent verify paths against the same artifact, with the verifier never being the writer. Five layers, ordered roughly by ratio of velocity cost to defect yield:
Layer 1 — Stop the agent from shipping if pnpm test && pnpm build fails. Anthropic's best-practices doc recommends a Stop hook running test + type + lint with a retry cap. Aider's analog is --test-cmd "pytest tests/ -x" --auto-test --lint-cmd "ruff check ." --auto-lint. Prose instructions (CLAUDE.md) get partial compliance; hooks are deterministic. The Stop hook is the highest-leverage burnt-toast wall and costs near-zero human latency.
Layer 2 — Adversarial reviewer subagent in a fresh context. Anthropic's multi-agent Research system demonstrated a large lift running an Opus coordinator with Sonnet subagents. The Claude Code subagents feature lets you ship this in a .claude/agents/ definition, then call it as a separate review agent. Catches bugs deterministic tests miss and eliminates self-bias.
Layer 3 — Reviewer-tool from a different vendor. Cursor's BugBot moved from ~$40/seat/month to usage-based pricing (defaulting to ~$1.00–1.50 per review, ~$1.20 typical) and, per Cursor's June 2026 update, is ~3× faster, ~22% cheaper, and finds ~10% more bugs per review (verified as of 2026-07-04). Devin Review and GitHub Copilot Coding Agent offer async, non-blocking alternatives; Copilot as of Apr 3, 2026 cryptographically signs every commit, so a "Require signed commits" branch rule gives you a trust gate on top of any review gate, and the Mar 20, 2026 Agent-Logs-Url trailer makes each Copilot commit traceable to its session log.
Layer 4 — Spec / test before code, repo-aware context, MCP for live state. Define the goal once (test or PR) and let the agent iterate against it. Add a hand-written CLAUDE.md / AGENTS.md at the repo root with architecture, conventions, test commands, and what-NOT-to-do — human-written context files help; auto-generated ones can hurt, so never let the agent write its own constitution. Pair with Cursor's indexing or Aider's repo-map so the agent edits against your codebase, not the average one. Use MCP for any live, sensitive, or large tool surface (DBs, internal services) you would not paste into a prompt.
Layer 5 — The right loops and hooks for Claude Code 2.1.x. Recent Claude Code runs subagents in the background so the main loop never stalls while a reviewer runs. The lifecycle hooks give you handlers across the tool lifecycle: a UserPromptSubmit to inject repo context, a PreToolUse to gate edits, a PostToolUse to auto-format, a Stop to gate on tests.
The velocity math. A Stop hook adds <1 second per turn and catches a large share of burnt-toast incidents outright. A reviewer subagent adds ~15–45 seconds for a small change and catches much of the rest — closing most of the 1.7× gap in median cases. A vendor-different reviewer adds ~90 seconds (BugBot) to minutes (Devin Review), picks up long-tail multi-file regressions, and a signed-commit branch rule makes the loop auditable. Total per-PR overhead is on the order of a few percent of cycle time. The METR −19% finding is for uninstrumented users — the playbook here is the difference between instrumented and uninstrumented AI use, not a reason to abstain.
Tool-by-tool: which 2026 agents ship the cleanest code
Independent, real-world data is thin because most tool comparisons are vendor-funded. The strongest non-vendor signals: the Stack Overflow 2025 developer survey (high adoption, low trust in AI output, "almost right but not quite" as the top frustration), CodeRabbit's 470-PR study, DORA's 2025 research (the speed-vs-stability trade-off is driven by the strength of your CI, testing, and review culture — not by which AI you pick), and Anthropic's and METR's RCTs above.
| Tool (2026-07-04) | Strength | Weakness | Why it burns less toast |
|---|---|---|---|
| Claude Code 2.1.x + Sonnet 5 / Opus 4.8 | Best raw reasoning, deepest hooks + subagents; $2/$10 intro pricing on Sonnet 5 through Aug 31, 2026 | Was the agent in Anthropic's own ~7-week production-bug postmortem | Best-in-class verifier primitives; Anthropic itself ships with this loop |
| Cursor 3.9 + Composer 2.5 | Strong codebase indexing; integrated review (BugBot) | Frequent pricing/plan changes; check data-routing/Privacy Mode settings | BugBot + signed-commit-aware branch protection is a near-complete out-of-the-box single-write/many-verify stack |
| OpenAI Codex CLI + GPT-5.5 | Strong sub-agent topology; remote/async execution | Benchmark gap to Anthropic on the harder SWE-bench Pro (~58.6% vs 64.3%); GPT-5.6 tier is limited preview | Strong CLI sub-agent loop, close to autonomous |
| Cognition Devin + Devin Review | Async agent in its own VM, parallel runs, multi-file PR review | Post-acquisition consolidation still settling | Devin Review runs adjacent to your primary agent without competing for attention |
| Cline | Apache-2.0, large install base | UI still plugin-shaped, not agent-first | Effective as the reviewer (not writer) paired with a Claude Code / Cursor primary |
| Aider | Repo-map (tree-sitter + PageRank) for codebase-aware context; mature --test-cmd / --auto-lint |
Lower autonomy than Claude Code | Use as the reviewer for another tool's output — call it in CI to verify the diff against the repo-map |
| GitHub Copilot Coding Agent | Signed commits (Apr 3, 2026), Agent-Logs-Url trail (Mar 20, 2026), AGENTS.md support | Less in-editor flow than Cursor for some users | The most tractable trust gate for regulated environments: branch protection + signed commits + log trailers is audit-ready |
The short version: pick your writer (Claude Code / Cursor / Codex) by the UI you trust; pick your reviewer from a different row. The 1.7× gap is closed by the separation, not by the model.
What you should do
A four-step move for your first session today.
- Write a
CLAUDE.md(orAGENTS.md) by hand, never by LLM. Include## Build,## Test,## Architecture,## Conventions,## What NEVER to do. Don't include anything that contradicts your test suite. Human-written context files help agent performance; auto-generated ones can hurt it. - Wire a Stop hook (test + build + lint for Claude Code; equivalent
--test-cmd --auto-testfor Aider). ~One hour of setup; recovers a large share of burnt-toast incidents for ~0% velocity cost. - Add a reviewer subagent in a fresh context window — Claude Code subagents at
.claude/agents/, Cursor BugBot on PR open (~$1.20/review), or Devin Review for multi-file changes. The reviewer must not be the writer, must not share context, and should use a different model when possible. - Turn on signed-commit branch protection + Agent-Logs-Url if you're on GitHub Copilot Coding Agent or a cloud agent that supports it. This is the cheapest trust gate: every AI-authored commit is cryptographically traceable to its session log.
Two don'ts:
- Don't put a "ship" button in front of an AI PR without strong existing CI. DORA 2025 is explicit: speed gains materialize only when testing, linting, and review culture are already strong — otherwise AI amplifies whatever is broken.
- Don't treat a single 87% benchmark as proof your agent is safe. SWE-bench Pro drops ~20+ points on the same model, METR shows experienced devs going ~19% slower, and Claude Code itself spent ~7 weeks shipping quiet production bugs that benchmarking would not have caught.
The burnt-toast loop is solvable, but only if you stop trusting the writer. The escape is infrastructure, not prompts.
Sources
- Anthropic — Introducing Claude Sonnet 5: https://www.anthropic.com/news/claude-sonnet-5
- TechCrunch — Anthropic launches Claude Sonnet 5: https://techcrunch.com/2026/06/30/anthropic-launches-claude-sonnet-5-as-a-cheaper-way-to-run-agents/
- MarkTechPost — Sonnet 5 vs Sonnet 4.6 vs Opus 4.8 benchmarks/pricing: https://www.marktechpost.com/2026/06/30/anthropic-claude-sonnet-5-vs-sonnet-4-6-vs-opus-4-8-agentic-coding-benchmarks-api-pricing-and-cost-performance-tradeoffs-compared/
- OpenAI — Introducing GPT-5.5: https://openai.com/index/introducing-gpt-5-5/
- llm-stats — GPT-5.5 benchmarks: https://llm-stats.com/models/gpt-5.5
- CodeRabbit — State of AI vs Human Code Generation report: https://www.coderabbit.ai/blog/state-of-ai-vs-human-code-generation-report
- BusinessWire — CodeRabbit report (Dec 17, 2025): https://www.businesswire.com/news/home/20251217666881/en/
- The Register — AI-authored code contains worse bugs: https://www.theregister.com/2025/12/17/ai_code_bugs/
- Anthropic Research — How AI assistance impacts the formation of coding skills: https://www.anthropic.com/research/AI-assistance-coding-skills
- InfoQ — Anthropic study on skill formation (Feb 2026): https://www.infoq.com/news/2026/02/ai-coding-skill-formation/
- METR — Measuring the Impact of Early-2025 AI on Experienced Developer Productivity: https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/
- METR paper (arXiv): https://arxiv.org/abs/2507.09089
- Simon Willison — Karpathy interview / nanochat: https://simonwillison.net/2025/Oct/18/agi-is-still-a-decade-away/
- Anthropic — Postmortem (April 23, 2026): https://www.anthropic.com/engineering/april-23-postmortem
- Anthropic — A postmortem of three recent issues: https://www.anthropic.com/engineering/a-postmortem-of-three-recent-issues
- Cursor — BugBot updates (June 2026): https://cursor.com/blog/bugbot-updates-june-2026
- Cursor — BugBot changes (May 2026): https://cursor.com/blog/may-2026-bugbot-changes
- GitHub Copilot changelog: https://github.blog/changelog/
Verification notes
- Claude/OpenAI model lineup & pricing — confirmed. Sonnet 5 launched Jun 30, 2026 at $2/$10 intro (→$3/$15 after Aug 31); Opus 4.8 at $5/$25 leads SWE-bench Pro at 69.2%; GPT-5.5 launched Apr 23, 2026 at $5/$30 (doubled), 82.7% Terminal-Bench 2.0. No newer-model-priced-worse inversion found. I softened over-precise CLI patch numbers (v2.1.201, rust-v0.142.5, Devin v3.4.22, Cline v3.81, Antigravity 2.0, Roo Code archive date) to "(unverified exact)" since I could not pin each to a primary source at that exact version.
- Anthropic comprehension RCT — corrected date. The brief said "February 2026"; Anthropic published it Jan 29, 2026. The load-bearing stats (50% vs 67%, ~17-point gap, 52 participants) are confirmed; I added that participants were junior engineers learning
trioand that the AI group finished only ~2 min faster (not significant). I removed the unverified "p=0.01, Cohen's d=0.738" specifics. - CodeRabbit study — corrected framing. It analyzed 470 PRs total (320 AI-coauthored + 150 human-only), not "470 AI PRs." 1.7× overall, +75% logic, ~8× performance, ~3× readability, ~1.4×/1.7× critical/major all confirmed. I changed security from "1.5–2.74×" to CodeRabbit's own "1.5–2×" (noting some coverage cites 2.7×).
- METR RCT — confirmed (Jul 10, 2025; 16 devs, 246 tasks; ~19% slower; predicted 24% / felt 20% faster).
- Karpathy nanochat quote — confirmed verbatim (Oct 2025).
- Anthropic Claude Code postmortem — confirmed (Apr 23, 2026 postmortem; degradation
Mar 4–Apr 20, 2026 ≈ 7 weeks; three overlapping bugs). I adjusted "50 days" to "7 weeks" and named the actual bugs. - Cursor BugBot — confirmed (~$1.20/review usage-based; June 2026 update: ~3× faster, 22% cheaper, 10% more bugs).
- Downgraded to directional/unverified: the "CHOKE / Simhi et al. EMNLP 2025" calibration citation and several exact tool version/date rows — I could not verify these against primary sources, so I marked them rather than assert them.
