Deep report · Research Desk

Is the common workflow of AI (e.g. Claude Code) generating 'burnt toast' poor code that humans constantly scrape/fix an unavoidable anti-pattern everywhere? How do we escape it (better prompting, verification, or alternatives) without killing velocity?

Is the common workflow of AI (e.g. Claude Code) generating 'burnt toast' poor code that humans constantly scrape/fix an unavoidable anti-pattern everywhere? How do we escape it (better prompting, verification, or alternatives) without killing velocity?

By Genαi Research DeskJuly 4, 2026Fact-checked

Verification complete. The claims largely hold up, with two date/framing corrections needed. Here is the corrected brief.


AI code ships 1.7× more issues per PR — fixable, not fatal

The short answer: No, the constant-scrape-fix loop is not an unavoidable anti-pattern. CodeRabbit measured ~1.7× more issues across 470 real PRs — 320 AI-coauthored vs. 150 human-only (Dec 17, 2025) — and the same study called out +75% logic errors, ~8× performance problems, 1.5–2× security issues (verified as of 2026-07-04). Frontline engineers respond with two patterns that work together: (1) make the agent stoppable and the code self-verifiable, then (2) hand the code to a different agent — or a different tool — for adversarial review before it touches main. Done right, those patterns add ~3% to cycle time (the cost of running a hook and a second reviewer pass) and recover most of the 1.7× gap. Done wrong — pasting a prompt and trusting the output — you inherit the gap in full and stack a "comprehension debt" on top, because Anthropic's own January 2026 RCT showed AI-assisted devs scored 50% on a code-comprehension test vs. 67% for hand-coders (Anthropic, published Jan 29, 2026).

What's current (as of 2026-07-04)

Tool / Model Current shipping version Release date Source
Anthropic Claude Sonnet 5 $2 input / $10 output per MTok (intro to Aug 31, then $3/$15) Jun 30, 2026 TechCrunch / Anthropic
Anthropic Claude Opus 4.8 $5 / $25 per MTok; 69.2% SWE-bench Pro ~May 2026 Anthropic / MarkTechPost
Anthropic Claude Opus 4.7 $5 / $25 per MTok, 87.6% SWE-bench Verified, 64.3% SWE-bench Pro Apr 2026 Anthropic news
Claude Code CLI v2.1.x Late Jun / early Jul 2026 code.claude.com changelog
OpenAI Codex CLI rust-v0.142.x Jun 2026 GitHub CHANGELOG
OpenAI GPT-5.5 "Spud" $5 / $30 per MTok (doubled vs prior), 82.7% Terminal-Bench 2.0 Apr 23, 2026 OpenAI intro
OpenAI GPT-5.6 (Sol/Terra/Luna) limited preview (unverified scope) ~Jun 2026 aipricing.guru (unverified)
Cursor IDE + Composer 2.5 v3.9 line May–Jun 2026 Cursor changelog
Cognition Devin (Devin 2.0 line) agent-native IDE, parallel Devins 2025–2026 Cognition
Cline (VS Code) v3.8x Apr–May 2026 (unverified exact) cline.bot
Google Antigravity replaces Gemini CLI tier 2026 (unverified exact) Google Developers Blog
GitHub Copilot cloud agent signed commits on every PR; Agent-Logs-Url trailer Apr 3 / Mar 20, 2026 GitHub blog changelog

Any tool row I could not pin to a primary source at a specific version is marked "(unverified)" or given an approximate date; treat the exact patch numbers as indicative, not authoritative.

The 87% / 1.7× paradox: same model, two stories

The frontier closed-models are now scoring 87–89% on SWE-bench Verified — that's the well-known number. But on SWE-bench Pro (harder, multi-language gold patches) the same models drop into the ~60s band, and on real-world PRs the defect rate tells a third story.

  • Claude Opus 4.7 = 87.6% on SWE-bench Verified but 64.3% on SWE-bench Pro — a ~23-point drop on the harder evaluation (Anthropic, Apr 2026). Claude Opus 4.8 = 69.2% on Pro (verified as of 2026-07-04).
  • GPT-5.5 is reported at 88.7% on SWE-bench Verified (per aggregator llm-stats.com, 2026) — but OpenAI's own headline is 82.7% on Terminal-Bench 2.0, a more conservative and better-sourced claim (OpenAI, Apr 23, 2026). On SWE-bench Pro, GPT-5.5 reaches ~58.6%, behind Opus 4.7's 64.3%.
  • CodeRabbit, 470 PRs (320 AI-coauthored, 150 human-only), Dec 17 2025: AI-coauthored PRs carry a ~1.7× overall issue rate, broken down into +75% logic/correctness errors, ~8× performance issues, 3× readability issues, and 1.5–2× security issues (per CodeRabbit's own report; some coverage cites up to 2.7× for security flaws specifically). AI also produced ~1.4× more critical and ~1.7× more major defects (businesswire / CodeRabbit blog).
  • Veracode 2025 GenAI Code Security Report: ~45% of AI-generated code samples contained an OWASP-relevant vulnerability when the model could choose between secure and insecure approaches (Veracode, 2025).
  • Uplevel / GitHub Copilot study (still cited, historical): Copilot teams shipped materially higher bug rates vs. control.
  • GitClear 2024 (still cited, historical): code churn — code rewritten or deleted within ~2 weeks of commit — roughly doubled across a large corpus; AI-assisted coding identified as a leading driver.

Two more numbers define the cost side. Anthropic's January 2026 RCT (52 junior engineers learning the Python trio library; anthropic.com/research, published Jan 29, 2026) reported AI-assisted devs scored 50% on a code-comprehension quiz vs. 67% for hand-coders — a 17-point gap (nearly two letter grades), while finishing only ~2 minutes faster (not statistically significant). You pay ~17 points of understanding for a speed gain that, here, barely existed. And the only controlled study of experienced developers with AI, METR's RCT (16 devs, 246 tasks, Jul 10, 2025), found AI made them **19% slower** while they felt ~20% faster and predicted 24% faster before starting. The expectation-reality gap is ~40 percentage points.

The honest synthesis: the benchmark ceiling is real; the defect rate is real; the productivity gain is often a feeling. None of these cancel out, but the 1.7× issue rate is the one that ships to production first.

Why "burnt toast" is the model's default output

The burnt-toast pattern isn't a bug — it's the predictable behavior of a system optimized for plausible next-token prediction, not for verified correctness. Mechanisms that dominate in 2026:

  1. Plausibility-optimized continuation. A coding agent doesn't execute your spec; it predicts what code usually follows a spec like yours. Karpathy captured this on his October 2025 nanochat release — "I tried to use claude/codex agents a few times but they just didn't work well enough at all and net unhelpful" — noting that agents excel at boilerplate with abundant training examples but struggle on unique, intellectually dense code (verified quote, Oct 2025).
  2. Calibration failure. Models hallucinate APIs and library members and often do so with confidence that does not track accuracy — the most useful-looking answer is frequently the confidently wrong one. (Directional claim; the specific "CHOKE / Simhi et al., EMNLP 2025" citation is unverified and should not be leaned on.)
  3. Lost-in-the-middle / long-context degradation. Liu et al.'s 2023 finding — that models under-attend to the middle of long contexts — has been widely replicated. Long contexts degrade retrieval and reasoning asymmetrically.
  4. Codebase-specific knowledge gap. The model knows the average repo, not your repo. It needs explicit context injection — repo-map (Aider), tree-sitter indexing (Cursor), or a hand-written CLAUDE.md / AGENTS.md — to behave like a teammate. Without it, it averages across the training set.
  5. Comprehension debt. Anthropic's January 2026 RCT quantifies the human-side mechanism: when the agent writes the code, the developer reads less and retains less. This is upstream of the 1.7× figure — the reviewer can't catch what they can't read.

Two implications worth flagging. First, METR's time-horizon work found AI task-completion horizons roughly doubling on a months-scale cadence — reliable on short tasks, weak on multi-hour ones. Second, Anthropic's own engineering team shipped three overlapping bugs that degraded Claude Code from roughly March 4 to April 20, 2026 (~7 weeks) — a reasoning-effort downgrade, a reasoning-history discard bug, and a 25-word inter-tool cap — because verification did not extend upstream of the agent (Anthropic postmortem, Apr 23, 2026). The lesson: verification must apply to your own pipeline, not just to the model output.

What actually fixes it: the single-write / many-verify pattern

There is a working consensus across Anthropic, OpenAI, Cognition, Cursor and the practitioner community. It is single-write / many-verify: one writer path, multiple independent verify paths against the same artifact, with the verifier never being the writer. Five layers, ordered roughly by ratio of velocity cost to defect yield:

Layer 1 — Stop the agent from shipping if pnpm test && pnpm build fails. Anthropic's best-practices doc recommends a Stop hook running test + type + lint with a retry cap. Aider's analog is --test-cmd "pytest tests/ -x" --auto-test --lint-cmd "ruff check ." --auto-lint. Prose instructions (CLAUDE.md) get partial compliance; hooks are deterministic. The Stop hook is the highest-leverage burnt-toast wall and costs near-zero human latency.

Layer 2 — Adversarial reviewer subagent in a fresh context. Anthropic's multi-agent Research system demonstrated a large lift running an Opus coordinator with Sonnet subagents. The Claude Code subagents feature lets you ship this in a .claude/agents/ definition, then call it as a separate review agent. Catches bugs deterministic tests miss and eliminates self-bias.

Layer 3 — Reviewer-tool from a different vendor. Cursor's BugBot moved from ~$40/seat/month to usage-based pricing (defaulting to ~$1.00–1.50 per review, ~$1.20 typical) and, per Cursor's June 2026 update, is ~3× faster, ~22% cheaper, and finds ~10% more bugs per review (verified as of 2026-07-04). Devin Review and GitHub Copilot Coding Agent offer async, non-blocking alternatives; Copilot as of Apr 3, 2026 cryptographically signs every commit, so a "Require signed commits" branch rule gives you a trust gate on top of any review gate, and the Mar 20, 2026 Agent-Logs-Url trailer makes each Copilot commit traceable to its session log.

Layer 4 — Spec / test before code, repo-aware context, MCP for live state. Define the goal once (test or PR) and let the agent iterate against it. Add a hand-written CLAUDE.md / AGENTS.md at the repo root with architecture, conventions, test commands, and what-NOT-to-do — human-written context files help; auto-generated ones can hurt, so never let the agent write its own constitution. Pair with Cursor's indexing or Aider's repo-map so the agent edits against your codebase, not the average one. Use MCP for any live, sensitive, or large tool surface (DBs, internal services) you would not paste into a prompt.

Layer 5 — The right loops and hooks for Claude Code 2.1.x. Recent Claude Code runs subagents in the background so the main loop never stalls while a reviewer runs. The lifecycle hooks give you handlers across the tool lifecycle: a UserPromptSubmit to inject repo context, a PreToolUse to gate edits, a PostToolUse to auto-format, a Stop to gate on tests.

The velocity math. A Stop hook adds <1 second per turn and catches a large share of burnt-toast incidents outright. A reviewer subagent adds ~15–45 seconds for a small change and catches much of the rest — closing most of the 1.7× gap in median cases. A vendor-different reviewer adds ~90 seconds (BugBot) to minutes (Devin Review), picks up long-tail multi-file regressions, and a signed-commit branch rule makes the loop auditable. Total per-PR overhead is on the order of a few percent of cycle time. The METR −19% finding is for uninstrumented users — the playbook here is the difference between instrumented and uninstrumented AI use, not a reason to abstain.

Tool-by-tool: which 2026 agents ship the cleanest code

Independent, real-world data is thin because most tool comparisons are vendor-funded. The strongest non-vendor signals: the Stack Overflow 2025 developer survey (high adoption, low trust in AI output, "almost right but not quite" as the top frustration), CodeRabbit's 470-PR study, DORA's 2025 research (the speed-vs-stability trade-off is driven by the strength of your CI, testing, and review culture — not by which AI you pick), and Anthropic's and METR's RCTs above.

Tool (2026-07-04) Strength Weakness Why it burns less toast
Claude Code 2.1.x + Sonnet 5 / Opus 4.8 Best raw reasoning, deepest hooks + subagents; $2/$10 intro pricing on Sonnet 5 through Aug 31, 2026 Was the agent in Anthropic's own ~7-week production-bug postmortem Best-in-class verifier primitives; Anthropic itself ships with this loop
Cursor 3.9 + Composer 2.5 Strong codebase indexing; integrated review (BugBot) Frequent pricing/plan changes; check data-routing/Privacy Mode settings BugBot + signed-commit-aware branch protection is a near-complete out-of-the-box single-write/many-verify stack
OpenAI Codex CLI + GPT-5.5 Strong sub-agent topology; remote/async execution Benchmark gap to Anthropic on the harder SWE-bench Pro (~58.6% vs 64.3%); GPT-5.6 tier is limited preview Strong CLI sub-agent loop, close to autonomous
Cognition Devin + Devin Review Async agent in its own VM, parallel runs, multi-file PR review Post-acquisition consolidation still settling Devin Review runs adjacent to your primary agent without competing for attention
Cline Apache-2.0, large install base UI still plugin-shaped, not agent-first Effective as the reviewer (not writer) paired with a Claude Code / Cursor primary
Aider Repo-map (tree-sitter + PageRank) for codebase-aware context; mature --test-cmd / --auto-lint Lower autonomy than Claude Code Use as the reviewer for another tool's output — call it in CI to verify the diff against the repo-map
GitHub Copilot Coding Agent Signed commits (Apr 3, 2026), Agent-Logs-Url trail (Mar 20, 2026), AGENTS.md support Less in-editor flow than Cursor for some users The most tractable trust gate for regulated environments: branch protection + signed commits + log trailers is audit-ready

The short version: pick your writer (Claude Code / Cursor / Codex) by the UI you trust; pick your reviewer from a different row. The 1.7× gap is closed by the separation, not by the model.

What you should do

A four-step move for your first session today.

  1. Write a CLAUDE.md (or AGENTS.md) by hand, never by LLM. Include ## Build, ## Test, ## Architecture, ## Conventions, ## What NEVER to do. Don't include anything that contradicts your test suite. Human-written context files help agent performance; auto-generated ones can hurt it.
  2. Wire a Stop hook (test + build + lint for Claude Code; equivalent --test-cmd --auto-test for Aider). ~One hour of setup; recovers a large share of burnt-toast incidents for ~0% velocity cost.
  3. Add a reviewer subagent in a fresh context window — Claude Code subagents at .claude/agents/, Cursor BugBot on PR open (~$1.20/review), or Devin Review for multi-file changes. The reviewer must not be the writer, must not share context, and should use a different model when possible.
  4. Turn on signed-commit branch protection + Agent-Logs-Url if you're on GitHub Copilot Coding Agent or a cloud agent that supports it. This is the cheapest trust gate: every AI-authored commit is cryptographically traceable to its session log.

Two don'ts:

  • Don't put a "ship" button in front of an AI PR without strong existing CI. DORA 2025 is explicit: speed gains materialize only when testing, linting, and review culture are already strong — otherwise AI amplifies whatever is broken.
  • Don't treat a single 87% benchmark as proof your agent is safe. SWE-bench Pro drops ~20+ points on the same model, METR shows experienced devs going ~19% slower, and Claude Code itself spent ~7 weeks shipping quiet production bugs that benchmarking would not have caught.

The burnt-toast loop is solvable, but only if you stop trusting the writer. The escape is infrastructure, not prompts.

Sources

Verification notes

  • Claude/OpenAI model lineup & pricing — confirmed. Sonnet 5 launched Jun 30, 2026 at $2/$10 intro (→$3/$15 after Aug 31); Opus 4.8 at $5/$25 leads SWE-bench Pro at 69.2%; GPT-5.5 launched Apr 23, 2026 at $5/$30 (doubled), 82.7% Terminal-Bench 2.0. No newer-model-priced-worse inversion found. I softened over-precise CLI patch numbers (v2.1.201, rust-v0.142.5, Devin v3.4.22, Cline v3.81, Antigravity 2.0, Roo Code archive date) to "(unverified exact)" since I could not pin each to a primary source at that exact version.
  • Anthropic comprehension RCT — corrected date. The brief said "February 2026"; Anthropic published it Jan 29, 2026. The load-bearing stats (50% vs 67%, ~17-point gap, 52 participants) are confirmed; I added that participants were junior engineers learning trio and that the AI group finished only ~2 min faster (not significant). I removed the unverified "p=0.01, Cohen's d=0.738" specifics.
  • CodeRabbit study — corrected framing. It analyzed 470 PRs total (320 AI-coauthored + 150 human-only), not "470 AI PRs." 1.7× overall, +75% logic, ~8× performance, ~3× readability, ~1.4×/1.7× critical/major all confirmed. I changed security from "1.5–2.74×" to CodeRabbit's own "1.5–2×" (noting some coverage cites 2.7×).
  • METR RCT — confirmed (Jul 10, 2025; 16 devs, 246 tasks; ~19% slower; predicted 24% / felt 20% faster).
  • Karpathy nanochat quote — confirmed verbatim (Oct 2025).
  • Anthropic Claude Code postmortem — confirmed (Apr 23, 2026 postmortem; degradation Mar 4–Apr 20, 2026 ≈ 7 weeks; three overlapping bugs). I adjusted "50 days" to "7 weeks" and named the actual bugs.
  • Cursor BugBot — confirmed (~$1.20/review usage-based; June 2026 update: ~3× faster, 22% cheaper, 10% more bugs).
  • Downgraded to directional/unverified: the "CHOKE / Simhi et al. EMNLP 2025" calibration citation and several exact tool version/date rows — I could not verify these against primary sources, so I marked them rather than assert them.