agentic-loops-and-harness-engineering

The Ralph Wiggum Loop: Why Stateless Agents Beat Smart Ones

Clearing an agent's context window every single iteration sounds idiotic — and it's one of the most reliable ways to run a coding agent for hundreds of turns.

June 10, 20269 min read Ralph Wiggum loopGeoffrey Huntleystateless agentcontext rotation
The Ralph Wiggum Loop: Why Stateless Agents Beat Smart Ones

The dumbest agent harness in production software is one line of bash: while :; do cat PROMPT.md | claude-code; done. No memory, no orchestration framework, no steering — the same prompt fed to a brand-new agent process, forever, with the context window wiped to zero every single iteration. Geoffrey Huntley published it in July 2025 under the name "Ralph Wiggum as a 'software engineer'", and within a year it had an official Anthropic plugin, a Vercel Labs implementation, and 40+ community variants across Cursor, Gemini CLI, Codex, and Copilot.

TL;DR: The Ralph Wiggum loop re-runs a stateless agent against a static prompt, externalizing all memory to files and git commits. It trades context accumulation and mid-task steering for two things that matter more in long-running automation: immunity to context rot, and failure modes you can actually inspect. On tasks with a deterministic "done" signal, it routinely outlasts smarter, stateful agents.

Key takeaways

  • The loop's core invariant is a fresh process per iteration — the agent never sees its own conversation history, only the files and git log it left behind.
  • Huntley's design principle: "It's better to fail predictably than succeed unpredictably." Ten failed iterations leave ten inspectable commits; one lucky success in a long session leaves nothing reusable.
  • State lives in three artifacts: a prd.json task list with passes flags, a progress.txt log with a curated "Codebase Patterns" section, and git commits keyed to story IDs.
  • Ralph wins on TDD loops, spec-driven greenfield builds, migrations, and test-covered refactors. It loses on exploratory design, ambiguous specs, and conversational debugging.
  • The cost figures circulating ($3.47 sessions, "$50K contract for $297") are practitioner-reported and uncorroborated — but the structural cost argument (short contexts are cheap; cost scales with iteration count, not context length) is sound.

The loop nobody believed in

Figure 1: The Ralph Wiggum Loop: Why Stateless Agents Beat Smart Ones

The name is a double joke. Ralph Wiggum is The Simpsons' lovably oblivious nine-year-old who keeps going despite every setback; "ralph" is also Australian slang for vomiting, which Huntley uses as dark-humor gloss on the volume of messy output the loop produces before converging. Huntley frames the operating model as a child and a playground:

"It begins with no playground, and Ralph is given instructions to construct one. Ralph is very good at making playgrounds, but he comes home bruised because he fell off the slide, so one then tunes Ralph by adding a sign next to the slide saying 'SLIDE DOWN, DON'T JUMP, LOOK AROUND,' and Ralph is more likely to look and see the sign."

That's the entire tuning model. You don't intervene in the agent's reasoning. You adjust the signs — the prompt, the guardrail files, the acceptance criteria — and run it again. The agent stays dumb; the environment gets smarter.

The philosophical core is a quote from the original post:

"The technique is deterministically bad in an undeterministic world. It's better to fail predictably than succeed unpredictably."

This inverts how most agent systems are evaluated. A continuous-conversation agent that succeeds once is a black box — the success may be luck, and the next task in the same session can fail for unrelated reasons buried in a 200k-token transcript. A Ralph loop that fails ten times before succeeding has left ten commits, ten logs, and ten state snapshots you can bisect. Predictable failure is a substrate you can engineer against. Unpredictable success is not.

How the mechanism actually works

Figure 2: The Ralph Wiggum Loop: Why Stateless Agents Beat Smart Ones

Fresh context, every time

The counterintuitive move is the refusal to use the context window as memory. Conventional agent wisdom says accumulate: keep the relevant history in context, let the model build understanding across turns. Ralph does the opposite — every iteration starts at zero tokens, and anything that must persist gets written to disk and re-read.

The justification is twofold. First, context rot: long sessions accumulate dead-end reasoning, failed tool-call transcripts, and noise the model must attend to on every subsequent turn. Resetting removes that tax entirely. Second, state legibility: when the agent is forced to externalize its memory, the operator can read it, edit it, and version it. The mikeyobrien/ralph-orchestrator README codifies this as "Fresh Context Is Reliability" — any state important enough to survive an iteration is too important to leave to the model's memory.

The three-file state model

Implementations converge on a small, consistent state model. The most thoroughly documented is Ryan Carson's snarktank/ralph, the most-starred community implementation:

Artifact Role Lifecycle
prd.json The plan: user stories with acceptanceCriteria, priority, and a passes: false flag Human writes once; agent flips flags as stories complete
progress.txt Iteration log, topped by a sticky-note ## Codebase Patterns section Agent reads the top section every iteration, appends a block after
Git commits The only durable cross-iteration memory; messages encode story IDs (feat: US-003 - Add login endpoint) One commit per completed story
AGENTS.md / CLAUDE.md Project conventions, build commands, do-not-touch lists Static; human-tuned between runs

The ## Codebase Patterns section is the pattern's most distinctive idea. It's not the raw transcript — it's a curated digest of what worked, which the agent re-reads at the start of every iteration. As the Anthropic plugin README puts it: "Each iteration sees modified files and git history. Claude autonomously improves by reading its own past work in files." The agent gets memory, but only memory someone (itself, previous iteration) decided was worth keeping. That's context rotation as an engineering discipline: the transcript is disposable; the digest is versioned.

Git does the rest. A run that crashes at iteration 47 of 50 resumes from git log, because every prior story is committed and every prior iteration's reasoning is reconstructible from the diff. mikeyobrien's orchestrator calls this "git-based asynchronous checkpointing for state recovery."

The minimal runnable version

A production-leaning Ralph loop is about fifteen lines:

#!/usr/bin/env bash
# ralph.sh — minimal Ralph loop
MAX_ITERATIONS=${MAX_ITERATIONS:-10}

for i in $(seq 1 "$MAX_ITERATIONS"); do
  echo "=== Iteration $i of $MAX_ITERATIONS ==="
  cat prompt.md | claude --print 2>&1 | tee last_iteration.log
  if grep -q "<promise>COMPLETE</promise>" last_iteration.log; then
    echo "All stories complete."
    exit 0
  fi
done
echo "Hit MAX_ITERATIONS without completion sentinel."
exit 1

The prompt tells the agent to read prd.json, pick the highest-priority story with passes: false, implement it against its acceptance criteria, run verification, commit with the story ID, update progress.txt, and emit the sentinel only when every story passes. That's it. The snarktank version wraps this in 113 lines with a structured PRD workflow; the Anthropic plugin (renamed from ralph-wiggum in PR #142) replaces the bash loop with a Stop hook that exits code 2 to block session termination and re-feed the prompt. The implementation is interchangeable; the four invariants — fresh invocation, static prompt, deterministic sentinel, externalized state — are not.

Portability comes from AGENTS.md, the project-instruction convention OpenAI released in August 2025 and transferred to the Linux Foundation's Agentic AI Foundation, now used by 60,000+ open-source projects. Because the loop's only coupling is to the file layout, the agent at the bottom is swappable — hence Ralph variants for Cursor, Gemini CLI, and beyond.

Why stateless beats smart — and where it doesn't

The honest framing is that Ralph is a trade, not a free lunch. It buys predictability and bounded cost by selling context accumulation and mid-execution steering.

Dimension Continuous-conversation agent Ralph loop
State In-context, long, accumulating On disk + git; durable across iterations
Failure mode Silent drift as context degrades Predictable, inspectable per-iteration
Steering Conversational, mid-task Pre-execution only, via files and guardrails
Cost Unbounded; long context = long bills Capped by MAX_ITERATIONS
Best for Exploration, design, debugging Greenfield from spec, refactors, migrations, TDD

Ralph wins when you can write the sentence "I will know this is done when X passes." A failing test suite, a green migration, a fully-specified PRD — these are deterministic acceptance signals, and the loop will iterate toward them indefinitely without a human in the chair. Huntley's most-cited validation case is cursed, a Gen Z programming language built end-to-end inside a Ralph loop.

Ralph loses when the spec itself is the deliverable. If the prd.json is wrong, the loop will faithfully build the wrong thing ten times. Open-ended design has no acceptance signal; conversational debugging needs exactly the accumulated context the reset throws away; multi-file architectural judgment calls suffer when the model can't hold competing options in mind across iterations. Marc Puig's critique — "Ralph Loop Is Innovative. I Wouldn't Use It for Anything That Matters" — lands on real tasks. The Ralph camp's rebuttal is that those aren't Ralph tasks: the loop is what you reach for after you've decided what the spec is, not the tool for deciding.

On cost, be skeptical of specific numbers. Huntley's repo tagline promises costs "less than a fast food worker's wage"; one practitioner blog reports a $3.47 session for a completed PRD; The Register framed it at $10.42/hour in January 2026; and a viral "$50,000 contract for $297" claim has no findable primary source. None of these are benchmarks. What is defensible is the structure: empty-context iterations are cheap per call, so total cost scales with iteration count rather than with an ever-growing transcript — the opposite of long-session economics.

What this means for you

If you run coding agents for more than a handful of turns, three pieces of Ralph are worth adopting even if you never run the loop itself:

  1. Steal the state model first. prd.json with testable acceptance criteria + a curated progress digest + story-keyed git commits works under any agent CLI. It's the durable contribution, and it makes any harness — looped or conversational — resumable and auditable.
  2. Write the acceptance signal before the prompt. The loop's real discipline is forcing you to define "done" as something a script can grep for. If you can't, that's a signal the task needs a human conversation, not more iterations.
  3. Cap and checkpoint. Every serious implementation bounds iterations (MAX_ITERATIONS=10 is the snarktank default), runs in a fresh worktree, and logs each iteration to disk. The loop is safe because of its configuration, not despite it.
  4. Use it for the work you'd be embarrassed to do by hand. Migrations, framework upgrades, making red tests green, porting modules. Keep design, debugging, and ambiguity in interactive sessions.

One caution for anyone reading derivative write-ups: there is no canonical Huntley-published "five phases" of the Ralph loop. Primary sources diverge — snarktank runs a pick→implement→commit micro-cycle, mikeyobrien uses a planner/builder/reviewer "hats" architecture, and the Anthropic plugin has no phase structure at all. Phase taxonomies you'll find elsewhere are practitioner abstractions, not specification.

The deeper lesson outlasts the meme. Huntley's January 2026 follow-up, "everything is a ralph loop", generalizes the pattern to any task with a deterministic acceptance signal — and the generalization is the point. The Ralph Wiggum loop works not because the agent is smart but because the harness makes failure cheap, legible, and recoverable. Context rotation, externalized state, and deterministic stop conditions are not workarounds for weak models; they're what engineering an agent actually looks like. The Simpsons reference is a joke. The pattern underneath it is not.

Frequently asked questions

What is the Ralph Wiggum loop?

It's a pattern for running AI coding agents, coined by Geoffrey Huntley in July 2025: a bash while-loop pipes the same static prompt into a brand-new agent process on every iteration. The agent starts with zero context each time, and all persistent state lives in files (a PRD, a progress log) and git commits rather than the model's context window.

Why would you deliberately clear the agent's context every iteration?

Two reasons: context rot and legibility. Long-running sessions accumulate dead-end reasoning and failed tool calls that degrade every subsequent turn, and resetting sidesteps that entirely. Forcing state onto disk also means the operator can inspect, edit, and version the agent's memory directly instead of treating it as an opaque transcript.

When does the Ralph loop beat a continuous-conversation agent?

When the task has a deterministic acceptance signal and a bounded state space — TDD loops, greenfield builds from a written spec, migrations, and large refactors covered by tests. It loses badly on exploratory design, ambiguous requirements, and debugging that needs conversational back-and-forth, because the per-iteration reset discards exactly the context that work requires.

Is the Ralph loop production-ready?

It's mainstreaming but narrow. Anthropic ships it as an official Claude Code plugin and Vercel Labs maintains an AI SDK implementation, with 40+ community variants catalogued by mid-2026. Canonical implementations gate risk with iteration caps, git checkpointing, and guardrail files — but the technique only works on tasks where 'done' can be written down as a testable condition.

How much does a Ralph loop session cost?

Reported figures range from a few dollars for a small PRD to journalistic framings around $10/hour, but none of these are benchmarks — they're uncorroborated practitioner reports. The structural point holds, though: fresh-context iterations are cheap, so total cost is driven by iteration count rather than ballooning context length.