How is harness engineering different from context engineering?

OpenAI's own contrast is the cleanest answer: context engineering asks what the agent should see, while harness engineering asks what the system should prevent, measure, and correct. Context engineering covers one pillar (inform); harness engineering adds constrain, verify, and correct on top of it.

Does harness engineering measurably improve AI agent reliability?

No primary vendor has published a reliability percentage attributable to harness changes alone. OpenAI's post reports throughput figures (about 1M lines of code, ~1,500 PRs in five months) rather than success rates. The closest number is Daniel Vaughan's April 2026 blog finding that the same model scored 16 points higher in a different tool, which should be treated as a blog claim, not a benchmark.

What is AGENTS.md and how should I use it?

AGENTS.md is a short instruction file in the repository root, loaded at the start of every agent session. OpenAI's guidance is to keep it around 100 lines and treat it as a map pointing to deeper versioned docs, since anything the agent cannot access in context effectively does not exist.

What should a team build first for a production agent harness?

In order: a short AGENTS.md, typed tools exposed as MCP servers, sandboxing around every command, observability the agent can read on demand, a test-and-retry verification loop, and a pull-request review gate. Teams that ship those six have a working harness; eval gates, persistent memory, and rollback plans turn it into a system that survives production.

Harness Engineering: Why Agent Reliability Beats Model IQ

A team of three OpenAI engineers (later seven) shipped an internal product over roughly five months with zero lines of manually written code: about one million lines shipped, roughly 1,500 pull requests merged, averaging 3.5 PRs per engineer per day. Those figures come from OpenAI's "Harness engineering" post by Ryan Lopopolo, published February 11, 2026, and they're the founding evidence for a discipline that's quickly become the most important layer in production agents.

Harness engineering is the discipline of designing the system that wraps an AI agent: constraining what it can do, informing it with the right context, verifying its work, and correcting it when it fails.

TL;DR: The marginal gain in production agents now comes from the harness around the model, at least as much as from the model itself. OpenAI's Codex team formalized the practice as four verbs (constrain, inform, verify, correct), and Anthropic, Microsoft, and GitHub have converged on the same architecture under different names. No vendor has published a reliability delta for harness work yet, but the build order is now clear and reproducible.

Key takeaways:

OpenAI's harness engineering post is a process report, with real throughput numbers (1M LOC, ~1,500 PRs, ~1/10th the estimated calendar time) and no published reliability percentage.
Harness engineering subsumes context engineering. Prompt engineering ⊂ context engineering ⊂ harness engineering.
MCP is the cross-vendor substrate: OpenAI Codex, Anthropic Claude, GitHub Copilot, and Microsoft Foundry all support MCP servers as the typed tool surface.
A minimal production harness is six components: AGENTS.md, MCP tools, sandbox, observability, retry loop, PR gate.

What is harness engineering?

Harness engineering names the system around the agent rather than the agent itself. OpenAI's post organizes it as four pillars: constrain what the agent can do, inform it with the right context, verify that the work is correct, and correct it when it goes wrong.

The post's sharpest line is its job description: "the engineer's primary job is no longer to write code, but to design environments, specify intent, and build feedback loops that allow Codex agents to do reliable work."

The model writes the code; the harness decides whether it ships.

The term itself crossed vendors within weeks. Anthropic published "Harness design for long-running application development" in March 2026 using the same word for the same closed-loop system, and Martin Fowler's bliki entry treats it as an established practice area. OpenAI issued the branded formulation; the architecture was already converging.

How does harness engineering differ from context engineering?

OpenAI draws the line directly in the post: "Context engineering asks: what should the agent see? Harness engineering asks: what should the system prevent, measure, and correct?"

That makes context engineering, the term popularized by Shopify's Tobi Lütke in November 2024 and developed in Anthropic's engineering writing, one pillar of four. It covers inform. The harness adds constrain, verify, and correct on top.

Term	Origin	Scope relative to harness engineering
Prompt engineering	Community, 2022-2023	The instruction text only; a strict subset
Context engineering	Tobi Lütke (Nov 2024), Anthropic	The full information state the agent sees; the "inform" pillar
Scaffolding	LangChain / ReAct lineage	Tools and support environment; lacks the verify/correct loop
Agent orchestration	Microsoft Semantic Kernel, AutoGen	A multi-agent pattern that runs inside a harness
Harness engineering	OpenAI (Feb 2026)	The closed loop: constrain, inform, verify, correct

One clarification worth making, because tooling pages still blur it: agent orchestration is a technique, and harness engineering is the system it runs in. OpenAI's Codex experiment was a single powerful agent in a heavy harness rather than a swarm.

The Hugging Face agent glossary is the best independent reference for keeping these terms straight.

What did OpenAI actually build around Codex agents?

Five techniques carried the experiment, and each maps to a pillar.

A short AGENTS.md as a map. The team kept a roughly 100-line AGENTS.md in the repo root, loaded at the start of every session, pointing to deeper versioned design docs. The post's instruction is verbatim: "give Codex a map, not a 1,000-page instruction manual." And its hardest rule: "Anything the agent cannot access in-context does not exist."

Repository knowledge as the system of record. Design decisions, execution plans, and quality grades live in the repo next to the code, versioned and reviewed through PRs. The agent's persistent memory is the repository, and the context window is just its working set.

Sandboxing on every command. The Codex CLI uses macOS Seatbelt and Linux Landlock plus seccomp, per Simon Willison's analysis, and OpenAI documented its Windows sandbox separately in May 2026. Constraint bounds the blast radius of every other failure.

Dynamic observability. The application boots per worktree, and logs, metrics, and screenshots are exposed on demand. The team wired Chrome DevTools Protocol into the runtime so agents could see the UI they were modifying, pulling current state instead of pre-loading it.

Verification through CI and PR review. Codex's default behavior is to iteratively run tests until they pass. Human steering happens at the pull request, with an agent-to-agent review pass before the human sees the change. The reviewer's job shifts to confirming the agent's self-tests actually exercise the change.

Is the industry converging on the same harness?

Yes, and the connective tissue is MCP. As of mid-2026, OpenAI Codex, Anthropic Claude, GitHub Copilot, and Microsoft Foundry all support Model Context Protocol servers as the typed, discoverable, auditable tool surface.

Microsoft and Anthropic even co-shipped an official C# SDK for MCP, and Microsoft Build 2026 (June 2-3) leaned heavily on MCP across Agent Framework 1.0 and Foundry.

The vocabulary diverges while the architecture converges. Anthropic packages agent instructions as Skills: a folder with a SKILL.md plus assets, lazy-loaded so only the name and description sit in context until needed.

That solves the same problem as AGENTS.md plus deep docs, at a finer granularity. GitHub Copilot added Agent Skills support on December 18, 2025.

The deeper reason all of this exists is a failure mode the community calls context rot, usually attributed to Andrej Karpathy: model recall and instruction-following degrade as the context window fills. The academic anchor is Liu et al.'s "Lost in the Middle" (2023), which showed models systematically under-recall information placed mid-context.

Every harness technique above pushes work out of the context window and into the repository, CI, and runtime, where it can't rot.

Does harness engineering improve AI agent reliability, measurably?

Honest answer: nobody has published the number yet. OpenAI's post reports throughput (1M LOC, ~1,500 PRs, an estimated 1/10th of hand-written calendar time, which is a self-reported estimate) and contains no task-completion rate or success percentage. Any "X% reliability improvement" attributed to the post is fabricated.

The closest figure in the wild comes from Daniel Vaughan's "The Harness Effect" (April 19, 2026), which reports the same model scoring 16 points higher in a different tool. That's one practitioner's blog measurement, worth taking seriously as a directional signal and nothing more until a primary issuer reproduces it.

The absence of a published delta tells you what kind of artifact the OpenAI post is: a discipline-framing piece backed by process numbers. If your team wants the reliability evidence, you'll have to generate it on your own evals, and Anthropic's "Demystifying evals for AI agents" is the methodology reference for doing that credibly.

There's also a fair critique to absorb: most of the harness is rebranded software engineering. Sandboxing, CI, code review, and versioned design docs have been best practice for decades.

A defensible synthesis is roughly 80% rebrand plus a genuinely novel 20%: the agent-loop surface of AGENTS.md as a map, MCP tool schemas, agent-readable observability, agent-to-agent review, and eval-gated merges. The rebranded 80% explains the skepticism.

The novel 20% justifies the name, and it's the part you probably haven't built yet.

A minimal production harness, in build order

For a team shipping its first production agent, the cross-vendor evidence supports this sequence. Each item depends on the one before it.

AGENTS.md (~100 lines) in the repo root, written for the agent, pointing to deeper versioned docs.
Typed tools as MCP servers: JSON-schema functions with names, descriptions, and return shapes, discoverable by any MCP client.
Sandbox every command: filesystem, network, and process spawn restricted by default.
Agent-readable observability: structured logs the agent can grep, OpenTelemetry traces, and an endpoint that returns a screenshot of current state.
A retry/verify loop: tests run until green; external actions return a "did it work" signal (return codes, screenshot diffs, eval scores).
A PR review gate: agent code lands on a branch and goes through the same review as human code. Agents don't self-correct on architecture or security.

Items 7 through 10 separate a demo from a system that survives six months: an automated eval gate before merge, file-based memory with per-session context resets (per Anthropic's long-running harness design), a branch-based rollback plan, and PR review for the harness itself, because the agent's instructions are code.

What this means for you

If you're running coding agents today, your highest-ROI work this quarter is probably items 1 through 6 above, in that order, rather than waiting for the next model release. The same model in a better harness behaves like a better model; Vaughan's 16-point report is the early evidence, and OpenAI's million-line experiment is the existence proof.

Start by shrinking your agent instructions, since a bloated CLAUDE.md or AGENTS.md actively fights you via context rot. Then make your tools typed and your runtime legible. Measure on your own evals, because nobody else's reliability numbers exist yet.

As Simon Willison put it in his Codex analysis, the interesting question has shifted from what the model can do to what the system around it lets it do reliably.

Harness engineering: why agent reliability now beats model IQ