The most rigorous study of AI coding assistants to date found that experienced developers were 19% slower with the tool than without it.
That number comes from a METR randomized controlled trial published in July 2025, run with 16 seasoned open-source developers working in their own long-lived repos. The same developers believed they'd been sped up. They were measurably wrong.
So why write a power-user playbook for OpenAI Codex at all? Because the same body of evidence shows the gains are real on a specific, bounded class of work, and the engineers who compound output are the ones who configure Codex to operate only on that class.
The leverage isn't the model. It's the setup.
This is a practitioner's guide to that setup: AGENTS.md, the Codex CLI, sandbox profiles, cloud delegation, and code review, with the honest numbers attached.
TL;DR
OpenAI Codex in mid-2026 is one system with three execution contexts: the local Codex CLI, the container-per-task Codex Cloud, and the IDE extensions. They share one model line and one config grammar (~/.codex/config.toml, AGENTS.md, and prompt directives). Power users get outsized results by writing AGENTS.md as imperative agent state, pinning a profile per task class, delegating dependency-heavy bootstrap to cloud tasks, and gating every Codex pull request through the team's existing CI. The productivity payoff is narrow but real: roughly 20 to 40% throughput on the task class it fits, near zero elsewhere.
Key takeaways
- Treat the CLI, Cloud, and IDE as three contexts of one configurable system, not three tools.
- Write
AGENTS.mdas imperatives and exact commands. Prose like "our project uses React" is ignored. - Pin a profile per task class (
review,refactor,cloud) to kill ad-hoc flag strings. - Offload install/build/test bootstrap to a cloud task; iterate locally with the CLI.
- Codex PRs have higher variance than human PRs. Run the same CI on them, and read the agent's transcript, not just the diff.
What is OpenAI Codex in 2026, exactly?
OpenAI Codex is a family of three coordinated coding-agent products that share one model lineage and one configuration grammar. The Codex CLI is a Rust binary you run locally. Codex Cloud is a hosted, container-per-task agent that opens pull requests. The IDE extensions cover VS Code, Cursor, JetBrains, and Neovim.
They descend from the 2021 Codex completion model OpenAI retired in 2023, but the architecture is entirely different now. As of mid-2026 the current generation runs on GPT-5.5, OpenAI's first fully retrained base model since GPT-4.5, exposed through the Codex-specific gpt-5.5-codex line on the API (distinct from the generic GPT-5.5 chat models). OpenAI has been shipping a new Codex model every few weeks, so treat the exact version as a moving target. The workflow below is the part that lasts.
The single mental model that matters: configure all three surfaces through one layered spec. A global ~/.codex/config.toml, an AGENTS.md in the repo, and prompt-level directives in the request itself.
AGENTS.md is agent state, not documentation
AGENTS.md is Codex's persistent memory, and it's the highest-leverage file in the whole system. Per OpenAI's docs, the runtime walks upward from the file the agent is editing, collects every AGENTS.md it finds, and concatenates them.
The nearest file is appended last, so a tighter AGENTS.md in services/payments/ overrides the org-wide rules at the repo root without duplicating them.
The most common mistake is writing it like a README. Narrative lines get ignored. Imperatives and exact commands get followed.
Compare these two:
# Ignored
Our project uses pnpm and we care about test coverage.
# Followed
- Run `pnpm install --frozen-lockfile` then `pnpm test` before declaring a task done.
- Never disable a lint rule inline. Add an exception in `.eslintrc.cjs` and justify it in the PR.
- Prefer named exports. Currency is stored as `bigint` minor units, never `number`.
Power users version AGENTS.md in git and review it like source. Several open-source projects now run a CI check that fails if AGENTS.md references a script that no longer exists. That one guard prevents the most common drift failure, where the agent dutifully runs a test command the team retired three sprints ago.
Codex also reads ~/.codex/AGENTS.md as a global default, plus an agent_skills mechanism that lets a repo advertise reusable named workflows like a security-audit skill.
How do sandbox modes and approvals work in the Codex CLI?
Codex runs every tool call inside an OS-level sandbox, and the sandbox is a separate axis from approvals. Getting both right is what makes autonomous coding safe enough to leave running.
| Sandbox flag | What the agent can do |
|---|---|
--sandbox read-only |
Read files, run network-less commands. No writes. |
--sandbox workspace-write |
Read and write inside the working tree. Default for TUI sessions. |
--sandbox danger-full-access |
No sandbox. Full filesystem and network. |
On macOS the sandbox is built on Seatbelt (the same layer Safari uses); on Linux it's Landlock plus seccomp, which only works on kernels 5.13 and newer. That kernel floor bites older CI runners, so check it before you wonder why a job ran unsandboxed.
Approvals are the second axis, set with --ask-for-approval: untrusted (approve every command), on-failure (run freely, pause only on failure), or on-request (the agent asks when it judges a command risky).
The power-user move is to layer the two by context. For a greenfield refactor on a clean feature branch, run --sandbox workspace-write --ask-for-approval on-request. For an interactive session on main, downgrade to --sandbox read-only --ask-for-approval untrusted so nothing lands without an explicit yes.
Profiles: the one change that pays for itself
Typing flag strings by hand is where time leaks. Profiles fix it. Define named profiles in config.toml, then dispatch with codex --profile review.
# ~/.codex/config.toml
[profiles.review]
model = "gpt-5.5-codex"
sandbox = "read-only"
approval_policy = "on-request"
[profiles.cloud]
model = "gpt-5-codex"
sandbox = "workspace-write"
approval_policy = "on-failure"
Check config.toml into the repo so every contributor runs the same sandbox and approval defaults. A review profile and a refactor profile, both committed, eliminate the ad-hoc decision and make it cheap to send the right kind of help to the right context.
Across the practitioner write-ups, this is the change that most reduces wasted time.
Codex also speaks the Model Context Protocol, so one mcp_servers block wires in tools from Jira, Sentry, or an internal database. One caveat worth knowing: stdio MCP subprocesses are only reaped when Codex exits, which matters if a CI job spawns many short-lived subagents.
Codex Cloud: delegate the bootstrap, keep the iteration
Codex Cloud reached general availability on 6 October 2025. Each task runs in an isolated Ubuntu 24.04 container with the repo checked out and a per-task worktree, and the environment is torn down on completion unless you pin a snapshot.
That ephemerality is the point. A cloud task can install dependencies, run the full suite, and open a draft pull request without touching your laptop.
Environment snapshots are the biggest cost-and-time lever here. A snapshot freezes the environment after dependency install, so later tasks skip the install and compress a 6 to 10 minute bootstrap into seconds.
Key the snapshot to your lockfile hash so it rebuilds only when pnpm-lock.yaml changes. OpenAI's own docs warn that forgotten long-lived snapshots are a top source of cost overruns, and recommend a weekly review.
The split that practitioners report works best: cloud tasks for the dependency-heavy bootstrap, CLI for the local iteration. The bootstrap step sees a reported 2 to 4x speed-up. The local iteration step shows no measurable change.
When a cloud task finishes it pushes a branch and opens a draft pull request, leaving it unmarked so a human inspects the diff and transcript first. You can fire tasks from the ChatGPT sidebar, the chatgpt.com/codex board, or by adding @Codex to a Slack thread.
Code review with Codex, in CI and on GitHub
Code review is where Codex earns trust fastest, because the output is bounded and easy to verify. The CLI ships a dedicated codex review subcommand that reads a diff against the merge base and emits a structured review. The same surface is GA on GitHub as a @codex review PR mention.
For repeatable automation, the openai/codex-action runs the CLI in headless codex exec mode inside a workflow:
- uses: openai/codex-action@v1
with:
profile: review
codex_args: 'codex review --base ${{ github.base_ref }}'
Keep the CI invocation narrow enough that a human verifies the result in under five minutes: autofix lint, generate a migration, add tests for a labelled issue. Codex in CI is a pre-step that produces a candidate diff. The team's existing CI is still the gate.
One discipline holds the whole thing together. A Codex PR has higher variance than a human PR, so read the agent's transcript, not just the diff. The transcript is where the "passed pytest but skipped a migration" pattern surfaces.
Treat @codex review as a second opinion that combines with a human, not as the primary reviewer. Ramp's engineering team reports the combined Codex-plus-human review is often faster than human-only.
What the productivity numbers actually say
Marketing promises multipliers. The empirical record is more disciplined, and knowing it keeps you pointed at the work where Codex pays off.
The METR follow-up in February 2026 softened the original to roughly a 4% slowdown and flagged a selection effect in the first study, then announced a redesign. DORA's 2025 report (N≈5,000) found AI correlated with higher throughput for the first time while stability moved the other way, framed as "AI amplifies what's already there." Microsoft Research's Copilot RCT found 55% faster, but on a tightly scoped task, not end-to-end delivery.
Meanwhile Stack Overflow's 2025 survey (49,000+ respondents) put AI adoption at 84% but distrust at 46%, up from 31% a year earlier, with 66% citing "almost right" output as a source of rework.
The honest synthesis: positive on well-scoped, verifiable tasks, near zero on open-ended work in a live codebase. Your engineering practices remain the dominant factor. Codex amplifies them in both directions.
How Codex compares to the alternatives
No tool wins on every axis. Codex's edge in mid-2026 is the cloud-task lifecycle and the GitHub-shaped workflow, and OpenAI's June 2026 acquisition of cloud platform Ona signals it's doubling down there.
| Tool | Where it wins | Where it trails |
|---|---|---|
| Codex CLI / Cloud | Container-per-task runs, first-class AGENTS.md, MCP-native, JetBrains first-party | Subagent cost control still maturing |
| Claude Code | Long-horizon refactors, very large context | Less mature cloud story |
| Cursor | Inline edits, tightest in-editor loop | Cloud-task PR workflow isn't the design center |
| GitHub Copilot | First-party GitHub, AGENTS.md support since Aug 2025, GPT-5.5-Codex | Sandbox/approval model less explicit |
| Aider | Diff hygiene, git-native, cheap tokens | No cloud agent |
Codex is the strongest choice when the workload is task-shaped (one bounded job, PR-shaped output) and the workflow is GitHub-shaped. For inline editing, Cursor still leads. For very long refactors, Claude Code is the most-cited alternative.
What this means for you
Start with three commits to your repo this week.
- Write a short, imperative
AGENTS.mdwith exact build and test commands, and add a CI check that fails when it references a missing script. - Check in two profiles,
reviewandcloud, so nobody types raw flag strings again. - Wire one narrow
codex-actionjob, lint autofix orcodex review, gated by your real CI.
Then aim Codex at the work the evidence supports: bounded, verifiable, PR-shaped tasks. Delegate the bootstrap to a cloud container, iterate locally, and read every transcript before you merge. The 10x isn't a model capability you enable. It's the compounding of a few configuration choices applied with discipline.
Sources
- Measuring the Impact of Early-2025 AI on Experienced Developers (METR)
- METR study preprint (arXiv)
- Codex CLI, OpenAI Developers
- Custom instructions with AGENTS.md, OpenAI Developers
- Sandbox, OpenAI Developers
- Codex Web / Cloud, OpenAI Developers
- Cloud environments, OpenAI Developers
- Code review in GitHub, OpenAI Developers
- How Ramp engineers accelerate code review with Codex
- Announcing the 2025 DORA Report (Google Cloud)
- Quantifying GitHub Copilot's impact (GitHub)
- Stack Overflow 2025 survey, via McKinsey interview
- Introducing GPT-5.5 (OpenAI)
- OpenAI to Acquire Cloud Platform Ona (Bloomberg)
