2026-06-11 · skills

Chapter 7: The Codex side

Rebuild the same PR reviewer in OpenAI Codex and pull out the vendor-neutral lessons that transfer between agents.

Xiaoman · The Far Shore

Xiaoman has always assumed its way is the only way. Today it meets a distant cousin.

Draft chapter. First cut to prove the format; it will be hardened before it is indexed.

What you’ll build

You will rebuild the PR reviewer in OpenAI Codex and run it against the same diff you used in earlier chapters. The point is not to pick a winner. It is to separate what is portable (the concepts) from what is vendor-specific (the config), so that switching tools later costs you a day of relearning syntax instead of rewriting your whole design. By the end you will have three things: the reviewer running on Codex’s own config, a side-by-side comparison table, and a short list of the design pieces that didn’t change when you moved.

Why is this chapter in the middle of the course, before evals and observability? Because getting locked into one vendor is a problem in itself. If your reviewer only works because of one provider’s exact prompt syntax, your judgment is now tied to that vendor’s config. Doing the same job twice, on two tools, forces you to see which parts are the idea and which are just implementation detail.

Prerequisites

  • Chapters 4 to 6 finished: a working PR reviewer in your Claude-based agent, with a defined scope, least-privilege tools, and (optionally) the orchestration from Chapter 6.
  • Access to OpenAI Codex and its docs. Codex ships as a CLI (@openai/codex), an IDE extension, a cloud/web mode, and integrations. See the official docs.
  • The same PR or diff you reviewed earlier, so the comparison is fair. Comparing on different inputs tells you nothing.

Steps

1. Restate the reviewer’s intent, tool-free

Write the review job in plain language, with no reference to any vendor. This statement is the portable core: if it mentions a specific config key or prompt-block name, it is not portable yet. Cut it down to goal, checks, inputs, and output shape.

INTENT (vendor-neutral):
  GOAL:   Review a PR diff and report defects a senior engineer would block on.
  CHECKS: security (injection, authz, secrets),
          performance (N+1, blocking I/O on hot paths),
          tests (untested branches, deleted assertions).
  INPUT:  unified diff + changed-file contents (read-only).
  OUTPUT: JSON list {file, line, severity, finding, suggested_fix}.
  LIMITS: read-only on the repo; no network; no writing files.

2. Map each concept to its Codex equivalent

For every piece of your Claude-based design, find the Codex counterpart in its docs, and note which ones line up and which don’t. The two tools have more in common than not: both have project instructions, a permission/approval model, a sandbox, MCP for external tools, and a non-interactive headless mode. Only the names and file formats differ.

CONCEPT                  Claude Code              Codex
project instructions     CLAUDE.md                AGENTS.md
packaged capability      Skill (SKILL.md +        custom prompt / "skill"-style
                         scripts)                 reusable instruction + scripts
config surface           settings.json            config.toml (model, approvals,
                                                   sandbox, mcp_servers, profiles)
permission model         allow/ask tool rules     approval_policy + sandbox_mode
external tools           MCP servers              MCP servers
headless / CI run        SDK / -p one-shot        codex exec (non-interactive)
sub-tasks                subagents                subagents

Note: treat the exact Codex config keys as illustrative and confirm names and values against the official docs, since they change between releases.

3. Implement the reviewer in Codex, natively

Express the intent using Codex’s own files, not a word-for-word copy of your Claude config. Put the long-lived instructions in AGENTS.md at the repo root, set the model and the safety settings in config.toml, and keep the reviewer read-only by picking a non-writing sandbox and an approval policy that never auto-edits. Here’s the trap: copying your CLAUDE.md verbatim into AGENTS.md. They do the same job, but each one’s surrounding defaults differ, so port the intent and let Codex’s own conventions fill in the rest.

# config.toml  (illustrative; verify key names in the official Codex docs)
model = "<a-codex-supported-model>"
approval_policy = "on-request"   # never auto-apply edits during a review
sandbox_mode = "read-only"       # the reviewer reads the diff, writes nothing

[mcp_servers.github]
command = "npx"
args = ["-y", "@modelcontextprotocol/server-github"]
<!-- AGENTS.md  (project instructions Codex reads automatically) -->
# PR reviewer
When asked to review a diff, act as a senior reviewer.
Check security, performance, and test coverage only.
Output a JSON list of {file, line, severity, finding, suggested_fix}.
Do not modify files. Do not access the network.

4. Run the same diff, headless

Use Codex’s non-interactive mode so the run is scriptable and repeatable, just like you would in CI. Feed it the identical PR and save the JSON output. Keeping the invocation headless and the output structured is what makes the next chapter’s evals possible: a review you can’t save as data, you can’t score.

# illustrative; confirm the exact subcommand and flags in the official docs
codex exec "Review the diff in pr-1234.diff against the rules in AGENTS.md. \
  Output only the JSON list." > codex-review.json

5. Diff the two reviews

Put the Codex findings next to your Claude findings on the same PR. Compare four things: coverage (did both hit every check your intent named?), agreement (same lines flagged?), tone, and effort. The disagreements are the interesting part: they usually come from a difference in defaults (one tool’s sandbox let it read a file the other couldn’t) rather than one model being smarter.

                       Claude Code           Codex
flagged the SQL inj.   yes (line 12)         yes (line 12)
flagged the N+1        yes                    yes
flagged missing test   yes                    partial (named file, not branch)
tone                   terse, prescriptive    more explanatory
divergence root cause  prompt phrasing, not model capability

Learned: seeing a second ecosystemXiaoman built the same reviewer again inside Codex and found the same primitives on both sides, instructions, a sandbox, approvals, MCP, a headless mode, just under different names and file formats, so it can now tell the idea apart from one vendor's config.

How to verify

  • Confirm the Codex reviewer covers every check your step-1 intent statement named. A missing check is a porting bug, not a model limitation.
  • Mark which concepts mapped one-to-one (instructions, MCP, headless run) and which needed rework (config format, approval semantics). The parts that didn’t line up are exactly the vendor-specific ones.
  • Confirm the portable core produced comparable reviews on both sides. If the two reviews diverge wildly, the cause is almost always in your prompt or your sandbox defaults, not the model. Track it down before you blame the vendor.

Learned: a gap is not always the modelYou can now put both reviews side by side on the same PR, confirm the vendor-neutral intent core is covered on each, and see that when they diverge the cause usually sits in your prompt phrasing or sandbox defaults, not in one model being smarter.

Why it works

Both agents are built from the same primitives, because the job (read code, reason, emit findings under a permission boundary) is the same no matter who ships the runtime. Project instructions, a sandbox, an approval gate, MCP tools, and a headless mode are not Claude features or Codex features; they are agent features. Once you see your reviewer as “intent plus a permission boundary plus a tool surface,” the vendor’s file format is just one way of writing it down. That is why the intent statement in step 1 carried over without a single edit while the config file had to be rewritten: one is the idea, the other is the config.

Recap

What lasts are the concepts: a clear job statement, scoped instructions, least-privilege tools, a read-only sandbox, and a headless invocation. Those crossed from Claude Code to Codex with nothing more than a rename. The config (CLAUDE.md versus AGENTS.md, settings.json versus config.toml, allow-rules versus approval_policy) differs per vendor and is the cheap part to relearn. Build around the concepts and any agent runtime becomes a deployment target instead of a rewrite. This closes Part 2: you can now ship the same reviewer design across Claude Code and Codex. Part 3 makes it trustworthy, starting with evals.

Common pitfalls

  • Porting syntax instead of intent. Copying one tool’s config verbatim fails because the defaults around it differ. Port the concept from step 1, then express it natively.
  • Assuming features and defaults match. Sandbox behavior, approval semantics, and model support differ. Check each against the Codex docs, not memory, and treat any config keys in this chapter as illustrative.
  • Skipping the same-input test. Comparing on different diffs tells you nothing. Use one shared PR so divergences mean something.
  • Blaming the model when the defaults don’t match. Most cross-vendor divergence comes from prompt phrasing or sandbox permissions, not raw capability. Check those first.

Xiaoman meets Codex and finds its way is not the only one. An easy stop, but you notice something: it has started to compare, and to prefer. The Far Shore lights up.

Just lit The Far Shore · 8 / 16 lit

But can you trust it? How would you know it is not bluffing you? Next: the Proving Grounds.

Sources

  1. OpenAI Codex documentation · official
  2. OpenAI Codex CLI (open source) · official
UP NEXT · CHAPTER 8 Evals