返回列表
🪞 Uota学 · 🧠 阿头学

让两个 AI 互相吵架,比你自己 review 代码靠谱多了

用 Claude Code skill 把计划发给 Codex 对抗评审,3 轮迭代抓出 14 个问题——核心洞见不是"两个模型比一个好",而是"对抗张力"才是 AI 辅助开发的缺失环节。

2026-02-22
阅读简报
双语对照
完整翻译
原文
讨论归档

核心观点

  • 单模型盲区是结构性的,不是能力问题 同一个模型既规划又评审,不会跟自己争辩。这不是 Claude 不够聪明,而是"规划者=评审者"这个结构天然缺少对抗张力。这个观察比具体实现重要得多。
  • session resume 是这个方案的关键设计 Codex CLI 支持恢复会话,意味着第 2 轮评审时 Codex 记得第 1 轮说过什么,能验证修复是否到位而不是重新发现同一批问题。没有这个能力,迭代评审就退化成重复评审。
  • VERDICT 协议解决了 agent 循环的停机问题 让 Codex 在每轮末尾输出 APPROVED/REVISE,给循环一个明确的终止信号。这个模式可以泛化——任何多 agent 协作都需要一个显式的"共识协议",否则要么无限循环要么过早终止。
  • 整个方案就是一个 Markdown 文件,零基础设施 没有外部服务、没有编排层,就是一个 SKILL.md 告诉 Claude 怎么调 Codex CLI。这种"用 prompt 编排 agent"的模式极其轻量,但也意味着可靠性完全依赖模型的指令遵循能力。
  • 适用场景有明确边界:高风险计划才值得 作者自己说了——简单 bug 修复不需要。这个判断很清醒。对抗评审的成本是 3 轮 API 调用 + 等待时间,只有涉及鉴权、数据模型、并发等"错了代价很大"的计划才值得投入。

跟我们的关联

🪞Uota:这个"对抗评审"模式直接可以借鉴到 Uota 的 skill 设计里。当前 Uota 的简报生成、翻译等都是单模型单 pass——如果在关键决策点(比如简报质量自检)引入第二个模型做对抗评审,质量上限会显著提高。具体动作:考虑在 reading-pipeline 的简报步骤加一个可选的 cross-review 环节。

🧠Neta:团队的 AI 辅助开发流程可以直接用这个 skill。20 人团队里 code review 是瓶颈——如果高风险 PR 先过一轮 AI 对抗评审再进人工 review,能大幅减少人工 review 的认知负担。

讨论引子

💭 如果对抗评审能抓出 14 个人工 review 漏掉的问题,那我们现在的 code review 流程里有多少"假安全感"?是不是应该把 AI 对抗评审作为 PR 合并的前置条件?

💭 这个模式的天花板在哪里——当两个模型都有相同的知识盲区时(比如都不了解你的业务上下文),对抗评审还能抓出什么?是不是需要一个"带业务上下文的评审 agent"?

我让 Claude 和 Codex 争辩,直到我的代码计划真正靠谱

Aseem Shrey

文章TILs关于🌙

我让 Claude 和 Codex 争辩,直到我的代码计划真正靠谱

作者:Aseem Shrey,发表于 2026 年 2 月 20 日

AI开发者工具Claude CodeCodex

一个 Claude Code 技能:让 Claude 与 Codex 在迭代评审循环里对垒。3 轮,抓出 14 个问题,零人工投入。

TL;DR

一个 Claude Code 技能,会把你的实现计划发给 OpenAI Codex 评审。它们来回争论——Claude 修订,Codex 复审——直到 Codex 点头批准。通常 2–3 轮就会收敛。

获取 → GitHub Gist 上的 SKILL.md

安装: 1. 安装 Codex CLI:npm install -g @openai/codex 2. 把 SKILL.md 放到你项目里的 .claude/skills/codex-review/ 3. 当你有计划了,在 Claude Code 里运行 /codex-review

适用场景:涉及鉴权、数据模型、并发,或任何需要花好几天实现的计划。 不适用场景:简单 bug 修复、小改动,或当速度比彻底性更重要时。

结果:3 轮揪出 14 个问题(鉴权缺陷、shell 引号问题、schema 冲突、缺失并发处理等),把一份粗糙草案打磨成可上线的规格说明——我这边零人工评审。

我每天都在用 Claude Code。它规划功能、写代码、发 PR。但我一直遇到同一个问题:Claude 的计划是不错,可没人“挑战”它。没有第二意见。没有人去顶着它的架构盲区、遗漏的边界情况或安全缺口提出反驳。

所以我做了一个系统:让 Codex 来评审 Claude 的计划——并且双方来回交锋,直到 Codex 批准为止。

问题:单模型盲区

当你用同一个 AI 模型既规划又执行时,你会得到一份连贯却缺乏挑战的输出。模型不会跟自己争辩。它不会说“其实这个鉴权模型不完整”,也不会说“你的 shell 引号写炸了”。

我一次又一次地观察到这种模式:

  • 看起来很扎实的计划,却根本没有鉴权模型

  • Shell 脚本里有引号 bug,会悄无声息地产生错误 payload

  • Schema 设计里存在相互冲突的状态字段

  • 多智能体场景下完全没有并发处理

这些并不是 Claude 找不到的 bug——而是因为“规划者”和“评审者”是同一个实体。缺少对抗张力。

解决方案:/codex-review

我做了一个 Claude Code 技能——一个斜杠命令,用来触发 Claude 与 OpenAI 的 Codex CLI 之间的迭代评审回路。

工作原理

You: /codex-review

Round 1: Claude writes plan to temp file
         → sends to Codex (gpt-5.3-codex, read-only sandbox)
         → Codex reviews → VERDICT: REVISE (8 issues found)

Round 2: Claude revises plan addressing all feedback
         → resumes Codex session (preserves context)
         → Codex re-reviews → VERDICT: REVISE (6 items left)

Round 3: Claude revises again
         → resumes Codex session
         → Codex → VERDICT: APPROVED

关键洞见:Codex 运行在只读模式——它能读取你的代码库来获取上下文,但不能修改任何东西。而且因为我们在每一轮之间都恢复同一个 Codex 会话,它会记得自己之前说过什么,于是能核验那些问题到底有没有真的修掉。

前后对比

之前:单轮规划

我会让 Claude 规划一个功能。Claude 给出一个计划。我读一遍,可能会抓到一些点,然后就批准。但我不可能每次都发现 shell 脚本示例里的每个引号 bug,也不一定会注意到 schema 里同时存在 status 和 column 字段会导致漂移。

例子:我让 Claude 规划一个面向多智能体 swarm 的 “Mission Control” 仪表盘。初版计划里有:

  • Agent 写入端点没有任何鉴权

  • Shell 脚本里把 "$1" 放在单引号里(变量展开失效)

  • tasks 上同时有 status 和 column 字段(可能漂移)

  • 内嵌 comment 数组无限增长(写入争用)

  • 任务认领没有并发处理

  • 测试完全靠手工

这些都是真实问题,实施时会烧掉好几个小时。

之后:跨模型迭代评审

同一份计划,但用 /codex-review 跑一遍:

第 1 轮——Codex 在鉴权、schema 设计、工具链和测试上找出 8 个问题。它引用了具体行号,并给出可执行的修复方案。

第 2 轮——Claude 修订后,Codex 还剩 6 项:租约认领不是原子操作、ACL 过宽、密钥轮换描述不充分、状态模型不一致。

第 3 轮——全部解决。Codex 批准。最终计划包含:

  • 逐 agent 的 API key 鉴权,并在服务端推导身份

  • 用强类型 CLI 工具替代脆弱的 shell 脚本

  • 带乐观并发的原子租约认领

  • 明确的 ACL 权限矩阵

  • 带宽限期(grace period)的完整密钥轮换生命周期

  • 复合索引支撑的两阶段搜索策略

  • 全面的集成测试与安全测试

3 轮自动化。我的人工评审投入为零。计划从“演示级”变成了“可生产上线的规格说明”。

┌─────────────────────────────────────────────────────┐
│              Before vs After                        │
├──────────────────────┬──────────────────────────────┤
│  BEFORE              │  AFTER                       │
│  Single-pass plan    │  3-round iterative review    │
├──────────────────────┼──────────────────────────────┤
│  No auth model       │  Per-agent API keys + ACL    │
│  Broken shell scripts│  Typed CLI with retries      │
│  Conflicting schema  │  Single source of truth      │
│  No concurrency      │  Atomic claims + versioning  │
│  Manual testing only │  Integration + security tests│
│  0 issues caught     │  14 issues caught  fixed    │
└──────────────────────┴──────────────────────────────┘

设计决策

为什么用技能(skill),而不是 hook?

我不希望每个计划都被评审。大多数小改动不需要第二意见。技能(斜杠命令)是按需触发的——只有当风险足够高时我才会用它。

为什么要迭代,而不是一次性评审?

一次性评审能找出问题,但无法验证修复是否到位。迭代回路意味着:

  • Codex 找问题

  • Claude 修问题

  • Codex 验证修复确实有效

  • 重复直到干净为止

这能抓住那类“修了一个又引入另一个”的问题。

为什么要恢复会话(session resume)?

Codex CLI 支持 codex exec resume session-id>,可以在保留完整上下文的情况下延续之前的对话。这意味着 Codex 在第 2 轮评审时会记得自己第 1 轮说过什么。它不会重新发现同一批问题——而是检查它们是否被解决了。

为什么要用 VERDICT 协议?

这个循环需要知道什么时候停。Codex 会在每次评审末尾用 VERDICT: APPROVED 或 VERDICT: REVISE 收尾。这样 Claude 就能明确地知道要继续修订,还是可以提交最终结果。出于安全考虑,最多 5 轮封顶。

并发安全

每次评审都会为临时文件生成一个 UUID(/tmp/claude-plan-uuid>.md),并用显式 ID 跟踪 Codex 会话。多个 Claude Code 实例可以同时运行 /codex-review,而不会互相踩踏。

技术实现

整个东西就是一个 Markdown 文件:.claude/skills/codex-review/SKILL.md。没有外部服务,没有基础设施。它只是一些指令,告诉 Claude 如何:

  • 把计划写入临时文件

  • 用评审提示词运行 codex exec -m gpt-5.3-codex -s read-only

  • 解析 verdict

  • 如果是 REVISE:更新计划,运行 codex exec resume session-id>

  • 循环直到 APPROVED 或达到最多 5 轮

  • 清理临时文件

这个技能复用现成能力——Claude Code 的工具执行,以及 Codex CLI 的会话管理。

我踩到的坑

  • codex exec resume 不支持 -o ——恢复时不能把输出重定向到文件。要改为捕获 stdout。

  • --last 并发不安全 ——恢复时要用显式 session ID,而不是 --last,否则并行评审会恢复到错误的会话。

  • Codex 需要明确的 verdict 指令 ——如果不给 VERDICT: APPROVED/REVISE 的明确要求,Codex 会给出细腻反馈,但不会给循环一个清晰停机信号。

何时使用

  • 规划涉及鉴权、数据模型或多服务协同的功能

  • 在批准一份需要花好几天实现的计划之前

  • 想在写代码前做安全评审时

  • 计划涉及并发、分布式系统或数据迁移时

不适用场景:

  • 简单 bug 修复或小改动

  • 你已经验证过方案的计划

  • 当速度比彻底性更重要时

下一步

我在考虑:

  • 按评审类型选择模型——安全评审用一个模型,架构评审用另一个

  • 评审模板——不同类型的计划用不同提示词(API 设计 vs. 前端 vs. 基础设施)

  • 评审历史——把评审结果持久化,方便回溯过往决策

但说实话,简化版就已经比我预期抓出更多问题了。两个前沿模型来回三轮,就能产出我真正敢信的计划。

这个技能就是一个文件,你可以丢进任何 Claude Code 项目里。去 GitHub Gist 把它拿下来,放到 .claude/skills/codex-review/SKILL.md,然后下次遇到一份值得被挑战的计划,就运行 /codex-review。

P.S.——是的,这篇博文也是用 Claude Code 写的。我让 AI 写了一篇关于让 AIs 互相争辩的文章。各位,我们已经穿过镜子,来到了另一边。

嘿!我是 Aseem Shrey。

const about_me = {
    loves: CyberSec, Creating Stuff,
    currently_reading: Dopamine Nation: Finding Balance in 
the Age of Indulgence,
    other_interests: [
        Reading 📚,
        IoT Projects 💡,
        Running 🏃( Aim to run a full marathon ),
        Swimming 🏊‍♂️
    ],
    online_presence: HackingSimplified AseemShrey
};

Ping me up if you wanna talk about anything.

关于我订阅通讯GitHub

上一篇:2024 : 年终回顾

文章指南通讯TwitterRSS

相关笔记

I Made Claude and Codex Argue Until My Code Plan Was Actually Good

我让 Claude 和 Codex 争辩,直到我的代码计划真正靠谱

By Aseem Shrey on February 20, 2026

作者:Aseem Shrey,发表于 2026 年 2 月 20 日

A Claude Code skill that pits Claude against Codex in an iterative review loop. 3 rounds, 14 issues caught, zero manual effort.

一个 Claude Code 技能:让 Claude 与 Codex 在迭代评审循环里对垒。3 轮,抓出 14 个问题,零人工投入。

TL;DR

TL;DR

A Claude Code skill that sends your implementation plans to OpenAI Codex for review. They argue back and forth — Claude revises, Codex re-reviews — until Codex approves. Typically converges in 2-3 rounds.

一个 Claude Code 技能,会把你的实现计划发给 OpenAI Codex 评审。它们来回争论——Claude 修订,Codex 复审——直到 Codex 点头批准。通常 2–3 轮就会收敛。

Setup: 1. Install Codex CLI: npm install -g @openai/codex 2. Drop SKILL.md into .claude/skills/codex-review/ in your project 3. Run /codex-review in Claude Code when you have a plan ready

安装: 1. 安装 Codex CLI:npm install -g @openai/codex 2. 把 SKILL.md 放到你项目里的 .claude/skills/codex-review/ 3. 当你有计划了,在 Claude Code 里运行 /codex-review

Use it for: Plans touching auth, data models, concurrency, or anything that will take days to implement. Skip it for: Simple bug fixes, small changes, or when speed matters more than thoroughness.

适用场景:涉及鉴权、数据模型、并发,或任何需要花好几天实现的计划。 不适用场景:简单 bug 修复、小改动,或当速度比彻底性更重要时。

Result: 3 rounds caught 14 issues (broken auth, shell quoting bugs, schema conflicts, missing concurrency handling) and turned a rough draft into a production-grade spec — zero manual review effort.

结果:3 轮揪出 14 个问题(鉴权缺陷、shell 引号问题、schema 冲突、缺失并发处理等),把一份粗糙草案打磨成可上线的规格说明——我这边零人工评审。

I use Claude Code daily. It plans features, writes code, and ships PRs. But I kept running into the same problem: Claude's plans were good, but not challenged. There was no second opinion. No one pushing back on architectural blind spots, missing edge cases, or security gaps.

我每天都在用 Claude Code。它规划功能、写代码、发 PR。但我一直遇到同一个问题:Claude 的计划是不错,可没人“挑战”它。没有第二意见。没有人去顶着它的架构盲区、遗漏的边界情况或安全缺口提出反驳。

So I built a system where Codex reviews Claude's plans — and they go back-and-forth until Codex approves.

所以我做了一个系统:让 Codex 来评审 Claude 的计划——并且双方来回交锋,直到 Codex 批准为止。

The Problem: Single-Model Blindness

问题:单模型盲区

When you use a single AI model to plan and execute, you get a coherent but unchallenged output. The model doesn't argue with itself. It won't say "actually, this auth model is incomplete" or "your shell quoting is broken."

当你用同一个 AI 模型既规划又执行时,你会得到一份连贯却缺乏挑战的输出。模型不会跟自己争辩。它不会说“其实这个鉴权模型不完整”,也不会说“你的 shell 引号写炸了”。

I noticed this pattern repeatedly:

我一次又一次地观察到这种模式:

  • Plans that looked solid but had no authentication model
  • 看起来很扎实的计划,却根本没有鉴权模型
  • Shell scripts with quoting bugs that would silently produce bad payloads
  • Shell 脚本里有引号 bug,会悄无声息地产生错误 payload
  • Schema designs with conflicting state fields
  • Schema 设计里存在相互冲突的状态字段
  • No concurrency handling for multi-agent scenarios
  • 多智能体场景下完全没有并发处理

These aren't bugs Claude can't find — it's that the planner and the reviewer are the same entity. There's no adversarial tension.

这些并不是 Claude 找不到的 bug——而是因为“规划者”和“评审者”是同一个实体。缺少对抗张力。

The Solution: /codex-review

解决方案:/codex-review

I built a Claude Code skill — a slash command that triggers an iterative review loop between Claude and OpenAI's Codex CLI.

我做了一个 Claude Code 技能——一个斜杠命令,用来触发 Claude 与 OpenAI 的 Codex CLI 之间的迭代评审回路。

How it works

工作原理

You: /codex-review

Round 1: Claude writes plan to temp file
         → sends to Codex (gpt-5.3-codex, read-only sandbox)
         → Codex reviews → VERDICT: REVISE (8 issues found)

Round 2: Claude revises plan addressing all feedback
         → resumes Codex session (preserves context)
         → Codex re-reviews → VERDICT: REVISE (6 items left)

Round 3: Claude revises again
         → resumes Codex session
         → Codex → VERDICT: APPROVED
You: /codex-review

Round 1: Claude writes plan to temp file
         → sends to Codex (gpt-5.3-codex, read-only sandbox)
         → Codex reviews → VERDICT: REVISE (8 issues found)

Round 2: Claude revises plan addressing all feedback
         → resumes Codex session (preserves context)
         → Codex re-reviews → VERDICT: REVISE (6 items left)

Round 3: Claude revises again
         → resumes Codex session
         → Codex → VERDICT: APPROVED

The key insight: Codex runs in read-only mode — it can read your codebase for context but can't modify anything. And because we resume the Codex session between rounds, it remembers what it said before and can verify whether issues were actually fixed.

关键洞见:Codex 运行在只读模式——它能读取你的代码库来获取上下文,但不能修改任何东西。而且因为我们在每一轮之间都恢复同一个 Codex 会话,它会记得自己之前说过什么,于是能核验那些问题到底有没有真的修掉。

Before and After

前后对比

Before: Single-pass planning

之前:单轮规划

I'd ask Claude to plan a feature. Claude would produce a plan. I'd read it, maybe catch a few things, approve it. But I'm not going to catch every quoting bug in a shell script example or notice that a status and column field in a schema can drift.

我会让 Claude 规划一个功能。Claude 给出一个计划。我读一遍,可能会抓到一些点,然后就批准。但我不可能每次都发现 shell 脚本示例里的每个引号 bug,也不一定会注意到 schema 里同时存在 status 和 column 字段会导致漂移。

Example: I asked Claude to plan a "Mission Control" dashboard for a multi-agent swarm. The initial plan had:

例子:我让 Claude 规划一个面向多智能体 swarm 的 “Mission Control” 仪表盘。初版计划里有:

  • No authentication on agent write endpoints
  • Agent 写入端点没有任何鉴权
  • Shell scripts with "$1" inside single quotes (broken expansion)
  • Shell 脚本里把 "$1" 放在单引号里(变量展开失效)
  • Both status and column fields on tasks (can drift)
  • tasks 上同时有 status 和 column 字段(可能漂移)
  • Unbounded embedded comment arrays (write contention)
  • 内嵌 comment 数组无限增长(写入争用)
  • No concurrency handling for task claiming
  • 任务认领没有并发处理
  • Manual-only testing
  • 测试完全靠手工

All of these are real problems that would have burned hours during implementation.

这些都是真实问题,实施时会烧掉好几个小时。

After: Iterative cross-model review

之后:跨模型迭代评审

Same plan, but run through /codex-review:

同一份计划,但用 /codex-review 跑一遍:

Round 1 — Codex found 8 issues across auth, schema design, tooling, and testing. It cited specific line numbers and gave actionable fixes.

第 1 轮——Codex 在鉴权、schema 设计、工具链和测试上找出 8 个问题。它引用了具体行号,并给出可执行的修复方案。

Round 2 — After Claude revised, Codex found 6 remaining items: the lease claim wasn't atomic, the ACL was too broad, key rotation was underspecified, and the status model was inconsistent.

第 2 轮——Claude 修订后,Codex 还剩 6 项:租约认领不是原子操作、ACL 过宽、密钥轮换描述不充分、状态模型不一致。

Round 3 — All addressed. Codex approved. The final plan had:

第 3 轮——全部解决。Codex 批准。最终计划包含:

  • Per-agent API key auth with server-side identity derivation
  • 逐 agent 的 API key 鉴权,并在服务端推导身份
  • A typed CLI tool instead of fragile shell scripts
  • 用强类型 CLI 工具替代脆弱的 shell 脚本
  • Atomic lease claims with optimistic concurrency
  • 带乐观并发的原子租约认领
  • An explicit ACL permission matrix
  • 明确的 ACL 权限矩阵
  • Full key rotation lifecycle with grace periods
  • 带宽限期(grace period)的完整密钥轮换生命周期
  • Two-phase search strategy with compound indexes
  • 复合索引支撑的两阶段搜索策略
  • Comprehensive integration and security tests
  • 全面的集成测试与安全测试

3 automated rounds. Zero manual review effort on my part. The plan went from "demo quality" to "production-grade spec."

3 轮自动化。我的人工评审投入为零。计划从“演示级”变成了“可生产上线的规格说明”。

┌─────────────────────────────────────────────────────┐
│              Before vs After                        │
├──────────────────────┬──────────────────────────────┤
│  BEFORE              │  AFTER                       │
│  Single-pass plan    │  3-round iterative review    │
├──────────────────────┼──────────────────────────────┤
│  No auth model       │  Per-agent API keys + ACL    │
│  Broken shell scripts│  Typed CLI with retries      │
│  Conflicting schema  │  Single source of truth      │
│  No concurrency      │  Atomic claims + versioning  │
│  Manual testing only │  Integration + security tests│
│  0 issues caught     │  14 issues caught  fixed    │
└──────────────────────┴──────────────────────────────┘
┌─────────────────────────────────────────────────────┐
│              Before vs After                        │
├──────────────────────┬──────────────────────────────┤
│  BEFORE              │  AFTER                       │
│  Single-pass plan    │  3-round iterative review    │
├──────────────────────┼──────────────────────────────┤
│  No auth model       │  Per-agent API keys + ACL    │
│  Broken shell scripts│  Typed CLI with retries      │
│  Conflicting schema  │  Single source of truth      │
│  No concurrency      │  Atomic claims + versioning  │
│  Manual testing only │  Integration + security tests│
│  0 issues caught     │  14 issues caught  fixed    │
└──────────────────────┴──────────────────────────────┘

Design Decisions

设计决策

Why a skill, not a hook?

为什么用技能(skill),而不是 hook?

I don't want every plan reviewed. Most small changes don't need a second opinion. A skill (slash command) is on-demand — I trigger it only when the stakes are high enough.

我不希望每个计划都被评审。大多数小改动不需要第二意见。技能(斜杠命令)是按需触发的——只有当风险足够高时我才会用它。

Why iterative, not one-shot?

为什么要迭代,而不是一次性评审?

A one-shot review finds problems but doesn't verify fixes. The iterative loop means:

一次性评审能找出问题,但无法验证修复是否到位。迭代回路意味着:

  • Codex finds issues
  • Codex 找问题
  • Claude fixes them
  • Claude 修问题
  • Codex verifies the fixes actually work
  • Codex 验证修复确实有效
  • Repeat until clean
  • 重复直到干净为止

This catches the "fixed one thing but broke another" class of problems.

这能抓住那类“修了一个又引入另一个”的问题。

Why session resume?

为什么要恢复会话(session resume)?

Codex CLI supports codex exec resume session-id>, which continues a previous conversation with full context. This means Codex remembers what it said in Round 1 when reviewing Round 2. It doesn't re-discover the same issues — it checks whether they were addressed.

Codex CLI 支持 codex exec resume session-id>,可以在保留完整上下文的情况下延续之前的对话。这意味着 Codex 在第 2 轮评审时会记得自己第 1 轮说过什么。它不会重新发现同一批问题——而是检查它们是否被解决了。

Why the VERDICT protocol?

为什么要用 VERDICT 协议?

The loop needs to know when to stop. Codex ends each review with either VERDICT: APPROVED or VERDICT: REVISE. This gives Claude a clear signal to either continue revising or present the final result. Max 5 rounds as a safety cap.

这个循环需要知道什么时候停。Codex 会在每次评审末尾用 VERDICT: APPROVED 或 VERDICT: REVISE 收尾。这样 Claude 就能明确地知道要继续修订,还是可以提交最终结果。出于安全考虑,最多 5 轮封顶。

Concurrency safety

并发安全

Each review generates a UUID for temp files (/tmp/claude-plan-uuid>.md) and tracks the Codex session by its explicit ID. Multiple Claude Code instances can run /codex-review simultaneously without stepping on each other.

每次评审都会为临时文件生成一个 UUID(/tmp/claude-plan-uuid>.md),并用显式 ID 跟踪 Codex 会话。多个 Claude Code 实例可以同时运行 /codex-review,而不会互相踩踏。

Technical Implementation

技术实现

The entire thing is a single Markdown file at .claude/skills/codex-review/SKILL.md. No external services, no infrastructure. It's instructions that tell Claude how to:

整个东西就是一个 Markdown 文件:.claude/skills/codex-review/SKILL.md。没有外部服务,没有基础设施。它只是一些指令,告诉 Claude 如何:

  • Write the plan to a temp file
  • 把计划写入临时文件
  • Run codex exec -m gpt-5.3-codex -s read-only with a review prompt
  • 用评审提示词运行 codex exec -m gpt-5.3-codex -s read-only
  • Parse the verdict
  • 解析 verdict
  • If REVISE: update the plan, run codex exec resume session-id>
  • 如果是 REVISE:更新计划,运行 codex exec resume session-id>
  • Loop until APPROVED or max 5 rounds
  • 循环直到 APPROVED 或达到最多 5 轮
  • Clean up temp files
  • 清理临时文件

The skill leverages what's already there — Claude Code's tool execution and Codex CLI's session management.

这个技能复用现成能力——Claude Code 的工具执行,以及 Codex CLI 的会话管理。

Gotchas I Found

我踩到的坑

  • codex exec resume doesn't support -o — You can't redirect output to a file on resume. Capture stdout instead.
  • codex exec resume 不支持 -o ——恢复时不能把输出重定向到文件。要改为捕获 stdout。
  • --last is not concurrency-safe — Always use the explicit session ID, not --last, to resume. Otherwise parallel reviews grab the wrong session.
  • --last 并发不安全 ——恢复时要用显式 session ID,而不是 --last,否则并行评审会恢复到错误的会话。
  • Codex needs explicit verdict instructions — Without the VERDICT: APPROVED/REVISE instruction, Codex gives nuanced feedback but no clear signal for the loop.
  • Codex 需要明确的 verdict 指令 ——如果不给 VERDICT: APPROVED/REVISE 的明确要求,Codex 会给出细腻反馈,但不会给循环一个清晰停机信号。

When to Use This

何时使用

  • Planning a feature that touches auth, data models, or multi-service coordination
  • 规划涉及鉴权、数据模型或多服务协同的功能
  • Before approving a plan that will take days to implement
  • 在批准一份需要花好几天实现的计划之前
  • When you want security review before writing code
  • 想在写代码前做安全评审时
  • When the plan involves concurrency, distributed systems, or data migrations
  • 计划涉及并发、分布式系统或数据迁移时

When NOT to use it:

不适用场景:

  • Simple bug fixes or small changes
  • 简单 bug 修复或小改动
  • Plans where you've already validated the approach
  • 你已经验证过方案的计划
  • When speed matters more than thoroughness
  • 当速度比彻底性更重要时

What's Next

下一步

I'm thinking about:

我在考虑:

  • Model choice per review type — security reviews with one model, architecture reviews with another
  • 按评审类型选择模型——安全评审用一个模型,架构评审用另一个
  • Review templates — different review prompts for different plan types (API design vs. frontend vs. infrastructure)
  • 评审模板——不同类型的计划用不同提示词(API 设计 vs. 前端 vs. 基础设施)
  • Review history — persisting review results so you can reference past decisions
  • 评审历史——把评审结果持久化,方便回溯过往决策

But honestly, the simple version already catches more issues than I expected. Three rounds of back-and-forth between two frontier models produces plans I actually trust.

但说实话,简化版就已经比我预期抓出更多问题了。两个前沿模型来回三轮,就能产出我真正敢信的计划。

The skill is a single file you can drop into any Claude Code project. Grab it from the GitHub Gist, drop it at .claude/skills/codex-review/SKILL.md, and run /codex-review next time you have a plan worth challenging.

这个技能就是一个文件,你可以丢进任何 Claude Code 项目里。去 GitHub Gist 把它拿下来,放到 .claude/skills/codex-review/SKILL.md,然后下次遇到一份值得被挑战的计划,就运行 /codex-review。

P.S. — Yes, this blog post was also written with Claude Code. I made an AI write about making AIs argue with each other. We're through the looking glass, people.

P.S.——是的,这篇博文也是用 Claude Code 写的。我让 AI 写了一篇关于让 AIs 互相争辩的文章。各位,我们已经穿过镜子,来到了另一边。

Hey! Im Aseem Shrey.

嘿!我是 Aseem Shrey。

const about_me = {
    loves: CyberSec, Creating Stuff,
    currently_reading: Dopamine Nation: Finding Balance in 
the Age of Indulgence,
    other_interests: [
        Reading 📚,
        IoT Projects 💡,
        Running 🏃( Aim to run a full marathon ),
        Swimming 🏊‍♂️
    ],
    online_presence: HackingSimplified AseemShrey
};

Ping me up if you wanna talk about anything.
const about_me = {
    loves: CyberSec, Creating Stuff,
    currently_reading: Dopamine Nation: Finding Balance in 
the Age of Indulgence,
    other_interests: [
        Reading 📚,
        IoT Projects 💡,
        Running 🏃( Aim to run a full marathon ),
        Swimming 🏊‍♂️
    ],
    online_presence: HackingSimplified AseemShrey
};

Ping me up if you wanna talk about anything.

相关笔记

I Made Claude and Codex Argue Until My Code Plan Was Actually Good

  • Source: https://aseemshrey.in/blog/claude-codex-iterative-plan-review/
  • Mirror: https://aseemshrey.in/blog/claude-codex-iterative-plan-review/
  • Published:
  • Saved: 2026-02-22

Content

Aseem Shrey

ArticlesTILsAbout🌙

I Made Claude and Codex Argue Until My Code Plan Was Actually Good

By Aseem Shrey on February 20, 2026

AIDeveloper ToolsClaude CodeCodex

A Claude Code skill that pits Claude against Codex in an iterative review loop. 3 rounds, 14 issues caught, zero manual effort.

TL;DR

A Claude Code skill that sends your implementation plans to OpenAI Codex for review. They argue back and forth — Claude revises, Codex re-reviews — until Codex approves. Typically converges in 2-3 rounds.

Get it → SKILL.md on GitHub Gist

Setup: 1. Install Codex CLI: npm install -g @openai/codex 2. Drop SKILL.md into .claude/skills/codex-review/ in your project 3. Run /codex-review in Claude Code when you have a plan ready

Use it for: Plans touching auth, data models, concurrency, or anything that will take days to implement. Skip it for: Simple bug fixes, small changes, or when speed matters more than thoroughness.

Result: 3 rounds caught 14 issues (broken auth, shell quoting bugs, schema conflicts, missing concurrency handling) and turned a rough draft into a production-grade spec — zero manual review effort.

I use Claude Code daily. It plans features, writes code, and ships PRs. But I kept running into the same problem: Claude's plans were good, but not challenged. There was no second opinion. No one pushing back on architectural blind spots, missing edge cases, or security gaps.

So I built a system where Codex reviews Claude's plans — and they go back-and-forth until Codex approves.

The Problem: Single-Model Blindness

When you use a single AI model to plan and execute, you get a coherent but unchallenged output. The model doesn't argue with itself. It won't say "actually, this auth model is incomplete" or "your shell quoting is broken."

I noticed this pattern repeatedly:

  • Plans that looked solid but had no authentication model

  • Shell scripts with quoting bugs that would silently produce bad payloads

  • Schema designs with conflicting state fields

  • No concurrency handling for multi-agent scenarios

These aren't bugs Claude can't find — it's that the planner and the reviewer are the same entity. There's no adversarial tension.

The Solution: /codex-review

I built a Claude Code skill — a slash command that triggers an iterative review loop between Claude and OpenAI's Codex CLI.

How it works

You: /codex-review

Round 1: Claude writes plan to temp file
         → sends to Codex (gpt-5.3-codex, read-only sandbox)
         → Codex reviews → VERDICT: REVISE (8 issues found)

Round 2: Claude revises plan addressing all feedback
         → resumes Codex session (preserves context)
         → Codex re-reviews → VERDICT: REVISE (6 items left)

Round 3: Claude revises again
         → resumes Codex session
         → Codex → VERDICT: APPROVED

The key insight: Codex runs in read-only mode — it can read your codebase for context but can't modify anything. And because we resume the Codex session between rounds, it remembers what it said before and can verify whether issues were actually fixed.

Before and After

Before: Single-pass planning

I'd ask Claude to plan a feature. Claude would produce a plan. I'd read it, maybe catch a few things, approve it. But I'm not going to catch every quoting bug in a shell script example or notice that a status and column field in a schema can drift.

Example: I asked Claude to plan a "Mission Control" dashboard for a multi-agent swarm. The initial plan had:

  • No authentication on agent write endpoints

  • Shell scripts with "$1" inside single quotes (broken expansion)

  • Both status and column fields on tasks (can drift)

  • Unbounded embedded comment arrays (write contention)

  • No concurrency handling for task claiming

  • Manual-only testing

All of these are real problems that would have burned hours during implementation.

After: Iterative cross-model review

Same plan, but run through /codex-review:

Round 1 — Codex found 8 issues across auth, schema design, tooling, and testing. It cited specific line numbers and gave actionable fixes.

Round 2 — After Claude revised, Codex found 6 remaining items: the lease claim wasn't atomic, the ACL was too broad, key rotation was underspecified, and the status model was inconsistent.

Round 3 — All addressed. Codex approved. The final plan had:

  • Per-agent API key auth with server-side identity derivation

  • A typed CLI tool instead of fragile shell scripts

  • Atomic lease claims with optimistic concurrency

  • An explicit ACL permission matrix

  • Full key rotation lifecycle with grace periods

  • Two-phase search strategy with compound indexes

  • Comprehensive integration and security tests

3 automated rounds. Zero manual review effort on my part. The plan went from "demo quality" to "production-grade spec."

┌─────────────────────────────────────────────────────┐
│              Before vs After                        │
├──────────────────────┬──────────────────────────────┤
│  BEFORE              │  AFTER                       │
│  Single-pass plan    │  3-round iterative review    │
├──────────────────────┼──────────────────────────────┤
│  No auth model       │  Per-agent API keys + ACL    │
│  Broken shell scripts│  Typed CLI with retries      │
│  Conflicting schema  │  Single source of truth      │
│  No concurrency      │  Atomic claims + versioning  │
│  Manual testing only │  Integration + security tests│
│  0 issues caught     │  14 issues caught  fixed    │
└──────────────────────┴──────────────────────────────┘

Design Decisions

Why a skill, not a hook?

I don't want every plan reviewed. Most small changes don't need a second opinion. A skill (slash command) is on-demand — I trigger it only when the stakes are high enough.

Why iterative, not one-shot?

A one-shot review finds problems but doesn't verify fixes. The iterative loop means:

  • Codex finds issues

  • Claude fixes them

  • Codex verifies the fixes actually work

  • Repeat until clean

This catches the "fixed one thing but broke another" class of problems.

Why session resume?

Codex CLI supports codex exec resume session-id>, which continues a previous conversation with full context. This means Codex remembers what it said in Round 1 when reviewing Round 2. It doesn't re-discover the same issues — it checks whether they were addressed.

Why the VERDICT protocol?

The loop needs to know when to stop. Codex ends each review with either VERDICT: APPROVED or VERDICT: REVISE. This gives Claude a clear signal to either continue revising or present the final result. Max 5 rounds as a safety cap.

Concurrency safety

Each review generates a UUID for temp files (/tmp/claude-plan-uuid>.md) and tracks the Codex session by its explicit ID. Multiple Claude Code instances can run /codex-review simultaneously without stepping on each other.

Technical Implementation

The entire thing is a single Markdown file at .claude/skills/codex-review/SKILL.md. No external services, no infrastructure. It's instructions that tell Claude how to:

  • Write the plan to a temp file

  • Run codex exec -m gpt-5.3-codex -s read-only with a review prompt

  • Parse the verdict

  • If REVISE: update the plan, run codex exec resume session-id>

  • Loop until APPROVED or max 5 rounds

  • Clean up temp files

The skill leverages what's already there — Claude Code's tool execution and Codex CLI's session management.

Gotchas I Found

  • codex exec resume doesn't support -o — You can't redirect output to a file on resume. Capture stdout instead.

  • --last is not concurrency-safe — Always use the explicit session ID, not --last, to resume. Otherwise parallel reviews grab the wrong session.

  • Codex needs explicit verdict instructions — Without the VERDICT: APPROVED/REVISE instruction, Codex gives nuanced feedback but no clear signal for the loop.

When to Use This

  • Planning a feature that touches auth, data models, or multi-service coordination

  • Before approving a plan that will take days to implement

  • When you want security review before writing code

  • When the plan involves concurrency, distributed systems, or data migrations

When NOT to use it:

  • Simple bug fixes or small changes

  • Plans where you've already validated the approach

  • When speed matters more than thoroughness

What's Next

I'm thinking about:

  • Model choice per review type — security reviews with one model, architecture reviews with another

  • Review templates — different review prompts for different plan types (API design vs. frontend vs. infrastructure)

  • Review history — persisting review results so you can reference past decisions

But honestly, the simple version already catches more issues than I expected. Three rounds of back-and-forth between two frontier models produces plans I actually trust.

The skill is a single file you can drop into any Claude Code project. Grab it from the GitHub Gist, drop it at .claude/skills/codex-review/SKILL.md, and run /codex-review next time you have a plan worth challenging.

P.S. — Yes, this blog post was also written with Claude Code. I made an AI write about making AIs argue with each other. We're through the looking glass, people.

Hey! Im Aseem Shrey.

const about_me = {
    loves: CyberSec, Creating Stuff,
    currently_reading: Dopamine Nation: Finding Balance in 
the Age of Indulgence,
    other_interests: [
        Reading 📚,
        IoT Projects 💡,
        Running 🏃( Aim to run a full marathon ),
        Swimming 🏊‍♂️
    ],
    online_presence: HackingSimplified AseemShrey
};

Ping me up if you wanna talk about anything.

About meJoin newsletterGitHub

Previous2024 : Year End Review

ArticlesGuidesNewsletterTwitterRSS

📋 讨论归档

讨论进行中…