返回列表
🧠 阿头学 · 💬 讨论题 · 💰投资

Claude Code vs. Codex:权威对比指南

这篇文章本质是在说:Claude Code 和 Codex 水平接近,但在“生态 + 价格梯度 + 使用体感”上,Claude 当前略占上风。
打开原文 ↗

2026-03-11 原文链接 ↗
阅读简报
双语对照
完整翻译
原文
讨论归档

核心观点

  • 模型任务上限:Claude 的 Opus 4.6 对“长难任务”更扛打

基于 METR 的 Task-Completion Time Horizon,对比 Opus 4.6 和 GPT-5.3-Codex:在 50% 成功率下,Opus 能稳定完成“人类 12 小时级别”的任务,而 Codex 约为 5 小时 50 分钟;在 80% 成功率下差距缩小。这说明 Claude 在长链条、高复杂度任务上理论上更稳,但这些指标未必直接等价于日常开发体验。

  • 速度不是关键,质量和“谁负责善后”才是

Claude Code 通常更快,但作者强调:真正重要的是你是否需要花大量时间帮它收拾烂摊子。如果一个代理慢一点但结果更稳、需要更少人工 debug,总体开发效率反而更高。也就是说,不能被“推理速度”这种表层指标带节奏。

  • 技术实现风格不同:Claude 偏“干活型”,Codex 偏“工程型”

两者在 RAG 案例中选择了类似的总体方案(同款 embedding、Top-K 等),但在实现细节上风格迥异:Claude Code 选择 ChromaDB、递归字符分块、函数式且相对扁平的结构,注重“先跑通”;Codex 选择 FAISS、句子级分块、带 dataclasses 和 CLI 的 OOP 管线,注重工程化、可配置、可维护。前者像“帮你快速搞定活儿的资深同事”,后者像“交付有规范、结构化不错的外包”。

  • 社区热度与生态:当前 Claude 更像“有拉力的整合生态”

作者明显感受到 Anthropic 的生态黏性:Claude Chat、Cowork、Code 逐渐形成一个闭环,体验接近“Apple 式生态”,加上中间 $100/月档位(Max 5x)使得“成本/体验比”更可控。而 OpenAI 这边,作者感知不到清晰的产品故事:除了 Codex,其它产品被描述为“零散拼图”,竞争力有限。VS Code 安装量和 GitHub star 也显示两者热度相近且 Claude 略微领先。

  • 基准与实验结果:性能接近,但 token 经济与行为模式有差异

第三方评测显示:在同样任务上,Claude Code 的 token 消耗是 Codex 的 3.2–4.2 倍,意味着更容易触及订阅上限。从作者亲自搭建 RAG 管线的实验看,100 道问题中 Claude 赢 42、Codex 赢 33、25 平局,差距不算碾压;Claude 在“跑通一条可用 pipeline 并顺手 debug 环境”的行为上更主动,而 Codex 更倾向于把部分安装/环境工作留给用户。

跟我们的关联

1. 对 ATou:如何选主力编码代理与付费梯度?

  • 意味着什么:如果你重度依赖 AI 编程,选择不仅是“谁更聪明”,而是“谁的生态与你的日常工作更契合 + 哪个价格档位能稳定覆盖你的 token 需求”。Claude 的 Max 5x($100/月)是一个现实可用的中档选择,而 Codex 则是 20→200 的“跳档”。
  • 下一步怎么用:
  • 先用 $20 档做 1–2 周“真实任务验证”:选 3–5 个可量化的项目(如 RAG、小工具、内部脚本重构),对比谁在“总交付时间(含 debug)/结果质量/心智负担”上更好。
  • 如果明显偏向 Claude,再评估是否需要升级到 Max 5x 而不是直接 200;在预算规划里把 token 经济(消耗多、但可能更高效)写进决策说明。

2. 对 Neta:如何设计和评估我们自己的代理产品?

  • 意味着什么:文章强调“体感”“长任务稳定性”“工程架构 vs 快速跑通”的差异,这些其实是我们设计代理产品和评测内部模型时应该显式量化和展示的维度,而不仅仅是“编码基准分数”。
  • 下一步怎么用:
  • 在内部评测中引入类似 Task-Completion Time Horizon 的指标(按“人类小时数 + 成功率”来表达),替换掉一部分抽象 Benchmark。
  • 在产品设计上明确我们的“人格定位”:是“资深开发者型”还是“外包承包商型”?围绕这个定位设计交互模式(提问/解释/自动环境操作的默认策略),并在文档中讲清楚,降低用户预期错配。

3. 对 Uota:如何组织团队对 AI 工具做“真实场景对比”?

  • 意味着什么:作者的 RAG 实战案例给了一个可复制的评测模板——不是看“谁写个 demo 漂亮”,而是定义一个有明确指标的真实任务(RAG / 训练 / 性能优化),然后让不同代理各自从 0 到 1 完成,再用统一评审标准比较。
  • 下一步怎么用:
  • 组织一次小型“工具对决”:选一个对团队有用的任务(如:搭一个内部知识库 RAG、写监控脚本、重构一段遗留代码),分别用 Claude Code 和 Codex(或我们当前候选工具)完成。
  • 在评审时刻意记录:
  • 代理是否自动跑通与测试?
  • 环境配置工作是自己干还是代理帮你干?
  • 产出的代码结构是否利于后续维护?
  • 总结出一套“我们团队自己的评估指标”,未来换工具时可重复复用。

4. 对 Neta / ATou:对第三方评测和“社区口碑”的使用方式

  • 意味着什么:文章引用了 METR 和 Morph 等评测,同时又强调“体感”和频繁迭代,这提醒我们:外部评测只能作为粗筛,无法替代在本团队工作流中的 A/B 实测。
  • 下一步怎么用:
  • 统一一个策略:
  • 第一步用第三方评测做“候选集筛选”(淘汰明显落后的工具);
  • 第二步在团队高频场景中做针对性试用(每个候选工具选 1–2 个“深度试用人”);
  • 第三步根据“工具与生态”的耦合度和未来路线(比如我们偏 Anthropic 生态还是 OpenAI 生态)做长期决策。
  • 对外沟通时,少用“某某基准高 X%”的宣传,多用“在某类任务的完成时长/稳定性/维护成本上的真实对比案例”。

讨论引子

1. 如果我们只允许订阅一个重度编码代理(Claude Code 或 Codex),在当前我们的实际任务结构中(长链路研发 vs 短平快需求),哪个会在“综合成本(钱+时间+维护)”下更划算? 2. 对我们团队来说,“快速跑通、自动处理环境问题”重要,还是“代码架构更规范、配置化更好”更重要?哪一个更符合我们的长期技术债策略? 3. 我们现在在评估 AI 工具时,是否过于相信 Benchmark / 社区口碑,而忽略了“我们自己的真实 A/B 实验”?要不要制定一套标准的“工具试用实验设计模板”?

我用了几个月的 Claude Code,后来转到 Codex。最近我又切回了 Claude,原因和基准测试无关。我也让它们在同一个任务上做了对比。

本文将讨论: - Claude Code 与 Codex 的不同侧面; - 驱动它们的两款旗舰模型之间的差异:Opus 4.6 vs. GPT-5.3-Codex; - 真正会改变你 AI 编程体验的因素; - 一个小型案例研究:我如何让它们在同一个任务(搭建 RAG 管线)上“同台竞技”。

先给个善意提醒:本文阅读约需 12 分钟。如果你打算每月为其中任何一个付 $200,我认为这段时间非常值得投入。

Opus 4.6 vs. GPT-5.3-Codex:任务完成时间跨度(Task-Completion Time Horizon)

要比较 Codex 和 Claude Code,一个相当可靠的维度,是它们各自的旗舰模型,以及 Completion Time Horizon(任务完成时间跨度),你可以在这里查看。

这个对比问的是:这个模型能可靠完成多长的任务? 任务完成时间跨度,指的是在某个任务时长(以人类专家完成该任务所需时间衡量)下,模型被预测能以某种可靠性成功完成任务。比如,一个“在 50% 成功率下时间跨度为 2 小时”的模型,意思是:给它一个熟练人类需要 2 小时完成的任务,AI 大约有一半概率能成功。

在这项研究里,他们为每个模型都使用了合适的脚手架(scaffold),包括 Claude Code 和 Codex。因此,虽然重点在模型而非脚手架,我们也能顺带对脚手架的可靠性有个直观感受:它告诉我们,这两个编码代理里,谁更能扛住 更长、更难 的任务。

从图表里你可以看到,Opus 4.6GPT-5.3-Codex 之间存在巨大的差距:在 50% 成功率下,Opus 4.6 的任务完成时长为 12 小时;而 GPT-5.3-Codex 的数字是 5 小时 50 分钟。在 80% 成功率下,两者的差距会缩小。

这清楚地表明,两款模型之间存在差距,因此 Claude Code 与 Codex 在应对困难、挑战性任务方面也会有差距。但这未必能直接映射到你日常使用它们的任务类型上——这一点需要记在心里。

Claude Code 更快,但速度其实没那么重要

Claude 的速度确实出了名地快过 Codex,但和编码代理一起工作,本质上是一个 长期 过程。

如果一个代理把任务用一半时间做完,却让你额外花 10 分钟去 debug 那破玩意;相比之下,另一个代理实现得更慢一点,但之后你几乎不用盯着它收拾残局——那多花的时间绝对 100% 值得

并不是说 Claude Code 或 Codex 更容易犯错;我只是想给你一个“评估代理时放在脑后”的思维框架:不要只听别人炫耀他们的代理写代码有多快。

任务类型对代理很重要

Codex 和 Claude Code 的表现,会随着所做的编码任务而变化:在某个 AI Engineering 任务里,一个可能胜过另一个;但在 web 开发任务里,同一个模型可能会被彻底打爆。

哪些编码任务更适合 Codex,哪些更适合 Claude Code? 这方面的研究并不充分。

例如,在底层编程里该用哪个,并不明确。理想情况下,你应该在一个简单、可验证的设置里把两者都测一遍,再决定全力投入。但对大多数人来说,同时付 $300–$400 订阅两个产品并不现实。

在各种编码任务上系统评测这两种代理会很有意思,但也并不简单,因为这些代理以及背后的模型每隔几个月就会发生巨大变化。

它们各自如何诞生

Claude Code 最初是 Anthropic 的 @bcherny 做的一个副项目:他搭了个终端原型,能调用 Claude API、读取文件、运行一些 bash 命令。

到了第五天,内部团队里就有一半人开始用它。随后,Claude Code 于 2025 年 2 月 24 日以 research preview 形式发布,使用的是 Claude 3.7 Sonnet。它被开发者大规模采用花了一些时间;后来,Anthropic 也为它发布了 VS Code 扩展。

另一方面,OpenAI 曾宣布最初的 Codex 模型:一个在 GitHub 代码上微调的 120 亿参数 GPT-3 模型,最终驱动了 GitHub Copilot 的第一版。不过,现在的 Codex 是一个完全全新的产品。

Codex CLI 于 2025 年 4 月 16 日先行发布,作为一个终端代理,并且此后持续演进,配上了更强的模型。最新的 GPT-5.3-Codex(2026 年 2 月 5 日)被 OpenAI 描述为“第一个帮助创造了自己的模型”。

https://newsletter.pragmaticengineer.com/p/how-codex-is-built

@GergelyOrosz 还和 Claude Code 与 Codex 的开发者分别做了两次非常有意思的访谈,聊他们的技术栈、研发方式,以及各自最初是怎么启动的。看完这两篇你能学到很多。

👉 Codex 是如何构建的

技术栈与驱动模型

Claude Code 用 TypeScript 编写,终端 UI 渲染采用 React + Ink。它以单个 Bun 可执行文件的形式发布(Anthropic 在 2025 年 12 月收购 Bun,也正是出于这个原因)。它使用的 Opus 和 Sonnet 模型还支持 100 万 token 的上下文窗口。

Codex CLI 则用 Rust 编写,强调性能、正确性与可移植性。OpenAI 甚至把 Ratatui(一个 Rust TUI 库)的维护者招进了团队。

这两个 CLI 工具,本质上都是通过 API 调用所用模型的轻量封装。我在使用 Claude Code CLI 时注意到一些小“毛刺”,而在 Codex 上几乎没怎么遇到;考虑到它们的技术栈差异,这也算情理之中。

不过,这些毛刺最多只是略微烦人;它们并不会真正影响你的编程体验。

基准测试很接近,但有细微差别:Token 经济学

最大的性能差异不在准确率,而在 token 效率。一份由 Morph 做的 Opus vs. Codex 综合评测,展示了一个有趣的差距。

https://www.morphllm.com/best-ai-model-for-coding

在完全相同的任务上,Claude Code 的 token 使用量是 Codex 的 3.2–4.2 倍。在一次 Figma 插件构建中,Codex 消耗了 150 万 token,而 Claude 消耗了 620 万。

如果这点属实,那么在支付同样的订阅费用时,你用 Claude Code 更可能触及 token 上限。

体感最重要

Claude 给人的感觉像是一位资深开发者在替你做事;而 Codex 更像一个你把任务交出去、过一会儿再回来验收结果的外包承包商。

这是开发者最常用来描述两者差异的方式。

据说 Claude Code 的交互感更强,也更具深度推理的质感——这很符合 Opus 的定位。它会问你问题、展示推理过程、解释它的做法。尽管在我那次单次对比实验里并不明显,但基于我连续数月使用 Claude 的经验,我可以确认这点确实成立。

Codex 则以“直观任务一次就写对”的能力著称,但代价是实现速度会略微慢一些。

话虽如此,当你在 AGENTS.md 里把诉求写得足够明确时,这种行为差异会显著缩小。比如你要求模型在“火力全开”之前先和你核对实现计划——无论你用的是“资深开发者”式代理,还是“承包商”式代理,它都会照做。

这并不是说它们不一样——它们确实不一样。

只是没有你在 X 上常听到的那么夸张。

快速数据

在 VS Code Marketplace 上,Claude Code 有 610 万次安装、评分 4/5;Codex 有 540 万次安装、评分 3.5/5。

在 GitHub 上,Claude Code 大约有 6.5–7.2 万星标,Codex 约有 6.4 万星标

我为什么暂时又回到 Claude Code

Anthropic 的生态拉力很强

选择 Codex 还是 Claude Code,不只是“写代码”这件事。订阅其中任何一个,本质上也是在订阅 Anthropic/OpenAI 的整个生态——这一点你可能需要纳入考虑。

https://metr.org/time-horizons/

我个人觉得,Claude 正在形成一个类似 Apple 的高热度生态:现在有 Claude Cowork、Claude Chat、以及 Claude Code。看起来 Anthropic 也在借助 Claude app,慢慢搭建一个更安全、更温和版本的 OpenClaw(你的主动型个人代理),并把相关的小模块逐步上线。

而在 OpenAI 这边,就目前而言我还没看到什么特别诱人的东西。除了 Codex 以外,其他看起来都很乏味。我感受到的不是“生态”,而是一些零散拼图——而且外面还有更好的替代品。

我已经更常用 Claude chat 而不是 ChatGPT;对我来说,相比 Opus,ChatGPT 在这个阶段几乎处于“勉强可用”的边缘。UI、聊天的语气、以及模型选择,这些都并不能真正鼓励我去用 ChatGPT。

因此,当我已经在高频使用 Claude Chat、也打算继续折腾 cowork,而又看不到从 Claude Code → Codex 的迁移带来任何“足以改变一切”的提升时,回到 Claude Code、把每月 $200 的订阅从钱包里省掉,几乎成了一个毫不费力的决定。

这已经变成影响我回到 Claude 的关键因素之一,并且显著左右了我的选择。

定价

Claude Code 和 Codex 的定价基本一致:

入门:两者都是 $20/月

重度用户(Power User):Claude Code 有 $100/月的 Max 5x

极重度用户(Heavy User):两者都是 $200/月

Claude Code 真正占优的地方在于:它提供了 $100/月这一档中间层级,而不是从 $20 直接疯跳到 $200。我认为 Max 5x($100/月)对大多数开发者来说已经足够。

因此,从某种意义上说,Claude Code 在实际使用中更“便宜”:它允许你选择适合自己的更低档位,而不是逼你一路爬完整条价格阶梯。

Skills 与插件:开发者生态

由于 skills 在 Claude Code 与 Codex 之间是兼容的,所以无论你用哪一个,基本都不会在这方面感到差异。不过,大多数 skill hub 和仓库都以 Claude Code 命名,这可能会带来一点困惑。

很多其他现象也一样:你在 Reddit、X 或各类博客里看到的关于编码代理的讨论,更多是在聊 Claude Code 而不是 Codex——尽管这些原则对二者都适用。这在某种程度上也说明了它们的热度与社区体量差异。

Codex 对 skills 与插件的支持上线得比 Claude Code 晚得多。但插件不像 skills 那样兼容;而且 Codex 的插件支持是最近才开始的,可用插件也没那么多。

话说回来,很多开发者(包括我)根本不用插件。所以,除非你确实需要各类插件支持,否则这点不必担心,也不该作为决策依据。

RAG 管线:一个案例研究

为了做对比,我选择了一个可以量化评估的任务。比如做一个 landing page 的问题在于,它是个定性任务:一个人觉得页面很酷,另一个人可能觉得那不过是“紫色渐变的糊弄之作”。

因此我选择搭建一个简单的 RAG 管线,因为生成答案的准确性可以用数字来衡量。

如果你也想做类似对比,其他不错的任务还包括:训练一个视觉模型、微调一个 LLM,或者测量一个底层程序的性能。

构建检索管线是 AI 工程师的常见工作,你可能会在工作中用 Claude Code 或 Codex 来做这类事情。我让这两个编码代理给我搭一个面向研究论文的 RAG 问答管线,流程很简单:

  1. 取若干论文并抽取其文本。

  2. 将内容切分为更小的块(chunk)。

  3. 把每个 chunk 嵌入到向量空间。

  4. 当用户提问时,找到与问题 embedding 最接近的 chunk embeddings。

  5. 以原始形式检索这些相近的 chunk(而不是它们的 embeddings)。

  6. 用这些上下文来回答用户的问题。

这个任务简单到可以在一次会话里实现,但其中的细节会极大影响输出,比如: - 用什么分块策略; - 如何对分块做 embedding; - 选择哪种向量存储; - 如何处理“哪个分块更接近查询”的置信度; - 是否要改写用户查询以便找到更多相似分块,等等。

实验设置

我从过去一周 @huggingface 的 daily papers 里选了 5 篇研究论文,并创建了一个测试数据集(size = 100),包含问题与对应的标准答案(ground truth),用于之后评估 Claude 或 Codex 的实现质量。

对两个编码代理,我都提出了以下要求:

  • 构建一个 Python RAG 管线

  • PyMuPDF 处理所有 PDF

  • 为该用例选择一个好的分块策略

  • 创建 embeddings,并建立一个可持久化的本地向量索引(由你选择)

  • llama-3.1-8b-instant 生成最终答案

  • 如果找不到足够证据,不要胡编;返回一个兜底回答

对于 Codex 和 Claude Code,我都使用了各自最强、最流行、且默认可用的模型:gpt-5.3-codexOpus 4.6,并都设置为 High effort(推理强度)。两者都没有配置 AGENTS.md。

它们如何实现这条管线

我并没有观察到它们在“如何思考任务”方面有什么明显差异,除了 Codex 在解释它的计划和接下来要做什么时更啰嗦。Claude 则更像是直接写文件、执行命令,不太多说。

Codex 完成任务所花的时间也比 Claude 更长。

更重要的是,Claude 会把脚本端到端跑通,确保管线可以直接使用。

而 Codex 完成实现后并没有测试或运行程序,只是让我去 pip install 依赖并运行脚本。结果我自然跑出了一个错误,之后 Codex 才修复它。Claude 的脚本则从一开始就毫无问题。

我在 Codex 身上也常见到这种模式:它会把不少劳动或环境设置留给你去做,而不是自己顺手就做完。

当然,Codex 会告诉你哪里有环境问题或实现难点,并采取行动;但 Claude 往往会直接“自作主张”把它修了——这取决于你的偏好,可能是好事,也可能是坏事。

我还注意到,Codex 在新会话里输出第一个 token 的初始响应时间,有时会高达一分钟;而 Claude Code 这方面短得多。

Claude Code vs. Codex:实现对比

两位编码代理采用的思路惊人地相似:

  • embedding 模型都选了同一个 all-MiniLM-L6-v2

  • Top-K 检索都选了 k=5

  • 都在系统提示词中限制 LLM 只能使用提供的上下文

它们在这些地方走了不同路线:

  • 向量存储: Claude Code 选择了 ChromaDB 作为向量数据库;Codex 则选择了 FAISS,这是一个更底层的相似度检索库,更省内存也更快。

  • 分块(Chunking): Claude Code 使用递归字符切分:先尝试 \n\n,再 \n,再 ".",再 " ";目标是 1000 字符、200 字符重叠。Codex 使用句子级的词数切分,把 chunk 填充到最多 220 个词,并采用 40 词重叠。Claude Code 按结构(段落 → 行 → 句子 → 词)逐级拆分,并以字符计量;Codex 先按句子切,再把句子打包进“词数预算”的桶里。Codex 的做法尊重句子边界,避免把句子从中间切开;但 220 词对学术文本来说可能偏小。

  • 检索: 两者都选 Top-5 chunk。Claude Code 返回原始的 L2 距离;Codex 返回内积(余弦)分数。

  • 置信度: Claude Code 对最佳 L2 距离使用单一阈值(>1.2 = 不相关),随后再用平均距离判断“证据薄弱” vs. “证据扎实”。Codex 用多指标、三档分级:强、中、弱(不足)。

  • 代码架构:
    Claude Code: 扁平函数式结构、每个模块各自放常量、不校验模型一致性的输入。
    Codex: 面向对象的管线类、集中式配置、dataclasses、argparse CLI、以及模型一致性校验。
    Codex 的工程化与可配置性明显更强;在更大、更严肃的代码库里,这一点非常关键。

结果

我用 gpt-5.4 作为“LLM 裁判”,从四个维度对两条管线的回答进行比较:正确性(Correctness)、完整性(Completeness)、相关性(Relevance)、简洁性(Conciseness)。

在 100 个问题中,Claude Code 赢了 42 个,Codex 赢了 33 个,25 个平局。 Claude 主要因为它的置信度门控更宽松;另外它的生成温度可能也略高一些(0.2 vs 0.1,Codex 管线为 0.1)。

一点“盐”

这只是一个非常简单的设置;我主要是好奇两个编码代理在实现同一个封闭式任务时,会采取怎样不同的路径。在专业场景里,整体架构的关键决策通常由开发者自己拍板:分块方法、向量库、检索策略等。而且在专业场景中,开发此类系统需要更多测试与迭代改进,配合更可靠的测试集与验证流程。

不过,对于不太有 RAG 管线经验的初级开发者来说,把这些决策交给 AI 来做,确实很常见、也很符合预期。

随便选一个就行

我不认为在 Claude Code 和 Codex 之间做选择,会出现什么“根本性错误”的决定。相较于当前的整体生态,两者都提供了很强的模型,并且在相近程度上把事情做成。

对我来说,两个最重要的因素是:Anthropic 的生态,以及 $100/月的定价档位。即便我将来不得不把这个档位加到 $200/月,单就生态这一点,我也仍然会选择 Anthropic 的 Claude Code。

最重要的是:你用这些脚手架做什么,以及你怎么用。

这比任何基准测试都更能决定哪一个更适合你;而且除了你亲自测试之后的“体感”,没有更明确的答案。

有些开发者(比如 @steipete)对 Codex 深信不疑;也有一群人坚信 Opus 对 OpenAI 的模型而言几乎是无可匹敌的。

我认为两者同时都可能是对的——只是因为他们使用这些编码代理的工作流不同,“口味”也不同。

如果你拿不准该选哪个,我建议你在与你相关的编程领域里,分别试用两者的 $20/月 版本,并尽量用多个可验证的任务来测试。

最后,请记住:和 AI 相关的一切一样,这个领域每隔几个月就会剧烈变化。你现在可能更喜欢其中一个,但三个月后,代理的行为可能漂移,或者市场上又出现了新的模型。

AI 里真正有“放之四海皆准的定论”的问题并不多,这个话题也不是其中之一 ;)

链接: http://x.com/i/article/2030946053629915136

I've used Claude Code for months, then moved to Codex. I just switched back to Claude and the reason has nothing to do with benchmarks. I also tested both on the same task.

我用了几个月的 Claude Code,后来转到 Codex。最近我又切回了 Claude,原因和基准测试无关。我也让它们在同一个任务上做了对比。

In this article: I will discuss the different aspects of Claude Code and Codex, the difference between the two flagship models powering them Opus 4.6 vs. GPT-5.3-Codex, what really changes your AI coding experience, and discuss a small case study where I have used both of them for the same task of building a RAG pipeline.

本文将讨论: - Claude Code 与 Codex 的不同侧面; - 驱动它们的两款旗舰模型之间的差异:Opus 4.6 vs. GPT-5.3-Codex; - 真正会改变你 AI 编程体验的因素; - 一个小型案例研究:我如何让它们在同一个任务(搭建 RAG 管线)上“同台竞技”。

Just to give you a fair warning, this article takes ~12 minutes to read, and I think that's a time well-invested if you are going to commit to spending $200/month for either of them.

先给个善意提醒:本文阅读约需 12 分钟。如果你打算每月为其中任何一个付 $200,我认为这段时间非常值得投入。

Opus 4.6 vs. GPT-5.3-Codex: Task-Completion Time Horizon

Opus 4.6 vs. GPT-5.3-Codex:任务完成时间跨度(Task-Completion Time Horizon)

One reliable comparison between Codex vs. Claude Code is about their underlying flagship models and the Completion Time Horizon, which you can check out here.

要比较 Codex 和 Claude Code,一个相当可靠的维度,是它们各自的旗舰模型,以及 Completion Time Horizon(任务完成时间跨度),你可以在这里查看。

This comparison asks: how long of a task can this model reliably complete? The task-completion time horizon is the task duration (measured by human expert completion time) at which the model is predicted to succeed with a level of reliability. So a model with a "2-hour time horizon at 50%" means: give it a task that would take a skilled human 2 hours, and the AI succeeds about half the time.

这个对比问的是:这个模型能可靠完成多长的任务? 任务完成时间跨度,指的是在某个任务时长(以人类专家完成该任务所需时间衡量)下,模型被预测能以某种可靠性成功完成任务。比如,一个“在 50% 成功率下时间跨度为 2 小时”的模型,意思是:给它一个熟练人类需要 2 小时完成的任务,AI 大约有一半概率能成功。

For this study, they use the appropriate scaffold for each model, including Claude Code and Codex. So while the focus is on the model, and not on the scaffold, we can get an idea of how reliable the scaffolds are as well. It tells us which one of these coding agents can handle longer, harder tasks.

在这项研究里,他们为每个模型都使用了合适的脚手架(scaffold),包括 Claude Code 和 Codex。因此,虽然重点在模型而非脚手架,我们也能顺带对脚手架的可靠性有个直观感受:它告诉我们,这两个编码代理里,谁更能扛住 更长、更难 的任务。

As you can see in the chart, there is a BIG gap between Opus 4.6 and GPT-5.3-Codex. Opus 4.6 has a 12 hour task completion length at 50% success while for GPT-5.3-Codex, this number is 5 hours and 50 minutes. This gap closes at 80% success between the two models.

从图表里你可以看到,Opus 4.6GPT-5.3-Codex 之间存在巨大的差距:在 50% 成功率下,Opus 4.6 的任务完成时长为 12 小时;而 GPT-5.3-Codex 的数字是 5 小时 50 分钟。在 80% 成功率下,两者的差距会缩小。

This is a clear indication of a gap between these two models and, consequently, between Claude Code and Codex, between how well they can tackle difficult and challenging tasks. It might not directly translate well to the type of tasks you use them for, so keep that in mind.

这清楚地表明,两款模型之间存在差距,因此 Claude Code 与 Codex 在应对困难、挑战性任务方面也会有差距。但这未必能直接映射到你日常使用它们的任务类型上——这一点需要记在心里。

Claude Code is Faster, but Speed Doesn't Matter That Much

Claude Code 更快,但速度其实没那么重要

Claude is famously faster than Codex, but working with coding agents is a long-term process.

Claude 的速度确实出了名地快过 Codex,但和编码代理一起工作,本质上是一个 长期 过程。

If an agent finishes the task in half the time, and then requires you to spend 10 minutes debugging the damn thing, as opposed to spending more time with implementation and not requiring you to babysit it afterwards, that extra time is 100% worth it.

如果一个代理把任务用一半时间做完,却让你额外花 10 分钟去 debug 那破玩意;相比之下,另一个代理实现得更慢一点,但之后你几乎不用盯着它收拾残局——那多花的时间绝对 100% 值得

This is NOT to say that Claude Code or Codex makes more mistakes, but a general idea to have in the back of your mind when evaluating the agents yourself or hearing people talk flex their agent's coding speed.

并不是说 Claude Code 或 Codex 更容易犯错;我只是想给你一个“评估代理时放在脑后”的思维框架:不要只听别人炫耀他们的代理写代码有多快。

The Task Matters For Agents

任务类型对代理很重要

Codex and Claude Code perform differently based on the coding task they're used in. In an AI Engineering task, one might outperform the other, while in a web development task, that same model would be obliterated.

Codex 和 Claude Code 的表现,会随着所做的编码任务而变化:在某个 AI Engineering 任务里,一个可能胜过另一个;但在 web 开发任务里,同一个模型可能会被彻底打爆。

Which coding tasks are better for Codex or Claude Code? This is not studied well.

哪些编码任务更适合 Codex,哪些更适合 Claude Code? 这方面的研究并不充分。

For example, it's not clear which one to use in low-level programming. Ideally, you'd test both in a simple and verifiable setup before going all-in. But spending $300-$400 for both is not feasible for most people.

例如,在底层编程里该用哪个,并不明确。理想情况下,你应该在一个简单、可验证的设置里把两者都测一遍,再决定全力投入。但对大多数人来说,同时付 $300–$400 订阅两个产品并不现实。

It's an interesting area of research to fully review both agents in a variety of coding tasks, but it's also not trivial since these agents and the model powering them change drastically every few months.

在各种编码任务上系统评测这两种代理会很有意思,但也并不简单,因为这些代理以及背后的模型每隔几个月就会发生巨大变化。

How Each Came to Exist

它们各自如何诞生

Claude Code initially started as a side project by @bcherny at Anthropic, who built a terminal prototype that could interact with the Claude API, read files, and run some bash commands.

Claude Code 最初是 Anthropic 的 @bcherny 做的一个副项目:他搭了个终端原型,能调用 Claude API、读取文件、运行一些 bash 命令。

Half the internal team started using it by day five. Then Claude Code was released as a research preview on February 24, 2025, using Claude 3.7 Sonnet. It took some time to be mass-adopted by developers, and over time, Anthropic released a VS Code extension for it as well.

到了第五天,内部团队里就有一半人开始用它。随后,Claude Code 于 2025 年 2 月 24 日以 research preview 形式发布,使用的是 Claude 3.7 Sonnet。它被开发者大规模采用花了一些时间;后来,Anthropic 也为它发布了 VS Code 扩展。

OpenAI on the other hand, announced the original Codex model as a 12B GPT-3 model fine-tuned on GitHub code, which eventually powered the first version of GitHub Copilot. The new Codex is an entirely new product though.

另一方面,OpenAI 曾宣布最初的 Codex 模型:一个在 GitHub 代码上微调的 120 亿参数 GPT-3 模型,最终驱动了 GitHub Copilot 的第一版。不过,现在的 Codex 是一个完全全新的产品。

Codex CLI launched first on April 16, 2025, as a terminal agent, and has evolved with better models even since. The latest GPT-5.3-Codex (February 5, 2026) is described by OpenAI as "the first model that helped create itself."

Codex CLI 于 2025 年 4 月 16 日先行发布,作为一个终端代理,并且此后持续演进,配上了更强的模型。最新的 GPT-5.3-Codex(2026 年 2 月 5 日)被 OpenAI 描述为“第一个帮助创造了自己的模型”。

@GergelyOrosz has two very interesting interviews with the developers of Claude Code and Codex, about their tech stack, how they develop them, and also how each one started initially. You can learn a lot from these two interviews.

@GergelyOrosz 还和 Claude Code 与 Codex 的开发者分别做了两次非常有意思的访谈,聊他们的技术栈、研发方式,以及各自最初是怎么启动的。看完这两篇你能学到很多。

👉 How Codex is Built

👉 Codex 是如何构建的

Tech stacks and Powering Models

技术栈与驱动模型

Claude Code is written in TypeScript, using React with Ink for terminal UI rendering. It ships as a single Bun executable (Anthropic acquired Bun in December 2025 for this reason). The Opus and Sonnet models used by it also support a 1M-token context window.

Claude Code 用 TypeScript 编写,终端 UI 渲染采用 React + Ink。它以单个 Bun 可执行文件的形式发布(Anthropic 在 2025 年 12 月收购 Bun,也正是出于这个原因)。它使用的 Opus 和 Sonnet 模型还支持 100 万 token 的上下文窗口。

The Codex CLI is written in Rust, for its performance, correctness, and portability. OpenAI even hired the maintainer of Ratatui (a Rust TUI library) for the team.

Codex CLI 则用 Rust 编写,强调性能、正确性与可移植性。OpenAI 甚至把 Ratatui(一个 Rust TUI 库)的维护者招进了团队。

Both CLI tools are thin wrappers around the model that they use through the API. I've noticed some small "glitches" when working with the Claude Code CLI that I didn't really notice with Codex, and I think that might be expected given their tech stack.

这两个 CLI 工具,本质上都是通过 API 调用所用模型的轻量封装。我在使用 Claude Code CLI 时注意到一些小“毛刺”,而在 Codex 上几乎没怎么遇到;考虑到它们的技术栈差异,这也算情理之中。

However, these glitches are nothing more than mildly annoying things; they really don't affect your coding experience.

不过,这些毛刺最多只是略微烦人;它们并不会真正影响你的编程体验。

Benchmarks are Close, But with Nuances: Token Economics

基准测试很接近,但有细微差别:Token 经济学

The biggest performance difference isn't accuracy, but token efficiency. A comprehensive review on the Opus vs. Codex done by Morph shows an interesting gap.

最大的性能差异不在准确率,而在 token 效率。一份由 Morph 做的 Opus vs. Codex 综合评测,展示了一个有趣的差距。

Claude Code uses 3.2–4.2x more tokens than Codex on identical tasks. On a Figma plugin build, Codex consumed 1.5M tokens compared to Claude's 6.2M.

在完全相同的任务上,Claude Code 的 token 使用量是 Codex 的 3.2–4.2 倍。在一次 Figma 插件构建中,Codex 消耗了 150 万 token,而 Claude 消耗了 620 万。

If this is true, it means for paying the same money for a Claude Code subscription, you're more likely to hit token limits.

如果这点属实,那么在支付同样的订阅费用时,你用 Claude Code 更可能触及 token 上限。

The Feeling Matters the Most

体感最重要

Claude feels like a senior developer doing work for you, and Codex is a contractor you hand off tasks to and then come back to pick up the results.

Claude 给人的感觉像是一位资深开发者在替你做事;而 Codex 更像一个你把任务交出去、过一会儿再回来验收结果的外包承包商。

This is the common way developers describe the difference.

这是开发者最常用来描述两者差异的方式。

Claude Code reportedly has a strong interactive feel to it, and also a deep reasoning quality, which is expected of Opus. It asks you questions, shows you the reasoning, and explains its approach. Even though this was not the case in my single comparison experiment, I can confirm this is true, from my many-months experience of using Claude.

据说 Claude Code 的交互感更强,也更具深度推理的质感——这很符合 Opus 的定位。它会问你问题、展示推理过程、解释它的做法。尽管在我那次单次对比实验里并不明显,但基于我连续数月使用 Claude 的经验,我可以确认这点确实成立。

Codex is famous for its first-attempt accuracy on straightforward tasks, which comes at the cost of a slight decrease in implementation speed.

Codex 则以“直观任务一次就写对”的能力著称,但代价是实现速度会略微慢一些。

With all that being said, the difference in the behavior really diminishes as you lay out specifically what you want in the AGENTS.md. If you specify that you need the model to check the implementation plan with you before going off guns blazing, the model will do that, regardless of which one you use, the "senior developer" agent or the "contractor" agent.

话虽如此,当你在 AGENTS.md 里把诉求写得足够明确时,这种行为差异会显著缩小。比如你要求模型在“火力全开”之前先和你核对实现计划——无论你用的是“资深开发者”式代理,还是“承包商”式代理,它都会照做。

This isn't to say that agents aren't actually different, THEY ARE.

这并不是说它们不一样——它们确实不一样。

But not as exaggerated as you commonly hear on X.

只是没有你在 X 上常听到的那么夸张。

Quick Numbers

快速数据

On VS Code Marketplace, Claude Code has 6.1M installs with a 4/5 rating, while Codex has 5.4M installs with a 3.5/5 rating.

在 VS Code Marketplace 上,Claude Code 有 610 万次安装、评分 4/5;Codex 有 540 万次安装、评分 3.5/5。

On GitHub, Claude Code has approximately 65–72K stars and Codex has ~64K stars.

在 GitHub 上,Claude Code 大约有 6.5–7.2 万星标,Codex 约有 6.4 万星标

Why I'm Moving Back to Claude Code for Now

我为什么暂时又回到 Claude Code

Anthropic's Ecosystem Pulls Hard

Anthropic 的生态拉力很强

Choosing whether to go for Codex or Claude Code isn't just about coding. A subscription to each of them is a subscription to the whole ecosystem of Anthropic/OpenAI and this is something you might want to consider.

选择 Codex 还是 Claude Code,不只是“写代码”这件事。订阅其中任何一个,本质上也是在订阅 Anthropic/OpenAI 的整个生态——这一点你可能需要纳入考虑。

I personally believe that Claude is becoming a very hot ecosystem similar to Apple, now with Claude Cowork, the Claude Chat, and the Claude Code. It seems Anthropic is also slowly building a safer and tamer version of OpenClaw (your proactive personal agent) with the Claude app, and the small bits and pieces for it are being rolled out gradually.

我个人觉得,Claude 正在形成一个类似 Apple 的高热度生态:现在有 Claude Cowork、Claude Chat、以及 Claude Code。看起来 Anthropic 也在借助 Claude app,慢慢搭建一个更安全、更温和版本的 OpenClaw(你的主动型个人代理),并把相关的小模块逐步上线。

On OpenAI's front, I'm not seeing anything enticing at the moment. Aside from Codex, everything else seems dull. I don't feel an ecosystem, but fragmented bits and pieces with better alternatives out there.

而在 OpenAI 这边,就目前而言我还没看到什么特别诱人的东西。除了 Codex 以外,其他看起来都很乏味。我感受到的不是“生态”,而是一些零散拼图——而且外面还有更好的替代品。

I've already been using Claude chat rather than ChatGPT, as for me, ChatGPT is borderline unusable at this point compared to Opus. The UI, the tone of the chat, and the model selection, none of them really encourage me to use ChatGPT.

我已经更常用 Claude chat 而不是 ChatGPT;对我来说,相比 Opus,ChatGPT 在这个阶段几乎处于“勉强可用”的边缘。UI、聊天的语气、以及模型选择,这些都并不能真正鼓励我去用 ChatGPT。

So, at the point of which I'm using Claude Chat frequently, I'm planning to tinker with cowork, and I don't see any deal-breaking improvement from Claude Code → Codex migration at the moment, the decision to go back to Claude Code and cut $200/month subscription price out of my pocket really seemed like an easy choice to make.

因此,当我已经在高频使用 Claude Chat、也打算继续折腾 cowork,而又看不到从 Claude Code → Codex 的迁移带来任何“足以改变一切”的提升时,回到 Claude Code、把每月 $200 的订阅从钱包里省掉,几乎成了一个毫不费力的决定。

This has become a major factor for me, and one that drastically influenced my decision to move back to Claude.

这已经变成影响我回到 Claude 的关键因素之一,并且显著左右了我的选择。

Pricing

定价

The pricing for both Claude Code and Codex is basically the same:

Claude Code 和 Codex 的定价基本一致:

Entry: $20/month for both

入门:两者都是 $20/月

Power User: Claude Code has a Max 5x priced at $100/month

重度用户(Power User):Claude Code 有 $100/月的 Max 5x

Heavy User: $200/month for both

极重度用户(Heavy User):两者都是 $200/月

Where Claude Code really shines is that it offers a mid-tier $100/month, rather than a crazy jump from $20 to $200 subscriptions, and I believe the Max 5x plan ($100/month) is really adequate for most developers.

Claude Code 真正占优的地方在于:它提供了 $100/月这一档中间层级,而不是从 $20 直接疯跳到 $200。我认为 Max 5x($100/月)对大多数开发者来说已经足够。

So in a way, you could say Claude Code is cheaper in practice, because it allows you to select a cheaper plan that works for you rather than forcing you to climb the pricing ladder.

因此,从某种意义上说,Claude Code 在实际使用中更“便宜”:它允许你选择适合自己的更低档位,而不是逼你一路爬完整条价格阶梯。

Skills and Plugins: The Developer Ecosystem

Skills 与插件:开发者生态

As skills are compatible between Claude Code and Codex, you won't notice a difference regardless of which one you use. However, most skill hubs and repos are named after Claude Code, which might be a little confusing.

由于 skills 在 Claude Code 与 Codex 之间是兼容的,所以无论你用哪一个,基本都不会在这方面感到差异。不过,大多数 skill hub 和仓库都以 Claude Code 命名,这可能会带来一点困惑。

This is the case with most other things as well. Many of the posts you see on Reddit, X, or blog posts about coding agents are about Claude Code rather than Codex, even though the same principles apply to both of them, which really tells you something about the popularity and community size.

很多其他现象也一样:你在 Reddit、X 或各类博客里看到的关于编码代理的讨论,更多是在聊 Claude Code 而不是 Codex——尽管这些原则对二者都适用。这在某种程度上也说明了它们的热度与社区体量差异。

Codex has launched support for both skills and plugins much later than Claude Code. But plugins aren't as compatible as skills. And as plugin support for Codex started just recently, there's not so many available.

Codex 对 skills 与插件的支持上线得比 Claude Code 晚得多。但插件不像 skills 那样兼容;而且 Codex 的插件支持是最近才开始的,可用插件也没那么多。

All this to say that many developers, including me, don't use plugins at all. So unless you specifically need the support for various plugins, this is not something to worry about or base your decision on.

话说回来,很多开发者(包括我)根本不用插件。所以,除非你确实需要各类插件支持,否则这点不必担心,也不该作为决策依据。

RAG Pipeline: A Case Study

RAG 管线:一个案例研究

For the comparison, I chose to go with a task that can be quantitatively assessed. The problem with creating a landing page, for example, is that it's a qualitative task: one might think a landing page is cool looking while the other calls it purple-gradient slop.

为了做对比,我选择了一个可以量化评估的任务。比如做一个 landing page 的问题在于,它是个定性任务:一个人觉得页面很酷,另一个人可能觉得那不过是“紫色渐变的糊弄之作”。

So I chose the task of building a simple RAG pipeline, since the accuracy of the generated answers can be determined in numbers.

因此我选择搭建一个简单的 RAG 管线,因为生成答案的准确性可以用数字来衡量。

Other good ideas if you want to do a similar comparison yourself, could be training a vision model or fine-tuning an LLM, or measuring the performance of a low-level program.

如果你也想做类似对比,其他不错的任务还包括:训练一个视觉模型、微调一个 LLM,或者测量一个底层程序的性能。

Building a retrieval pipeline is a common task of an AI engineer, potentially something you'd use Claude Code or Codex in your job. I tasked both of these coding agents to build me a RAG Q&A pipeline for research papers. The workflow is simple:

构建检索管线是 AI 工程师的常见工作,你可能会在工作中用 Claude Code 或 Codex 来做这类事情。我让这两个编码代理给我搭一个面向研究论文的 RAG 问答管线,流程很简单:

  1. Take a number of papers and extract their text.
  1. 取若干论文并抽取其文本。
  1. Chunk the contents into smaller bits.
  1. 将内容切分为更小的块(chunk)。
  1. Embed each chunk into a vector space.
  1. 把每个 chunk 嵌入到向量空间。
  1. When a user asks a question, find the closest chunk embeddings to the embedding of the question.
  1. 当用户提问时,找到与问题 embedding 最接近的 chunk embeddings。
  1. Retrieve the close chunks in their original form (not their embeddings).
  1. 以原始形式检索这些相近的 chunk(而不是它们的 embeddings)。
  1. Use that context to answer the user's question.
  1. 用这些上下文来回答用户的问题。

This is a task simple enough to be implemented in one session, but it has intricate details that massively influence the output: - what chunking strategy to use - how to embed the chunks - what vector storage to go for - how to handle the confidence of which chunk is closer to the query - whether to rephrase the user's query to help find more similar chunks, etc.

这个任务简单到可以在一次会话里实现,但其中的细节会极大影响输出,比如: - 用什么分块策略; - 如何对分块做 embedding; - 选择哪种向量存储; - 如何处理“哪个分块更接近查询”的置信度; - 是否要改写用户查询以便找到更多相似分块,等等。

The Experiment Setup

实验设置

I took 5 research papers from the @huggingface daily papers of the past week, and created a test dataset (size = 100) of questions and ground truth answers, which I would later use for testing how good the implementation of Claude or Codex is.

我从过去一周 @huggingface 的 daily papers 里选了 5 篇研究论文,并创建了一个测试数据集(size = 100),包含问题与对应的标准答案(ground truth),用于之后评估 Claude 或 Codex 的实现质量。

For both coding agents, I specified the following:

对两个编码代理,我都提出了以下要求:

  • Build a Python RAG pipeline
  • 构建一个 Python RAG 管线
  • Process all PDFs using PyMuPDF
  • PyMuPDF 处理所有 PDF
  • Choose a good chunking strategy for this use case
  • 为该用例选择一个好的分块策略
  • Create embeddings and a persistent local vector index (your choice)
  • 创建 embeddings,并建立一个可持久化的本地向量索引(由你选择)
  • generate final answers with **llama-3.1-8b-instant**.
  • llama-3.1-8b-instant 生成最终答案
  • If no sufficient evidence is found, do not hallucinate. return a fallback response
  • 如果找不到足够证据,不要胡编;返回一个兜底回答

For both Codex and Claude Code, I used the best most popular and default available models: gpt-5.3-codex and Opus 4.6, both with High effort (the degree of reasoning). None had an AGENTS.md.

对于 Codex 和 Claude Code,我都使用了各自最强、最流行、且默认可用的模型:gpt-5.3-codexOpus 4.6,并都设置为 High effort(推理强度)。两者都没有配置 AGENTS.md。

How They Implemented The Pipeline

它们如何实现这条管线

I didn't notice any noticeable difference in how each agent thinks about the task, other than the fact that Codex is more verbose in explaining its plan and what it's going to do. Claude simply writes the files and executes the commands without talking so much about it.

我并没有观察到它们在“如何思考任务”方面有什么明显差异,除了 Codex 在解释它的计划和接下来要做什么时更啰嗦。Claude 则更像是直接写文件、执行命令,不太多说。

Codex also took longer to finish the task compared to Claude.

Codex 完成任务所花的时间也比 Claude 更长。

More importantly, Claude tested the script end-to-end and made sure the pipeline is ready to use.

更重要的是,Claude 会把脚本端到端跑通,确保管线可以直接使用。

Codex, on the other hand, finished the implementation but didn't test or run the program, and instructed me to pip install the requirements and run the script. Naturally, I hit an error in running the script, which Codex fixed. Claude's script worked with no problems whatsoever.

而 Codex 完成实现后并没有测试或运行程序,只是让我去 pip install 依赖并运行脚本。结果我自然跑出了一个错误,之后 Codex 才修复它。Claude 的脚本则从一开始就毫无问题。

I've noticed this pattern with Codex, that it leaves many of the labor or setups for you to do rather than simply doing it itself.

我在 Codex 身上也常见到这种模式:它会把不少劳动或环境设置留给你去做,而不是自己顺手就做完。

While Codex would let you know and take action for an env problem or implementation difficulty, Claude takes the liberty of fixing it, which depending on your preference, can be a good/bad thing.

当然,Codex 会告诉你哪里有环境问题或实现难点,并采取行动;但 Claude 往往会直接“自作主张”把它修了——这取决于你的偏好,可能是好事,也可能是坏事。

I've also noticed that the initial time-to-response for the first token in a new session for Codex can go as high as a minute, while this is much shorter for Claude Code.

我还注意到,Codex 在新会话里输出第一个 token 的初始响应时间,有时会高达一分钟;而 Claude Code 这方面短得多。

Claude Code vs. Codex Implementation

Claude Code vs. Codex:实现对比

Both coding agents went for surprisingly similar approaches:

两位编码代理采用的思路惊人地相似:

  • they both went for the same all-MiniLM-L6-v2 as the embedding model
  • embedding 模型都选了同一个 all-MiniLM-L6-v2
  • they selected k=5 for the Top-K retrieval
  • Top-K 检索都选了 k=5
  • both restricted the LLM in the system prompt to only use the provided context
  • 都在系统提示词中限制 LLM 只能使用提供的上下文

This is where they went with separate approaches:

它们在这些地方走了不同路线:

  • Vector Storage: Claude Code chose ChromaDB for the vector DB, and Codex went for FAISS, which is a lower-level similarity search library, more memory-efficient and faster.
  • 向量存储: Claude Code 选择了 ChromaDB 作为向量数据库;Codex 则选择了 FAISS,这是一个更底层的相似度检索库,更省内存也更快。
  • Chunking: Claude Code went for a recursive character splitting. It tried \n\n first, then \n, then "." , then " ". The target is 1000 chars with 200 char overlap. Codex went for a sentence-level word splitting and fills chunks up to 220 words, with 40-word overlap. Claude Code splits by structure (paragraphs → lines → sentences → words) and measures in characters. Codex splits by sentences first, then packs them into word-budget bins. Codex's approach respects the sentence boundaries and avoids mid-sentence cuts, but the 220 words may be too small for this context (academic text).
  • 分块(Chunking): Claude Code 使用递归字符切分:先尝试 \n\n,再 \n,再 ".",再 " ";目标是 1000 字符、200 字符重叠。Codex 使用句子级的词数切分,把 chunk 填充到最多 220 个词,并采用 40 词重叠。Claude Code 按结构(段落 → 行 → 句子 → 词)逐级拆分,并以字符计量;Codex 先按句子切,再把句子打包进“词数预算”的桶里。Codex 的做法尊重句子边界,避免把句子从中间切开;但 220 词对学术文本来说可能偏小。
  • Retrieval: Both chose Top-5 chunks. Claude Code returns raw L2 distances and Codex returns inner-product (cosine) scores.
  • 检索: 两者都选 Top-5 chunk。Claude Code 返回原始的 L2 距离;Codex 返回内积(余弦)分数。
  • Confidence: Claude Code used a single threshold on the best L2 distance (>1.2 = irrelevant) and then checks the average distance for low vs. well-grounded chunks. Codex uses multi-criteria with three tiers: strong, moderate, and insufficient.
  • 置信度: Claude Code 对最佳 L2 距离使用单一阈值(>1.2 = 不相关),随后再用平均距离判断“证据薄弱” vs. “证据扎实”。Codex 用多指标、三档分级:强、中、弱(不足)。
  • Code Architecture: Claude Code: Flat functions, constants in each module, no input validation on model consistency. Codex: OOP pipeline class, centralized config, dataclasses, argparse CLI, model consistency validation. Codex is clearly better engineered and more configurable. In large and more serious codebases, this is critical.
  • 代码架构:
    Claude Code: 扁平函数式结构、每个模块各自放常量、不校验模型一致性的输入。
    Codex: 面向对象的管线类、集中式配置、dataclasses、argparse CLI、以及模型一致性校验。
    Codex 的工程化与可配置性明显更强;在更大、更严肃的代码库里,这一点非常关键。

Results

结果

Using gpt-5.4 as the LLM-as-a-judge, the answer of both pipelines is compared in four criteria: Correctness, Completeness, Relevance, Conciseness.

我用 gpt-5.4 作为“LLM 裁判”,从四个维度对两条管线的回答进行比较:正确性(Correctness)、完整性(Completeness)、相关性(Relevance)、简洁性(Conciseness)。

Among the 100 questions, Claude Code won 42, Codex won 33, and 25 were ties. Claude won mostly due to its looser confidence gating, and maybe a slightly higher generation temperature (0.2 vs 0.1 in Codex's pipeline).

在 100 个问题中,Claude Code 赢了 42 个,Codex 赢了 33 个,25 个平局。 Claude 主要因为它的置信度门控更宽松;另外它的生成温度可能也略高一些(0.2 vs 0.1,Codex 管线为 0.1)。

A Pinch of Salt

一点“盐”

Now this was a very simple setup, and I was mostly curious to see the different approach the two coding agents take in implementing the same close-ended task. In a professional setup, it's the developer who makes the calls for the overall architecture: the chunking method, the Vector DB, the retrieval strategy, etc. Also, in a professional setup, developing such systems requires much more testing and iterative improvements, with more reliable test sets and verifications.

这只是一个非常简单的设置;我主要是好奇两个编码代理在实现同一个封闭式任务时,会采取怎样不同的路径。在专业场景里,整体架构的关键决策通常由开发者自己拍板:分块方法、向量库、检索策略等。而且在专业场景中,开发此类系统需要更多测试与迭代改进,配合更可靠的测试集与验证流程。

However, it's really expected that a junior developer who's not very experienced in building a RAG pipeline leaves these decisions to the AI to make.

不过,对于不太有 RAG 管线经验的初级开发者来说,把这些决策交给 AI 来做,确实很常见、也很符合预期。

Just Pick One

随便选一个就行

I don't think there is any terminally wrong decision whether you choose Claude Code or Codex. Both offer strong models compared to the existing landscape and get the job done to a similar degree.

我不认为在 Claude Code 和 Codex 之间做选择,会出现什么“根本性错误”的决定。相较于当前的整体生态,两者都提供了很强的模型,并且在相近程度上把事情做成。

Two major factors for me have been: the Anthropic ecosystem, and the $100/month pricing tier. Even if I have to bump up that tier to the $200/month pricing, I would still stick to Anthropic's Claude Code for the former reason.

对我来说,两个最重要的因素是:Anthropic 的生态,以及 $100/月的定价档位。即便我将来不得不把这个档位加到 $200/月,单就生态这一点,我也仍然会选择 Anthropic 的 Claude Code。

The most important thing is what you use these scaffolds for and how you use them.

最重要的是:你用这些脚手架做什么,以及你怎么用。

This determines which one is better for you better than any benchmarks, and there's no clear answer to that other than your gut telling you which one feels better after you test both.

这比任何基准测试都更能决定哪一个更适合你;而且除了你亲自测试之后的“体感”,没有更明确的答案。

There are developers like @steipete who swear by Codex, and there is a community that believes Opus is just unrivaled by OpenAI models.

有些开发者(比如 @steipete)对 Codex 深信不疑;也有一群人坚信 Opus 对 OpenAI 的模型而言几乎是无可匹敌的。

I think both of them are correct at the same time, simply because their workflow of using these coding agents, and their "taste" is different.

我认为两者同时都可能是对的——只是因为他们使用这些编码代理的工作流不同,“口味”也不同。

If you're doubtful about which one to go with, I suggest trying out the $20/month version of both of them on the type of programming field that's relevant to you, and test preferably on several verifiable tasks.

如果你拿不准该选哪个,我建议你在与你相关的编程领域里,分别试用两者的 $20/月 版本,并尽量用多个可验证的任务来测试。

Finally, keep in mind that similar to anything else related to AI, the landscape changes drastically every few months. While you might like one of them now, three months later, the agent's behavior might drift, or a new model might hit the market.

最后,请记住:和 AI 相关的一切一样,这个领域每隔几个月就会剧烈变化。你现在可能更喜欢其中一个,但三个月后,代理的行为可能漂移,或者市场上又出现了新的模型。

There are very few things in AI with definitive global answers, and this subject is not one of them ;)

AI 里真正有“放之四海皆准的定论”的问题并不多,这个话题也不是其中之一 ;)

Link: http://x.com/i/article/2030946053629915136

链接: http://x.com/i/article/2030946053629915136

相关笔记

I've used Claude Code for months, then moved to Codex. I just switched back to Claude and the reason has nothing to do with benchmarks. I also tested both on the same task.

In this article: I will discuss the different aspects of Claude Code and Codex, the difference between the two flagship models powering them Opus 4.6 vs. GPT-5.3-Codex, what really changes your AI coding experience, and discuss a small case study where I have used both of them for the same task of building a RAG pipeline.

Just to give you a fair warning, this article takes ~12 minutes to read, and I think that's a time well-invested if you are going to commit to spending $200/month for either of them.

Opus 4.6 vs. GPT-5.3-Codex: Task-Completion Time Horizon

One reliable comparison between Codex vs. Claude Code is about their underlying flagship models and the Completion Time Horizon, which you can check out here.

This comparison asks: how long of a task can this model reliably complete? The task-completion time horizon is the task duration (measured by human expert completion time) at which the model is predicted to succeed with a level of reliability. So a model with a "2-hour time horizon at 50%" means: give it a task that would take a skilled human 2 hours, and the AI succeeds about half the time.

For this study, they use the appropriate scaffold for each model, including Claude Code and Codex. So while the focus is on the model, and not on the scaffold, we can get an idea of how reliable the scaffolds are as well. It tells us which one of these coding agents can handle longer, harder tasks.

As you can see in the chart, there is a BIG gap between Opus 4.6 and GPT-5.3-Codex. Opus 4.6 has a 12 hour task completion length at 50% success while for GPT-5.3-Codex, this number is 5 hours and 50 minutes. This gap closes at 80% success between the two models.

This is a clear indication of a gap between these two models and, consequently, between Claude Code and Codex, between how well they can tackle difficult and challenging tasks. It might not directly translate well to the type of tasks you use them for, so keep that in mind.

Claude Code is Faster, but Speed Doesn't Matter That Much

Claude is famously faster than Codex, but working with coding agents is a long-term process.

If an agent finishes the task in half the time, and then requires you to spend 10 minutes debugging the damn thing, as opposed to spending more time with implementation and not requiring you to babysit it afterwards, that extra time is 100% worth it.

This is NOT to say that Claude Code or Codex makes more mistakes, but a general idea to have in the back of your mind when evaluating the agents yourself or hearing people talk flex their agent's coding speed.

The Task Matters For Agents

Codex and Claude Code perform differently based on the coding task they're used in. In an AI Engineering task, one might outperform the other, while in a web development task, that same model would be obliterated.

Which coding tasks are better for Codex or Claude Code? This is not studied well.

For example, it's not clear which one to use in low-level programming. Ideally, you'd test both in a simple and verifiable setup before going all-in. But spending $300-$400 for both is not feasible for most people.

It's an interesting area of research to fully review both agents in a variety of coding tasks, but it's also not trivial since these agents and the model powering them change drastically every few months.

How Each Came to Exist

Claude Code initially started as a side project by @bcherny at Anthropic, who built a terminal prototype that could interact with the Claude API, read files, and run some bash commands.

Half the internal team started using it by day five. Then Claude Code was released as a research preview on February 24, 2025, using Claude 3.7 Sonnet. It took some time to be mass-adopted by developers, and over time, Anthropic released a VS Code extension for it as well.

OpenAI on the other hand, announced the original Codex model as a 12B GPT-3 model fine-tuned on GitHub code, which eventually powered the first version of GitHub Copilot. The new Codex is an entirely new product though.

Codex CLI launched first on April 16, 2025, as a terminal agent, and has evolved with better models even since. The latest GPT-5.3-Codex (February 5, 2026) is described by OpenAI as "the first model that helped create itself."

https://newsletter.pragmaticengineer.com/p/how-codex-is-built

@GergelyOrosz has two very interesting interviews with the developers of Claude Code and Codex, about their tech stack, how they develop them, and also how each one started initially. You can learn a lot from these two interviews.

👉 How Codex is Built

Tech stacks and Powering Models

Claude Code is written in TypeScript, using React with Ink for terminal UI rendering. It ships as a single Bun executable (Anthropic acquired Bun in December 2025 for this reason). The Opus and Sonnet models used by it also support a 1M-token context window.

The Codex CLI is written in Rust, for its performance, correctness, and portability. OpenAI even hired the maintainer of Ratatui (a Rust TUI library) for the team.

Both CLI tools are thin wrappers around the model that they use through the API. I've noticed some small "glitches" when working with the Claude Code CLI that I didn't really notice with Codex, and I think that might be expected given their tech stack.

However, these glitches are nothing more than mildly annoying things; they really don't affect your coding experience.

Benchmarks are Close, But with Nuances: Token Economics

The biggest performance difference isn't accuracy, but token efficiency. A comprehensive review on the Opus vs. Codex done by Morph shows an interesting gap.

https://www.morphllm.com/best-ai-model-for-coding

Claude Code uses 3.2–4.2x more tokens than Codex on identical tasks. On a Figma plugin build, Codex consumed 1.5M tokens compared to Claude's 6.2M.

If this is true, it means for paying the same money for a Claude Code subscription, you're more likely to hit token limits.

The Feeling Matters the Most

Claude feels like a senior developer doing work for you, and Codex is a contractor you hand off tasks to and then come back to pick up the results.

This is the common way developers describe the difference.

Claude Code reportedly has a strong interactive feel to it, and also a deep reasoning quality, which is expected of Opus. It asks you questions, shows you the reasoning, and explains its approach. Even though this was not the case in my single comparison experiment, I can confirm this is true, from my many-months experience of using Claude.

Codex is famous for its first-attempt accuracy on straightforward tasks, which comes at the cost of a slight decrease in implementation speed.

With all that being said, the difference in the behavior really diminishes as you lay out specifically what you want in the AGENTS.md. If you specify that you need the model to check the implementation plan with you before going off guns blazing, the model will do that, regardless of which one you use, the "senior developer" agent or the "contractor" agent.

This isn't to say that agents aren't actually different, THEY ARE.

But not as exaggerated as you commonly hear on X.

Quick Numbers

On VS Code Marketplace, Claude Code has 6.1M installs with a 4/5 rating, while Codex has 5.4M installs with a 3.5/5 rating.

On GitHub, Claude Code has approximately 65–72K stars and Codex has ~64K stars.

Why I'm Moving Back to Claude Code for Now

Anthropic's Ecosystem Pulls Hard

Choosing whether to go for Codex or Claude Code isn't just about coding. A subscription to each of them is a subscription to the whole ecosystem of Anthropic/OpenAI and this is something you might want to consider.

https://metr.org/time-horizons/

I personally believe that Claude is becoming a very hot ecosystem similar to Apple, now with Claude Cowork, the Claude Chat, and the Claude Code. It seems Anthropic is also slowly building a safer and tamer version of OpenClaw (your proactive personal agent) with the Claude app, and the small bits and pieces for it are being rolled out gradually.

On OpenAI's front, I'm not seeing anything enticing at the moment. Aside from Codex, everything else seems dull. I don't feel an ecosystem, but fragmented bits and pieces with better alternatives out there.

I've already been using Claude chat rather than ChatGPT, as for me, ChatGPT is borderline unusable at this point compared to Opus. The UI, the tone of the chat, and the model selection, none of them really encourage me to use ChatGPT.

So, at the point of which I'm using Claude Chat frequently, I'm planning to tinker with cowork, and I don't see any deal-breaking improvement from Claude Code → Codex migration at the moment, the decision to go back to Claude Code and cut $200/month subscription price out of my pocket really seemed like an easy choice to make.

This has become a major factor for me, and one that drastically influenced my decision to move back to Claude.

Pricing

The pricing for both Claude Code and Codex is basically the same:

Entry: $20/month for both

Power User: Claude Code has a Max 5x priced at $100/month

Heavy User: $200/month for both

Where Claude Code really shines is that it offers a mid-tier $100/month, rather than a crazy jump from $20 to $200 subscriptions, and I believe the Max 5x plan ($100/month) is really adequate for most developers.

So in a way, you could say Claude Code is cheaper in practice, because it allows you to select a cheaper plan that works for you rather than forcing you to climb the pricing ladder.

Skills and Plugins: The Developer Ecosystem

As skills are compatible between Claude Code and Codex, you won't notice a difference regardless of which one you use. However, most skill hubs and repos are named after Claude Code, which might be a little confusing.

This is the case with most other things as well. Many of the posts you see on Reddit, X, or blog posts about coding agents are about Claude Code rather than Codex, even though the same principles apply to both of them, which really tells you something about the popularity and community size.

Codex has launched support for both skills and plugins much later than Claude Code. But plugins aren't as compatible as skills. And as plugin support for Codex started just recently, there's not so many available.

All this to say that many developers, including me, don't use plugins at all. So unless you specifically need the support for various plugins, this is not something to worry about or base your decision on.

RAG Pipeline: A Case Study

For the comparison, I chose to go with a task that can be quantitatively assessed. The problem with creating a landing page, for example, is that it's a qualitative task: one might think a landing page is cool looking while the other calls it purple-gradient slop.

So I chose the task of building a simple RAG pipeline, since the accuracy of the generated answers can be determined in numbers.

Other good ideas if you want to do a similar comparison yourself, could be training a vision model or fine-tuning an LLM, or measuring the performance of a low-level program.

Building a retrieval pipeline is a common task of an AI engineer, potentially something you'd use Claude Code or Codex in your job. I tasked both of these coding agents to build me a RAG Q&A pipeline for research papers. The workflow is simple:

  1. Take a number of papers and extract their text.

  2. Chunk the contents into smaller bits.

  3. Embed each chunk into a vector space.

  4. When a user asks a question, find the closest chunk embeddings to the embedding of the question.

  5. Retrieve the close chunks in their original form (not their embeddings).

  6. Use that context to answer the user's question.

This is a task simple enough to be implemented in one session, but it has intricate details that massively influence the output: - what chunking strategy to use - how to embed the chunks - what vector storage to go for - how to handle the confidence of which chunk is closer to the query - whether to rephrase the user's query to help find more similar chunks, etc.

The Experiment Setup

I took 5 research papers from the @huggingface daily papers of the past week, and created a test dataset (size = 100) of questions and ground truth answers, which I would later use for testing how good the implementation of Claude or Codex is.

For both coding agents, I specified the following:

  • Build a Python RAG pipeline

  • Process all PDFs using PyMuPDF

  • Choose a good chunking strategy for this use case

  • Create embeddings and a persistent local vector index (your choice)

  • generate final answers with **llama-3.1-8b-instant**.

  • If no sufficient evidence is found, do not hallucinate. return a fallback response

For both Codex and Claude Code, I used the best most popular and default available models: gpt-5.3-codex and Opus 4.6, both with High effort (the degree of reasoning). None had an AGENTS.md.

How They Implemented The Pipeline

I didn't notice any noticeable difference in how each agent thinks about the task, other than the fact that Codex is more verbose in explaining its plan and what it's going to do. Claude simply writes the files and executes the commands without talking so much about it.

Codex also took longer to finish the task compared to Claude.

More importantly, Claude tested the script end-to-end and made sure the pipeline is ready to use.

Codex, on the other hand, finished the implementation but didn't test or run the program, and instructed me to pip install the requirements and run the script. Naturally, I hit an error in running the script, which Codex fixed. Claude's script worked with no problems whatsoever.

I've noticed this pattern with Codex, that it leaves many of the labor or setups for you to do rather than simply doing it itself.

While Codex would let you know and take action for an env problem or implementation difficulty, Claude takes the liberty of fixing it, which depending on your preference, can be a good/bad thing.

I've also noticed that the initial time-to-response for the first token in a new session for Codex can go as high as a minute, while this is much shorter for Claude Code.

Claude Code vs. Codex Implementation

Both coding agents went for surprisingly similar approaches:

  • they both went for the same all-MiniLM-L6-v2 as the embedding model

  • they selected k=5 for the Top-K retrieval

  • both restricted the LLM in the system prompt to only use the provided context

This is where they went with separate approaches:

  • Vector Storage: Claude Code chose ChromaDB for the vector DB, and Codex went for FAISS, which is a lower-level similarity search library, more memory-efficient and faster.

  • Chunking: Claude Code went for a recursive character splitting. It tried \n\n first, then \n, then "." , then " ". The target is 1000 chars with 200 char overlap. Codex went for a sentence-level word splitting and fills chunks up to 220 words, with 40-word overlap. Claude Code splits by structure (paragraphs → lines → sentences → words) and measures in characters. Codex splits by sentences first, then packs them into word-budget bins. Codex's approach respects the sentence boundaries and avoids mid-sentence cuts, but the 220 words may be too small for this context (academic text).

  • Retrieval: Both chose Top-5 chunks. Claude Code returns raw L2 distances and Codex returns inner-product (cosine) scores.

  • Confidence: Claude Code used a single threshold on the best L2 distance (>1.2 = irrelevant) and then checks the average distance for low vs. well-grounded chunks. Codex uses multi-criteria with three tiers: strong, moderate, and insufficient.

  • Code Architecture: Claude Code: Flat functions, constants in each module, no input validation on model consistency. Codex: OOP pipeline class, centralized config, dataclasses, argparse CLI, model consistency validation. Codex is clearly better engineered and more configurable. In large and more serious codebases, this is critical.

Results

Using gpt-5.4 as the LLM-as-a-judge, the answer of both pipelines is compared in four criteria: Correctness, Completeness, Relevance, Conciseness.

Among the 100 questions, Claude Code won 42, Codex won 33, and 25 were ties. Claude won mostly due to its looser confidence gating, and maybe a slightly higher generation temperature (0.2 vs 0.1 in Codex's pipeline).

A Pinch of Salt

Now this was a very simple setup, and I was mostly curious to see the different approach the two coding agents take in implementing the same close-ended task. In a professional setup, it's the developer who makes the calls for the overall architecture: the chunking method, the Vector DB, the retrieval strategy, etc. Also, in a professional setup, developing such systems requires much more testing and iterative improvements, with more reliable test sets and verifications.

However, it's really expected that a junior developer who's not very experienced in building a RAG pipeline leaves these decisions to the AI to make.

Just Pick One

I don't think there is any terminally wrong decision whether you choose Claude Code or Codex. Both offer strong models compared to the existing landscape and get the job done to a similar degree.

Two major factors for me have been: the Anthropic ecosystem, and the $100/month pricing tier. Even if I have to bump up that tier to the $200/month pricing, I would still stick to Anthropic's Claude Code for the former reason.

The most important thing is what you use these scaffolds for and how you use them.

This determines which one is better for you better than any benchmarks, and there's no clear answer to that other than your gut telling you which one feels better after you test both.

There are developers like @steipete who swear by Codex, and there is a community that believes Opus is just unrivaled by OpenAI models.

I think both of them are correct at the same time, simply because their workflow of using these coding agents, and their "taste" is different.

If you're doubtful about which one to go with, I suggest trying out the $20/month version of both of them on the type of programming field that's relevant to you, and test preferably on several verifiable tasks.

Finally, keep in mind that similar to anything else related to AI, the landscape changes drastically every few months. While you might like one of them now, three months later, the agent's behavior might drift, or a new model might hit the market.

There are very few things in AI with definitive global answers, and this subject is not one of them ;)

Link: http://x.com/i/article/2030946053629915136

📋 讨论归档

讨论进行中…