返回列表
🧠 阿头学 · 💬 讨论题 · 💰投资

以推理速度交付

当 AI 编程代理能力突破临界点后,个人开发者的生产力瓶颈从"写代码"转移到"系统架构决策"和"推理时间",但这套极速工作流本质上只适用于独立开发者,对团队协作、安全合规、长期维护的现实约束视而不见。
打开原文 ↗

2026-03-13 原文链接 ↗
阅读简报
双语对照
完整翻译
原文
讨论归档

核心观点

  • 代码阅读权的彻底让渡 作者已经几乎不读代码,只看输出流和系统结构,核心竞争力完全转移到"技术选型(语言/生态/依赖)"和"系统设计(数据流向)"。这打破了"必须亲自写代码才能把控质量"的传统偏见,但也隐含了一个风险:当 AI 出错时,缺乏代码心智模型的开发者将完全受制于 AI 的调试能力。
  • "为模型设计代码库"是新范式 作者明确指出不再为人类浏览优化代码库,而是为 AI 高效工作而设计——统一的目录结构、显眼的文档、跨项目复用模式。这部分洞察确实捕捉到未来工程的真实趋势,但前提是你的所有资产都已经"AI 可消费化",这对大多数现存企业代码库是巨大的重构成本。
  • 单体强模型 > 复杂多代理编排 作者放弃了多代理系统、复杂任务编排、issue tracker,理由是真正的瓶颈在人类决策,不在工具编排。这个观点对"当前模型能力"成立,但他同时承认这套流程"在更大的团队里显然行不通"——这个限定条件被轻描淡写地掩盖了。
  • 工作流的极端个人化掩盖了系统工程的真实复杂性 直接提交 main、不读代码、不用回滚、同时推进 8 个项目、依赖 OpenAI 生态的多个高级功能(Pro、skills、/compact、web_search)。这套方案在"solo hacker"场景下确实高效,但对需要代码审查、安全审计、多人交接、生产系统的团队完全不适用,文中几乎没有对这些约束的实质性讨论。
  • 隐性成本与资源门槛被"交付速度"的光环掩盖 让 GPT-5.2 Pro/codex 连续运行数小时、使用多台 Mac 并发执行、大量 API Token 消耗。这种工作流背后的成本(金钱、算力、试错)被忽略,只强调了"快"这一维度,属于典型的"只报喜不报忧"。

跟我们的关联

  • 对 ATou 意味着什么 如果你是独立开发者或小团队,这套工作流可以直接参考:从 CLI 开始、为模型设计代码库结构、用文档驱动上下文、跨项目复用模式。但如果你在企业或需要多人协作,这篇文章的大部分建议会制造混乱——你需要的是"如何在团队约束下用 AI",而不是"如何最大化个人产能"。下一步:先评估自己的工作环境是否真的是"solo",再决定照搬哪些部分。
  • 对 Neta 意味着什么 这篇文章是 OpenAI 生态的深度营销,虽然没有直接推销产品,但密集植入了个人项目、工具链、社交账号。它展现了"GPT-5.2 + codex + Pro + skills"这套栈的真实能力上限,但同时也暴露了锁定风险——一旦 OpenAI 改价格、改 API、改能力,这套工作流的整个基础就会动摇。下一步:如果你要投资 AI 工具链,要考虑多模型支持和本地备选方案。
  • 对 Uota 意味着什么 文中对"大多数软件不需要深度思考"的论断,在 Uota 的语境里需要打个大问号。真正难的往往是业务规则、组织约束、跨团队接口、法律合规,而不是单个代码文件。作者的"快速交付"在这些维度上可能是"快速踩坑"。下一步:如果要用 AI 加速开发,先确保你的业务逻辑、合规要求、团队流程已经清晰,再让 AI 去执行;不要指望 AI 能替你做架构决策。
  • 对通用开发者意味着什么 这篇文章最有价值的部分是"为模型设计资产"的思路——无论你用什么模型、什么工作流,把代码库、文档、示例都设计成"AI 可消费"的形式,都会大幅提升协作效率。但不要被"以推理速度交付"的光环迷惑,以为这就是未来的全部——真正的未来是"人类做决策,AI 做执行",而不是"AI 做一切,人类只看输出"。下一步:开始审视自己的代码库、文档、工程流程,看哪些地方可以优化成"AI 友好"的形式。

讨论引子

1. 当开发者不再需要读代码、只依赖 AI 的调试能力时,如何防止系统性的技术债和隐藏的安全漏洞?特别是在生产环境或涉及用户数据的场景下,这种"信任 AI"的工作流是否真的可持续?

2. 文中的工作流高度依赖 OpenAI 的生态(GPT-5.2、codex、Pro、skills、/compact)。如果 OpenAI 改价格、改能力或出现服务中断,这套"以推理速度交付"的模式会如何崩塌?我们应该如何为多模型、多供应商的未来做准备?

3. 作者把"大多数软件不需要深度思考"作为前提,但这个判断是基于他自己做的 CLI、小工具、个人自动化项目。对于需要复杂业务建模、多人协作、长期维护的企业系统,这个前提还成立吗?我们如何区分"真的不需要深度思考"和"我只是没看到复杂性"?

| 在 GitHub 上编辑

自五月以来的变化

今年“vibe coding”的进步之大,令人难以置信。大约在五月时,我还会惊讶于某些提示词竟能生成开箱即用的代码;如今,这已经成了我的默认预期。我现在交付代码的速度快到不真实。自那以后我烧掉了大量 token。是时候更新一下了。

这些代理的工作方式挺有意思。几周前有人争论说,必须亲自写代码才能“感受到”糟糕的架构,而使用代理会让人产生隔阂——我完全不同意。只要和代理相处得够久,你就会非常清楚某件事应该花多久;当 codex 回来却没能一次搞定时,我立刻就会起疑。

我如今能产出的软件量,主要受限于推理时间和真正的深度思考。而说实话——大多数软件根本不需要深度思考。多数应用只是把数据从一个表单搬到另一个表单,可能存一下,然后以某种方式展示给用户。最简单的呈现就是文本,所以默认情况下,不管我想做什么,都会先从 CLI 开始。代理可以直接调用它并验证输出——形成闭环。

模型的转变

真正让我实现像工厂一样造东西的,是 GPT 5。发布后我过了几周才看清这一点——一方面要等 codex 把 Claude Code 已经有的功能补齐,另一方面也需要一点时间学习并理解差异;但随后我对模型的信任与日俱增。如今我几乎不怎么读代码了。我会看它的输出流,有时扫一眼关键部分,但老实说——大多数代码我不读。我知道各个组件放在哪、结构如何组织、整体系统如何设计,这通常就够了。

现在真正重要的决定是语言/生态和依赖选型。我常用的语言是:做 Web 用 TypeScript;做 CLI 用 Go;如果需要用到 macOS 能力或做 UI,就用 Swift。几个月前我甚至没怎么考虑过 Go,但后来折腾了一阵发现,代理写 Go 特别在行,而它简洁的类型系统也让 lint 很快。

做 Mac 或 iOS 的朋友:你现在几乎不需要 Xcode 了。我甚至不用 xcodeproj 文件。如今 Swift 的构建基础设施对大多数事情都足够。codex 知道怎么运行 iOS App,也知道怎么和 Simulator 打交道。不需要任何特殊配置或 MCP。

codex 对比 Opus

我写这篇文章的时候,codex 正在啃一个庞大的、要跑好几个小时的重构,把 Opus 4.0 早期留下的烂摊子清理干净。Twitter 上经常有人问我:Opus 和 codex 的差别到底在哪?基准测试那么接近,这差别有什么意义?在我看来,基准测试越来越不值得信任——你得两个都用过,才能真正理解。无论 OpenAI 在后训练里做了什么,codex 显然被训练成:在动手写代码之前先读大量代码。

有时它会默默读文件 10、15 分钟,才开始写任何代码。一方面挺烦,另一方面也很惊艳——因为这大大提高了它修到“正确地方”的概率。相反,Opus 更急着动手:做小改动很棒;但做大功能或重构就不太行——它经常不把整份文件读完,或漏掉部分内容,结果要么实现得低效,要么干脆漏掉点什么。我发现即便在可比任务上 codex 有时比 Opus 慢 4 倍,我最终往往更快,因为我不需要回头去“修修复”,而这在我还用 Claude Code 的时候几乎是家常便饭。

codex 也让我把很多在 Claude Code 时代不得不演的“把戏”彻底戒掉了。不再需要所谓的 “plan mode”,我只是和模型开始一段对话,问个问题,让它去 Google、探索代码、一起制定计划;当我对它呈现的东西满意时,我就输入 “build”,或者 “write plan to docs/*.md and build this”。Plan mode 更像是为早期那代不太会遵循提示词的模型打的补丁——于是我们不得不拿走它们的编辑工具。我的一条经常被误解的推文至今还在被转来转去,也让我意识到:大多数人并不明白,plan mode 并不是什么魔法

Oracle

从 GPT 5/5.1 到 5.2 的跨越非常巨大。大约一个月前我做了oracle 🧿——这是一个 CLI,让代理可以跑 GPT 5 Pro,上传文件加提示词,并管理会话,方便之后再取回答案。我之所以做它,是因为很多时候代理卡住了,我会让它把所有东西写进一个 Markdown 文件,然后自己去查询;这既重复又浪费时间——也正好是一个把闭环补上的机会。使用说明在我的全局 AGENTS.MD里,模型有时在卡住时还会自己触发 oracle。我每天会用它好几次。它是个巨大的解锁。Pro 在“冲刺式”扫完约 50 个网站后再认真思考这件事上强得离谱,几乎每次都能正中要害。有时它很快,10 分钟就完事;但我也遇到过跑了超过一小时的情况。

现在 GPT 5.2 出来了,我需要它的场景少了很多。我自己有时仍会用 Pro 做研究,但我让模型去“ask the oracle”的次数,从每天多次降到了每周几次。我并不为此懊恼——做 oracle 特别有趣,我学到了不少浏览器自动化、Windows 相关的东西,也终于抽时间研究了 skills(此前我很长一段时间都把这个想法否掉了)。这恰恰说明:5.2 在许多真实世界的编程任务上强了多少。几乎我扔给它的任何东西,它都能一发命中。

另一个巨大的优势是知识截止日期。GPT 5.2 的知识更新到 8 月底,而 Opus 停在 3 月中旬——差不多差了 5 个月。想用最新工具时,这一点非常关键。

一个具体例子:VibeTunnel

再给你一个例子,看看模型到底进步了多少。我早期一个投入很深的项目是 VibeTunnel。它是一个终端复用器(terminal multiplexer),让你在路上也能写代码。今年年初我几乎把所有时间都砸在它上面,两个月后它好用到我在和朋友外出时都忍不住用手机写代码……于是我决定该停一停——更多是为了心理健康。那会儿我想把复用器的一段核心逻辑从 TypeScript 重写出去,老一代模型一直帮不上忙。我试过 Rust、Go……甚至(老天保佑)zig。当然我也可以把这个重构做完,但它需要大量手工活,所以在我把项目封存之前一直没真正完成。上周我把它从灰尘里翻出来,只给 codex 两句话提示,让它把整个转发系统改成 zig,它跑了 5 个多小时、经历了多次 compaction,最后一次就交付了可用的转换结果。

你可能会问:我为什么又把它翻出来?我目前的重心是 Clawdis,一个 AI 助手,可以完全访问我所有电脑上的一切、消息邮件家庭自动化摄像头、灯光、音乐,见鬼,它甚至能控制我床的温度。当然,它也有自己的声音发推用的 CLI 以及自己的 clawd.bot

Clawd 能看到并控制我的屏幕,有时还会阴阳怪气几句;但我还想给他一个能力:去查看我的代理的进度。相比看图片,直接拿到字符流要高效得多……这事能不能成,走着瞧!

我的工作流

我知道……你是来学怎么更快做东西的,而我却像是在给 OpenAI 写营销软文。我也希望 Anthropic 正在憋 Opus 5,然后风向再变一次。竞争是好事!同时,我也很喜欢把 Opus 当作通用模型。我的 AI 代理要是跑在 GPT 5 上,乐趣至少少一半。Opus 有种特别的东西,让它用起来很愉悦。我把它用于大多数电脑自动化任务,当然 Clawd🦞 也是它驱动的。

和我十月那次总结相比,我的工作流并没有太大变化。

我通常会同时做多个项目。视复杂度而定,可能是 3–8 个。频繁切换上下文会很累,我基本只能在家、安静且专注时才做得到。要在脑子里来回切换一堆心智模型。好在大多数软件都很无聊。比如做一个 CLI 来查看外卖配送进度,并不需要太多思考。通常我的注意力会放在一个大项目上,旁边再有一些“卫星项目”慢慢推进。做得足够多的 agentic engineering 之后,你会形成一种直觉:什么会很容易、模型大概率会在哪些地方卡壳。所以很多时候我只要丢一个提示词,codex 跑上 30 分钟,我就拿到了需要的东西。有时需要一点点折腾或创意,但多数情况都很直接。

我大量使用 codex 的排队功能——有了新想法就塞进流水线。我看到很多人在尝试各种多代理编排、邮件驱动或自动化任务管理系统——但到目前为止,我并不觉得很有必要——通常瓶颈在我。我的软件构建方式非常迭代:做点东西、玩一玩、看看“手感”如何,然后冒出新想法再打磨它。我很少在脑子里一开始就有完整的成品图景。确实会有个粗略方向,但在探索问题域时它经常会发生巨大变化。所以那种把“完整想法”当输入、然后直接吐出结果的系统对我并不适用。我需要去玩它、摸它、感受它、看见它——我就是这样把它演化出来的。

我基本从不回滚或用 checkpoint。如果某个东西不是我想要的样子,我就让模型改。codex 有时会把文件重置,但更多时候它会直接撤销或修改已有改动;很少需要我彻底退回,而是我们换个方向继续走。做软件就像爬山:你不会笔直往上冲,而是绕着山转、不断拐弯;有时走偏了需要退回来一点,过程并不完美,但最终你会到达你要去的地方。

我就直接往 main 提交。有时 codex 会觉得太乱了,自动建一个 worktree 再把改动合并回来,但这种情况很少,我也只在极少数场景下会主动提示它这么做。我觉得为了在项目里维护多个状态而增加的认知负担没必要,更喜欢线性演进。大任务我会留到自己分心的时候做——比如我写这篇文章时,同时在 4 个项目上跑重构,每个大约要 1–2 小时才能完成。当然也可以放在 worktree 里做,但那只会带来大量合并冲突和不够理想的重构。注意:我通常是独自工作;如果你在更大的团队里,这套流程显然行不通。

前面我已经提过我规划功能的方式。我会不断交叉引用项目,尤其是当我知道某个问题在别处已经解决过时,我会让 codex 去看 ../project-folder,这通常就足够它从上下文推断该去哪找。这在节省提示词上非常有用。我甚至可以直接写:“看看 ../vibetunnel,然后对 Sparkle changelogs 做同样的事”,因为那边已经解决过了,几乎有 99% 的把握它能正确复制并适配到新项目。这也是我搭脚手架启动新项目的方式。

我见过不少系统,专门给想回看历史会话的人用。另一件我几乎从不需要、也不用的东西。我会在每个项目里用一个 docs 文件夹维护子系统和功能文档,并在全局 AGENTS 文件里用一个脚本加上一些指令强制模型在特定主题上先读文档。项目越大,这个做法越划算,所以我不会到处都用;但它对保持文档及时更新、为任务构造更好的上下文非常有帮助。

说到上下文。我以前对“新任务就重开会话”这件事非常认真。到了 GPT 5.2,这已经不再需要。即便上下文更满,性能也极好;而且常常还能更快,因为模型在已经加载了很多文件后工作得更快。当然,这只在你把任务串行化,或者把改动隔得足够远、让两个会话几乎不碰同一片代码时才好用。codex 不像 Claude Code 那样有“这个文件变了”的系统事件,所以你得更小心;但反过来,codex 的上下文管理能力就是强太多了——我感觉一个 codex 会话里能做的事,比用 claude 多 5 倍。这不仅仅是客观上上下文更大,还有别的因素在起作用。我猜 codex 在内部会把思考压缩得很紧凑以节省 token,而 Opus 则非常话痨。有时模型会出错,把内部思考流泄露给用户,所以我已经见过不少次。说真的,我觉得 codex 的文笔莫名挺有娱乐性。

提示词。我以前会用语音输入写很长、很精细的提示词。用上 codex 后,我的提示词短了很多;我又开始更多地打字,而且经常会加图片,尤其是在迭代 UI(或者 CLI 的文案)时。如果你把问题“展示”给模型看,往往只要几句话就能让它做出你想要的效果。对,我就是那种把某个 UI 组件的截图拖进来,然后写一句“fix padding”或“redesign”的人;很多时候这要么直接解决问题,要么至少能推进到一个相当不错的程度。我以前会引用 Markdown 文件,但有了我的 docs:list 脚本后就不必了。

Markdown。很多时候我会写“write docs to docs/*.md”,然后让模型自己挑文件名。你为模型的训练习惯把结构设计得越“显而易见”,你的工作就会越轻松。毕竟,我不是把代码库设计得方便我自己浏览,而是把它工程化成代理能高效工作的样子。和模型对着干,往往只是浪费时间和 token。

工具与基础设施

还有哪些依然很难?挑对依赖和框架是我会投入不少时间的事:它维护得好吗?peer dependencies 怎么样?够不够流行——也就是“世界知识”够不够多,让代理更容易上手?同样还有系统设计:我们用 web sockets 通信吗?还是 HTML?哪些放服务器,哪些放客户端?数据具体怎么流、从哪到哪?这些往往更难对模型解释清楚,也是研究与思考更值得投入的地方。

因为我管理很多项目,我经常让一个代理就跑在项目目录里;当我琢磨出一个新模式时,就让它“find all my recent go projects and implement this change there too + update changelog”。我的每个项目都会在那个文件里把 patch 版本号抬一下;等我下次回来看时,就已经有些改进在等我测试了。

当然我把一切都自动化了。有一个 skill 用来注册域名和改 DNS;有一个用来写好前端;我的 AGENTS 文件里还有一段关于 tailscale 网络的说明,所以我可以直接说“go to my mac studio and update xxx”。

说到多台 Mac。我通常用两台 Mac 工作:大屏上是我的 MacBook Pro,另一块屏幕开一个 Jump Desktop 连接到我的 Mac Studio。有些项目在那边“炖”,有些在这边。有时我会在两台机器上编辑同一个项目的不同部分,再用 git 同步。比 worktree 简单,因为 main 上的漂移很容易对齐。还有个额外好处:任何需要 UI 或浏览器自动化的任务我都可以扔给 Studio,那边的弹窗就不会来烦我。(是的,Playwright 有 headless 模式,但总有不少场景用不了)

另一个好处是任务会一直在那边跑,所以我出门时,远程机就成了我的主工作站;即便我把 Mac 合上,任务也照样继续。我过去也试过真正的异步代理,比如 codex 或 Cursor web,但我会怀念那种“可操控性”;而且最终工作会变成 pull request,这又给我的 setup 增加了复杂度。我更喜欢终端的简单。

我以前也玩过 slash commands,但一直没觉得特别有用。Skills 替代了其中一部分;其余的我还是会写“commit/push”,因为它和 /commit 一样快,而且永远能用。

过去我常常会专门腾出几天来重构、清理项目;现在更多是随手做。只要提示词开始跑得太久,或者我在代码流里看到什么丑东西飞过去,我就会当场处理掉。

我试过 Linear 或其他 issue tracker,但没有一个真正坚持下来。重要想法我会立刻去试;其他的要么我会记得,要么它就没那么重要。当然,对于使用我开源代码的人报的 bug,我会有公开的 bug tracker;但当我自己发现一个 bug 时,我会立刻提示模型去修——比写下来、然后以后再切回上下文处理要快得多。

不管你要做什么,都先从模型和 CLI 开始。我脑子里有个做一个 Chrome 扩展来总结 YouTube视频的想法很久了。上周我开始做 summarize:一个 CLI,能把任何东西转成 Markdown,再把它喂给模型做总结。我先把核心打磨对了;核心跑得很稳之后,我用一天把整个扩展做完了。我非常喜欢它:可跑本地模型,也可跑免费或付费模型;能在本地转写视频或音频;还能跟本地 daemon 通信,所以非常快。试试看吧!

我常用的模型是 gpt-5.2-codex high。还是那句话,KISS。xhigh 除了更慢以外几乎没什么好处,我也不想花时间在不同模式或“ultrathink”上纠结。所以几乎所有东西都跑在 high。GPT 5.2 和 codex 已经足够接近,换模型没什么意义,我就一直用这个。

我的配置

这是我的 ~/.codex/config.toml:

model = "gpt-5.2-codex"
model_reasoning_effort = "high"
tool_output_token_limit = 25000
# Leave room for native compaction near the 272–273k context window.
# Formula: 273000 - (tool_output_token_limit + 15000)
# With tool_output_token_limit=25000 ⇒ 273000 - (25000 + 15000) = 233000
model_auto_compact_token_limit = 233000
[features]
ghost_commit = false
unified_exec = true
apply_patch_freeform = true
web_search_request = true
skills = true
shell_snapshot = true

[projects."/Users/steipete/Projects"]
trust_level = "trusted"

这能让模型一次读更多内容;默认值有点小,会限制它能看见什么。更糟的是它会“静默失败”,很折磨——他们迟早会修。还有,web search 竟然还不是默认开启?unified_exec 替代了 tmux 和我以前的 runner 脚本,其他也挺不错。以及别害怕 compaction——自从 OpenAI 切到新的 /compact endpoint 之后,这件事已经足够可靠:任务可以跨多次 compaction 继续跑,最后也能完成。它会让事情变慢,但往往也像一次复审;模型在重新看代码时还会顺手揪出 bug。

目前就这些。我打算之后多写点东西,脑子里也攒了不少想法,只是最近做东西玩得 太开心了。如果你想听我在这个新世界里更多的碎碎念和构建点子,在 Twitter 上关注我

最新文章、发版故事和极客链接,直接送到你的邮箱。

每月 2 次,纯干货,零废话。

| 在 GitHub 上编辑

Shipping at Inference-Speed | Peter Steinberger

Skip to content

Peter Steinberger
- Posts

-

Shipping at Inference-Speed

Published: 28 Dec, 2025

• 18 min read

| Edit on GitHub

以推理速度交付

发表于:2025 年 12 月 28 日

• 阅读约 18 分钟

| 在 GitHub 上编辑

What Changed Since May

It’s incredible how far “vibe coding” has come this year. Whereas in ~May I was amazed that some prompts produced code that worked out of the box, this is now my expectation. I can ship code now at a speed that seems unreal. I burned a lot of tokens since then. Time for an update.

It’s funny how these agents work. There’s been this argument a few weeks ago that one needs to write code in order to feel bad architecture and that using agents creates a disconnection - and I couldn’t disagree more. When you spend enough time with agents, you know exactly how long sth should take, and when codex comes back and hasn’t solved it in one shot, I already get suspicious.

The amount of software I can create is now mostly limited by inference time and hard thinking. And let’s be honest - most software does not require hard thinking. Most apps shove data from one form to another, maybe store it somewhere, and then show it to the user in some way or another. The simplest form is text, so by default, whatever I wanna build, it starts as CLI. Agents can call it directly and verify output - closing the loop.

自五月以来的变化

今年“vibe coding”的进步之大,令人难以置信。大约在五月时,我还会惊讶于某些提示词竟能生成开箱即用的代码;如今,这已经成了我的默认预期。我现在交付代码的速度快到不真实。自那以后我烧掉了大量 token。是时候更新一下了。

这些代理的工作方式挺有意思。几周前有人争论说,必须亲自写代码才能“感受到”糟糕的架构,而使用代理会让人产生隔阂——我完全不同意。只要和代理相处得够久,你就会非常清楚某件事应该花多久;当 codex 回来却没能一次搞定时,我立刻就会起疑。

我如今能产出的软件量,主要受限于推理时间和真正的深度思考。而说实话——大多数软件根本不需要深度思考。多数应用只是把数据从一个表单搬到另一个表单,可能存一下,然后以某种方式展示给用户。最简单的呈现就是文本,所以默认情况下,不管我想做什么,都会先从 CLI 开始。代理可以直接调用它并验证输出——形成闭环。

The Model Shift

The real unlock into building like a factory was GPT 5. It took me a few weeks after the release to see it - and for codex to catch up on features that claude code had, and a bit to learn and understand the differences, but then I started trusting the model more and more. These days I don’t read much code anymore. I watch the stream and sometimes look at key parts, but I gotta be honest - most code I don’t read. I do know where which components are and how things are structured and how the overall system is designed, and that’s usually all that’s needed.

The important decisions these days are language/ecosystem and dependencies. My go-to languages are TypeScript for web stuff, Go for CLIs and Swift if it needs to use macOS stuff or has UI. Go wasn’t something I gave even the slightest thought even a few months ago, but eventually I played around and found that agents are really great at writing it, and its simple type system makes linting fast.

Folks building Mac or iOS stuff: You don’t need Xcode much anymore. I don’t even use xcodeproj files. Swift’s build infra is good enough for most things these days. codex knows how to run iOS apps and how to deal with the Simulator. No special stuff or MCPs needed.

模型的转变

真正让我实现像工厂一样造东西的,是 GPT 5。发布后我过了几周才看清这一点——一方面要等 codex 把 Claude Code 已经有的功能补齐,另一方面也需要一点时间学习并理解差异;但随后我对模型的信任与日俱增。如今我几乎不怎么读代码了。我会看它的输出流,有时扫一眼关键部分,但老实说——大多数代码我不读。我知道各个组件放在哪、结构如何组织、整体系统如何设计,这通常就够了。

现在真正重要的决定是语言/生态和依赖选型。我常用的语言是:做 Web 用 TypeScript;做 CLI 用 Go;如果需要用到 macOS 能力或做 UI,就用 Swift。几个月前我甚至没怎么考虑过 Go,但后来折腾了一阵发现,代理写 Go 特别在行,而它简洁的类型系统也让 lint 很快。

做 Mac 或 iOS 的朋友:你现在几乎不需要 Xcode 了。我甚至不用 xcodeproj 文件。如今 Swift 的构建基础设施对大多数事情都足够。codex 知道怎么运行 iOS App,也知道怎么和 Simulator 打交道。不需要任何特殊配置或 MCP。

codex vs Opus

I’m writing this post here while codex crunches through a huge, multi-hour refactor and un-slops older crimes of Opus 4.0. People on Twitter often ask me what’s the big difference between Opus and codex and why it even matters because the benchmarks are so close. IMO it’s getting harder and harder to trust benchmarks - you need to try both to really understand. Whatever OpenAI did in post-training, codex has been trained to read LOTS of code before starting.

Sometimes it just silently reads files for 10, 15 minutes before starting to write any code. On the one hand that’s annoying, on the other hand that’s amazing because it greatly increases the chance that it fixes the right thing. Opus on the other hand is much more eager - great for smaller edits - not so good for larger features or refactors, it often doesn’t read the whole file or misses parts and then delivers inefficient outcomes or misses sth. I noticed that even tho codex sometimes takes 4x longer than Opus for comparable tasks, I’m often faster because I don’t have to go back and fix the fix, sth that felt quite normal when I was still using Claude Code.

codex also allowed me to unlearn lots of charades that were necessary with Claude Code. Instead of “plan mode”, I simply start a conversation with the model, ask a question, let it google, explore code, create a plan together, and when I’m happy with what I see, I write “build” or “write plan to docs/*.md and build this”. Plan mode feels like a hack that was necessary for older generations of models that were not great at adhering to prompts, so we had to take away their edit tools. There’s a highly misunderstood tweet of mine that’s still circling around that showed me that most people don’t get that plan mode is not magic.

codex 对比 Opus

我写这篇文章的时候,codex 正在啃一个庞大的、要跑好几个小时的重构,把 Opus 4.0 早期留下的烂摊子清理干净。Twitter 上经常有人问我:Opus 和 codex 的差别到底在哪?基准测试那么接近,这差别有什么意义?在我看来,基准测试越来越不值得信任——你得两个都用过,才能真正理解。无论 OpenAI 在后训练里做了什么,codex 显然被训练成:在动手写代码之前先读大量代码。

有时它会默默读文件 10、15 分钟,才开始写任何代码。一方面挺烦,另一方面也很惊艳——因为这大大提高了它修到“正确地方”的概率。相反,Opus 更急着动手:做小改动很棒;但做大功能或重构就不太行——它经常不把整份文件读完,或漏掉部分内容,结果要么实现得低效,要么干脆漏掉点什么。我发现即便在可比任务上 codex 有时比 Opus 慢 4 倍,我最终往往更快,因为我不需要回头去“修修复”,而这在我还用 Claude Code 的时候几乎是家常便饭。

codex 也让我把很多在 Claude Code 时代不得不演的“把戏”彻底戒掉了。不再需要所谓的 “plan mode”,我只是和模型开始一段对话,问个问题,让它去 Google、探索代码、一起制定计划;当我对它呈现的东西满意时,我就输入 “build”,或者 “write plan to docs/*.md and build this”。Plan mode 更像是为早期那代不太会遵循提示词的模型打的补丁——于是我们不得不拿走它们的编辑工具。我的一条经常被误解的推文至今还在被转来转去,也让我意识到:大多数人并不明白,plan mode 并不是什么魔法

Oracle

The step from GPT 5/5.1 to 5.2 was massive. I built oracle 🧿 about a month ago - it’s a CLI that allows the agent to run GPT 5 Pro and upload files + a prompt and manages sessions so answers can be retrieved later. I did this because many times when agents were stuck, I asked it to write everything into a markdown file and then did the query myself, and that felt like a repetitive waste of time - and an opportunity to close the loop. The instructions are in my global AGENTS.MD file and the model sometimes by itself triggered oracle when it got stuck. I used this multiple times per day. It was a massive unlock. Pro is insanely good at doing a speedrun across ~50 websites and then thinking really hard at it and in almost every case nailed the response. Sometimes it’s fast and takes 10 minutes, but I had runs that took more than an hour.

Now that GPT 5.2 is out, I have far fewer situations where I need it. I do use Pro myself sometimes for research, but the cases where I asked the model to “ask the oracle” went from multiple times per day to a few times per week. I’m not mad about this - building oracle was super fun and I learned lots about browser automation, Windows and finally took my time to look into skills, after dismissing that idea for quite some time. What it does show is how much better 5.2 got for many real-life coding tasks. It one-shots almost anything I throw at it.

Another massive win is the knowledge cutoff date. GPT 5.2 goes till end of August whereas Opus is stuck in mid-March - that’s about 5 months. Which is significant when you wanna use the latest available tools.

Oracle

从 GPT 5/5.1 到 5.2 的跨越非常巨大。大约一个月前我做了oracle 🧿——这是一个 CLI,让代理可以跑 GPT 5 Pro,上传文件加提示词,并管理会话,方便之后再取回答案。我之所以做它,是因为很多时候代理卡住了,我会让它把所有东西写进一个 Markdown 文件,然后自己去查询;这既重复又浪费时间——也正好是一个把闭环补上的机会。使用说明在我的全局 AGENTS.MD里,模型有时在卡住时还会自己触发 oracle。我每天会用它好几次。它是个巨大的解锁。Pro 在“冲刺式”扫完约 50 个网站后再认真思考这件事上强得离谱,几乎每次都能正中要害。有时它很快,10 分钟就完事;但我也遇到过跑了超过一小时的情况。

现在 GPT 5.2 出来了,我需要它的场景少了很多。我自己有时仍会用 Pro 做研究,但我让模型去“ask the oracle”的次数,从每天多次降到了每周几次。我并不为此懊恼——做 oracle 特别有趣,我学到了不少浏览器自动化、Windows 相关的东西,也终于抽时间研究了 skills(此前我很长一段时间都把这个想法否掉了)。这恰恰说明:5.2 在许多真实世界的编程任务上强了多少。几乎我扔给它的任何东西,它都能一发命中。

另一个巨大的优势是知识截止日期。GPT 5.2 的知识更新到 8 月底,而 Opus 停在 3 月中旬——差不多差了 5 个月。想用最新工具时,这一点非常关键。

A Concrete Example: VibeTunnel

To give you another example on how far models have come. One of my early intense projects was VibeTunnel. A terminal-multiplexer so you can code on-the-go. I poured pretty much all my time into this earlier this year, and after 2 months it was so good that I caught myself coding from my phone while out with friends… and decided that this is something I should stop, more for mental health than anything. Back then I tried to rewrite a core part of the multiplexer away from TypeScript, and the older models consistently failed me. I tried Rust, Go… god forbid, even zig. Of course I could have finished this refactor, but it would have required lots of manual work, so I never got around completing this before I put it to rest. Last week I un-dusted this and gave codex a two sentence prompt to convert the whole forwarding-system to zig, and it ran over 5h and multiple compactions and delivered a working conversion in one shot.

Why did I even un-dust it, you ask? My current focus is Clawdis, an AI assistant that has full access to everything on all my computers, messages, emails, home automation, cameras, lights, music, heck it can even control the temperature of my bed. Ofc it also has its own voice, a CLI to tweet and its own clawd.bot.

Clawd can see and control my screen and sometimes makes snarky remarks, but I also wanted to give him the ability to check on my agents, and getting a character stream is just far more efficient than looking at images… if this will work out, we’ll see!

一个具体例子:VibeTunnel

再给你一个例子,看看模型到底进步了多少。我早期一个投入很深的项目是 VibeTunnel。它是一个终端复用器(terminal multiplexer),让你在路上也能写代码。今年年初我几乎把所有时间都砸在它上面,两个月后它好用到我在和朋友外出时都忍不住用手机写代码……于是我决定该停一停——更多是为了心理健康。那会儿我想把复用器的一段核心逻辑从 TypeScript 重写出去,老一代模型一直帮不上忙。我试过 Rust、Go……甚至(老天保佑)zig。当然我也可以把这个重构做完,但它需要大量手工活,所以在我把项目封存之前一直没真正完成。上周我把它从灰尘里翻出来,只给 codex 两句话提示,让它把整个转发系统改成 zig,它跑了 5 个多小时、经历了多次 compaction,最后一次就交付了可用的转换结果。

你可能会问:我为什么又把它翻出来?我目前的重心是 Clawdis,一个 AI 助手,可以完全访问我所有电脑上的一切、消息邮件家庭自动化摄像头、灯光、音乐,见鬼,它甚至能控制我床的温度。当然,它也有自己的声音发推用的 CLI 以及自己的 clawd.bot

Clawd 能看到并控制我的屏幕,有时还会阴阳怪气几句;但我还想给他一个能力:去查看我的代理的进度。相比看图片,直接拿到字符流要高效得多……这事能不能成,走着瞧!

My Workflow

I know… you came here to learn how to build faster, and I’m just writing a marketing-pitch for OpenAI. I hope Anthropic is cooking Opus 5 and the tides turn again. Competition is good! At the same time, I love Opus as general purpose model. My AI agent wouldn’t be half as fun running on GPT 5. Opus has something special that makes it a delight to work with. I use it for most of my computer automation tasks and ofc it powers Clawd🦞.

I haven’t changed my workflow all that much from my last take at it in October.

I usually work on multiple projects at the same time. Depending on complexity that can be between 3-8. The context switching can be tiresome, I really only can do that when I’m working at home, in silence and concentrated. It’s a lot of mental models to shuffle. Luckily most software is boring. Creating a CLI to check up on your food delivery doesn’t need a lot of thinking. Usually my focus is on one big project and satellite projects that chug along. When you do enough agentic engineering, you develop a feeling for what’s gonna be easy and where the model likely will struggle, so often I just put in a prompt, codex will chug along for 30 minutes and I have what I need. Sometimes it takes a little fiddling or creativity, but often things are straightforward.

I extensively use the queueing feature of codex - as I get a new idea, I add it to the pipeline. I see many folks experimenting with various systems of multi-agent orchestration, emails or automatic task management - so far I don’t see much need for this - usually I’m the bottleneck. My approach to building software is very iterative. I build sth, play with it, see how it “feels”, and then get new ideas to refine it. Rarely do I have a complete picture of what I want in my head. Sure, I have a rough idea, but often that drastically changes as I explore the problem domain. So systems that take the complete idea as input and then deliver output wouldn’t work well for me. I need to play with it, touch it, feel it, see it, that’s how I evolve it.

I basically never revert or use checkpointing. If something isn’t how I like it, I ask the model to change it. codex sometimes then resets a file, but often it simply reverts or modifies the edits, very rare that I have to back completely, and instead we just travel into a different direction. Building software is like walking up a mountain. You don’t go straight up, you circle around it and take turns, sometimes you get off path and have to walk a bit back, and it’s imperfect, but eventually you get to where you need to be.

I simply commit to main. Sometimes codex decides that it’s too messy and automatically creates a worktree and then merges changes back, but it’s rare and I only prompt that in exceptional cases. I find the added cognitive load of having to think of different states in my projects unnecessary and prefer to evolve it linearly. Bigger tasks I keep for moments where I’m distracted - for example while writing this, I run refactors on 4 projects here that will take around 1-2h each to complete. Ofc I could do that in a worktree, but that would just cause lots of merge conflicts and suboptimal refactors. Caveat: I usually work alone, if you work in a bigger team that workflow obv won’t fly.

I’ve already mentioned my way of planning a feature. I cross-reference projects all the time, esp if I know that I already solved sth somewhere else, I ask codex to look in ../project-folder and that’s usually enough for it to infer from context where to look. This is extremely useful to save on prompts. I can just write “look at ../vibetunnel and do the same for Sparkle changelogs”, because it’s already solved there and with a 99% guarantee it’ll correctly copy things over and adapt to the new project. That’s how I scaffold new projects as well.

I’ve seen plenty of systems for folks wanting to refer to past sessions. Another thing I never need or use. I maintain docs for subsystems and features in a docs folder in each project, and use a script + some instructions in my global AGENTS file to force the model to read docs on certain topics. This pays off more the larger the project is, so I don’t use it everywhere, but it is of great help to keep docs up-to-date and engineer a better context for my tasks.

Apropos context. I used to be really diligent to restart a session for new tasks. With GPT 5.2 this is no longer needed. Performance is extremely good even when the context is fuller, and often it helps with speed since the model works faster when it already has loaded plenty files. Obviously that only works well when you serialize your tasks or keep the changes so far apart that two sessions don’t touch each other much. codex has no system events for “this file changed”, unlike claude code, so you need to be more careful - on the flip side, codex is just FAR better at context management, I feel I get 5x more done on one codex session than with claude. This is more than just the objectively larger context size, there’s other things at work. My guess is that codex internally thinks really condensed to save tokens, whereas Opus is very wordy. Sometimes the model messes up and its internal thinking stream leaks to the user, so I’ve seen this quite a few times. Really, codex has a way with words I find strangely entertaining.

Prompts. I used to write long, elaborate prompts with voice dictation. With codex, my prompts gotten much shorter, I often type again, and many times I add images, especially when iterating on UI (or text copies with CLIs). If you show the model what’s wrong, just a few words are enough to make it do what you want. Yes, I’m that person that drags in a clipped image of some UI component with “fix padding” or “redesign”, many times that either solves my issue or gets me reasonably far. I used to refer to markdown files, but with my docs:list script that’s no longer necessary.

Markdowns. Many times I write “write docs to docs/*.md” and simply let the model pick a filename. The more obvious you design the structure for what the model is trained on, the easier your work will be. After all, I don’t design codebases to be easy to navigate for me, I engineer them so agents can work in it efficiently. Fighting the model is often a waste of time and tokens.

我的工作流

我知道……你是来学怎么更快做东西的,而我却像是在给 OpenAI 写营销软文。我也希望 Anthropic 正在憋 Opus 5,然后风向再变一次。竞争是好事!同时,我也很喜欢把 Opus 当作通用模型。我的 AI 代理要是跑在 GPT 5 上,乐趣至少少一半。Opus 有种特别的东西,让它用起来很愉悦。我把它用于大多数电脑自动化任务,当然 Clawd🦞 也是它驱动的。

和我十月那次总结相比,我的工作流并没有太大变化。

我通常会同时做多个项目。视复杂度而定,可能是 3–8 个。频繁切换上下文会很累,我基本只能在家、安静且专注时才做得到。要在脑子里来回切换一堆心智模型。好在大多数软件都很无聊。比如做一个 CLI 来查看外卖配送进度,并不需要太多思考。通常我的注意力会放在一个大项目上,旁边再有一些“卫星项目”慢慢推进。做得足够多的 agentic engineering 之后,你会形成一种直觉:什么会很容易、模型大概率会在哪些地方卡壳。所以很多时候我只要丢一个提示词,codex 跑上 30 分钟,我就拿到了需要的东西。有时需要一点点折腾或创意,但多数情况都很直接。

我大量使用 codex 的排队功能——有了新想法就塞进流水线。我看到很多人在尝试各种多代理编排、邮件驱动或自动化任务管理系统——但到目前为止,我并不觉得很有必要——通常瓶颈在我。我的软件构建方式非常迭代:做点东西、玩一玩、看看“手感”如何,然后冒出新想法再打磨它。我很少在脑子里一开始就有完整的成品图景。确实会有个粗略方向,但在探索问题域时它经常会发生巨大变化。所以那种把“完整想法”当输入、然后直接吐出结果的系统对我并不适用。我需要去玩它、摸它、感受它、看见它——我就是这样把它演化出来的。

我基本从不回滚或用 checkpoint。如果某个东西不是我想要的样子,我就让模型改。codex 有时会把文件重置,但更多时候它会直接撤销或修改已有改动;很少需要我彻底退回,而是我们换个方向继续走。做软件就像爬山:你不会笔直往上冲,而是绕着山转、不断拐弯;有时走偏了需要退回来一点,过程并不完美,但最终你会到达你要去的地方。

我就直接往 main 提交。有时 codex 会觉得太乱了,自动建一个 worktree 再把改动合并回来,但这种情况很少,我也只在极少数场景下会主动提示它这么做。我觉得为了在项目里维护多个状态而增加的认知负担没必要,更喜欢线性演进。大任务我会留到自己分心的时候做——比如我写这篇文章时,同时在 4 个项目上跑重构,每个大约要 1–2 小时才能完成。当然也可以放在 worktree 里做,但那只会带来大量合并冲突和不够理想的重构。注意:我通常是独自工作;如果你在更大的团队里,这套流程显然行不通。

前面我已经提过我规划功能的方式。我会不断交叉引用项目,尤其是当我知道某个问题在别处已经解决过时,我会让 codex 去看 ../project-folder,这通常就足够它从上下文推断该去哪找。这在节省提示词上非常有用。我甚至可以直接写:“看看 ../vibetunnel,然后对 Sparkle changelogs 做同样的事”,因为那边已经解决过了,几乎有 99% 的把握它能正确复制并适配到新项目。这也是我搭脚手架启动新项目的方式。

我见过不少系统,专门给想回看历史会话的人用。另一件我几乎从不需要、也不用的东西。我会在每个项目里用一个 docs 文件夹维护子系统和功能文档,并在全局 AGENTS 文件里用一个脚本加上一些指令强制模型在特定主题上先读文档。项目越大,这个做法越划算,所以我不会到处都用;但它对保持文档及时更新、为任务构造更好的上下文非常有帮助。

说到上下文。我以前对“新任务就重开会话”这件事非常认真。到了 GPT 5.2,这已经不再需要。即便上下文更满,性能也极好;而且常常还能更快,因为模型在已经加载了很多文件后工作得更快。当然,这只在你把任务串行化,或者把改动隔得足够远、让两个会话几乎不碰同一片代码时才好用。codex 不像 Claude Code 那样有“这个文件变了”的系统事件,所以你得更小心;但反过来,codex 的上下文管理能力就是强太多了——我感觉一个 codex 会话里能做的事,比用 claude 多 5 倍。这不仅仅是客观上上下文更大,还有别的因素在起作用。我猜 codex 在内部会把思考压缩得很紧凑以节省 token,而 Opus 则非常话痨。有时模型会出错,把内部思考流泄露给用户,所以我已经见过不少次。说真的,我觉得 codex 的文笔莫名挺有娱乐性。

提示词。我以前会用语音输入写很长、很精细的提示词。用上 codex 后,我的提示词短了很多;我又开始更多地打字,而且经常会加图片,尤其是在迭代 UI(或者 CLI 的文案)时。如果你把问题“展示”给模型看,往往只要几句话就能让它做出你想要的效果。对,我就是那种把某个 UI 组件的截图拖进来,然后写一句“fix padding”或“redesign”的人;很多时候这要么直接解决问题,要么至少能推进到一个相当不错的程度。我以前会引用 Markdown 文件,但有了我的 docs:list 脚本后就不必了。

Markdown。很多时候我会写“write docs to docs/*.md”,然后让模型自己挑文件名。你为模型的训练习惯把结构设计得越“显而易见”,你的工作就会越轻松。毕竟,我不是把代码库设计得方便我自己浏览,而是把它工程化成代理能高效工作的样子。和模型对着干,往往只是浪费时间和 token。

Tooling Infrastructure

What’s still hard? Picking the right dependency and framework to set on is something I invest quite some time on. Is this well-maintained? How about peer dependencies? Is it popular = will have enough world knowledge so agents have an easy time? Equally, system design. Will we communicate via web sockets? HTML? What do I put into the server and what into the client? How and which data flows where to where? Often these are things that are a bit harder to explain to a model and where research and thinking pays off.

Since I manage lots of projects, often I let an agent simply run in my project folder and when I figure out a new pattern, I ask it to “find all my recent go projects and implement this change there too + update changelog”. Each of my project has a raised patch version in that file and when I revisit it, some improvements are already waiting for me to test.

Ofc I automate everything. There’s a skill to register domains and change DNS. One to write good frontends. There’s a note in my AGENTS file about my tailscale network so I can just say “go to my mac studio and update xxx”.

Apropos multiple Macs. I usually work on two Macs. My MacBook Pro on the big screen, and a Jump Desktop session to my Mac Studio on another screen. Some projects are cooking there, some here. Sometimes I edit different parts of the same project on each machine and sync via git. Simpler than worktrees because drifts on main are easy to reconcile. Has the added benefit that anything that needs UI or browser automation I can move to my Studio and it won’t annoy me with popups. (Yes, Playwright has headless mode but there’s enough situations where that won’t work)

Another benefit is that tasks keep running there, so whenever I travel, remote becomes my main workstation and tasks simply keep running even if I close my Mac. I did experiment with real async agents like codex or Cursor web in the past, but I miss the steerability, and ultimately the work ends up as pull request, which again adds complexity to my setup. I much prefer the simplicity of the terminal.

I used to play with slash commands, but just never found them too useful. Skills replaced some of it, and for the rest I keep writing “commit/push” because it takes the same time as /commit and always works.

In the past I often took dedicated days to refactor and clean up projects, I do this much more ad-hoc now. Whenever prompts start taking too long or I see sth ugly flying by in the code stream, I’ll deal with it right away.

I tried linear or other issue trackers, but nothing did stick. Important ideas I try right away, and everything else I’ll either remember or it wasn’t important. Of course I have public bug trackers for bugs for folks that use my open source code, but when I find a bug, I’ll immediately prompt it - much faster than writing it down and then later having to switch context back to it.

Whatever you build, start with the model and a CLI first. I had this idea of a Chrome extension to summarize YouTube vids in my head for a long time. Last week I started working on summarize, a CLI that converts anything to markdown and then feeds that to a model for summarization. First I got the core right, and once that worked great I built the whole extension in a day. I’m quite in love with it. Runs on local, free or paid models. Transcribes video or audio locally. Talks to a local daemon so it’s super fast. Give it a go!

My go-to model is gpt-5.2-codex high. Again, KISS. There’s very little benefit to xhigh other than it being far slower, and I don’t wanna spend time thinking about different modes or “ultrathink”. So pretty much everything runs on high. GPT 5.2 and codex are close enough that changing models makes no sense, so I just use that.

工具与基础设施

还有哪些依然很难?挑对依赖和框架是我会投入不少时间的事:它维护得好吗?peer dependencies 怎么样?够不够流行——也就是“世界知识”够不够多,让代理更容易上手?同样还有系统设计:我们用 web sockets 通信吗?还是 HTML?哪些放服务器,哪些放客户端?数据具体怎么流、从哪到哪?这些往往更难对模型解释清楚,也是研究与思考更值得投入的地方。

因为我管理很多项目,我经常让一个代理就跑在项目目录里;当我琢磨出一个新模式时,就让它“find all my recent go projects and implement this change there too + update changelog”。我的每个项目都会在那个文件里把 patch 版本号抬一下;等我下次回来看时,就已经有些改进在等我测试了。

当然我把一切都自动化了。有一个 skill 用来注册域名和改 DNS;有一个用来写好前端;我的 AGENTS 文件里还有一段关于 tailscale 网络的说明,所以我可以直接说“go to my mac studio and update xxx”。

说到多台 Mac。我通常用两台 Mac 工作:大屏上是我的 MacBook Pro,另一块屏幕开一个 Jump Desktop 连接到我的 Mac Studio。有些项目在那边“炖”,有些在这边。有时我会在两台机器上编辑同一个项目的不同部分,再用 git 同步。比 worktree 简单,因为 main 上的漂移很容易对齐。还有个额外好处:任何需要 UI 或浏览器自动化的任务我都可以扔给 Studio,那边的弹窗就不会来烦我。(是的,Playwright 有 headless 模式,但总有不少场景用不了)

另一个好处是任务会一直在那边跑,所以我出门时,远程机就成了我的主工作站;即便我把 Mac 合上,任务也照样继续。我过去也试过真正的异步代理,比如 codex 或 Cursor web,但我会怀念那种“可操控性”;而且最终工作会变成 pull request,这又给我的 setup 增加了复杂度。我更喜欢终端的简单。

我以前也玩过 slash commands,但一直没觉得特别有用。Skills 替代了其中一部分;其余的我还是会写“commit/push”,因为它和 /commit 一样快,而且永远能用。

过去我常常会专门腾出几天来重构、清理项目;现在更多是随手做。只要提示词开始跑得太久,或者我在代码流里看到什么丑东西飞过去,我就会当场处理掉。

我试过 Linear 或其他 issue tracker,但没有一个真正坚持下来。重要想法我会立刻去试;其他的要么我会记得,要么它就没那么重要。当然,对于使用我开源代码的人报的 bug,我会有公开的 bug tracker;但当我自己发现一个 bug 时,我会立刻提示模型去修——比写下来、然后以后再切回上下文处理要快得多。

不管你要做什么,都先从模型和 CLI 开始。我脑子里有个做一个 Chrome 扩展来总结 YouTube视频的想法很久了。上周我开始做 summarize:一个 CLI,能把任何东西转成 Markdown,再把它喂给模型做总结。我先把核心打磨对了;核心跑得很稳之后,我用一天把整个扩展做完了。我非常喜欢它:可跑本地模型,也可跑免费或付费模型;能在本地转写视频或音频;还能跟本地 daemon 通信,所以非常快。试试看吧!

我常用的模型是 gpt-5.2-codex high。还是那句话,KISS。xhigh 除了更慢以外几乎没什么好处,我也不想花时间在不同模式或“ultrathink”上纠结。所以几乎所有东西都跑在 high。GPT 5.2 和 codex 已经足够接近,换模型没什么意义,我就一直用这个。

My Config

This is my ~/.codex/config.toml:

model = "gpt-5.2-codex"
model_reasoning_effort = "high"
tool_output_token_limit = 25000
# Leave room for native compaction near the 272–273k context window.
# Formula: 273000 - (tool_output_token_limit + 15000)
# With tool_output_token_limit=25000 ⇒ 273000 - (25000 + 15000) = 233000
model_auto_compact_token_limit = 233000
[features]
ghost_commit = false
unified_exec = true
apply_patch_freeform = true
web_search_request = true
skills = true
shell_snapshot = true

[projects."/Users/steipete/Projects"]
trust_level = "trusted"

This allows the model to read more in one go, the defaults are a bit small and can limit what it sees. It fails silently, which is a pain and something they’ll eventually fix. Also, web search is still not on by default? unified_exec replaced tmux and my old runner script, rest’s neat too. And don’t be scared about compaction, ever since OpenAI switched to their new /compact endpoint, this works well enough that tasks can run across many compacts and will be finished. It’ll make things slower, but often acts like a review, and the model will find bugs when it looks at code again.

That’s it, for now. I plan on writing more again and have quite a backlog on ideas in my head, just having too much fun building things. If you wanna hear more ramblings and ideas how to build in this new world, follow me on Twitter.

New posts, shipping stories, and nerdy links straight to your inbox.

  Subscribe

2× per month, pure signal, zero fluff.

| Edit on GitHub

Share this post on:

Share this post on XShare this post on BlueSkyShare this post on LinkedInShare this post via WhatsAppShare this post on FacebookShare this post via TelegramShare this post on PinterestShare this post via email

Back to Top

Previous Post OpenClaw, OpenAI and the future Next Post The Signature Flicker

Peter Steinberger on GithubPeter Steinberger on XPeter Steinberger on BlueSkyPeter Steinberger on LinkedInSend an email to Peter Steinberger

Steal this post ➜ CC BY 4.0 · Code MIT

我的配置

这是我的 ~/.codex/config.toml:

model = "gpt-5.2-codex"
model_reasoning_effort = "high"
tool_output_token_limit = 25000
# Leave room for native compaction near the 272–273k context window.
# Formula: 273000 - (tool_output_token_limit + 15000)
# With tool_output_token_limit=25000 ⇒ 273000 - (25000 + 15000) = 233000
model_auto_compact_token_limit = 233000
[features]
ghost_commit = false
unified_exec = true
apply_patch_freeform = true
web_search_request = true
skills = true
shell_snapshot = true

[projects."/Users/steipete/Projects"]
trust_level = "trusted"

这能让模型一次读更多内容;默认值有点小,会限制它能看见什么。更糟的是它会“静默失败”,很折磨——他们迟早会修。还有,web search 竟然还不是默认开启?unified_exec 替代了 tmux 和我以前的 runner 脚本,其他也挺不错。以及别害怕 compaction——自从 OpenAI 切到新的 /compact endpoint 之后,这件事已经足够可靠:任务可以跨多次 compaction 继续跑,最后也能完成。它会让事情变慢,但往往也像一次复审;模型在重新看代码时还会顺手揪出 bug。

目前就这些。我打算之后多写点东西,脑子里也攒了不少想法,只是最近做东西玩得 太开心了。如果你想听我在这个新世界里更多的碎碎念和构建点子,在 Twitter 上关注我

最新文章、发版故事和极客链接,直接送到你的邮箱。

  订阅

每月 2 次,纯干货,零废话。

| 在 GitHub 上编辑

分享到:

在 X 上分享本文在 BlueSky 上分享本文在 LinkedIn 上分享本文通过 WhatsApp 分享本文在 Facebook 上分享本文通过 Telegram 分享本文在 Pinterest 上分享本文通过邮件分享本文

回到顶部

上一篇 OpenClaw、OpenAI 与未来 下一篇 Signature Flicker

Peter Steinberger 的 GithubPeter Steinberger 的 XPeter Steinberger 的 BlueSkyPeter Steinberger 的 LinkedIn给 Peter Steinberger 发邮件

转载本文 ➜ CC BY 4.0 · 代码 MIT

相关笔记

| Edit on GitHub

What Changed Since May

It’s incredible how far “vibe coding” has come this year. Whereas in ~May I was amazed that some prompts produced code that worked out of the box, this is now my expectation. I can ship code now at a speed that seems unreal. I burned a lot of tokens since then. Time for an update.

It’s funny how these agents work. There’s been this argument a few weeks ago that one needs to write code in order to feel bad architecture and that using agents creates a disconnection - and I couldn’t disagree more. When you spend enough time with agents, you know exactly how long sth should take, and when codex comes back and hasn’t solved it in one shot, I already get suspicious.

The amount of software I can create is now mostly limited by inference time and hard thinking. And let’s be honest - most software does not require hard thinking. Most apps shove data from one form to another, maybe store it somewhere, and then show it to the user in some way or another. The simplest form is text, so by default, whatever I wanna build, it starts as CLI. Agents can call it directly and verify output - closing the loop.

The Model Shift

The real unlock into building like a factory was GPT 5. It took me a few weeks after the release to see it - and for codex to catch up on features that claude code had, and a bit to learn and understand the differences, but then I started trusting the model more and more. These days I don’t read much code anymore. I watch the stream and sometimes look at key parts, but I gotta be honest - most code I don’t read. I do know where which components are and how things are structured and how the overall system is designed, and that’s usually all that’s needed.

The important decisions these days are language/ecosystem and dependencies. My go-to languages are TypeScript for web stuff, Go for CLIs and Swift if it needs to use macOS stuff or has UI. Go wasn’t something I gave even the slightest thought even a few months ago, but eventually I played around and found that agents are really great at writing it, and its simple type system makes linting fast.

Folks building Mac or iOS stuff: You don’t need Xcode much anymore. I don’t even use xcodeproj files. Swift’s build infra is good enough for most things these days. codex knows how to run iOS apps and how to deal with the Simulator. No special stuff or MCPs needed.

codex vs Opus

I’m writing this post here while codex crunches through a huge, multi-hour refactor and un-slops older crimes of Opus 4.0. People on Twitter often ask me what’s the big difference between Opus and codex and why it even matters because the benchmarks are so close. IMO it’s getting harder and harder to trust benchmarks - you need to try both to really understand. Whatever OpenAI did in post-training, codex has been trained to read LOTS of code before starting.

Sometimes it just silently reads files for 10, 15 minutes before starting to write any code. On the one hand that’s annoying, on the other hand that’s amazing because it greatly increases the chance that it fixes the right thing. Opus on the other hand is much more eager - great for smaller edits - not so good for larger features or refactors, it often doesn’t read the whole file or misses parts and then delivers inefficient outcomes or misses sth. I noticed that even tho codex sometimes takes 4x longer than Opus for comparable tasks, I’m often faster because I don’t have to go back and fix the fix, sth that felt quite normal when I was still using Claude Code.

codex also allowed me to unlearn lots of charades that were necessary with Claude Code. Instead of “plan mode”, I simply start a conversation with the model, ask a question, let it google, explore code, create a plan together, and when I’m happy with what I see, I write “build” or “write plan to docs/*.md and build this”. Plan mode feels like a hack that was necessary for older generations of models that were not great at adhering to prompts, so we had to take away their edit tools. There’s a highly misunderstood tweet of mine that’s still circling around that showed me that most people don’t get that plan mode is not magic.

Oracle

The step from GPT 5/5.1 to 5.2 was massive. I built oracle 🧿 about a month ago - it’s a CLI that allows the agent to run GPT 5 Pro and upload files + a prompt and manages sessions so answers can be retrieved later. I did this because many times when agents were stuck, I asked it to write everything into a markdown file and then did the query myself, and that felt like a repetitive waste of time - and an opportunity to close the loop. The instructions are in my global AGENTS.MD file and the model sometimes by itself triggered oracle when it got stuck. I used this multiple times per day. It was a massive unlock. Pro is insanely good at doing a speedrun across ~50 websites and then thinking really hard at it and in almost every case nailed the response. Sometimes it’s fast and takes 10 minutes, but I had runs that took more than an hour.

Now that GPT 5.2 is out, I have far fewer situations where I need it. I do use Pro myself sometimes for research, but the cases where I asked the model to “ask the oracle” went from multiple times per day to a few times per week. I’m not mad about this - building oracle was super fun and I learned lots about browser automation, Windows and finally took my time to look into skills, after dismissing that idea for quite some time. What it does show is how much better 5.2 got for many real-life coding tasks. It one-shots almost anything I throw at it.

Another massive win is the knowledge cutoff date. GPT 5.2 goes till end of August whereas Opus is stuck in mid-March - that’s about 5 months. Which is significant when you wanna use the latest available tools.

A Concrete Example: VibeTunnel

To give you another example on how far models have come. One of my early intense projects was VibeTunnel. A terminal-multiplexer so you can code on-the-go. I poured pretty much all my time into this earlier this year, and after 2 months it was so good that I caught myself coding from my phone while out with friends… and decided that this is something I should stop, more for mental health than anything. Back then I tried to rewrite a core part of the multiplexer away from TypeScript, and the older models consistently failed me. I tried Rust, Go… god forbid, even zig. Of course I could have finished this refactor, but it would have required lots of manual work, so I never got around completing this before I put it to rest. Last week I un-dusted this and gave codex a two sentence prompt to convert the whole forwarding-system to zig, and it ran over 5h and multiple compactions and delivered a working conversion in one shot.

Why did I even un-dust it, you ask? My current focus is Clawdis, an AI assistant that has full access to everything on all my computers, messages, emails, home automation, cameras, lights, music, heck it can even control the temperature of my bed. Ofc it also has its own voice, a CLI to tweet and its own clawd.bot.

Clawd can see and control my screen and sometimes makes snarky remarks, but I also wanted to give him the ability to check on my agents, and getting a character stream is just far more efficient than looking at images… if this will work out, we’ll see!

My Workflow

I know… you came here to learn how to build faster, and I’m just writing a marketing-pitch for OpenAI. I hope Anthropic is cooking Opus 5 and the tides turn again. Competition is good! At the same time, I love Opus as general purpose model. My AI agent wouldn’t be half as fun running on GPT 5. Opus has something special that makes it a delight to work with. I use it for most of my computer automation tasks and ofc it powers Clawd🦞.

I haven’t changed my workflow all that much from my last take at it in October.

I usually work on multiple projects at the same time. Depending on complexity that can be between 3-8. The context switching can be tiresome, I really only can do that when I’m working at home, in silence and concentrated. It’s a lot of mental models to shuffle. Luckily most software is boring. Creating a CLI to check up on your food delivery doesn’t need a lot of thinking. Usually my focus is on one big project and satellite projects that chug along. When you do enough agentic engineering, you develop a feeling for what’s gonna be easy and where the model likely will struggle, so often I just put in a prompt, codex will chug along for 30 minutes and I have what I need. Sometimes it takes a little fiddling or creativity, but often things are straightforward.

I extensively use the queueing feature of codex - as I get a new idea, I add it to the pipeline. I see many folks experimenting with various systems of multi-agent orchestration, emails or automatic task management - so far I don’t see much need for this - usually I’m the bottleneck. My approach to building software is very iterative. I build sth, play with it, see how it “feels”, and then get new ideas to refine it. Rarely do I have a complete picture of what I want in my head. Sure, I have a rough idea, but often that drastically changes as I explore the problem domain. So systems that take the complete idea as input and then deliver output wouldn’t work well for me. I need to play with it, touch it, feel it, see it, that’s how I evolve it.

I basically never revert or use checkpointing. If something isn’t how I like it, I ask the model to change it. codex sometimes then resets a file, but often it simply reverts or modifies the edits, very rare that I have to back completely, and instead we just travel into a different direction. Building software is like walking up a mountain. You don’t go straight up, you circle around it and take turns, sometimes you get off path and have to walk a bit back, and it’s imperfect, but eventually you get to where you need to be.

I simply commit to main. Sometimes codex decides that it’s too messy and automatically creates a worktree and then merges changes back, but it’s rare and I only prompt that in exceptional cases. I find the added cognitive load of having to think of different states in my projects unnecessary and prefer to evolve it linearly. Bigger tasks I keep for moments where I’m distracted - for example while writing this, I run refactors on 4 projects here that will take around 1-2h each to complete. Ofc I could do that in a worktree, but that would just cause lots of merge conflicts and suboptimal refactors. Caveat: I usually work alone, if you work in a bigger team that workflow obv won’t fly.

I’ve already mentioned my way of planning a feature. I cross-reference projects all the time, esp if I know that I already solved sth somewhere else, I ask codex to look in ../project-folder and that’s usually enough for it to infer from context where to look. This is extremely useful to save on prompts. I can just write “look at ../vibetunnel and do the same for Sparkle changelogs”, because it’s already solved there and with a 99% guarantee it’ll correctly copy things over and adapt to the new project. That’s how I scaffold new projects as well.

I’ve seen plenty of systems for folks wanting to refer to past sessions. Another thing I never need or use. I maintain docs for subsystems and features in a docs folder in each project, and use a script + some instructions in my global AGENTS file to force the model to read docs on certain topics. This pays off more the larger the project is, so I don’t use it everywhere, but it is of great help to keep docs up-to-date and engineer a better context for my tasks.

Apropos context. I used to be really diligent to restart a session for new tasks. With GPT 5.2 this is no longer needed. Performance is extremely good even when the context is fuller, and often it helps with speed since the model works faster when it already has loaded plenty files. Obviously that only works well when you serialize your tasks or keep the changes so far apart that two sessions don’t touch each other much. codex has no system events for “this file changed”, unlike claude code, so you need to be more careful - on the flip side, codex is just FAR better at context management, I feel I get 5x more done on one codex session than with claude. This is more than just the objectively larger context size, there’s other things at work. My guess is that codex internally thinks really condensed to save tokens, whereas Opus is very wordy. Sometimes the model messes up and its internal thinking stream leaks to the user, so I’ve seen this quite a few times. Really, codex has a way with words I find strangely entertaining.

Prompts. I used to write long, elaborate prompts with voice dictation. With codex, my prompts gotten much shorter, I often type again, and many times I add images, especially when iterating on UI (or text copies with CLIs). If you show the model what’s wrong, just a few words are enough to make it do what you want. Yes, I’m that person that drags in a clipped image of some UI component with “fix padding” or “redesign”, many times that either solves my issue or gets me reasonably far. I used to refer to markdown files, but with my docs:list script that’s no longer necessary.

Markdowns. Many times I write “write docs to docs/*.md” and simply let the model pick a filename. The more obvious you design the structure for what the model is trained on, the easier your work will be. After all, I don’t design codebases to be easy to navigate for me, I engineer them so agents can work in it efficiently. Fighting the model is often a waste of time and tokens.

Tooling Infrastructure

What’s still hard? Picking the right dependency and framework to set on is something I invest quite some time on. Is this well-maintained? How about peer dependencies? Is it popular = will have enough world knowledge so agents have an easy time? Equally, system design. Will we communicate via web sockets? HTML? What do I put into the server and what into the client? How and which data flows where to where? Often these are things that are a bit harder to explain to a model and where research and thinking pays off.

Since I manage lots of projects, often I let an agent simply run in my project folder and when I figure out a new pattern, I ask it to “find all my recent go projects and implement this change there too + update changelog”. Each of my project has a raised patch version in that file and when I revisit it, some improvements are already waiting for me to test.

Ofc I automate everything. There’s a skill to register domains and change DNS. One to write good frontends. There’s a note in my AGENTS file about my tailscale network so I can just say “go to my mac studio and update xxx”.

Apropos multiple Macs. I usually work on two Macs. My MacBook Pro on the big screen, and a Jump Desktop session to my Mac Studio on another screen. Some projects are cooking there, some here. Sometimes I edit different parts of the same project on each machine and sync via git. Simpler than worktrees because drifts on main are easy to reconcile. Has the added benefit that anything that needs UI or browser automation I can move to my Studio and it won’t annoy me with popups. (Yes, Playwright has headless mode but there’s enough situations where that won’t work)

Another benefit is that tasks keep running there, so whenever I travel, remote becomes my main workstation and tasks simply keep running even if I close my Mac. I did experiment with real async agents like codex or Cursor web in the past, but I miss the steerability, and ultimately the work ends up as pull request, which again adds complexity to my setup. I much prefer the simplicity of the terminal.

I used to play with slash commands, but just never found them too useful. Skills replaced some of it, and for the rest I keep writing “commit/push” because it takes the same time as /commit and always works.

In the past I often took dedicated days to refactor and clean up projects, I do this much more ad-hoc now. Whenever prompts start taking too long or I see sth ugly flying by in the code stream, I’ll deal with it right away.

I tried linear or other issue trackers, but nothing did stick. Important ideas I try right away, and everything else I’ll either remember or it wasn’t important. Of course I have public bug trackers for bugs for folks that use my open source code, but when I find a bug, I’ll immediately prompt it - much faster than writing it down and then later having to switch context back to it.

Whatever you build, start with the model and a CLI first. I had this idea of a Chrome extension to summarize YouTube vids in my head for a long time. Last week I started working on summarize, a CLI that converts anything to markdown and then feeds that to a model for summarization. First I got the core right, and once that worked great I built the whole extension in a day. I’m quite in love with it. Runs on local, free or paid models. Transcribes video or audio locally. Talks to a local daemon so it’s super fast. Give it a go!

My go-to model is gpt-5.2-codex high. Again, KISS. There’s very little benefit to xhigh other than it being far slower, and I don’t wanna spend time thinking about different modes or “ultrathink”. So pretty much everything runs on high. GPT 5.2 and codex are close enough that changing models makes no sense, so I just use that.

My Config

This is my ~/.codex/config.toml:

model = "gpt-5.2-codex"
model_reasoning_effort = "high"
tool_output_token_limit = 25000
# Leave room for native compaction near the 272–273k context window.
# Formula: 273000 - (tool_output_token_limit + 15000)
# With tool_output_token_limit=25000 ⇒ 273000 - (25000 + 15000) = 233000
model_auto_compact_token_limit = 233000
[features]
ghost_commit = false
unified_exec = true
apply_patch_freeform = true
web_search_request = true
skills = true
shell_snapshot = true

[projects."/Users/steipete/Projects"]
trust_level = "trusted"

This allows the model to read more in one go, the defaults are a bit small and can limit what it sees. It fails silently, which is a pain and something they’ll eventually fix. Also, web search is still not on by default? unified_exec replaced tmux and my old runner script, rest’s neat too. And don’t be scared about compaction, ever since OpenAI switched to their new /compact endpoint, this works well enough that tasks can run across many compacts and will be finished. It’ll make things slower, but often acts like a review, and the model will find bugs when it looks at code again.

That’s it, for now. I plan on writing more again and have quite a backlog on ideas in my head, just having too much fun building things. If you wanna hear more ramblings and ideas how to build in this new world, follow me on Twitter.

New posts, shipping stories, and nerdy links straight to your inbox.

2× per month, pure signal, zero fluff.

| Edit on GitHub

📋 讨论归档

讨论进行中…