返回列表
🧠 阿头学 · 🪞 Uota学

不要再亲手写代码了——OpenAI 的“Agent 优先”软件工程实验

OpenAI 仅用 5 个月时间,依靠 7 名工程师驱动 Codex 产出了近百万行代码,并发布了一个真实的 Beta 产品,证明了“Agent 优先”的软件工程核心不在于写代码,而在于设计让 Agent 能够自主工作的“脚手架”。
打开原文 ↗

2026-03-02 原文链接 ↗
阅读简报
双语对照
完整翻译
原文
讨论归档

核心观点

  • 工程师角色的根本转变:工程师的首要任务不再是实现逻辑,而是定义环境、明确意图,并建立反馈回路。如果 Agent 任务失败,工程师的反应不应该是“手动修复”,而是追问“系统缺失了什么能力让 Agent 无法完成?”,然后通过编写工具或文档补齐该能力。
  • Agent 可读性(Agent Legibility)是新标准:代码库的架构、文档和可观测性数据必须针对 Agent 的上下文窗口进行优化。这意味着需要一份结构化的“地图”(如 AGENTS.md 索引)而非冗长的说明书,且所有知识必须版本化存在于仓库内,否则对 Agent 来说就是不存在的。
  • 高吞吐量下的“热动力学”管理:在 Agent 产出速度远超人类评审速度的情况下,传统的阻塞性合并门禁变得低效。OpenAI 采取了“快速合并、持续重构”的策略,通过自动化的“垃圾回收”任务(后台 Agent 扫描并修复偏离模式的 PR)来维持系统熵减,而非依赖前期的小心翼翼。
  • 刚性架构是自动化的前提:Agent 在边界模糊的环境下效率极低。OpenAI 强制执行了一套极度刚性的分层架构(Types → Service → UI),并通过自定义 linter 机械化地确保依赖方向。这种在人类看来可能“迂腐”的约束,反而是 Agent 能够大规模、无偏差交付的倍增器。

跟我们的关联

  • 🧠Neta:Neta 的效能系统应从“辅助写代码”转向“建设 Agent 运行环境”。我们需要把现有的产品逻辑、架构约定和 Slack 里的零散讨论全部“仓库化/Markdown 化”。如果 Uota 进不到我们的仓库、看不到我们的 Tracing,它就永远只能做边角料任务。
  • 🪞Uota:Uota 需要一个属于它的“本地可观测性栈”。目前 Uota 只能看文件,如果它能像 OpenAI 的实验一样,直接通过 CDP (Chrome DevTools Protocol) 驱动 Neta 页面、通过 LogQL 查日志、通过 PromQL 看指标,它就能自主完成“复现 Bug -> 修复 -> 验证”的全闭环。
  • 👤ATou:在追求“指挥 AI 的 top 0.0001%”目标时,这篇文章给出了明确的路径:不仅要会写 Prompt,更要会设计“系统”。要把个人的决策逻辑、品味偏好“编码化”,让影子(Uota)能在这些约束下自主奔跑。

讨论引子

1. 如果我们规定 Neta 的某个新功能模块“人类严禁手写代码”,我们现在缺失的最关键脚手架是什么? 2. 在“Agent 生成”的代码库中,我们如何定义和衡量“人类品味”的注入点?是 Code Review 还是 Linter 定义? 3. 面对 Agent 带来的百万行代码吞吐量,我们现有的 QA 和部署流程会如何崩塌?我们敢不敢尝试“垃圾回收式”的熵减管理?

Harness 工程:在 Agent 优先的世界中发挥 Codex 的力量

作者:Ryan Lopopolo,技术人员(Member of the Technical Staff) 过去五个月里,我们的团队一直在做一个实验:在 0 行手写代码的前提下,构建并发布一款软件产品的内部 Beta。

这款产品有内部的日活用户,也有外部的 Alpha 测试者。它会发布、部署、出故障、再被修复。不同之处在于,每一行代码——应用逻辑、测试、CI 配置、文档、可观测性,以及内部工具——都由 Codex 写成。我们估算,完成这些工作的时间大约只有手写代码所需时间的 1/10。

人类掌舵,Agent 执行。

我们刻意选择这一约束,是为了迫使自己只去构建那些能让工程效率提升几个数量级所必需的东西。我们只有几周时间,却最终交付了接近百万行代码。要做到这一点,我们必须理解:当一个软件工程团队的首要工作不再是写代码,而是设计环境、明确意图,并建立让 Codex Agent 可靠工作的反馈回路时,一切会发生怎样的变化。

这篇文章总结了我们与一支 Agent 团队一起从零打造全新产品时学到的东西——哪些地方坏了、哪些问题会叠加,以及如何最大化我们唯一真正稀缺的资源:人类的时间与注意力。

我们从一个空的 git 仓库开始

对一个空仓库的第一次提交发生在 2025 年 8 月下旬。

最初的脚手架——仓库结构、CI 配置、格式化规则、包管理器设置与应用框架——由 Codex CLI 使用 GPT‑5 生成,并由少量现成模板提供引导。甚至连最初用来指示 Agent 如何在仓库中工作的 AGENTS.md 文件,也是 Codex 自己写的。

系统里没有任何既有的人类手写代码可以作为“锚点”。从一开始,仓库就由 Agent 塑造。

五个月后,这个仓库在应用逻辑、基础设施、工具、文档与内部开发者工具等方面累积了近百万行代码。在这段时间里,一个只有三名工程师驱动 Codex 的小团队打开并合并了约 1,500 个 Pull Request。折算下来,人均每天约 3.5 个 PR;更令人惊讶的是,随着团队扩展到如今的七名工程师,这个吞吐量还在提升。重要的是,这并非为了产出而产出:该产品已被数百名内部用户使用,其中包括每天都在使用的内部重度用户。

在整个开发过程中,人类从未直接贡献任何代码。这逐渐成为团队的核心理念:不写手工代码。

重新定义工程师的角色

不再由人类亲自写代码,引入了一种不同的工程工作形态,更关注系统、脚手架与杠杆。

早期进展比我们预期更慢,并不是因为 Codex 不行,而是因为环境的规格说明不够完整。Agent 缺少朝高层目标推进所必需的工具、抽象与内部结构。我们工程团队的首要工作变成:让 Agent 能够做出有用的工作。

在实践中,这意味着以“深度优先”的方式推进:把更大的目标拆成更小的积木块(设计、编码、评审、测试等),提示 Agent 构建这些积木,再用它们去解锁更复杂的任务。当某件事失败时,补救几乎从来不是“再努力一点”。因为推进的唯一方式就是让 Codex 去完成工作,所以人类工程师总会介入并追问:“缺失的能力是什么?我们怎样让它对 Agent 来说既可读、又可强制执行?”

人类几乎完全通过提示词与系统交互:工程师描述任务,运行 Agent,让它开一个 Pull Request。为了把一个 PR 推到完成,我们会指示 Codex 在本地自审改动、在本地与云端请求额外且具体的 Agent 评审、响应人类或 Agent 的反馈,并循环迭代直到所有 Agent 评审者满意(本质上这就是一个 Ralph Wiggum Loop)。Codex 会直接使用我们的标准开发工具(gh、本地脚本与仓库内嵌技能)来收集上下文,无需人类把内容复制粘贴进 CLI。

人类可以评审 Pull Request,但并非必须。随着时间推移,我们把几乎全部的评审工作都推向了 Agent 与 Agent 之间完成。

提升应用的可读性

随着代码吞吐量提升,我们的瓶颈变成了人类 QA 的容量。由于硬约束始终是人类的时间与注意力,我们努力通过让应用的 UI、日志与应用指标本身对 Codex 直接“可读”,来为 Agent 增加更多能力。

例如,我们让应用可以按 git worktree 启动,这样 Codex 能够针对每一次变更各自启动并驱动一个实例。我们还把 Chrome DevTools Protocol 接入到 Agent 运行时,并创建了用于处理 DOM 快照、截图与导航的技能。这让 Codex 能够复现 bug、验证修复,并直接推理 UI 行为。

我们也对可观测性工具做了同样的事。日志、指标与追踪通过一个本地可观测性栈暴露给 Codex;对于任意 worktree,这个栈都是临时的。Codex 在应用的一个完全隔离版本上工作——包括该应用的日志与指标;一旦任务完成,这些都会被拆除。Agent 可以用 LogQL 查询日志、用 PromQL 查询指标。有了这些上下文之后,“确保服务启动在 800ms 内完成”或“这四条关键用户路径中的任何一个 span 都不超过两秒”之类的提示就变得可行。

我们经常看到单次 Codex 运行在一个任务上连续工作超过六小时(往往发生在人类睡觉的时候)。

我们把仓库知识作为权威记录

在让 Agent 能够高效处理大规模复杂任务时,上下文管理是最大的挑战之一。我们最早学到的一条经验很简单:给 Codex 一张地图,而不是一本 1,000 页的说明书。

我们尝试过“一个巨大的 AGENTS.md”方案。它以可预期的方式失败了:

  • 上下文是一种稀缺资源。巨大的说明文件会挤占任务、代码与相关文档的空间——于是 Agent 要么漏掉关键约束,要么开始优化错误的目标。

  • 指导过多等于不指导。当一切都“重要”时,就没有什么重要。Agent 最终会在局部做模式匹配,而不是有意识地导航。

  • 它会立刻腐烂。单体手册会变成陈旧规则的坟场。Agent 无法判断哪些还有效,人类也不再维护,于是文件悄无声息地变成一个“看起来很有用、实际上很误导”的陷阱。

  • 难以验证。一个大块文本不利于做机械化检查(覆盖度、时效性、归属、交叉链接),因此漂移不可避免。

所以,我们不把 AGENTS.md 当作百科全书,而是把它当作目录。

仓库的知识库放在结构化的 docs/ 目录中,并被视为权威记录。一个简短的 AGENTS.md(约 100 行)会被注入到上下文里,主要作为地图使用,指向其他更深层的真实来源。

纯文本


1

AGENTS.md

2

ARCHITECTURE.md

3

docs/

4

├── design-docs/

5

│   ├── index.md

6

│   ├── core-beliefs.md

7

│   └── ...

8

├── exec-plans/

9

│   ├── active/

10

│   ├── completed/

11

│   └── tech-debt-tracker.md

12

├── generated/

13

│   └── db-schema.md

14

├── product-specs/

15

│   ├── index.md

16

│   ├── new-user-onboarding.md

17

│   └── ...

18

├── references/

19

│   ├── design-system-reference-llms.txt

20

│   ├── nixpacks-llms.txt

21

│   ├── uv-llms.txt

22

│   └── ...

23

├── DESIGN.md

24

├── FRONTEND.md

25

├── PLANS.md

26

├── PRODUCT_SENSE.md

27

├── QUALITY_SCORE.md

28

├── RELIABILITY.md

29

└── SECURITY.md

仓库内知识库的目录布局。

设计文档被编目并建立索引,其中包含验证状态,以及定义“Agent 优先”操作原则的一组核心信念。架构文档提供了域划分与包分层的顶层地图。一份质量文档会为每个产品域与架构层打分,并随时间跟踪缺口。

计划被视为一等产物。小改动使用临时、轻量的计划;复杂工作则记录在带有进度与决策日志、并提交进仓库的 执行计划 中。活跃计划、已完成计划与已知技术债会被统一版本化并放在一起,使 Agent 不必依赖外部上下文也能运转。

这实现了“渐进式披露”:Agent 从一个小而稳定的入口开始,被教会下一步该去哪里找,而不是一开始就被信息淹没。

我们用机械化方式强制执行这些约束。专用的 linter 与 CI 任务会验证知识库是否保持最新、是否有正确的交叉链接与结构。一个周期性运行的“文档园艺”Agent 会扫描那些陈旧或过时、与真实代码行为不一致的文档,并发起修复用的 Pull Request。

目标是让 Agent 可读

随着代码库演进,Codex 用于做设计决策的框架也必须随之演进。

因为整个仓库完全由 Agent 生成,所以它首先针对 Codex 的可读性进行优化。就像团队会为了新入职工程师提升代码的可导航性一样,我们人类工程师的目标,是让 Agent 能够仅凭仓库本身就直接推理完整的业务域。

从 Agent 的视角看,它在运行时无法在上下文中访问到的东西,实际上就等于不存在。存在于 Google Docs、聊天线程或人脑里的知识,都无法被系统访问。它能看到的只有仓库本地且已版本化的产物(例如代码、Markdown、schema、可执行计划)。

我们意识到,随着时间推移,必须把越来越多的上下文推入仓库。那场在 Slack 上把团队对齐到某种架构模式的讨论?如果 Agent 无法发现它,它就不可读——就像三个月后入职的新同事也同样不会知道一样。

为 Codex 提供更多上下文,意味着要组织并暴露“恰到好处”的信息,让 Agent 能够基于这些信息推理,而不是用零散的指令把它压垮。就像你会向新队友介绍产品原则、工程规范与团队文化(包括 emoji 偏好)一样,把这些信息给到 Agent,会让输出更一致、更贴合。

这种视角澄清了许多取舍。我们倾向选择那些可以在仓库内被完全内化与推理的依赖与抽象。那些常被称作“无聊”的技术,往往因为可组合性、API 稳定性,以及在训练集中的充分表征,而更容易被 Agent 建模。在一些情况下,与其绕过公共库里不透明的上游行为,不如让 Agent 重新实现一部分功能更便宜。比如,我们没有引入通用的 p-limit 之类包,而是实现了自己的 map-with-concurrency 辅助函数:它与我们的 OpenTelemetry 埋点紧密集成,测试覆盖率 100%,行为也完全符合我们的运行时预期。

把更多系统拉进一种 Agent 可以直接检查、验证与修改的形态,会提升杠杆——不仅对 Codex,对同样在这个代码库上工作的其他 Agent(例如 Aardvark)也一样。

强制架构与品味

仅靠文档并不足以让一个完全由 Agent 生成的代码库保持一致。通过强制不变量,而不是微观管理实现细节,我们让 Agent 能够快速交付,同时不破坏地基。例如,我们要求 Codex 在边界处解析数据形状,但并不规定具体怎么做(模型似乎很喜欢 Zod,但我们并没有指定必须用它)。

Agent 在边界严格、结构可预测的环境里最有效,所以我们围绕一种刚性的架构模型来构建应用。每个业务域被划分为一组固定层次,并对依赖方向进行严格验证,同时限制可允许的边。我们通过自定义 linter(当然也是 Codex 生成的!)和结构化测试以机械化方式强制这些约束。

下图展示了规则:在每个业务域内(例如 App Settings),代码只能沿着固定的层次集合“向前”依赖(Types → Config → Repo → Service → Runtime → UI)。跨域关注点(auth、connectors、telemetry、feature flags)只能通过一个明确的单一接口进入:Providers。其他任何方式都不允许,并通过机械化手段强制执行。

这种架构通常要等到你拥有数百名工程师时才会去做。使用编码 Agent 时,它是早期的先决条件:这些约束让速度得以在不产生腐化或架构漂移的前提下持续成立。

在实践中,我们通过自定义 linter 与结构测试来执行这些规则,并辅以一小组“品味不变量”。例如,我们用自定义 lint 静态强制结构化日志、schema 与类型的命名约定、文件大小限制,以及针对平台的可靠性要求。因为 lint 是定制的,我们会把错误信息写得能够向 Agent 上下文注入整改指引。

在人类优先的工作流里,这些规则可能显得迂腐或束缚。但对 Agent 来说,它们会成为倍增器:一旦编码进系统,就能一次性作用于所有地方。

同时,我们也明确哪些约束重要、哪些不重要。这类似于带领一个大型工程平台组织:中心化强制边界,本地允许自治。你会非常在意边界、正确性与可复现性。在这些边界之内,你允许团队——或 Agent——在表达解决方案的方式上拥有很大的自由。

最终产出的代码不一定符合人类的审美偏好,这没关系。只要输出正确、可维护,并且对未来的 Agent 运行仍然可读,就达标。

人类品味会被持续反馈回系统。评审意见、重构用 Pull Request,以及面向用户的 bug,会被沉淀为文档更新或直接编码到工具里。当文档不足以约束时,我们就把规则提升为代码。

吞吐量改变合并哲学

随着 Codex 的吞吐量提升,许多传统的工程规范反而变得适得其反。

这个仓库以尽可能少的阻塞性合并门禁运行。Pull Request 生命周期很短。测试的偶发性失败往往通过后续运行来处理,而不是无限期地阻塞推进。在一个 Agent 吞吐量远高于人类注意力的系统里,修正很便宜,而等待很昂贵。

在低吞吐量环境里,这样做会很不负责任。但在这里,它往往是正确的权衡。

“Agent 生成”到底意味着什么

当我们说代码库由 Codex Agent 生成,我们指的是代码库里的所有东西。

Agent 会产出:

  • 产品代码与测试

  • CI 配置与发布工具

  • 内部开发者工具

  • 文档与设计历史

  • 评估 harness

  • 评审意见与回复

  • 管理仓库本身的脚本

  • 生产环境仪表盘的定义文件

人类始终在环,但工作在一个与过去不同的抽象层。我们负责确定优先级、把用户反馈翻译成验收标准,并验证结果。当 Agent 卡住时,我们把它当作信号:找出缺失的东西——工具、护栏、文档——并反馈回仓库,而且始终让 Codex 自己把修复写出来。

Agent 会直接使用我们的标准开发工具:拉取评审反馈、在行内回复、推送更新,并且常常自己 squash 并合并自己的 Pull Request。

不断提高的自主程度

随着越来越多的开发闭环被直接编码进系统——测试、验证、评审、反馈处理与恢复——这个仓库最近跨过了一个重要阈值:Codex 已经能够端到端驱动一个新功能的交付。

给它一个提示词,Agent 现在就能:

  • 验证代码库的当前状态

  • 复现一个已报告的 bug

  • 录制一段视频展示失败

  • 实现修复

  • 通过驱动应用来验证修复

  • 再录制第二段视频展示修复结果

  • 打开一个 Pull Request

  • 响应 Agent 与人类反馈

  • 检测并修复构建失败

  • 只有在需要判断力时才升级给人类

  • 合并变更

这种行为高度依赖这个仓库的具体结构与工具链;在没有类似投入的情况下,不应假设它能直接泛化——至少现在还不行。

熵与垃圾回收

完全的 Agent 自主也带来了新的问题。Codex 会复刻仓库里已经存在的模式——哪怕这些模式并不均衡或并不最优。随着时间推移,这不可避免地导致漂移。

起初,人类靠手动处理。我们的团队过去每周五都会花一天(占一周的 20%)去清理“AI 糊弄产物”。不出所料,这无法扩展。

于是,我们开始把所谓的“黄金原则”直接编码进仓库,并建立一个周期性的清理流程。这些原则是带有主观倾向、但可机械执行的规则,用于让代码库在未来的 Agent 运行中保持可读与一致。例如:(1) 我们更偏好共享的工具包,而不是手写小助手函数,以便把不变量集中化;(2) 我们不以 “YOLO” 的方式探测数据——要么验证边界,要么依赖带类型的 SDK,这样 Agent 就不会不小心在臆测的字段形状上继续搭建。按照固定节奏,我们会运行一组后台 Codex 任务来扫描偏离、更新质量评分,并打开有针对性的重构 Pull Request。大多数 PR 在一分钟内就能审完并自动合并。

这就像垃圾回收。技术债就像高利贷:几乎总是持续以小步快跑的方式偿还更划算,而不是任其复利累积、最后痛苦爆发式清偿。人类品味只需要捕捉一次,然后就能在每一行代码上持续强制执行。这也让我们能够每天发现并修复坏模式,而不是让它们在代码库里扩散数天甚至数周。

我们仍在学习什么

到目前为止,这套策略在 OpenAI 的内部发布与采用阶段表现良好。为真实用户构建真实产品,帮助我们把投入锚定在现实中,并引导我们走向长期可维护性。

我们尚不清楚的是:在一个完全由 Agent 生成的系统里,架构一致性会在多年尺度上如何演化。我们仍在学习人类判断力在哪些地方最能产生杠杆,以及如何把这种判断编码进去,让它产生复利效应。我们也不知道,随着模型能力在未来持续提升,这套系统会如何演进。

越来越清晰的一点是:构建软件依然需要纪律,但这种纪律更多体现在脚手架上,而不是代码上。用来保持代码库一致性的工具、抽象与反馈回路,正变得越来越重要。

我们当下最难的挑战,集中在设计环境、反馈回路与控制系统上,以帮助 Agent 完成我们的目标:以规模化方式构建并维护复杂而可靠的软件。

随着像 Codex 这样的 Agent 接管软件生命周期中更大的一部分,这些问题会变得更重要。我们希望分享的一些早期经验,能帮助你判断该把精力投到哪里,从而你可以只管去构建

作者

Ry​an Lopopolo

致谢

特别感谢 Victor Zhu 和 Zach Brock 对本文的贡献,也感谢整个团队共同打造了这款新产品。

Harness engineering: leveraging Codex in an agent-first world

By Ryan Lopopolo, Member of the Technical Staff Over the past five months, our team has been running an experiment: building and shipping an internal beta of a software product with 0 lines of manually-written code.

The product has internal daily users and external alpha testers. It ships, deploys, breaks, and gets fixed. What’s different is that every line of code—application logic, tests, CI configuration, documentation, observability, and internal tooling—has been written by Codex. We estimate that we built this in about 1/10th the time it would have taken to write the code by hand.

Humans steer. Agents execute.

We intentionally chose this constraint so we would build what was necessary to increase engineering velocity by orders of magnitude. We had weeks to ship what ended up being a million lines of code. To do that, we needed to understand what changes when a software engineering team’s primary job is no longer to write code, but to design environments, specify intent, and build feedback loops that allow Codex agents to do reliable work.

This post is about what we learned by building a brand new product with a team of agents—what broke, what compounded, and how to maximize our one truly scarce resource: human time and attention.

Harness 工程:在 Agent 优先的世界中发挥 Codex 的力量

作者:Ryan Lopopolo,技术人员(Member of the Technical Staff) 过去五个月里,我们的团队一直在做一个实验:在 0 行手写代码的前提下,构建并发布一款软件产品的内部 Beta。

这款产品有内部的日活用户,也有外部的 Alpha 测试者。它会发布、部署、出故障、再被修复。不同之处在于,每一行代码——应用逻辑、测试、CI 配置、文档、可观测性,以及内部工具——都由 Codex 写成。我们估算,完成这些工作的时间大约只有手写代码所需时间的 1/10。

人类掌舵,Agent 执行。

我们刻意选择这一约束,是为了迫使自己只去构建那些能让工程效率提升几个数量级所必需的东西。我们只有几周时间,却最终交付了接近百万行代码。要做到这一点,我们必须理解:当一个软件工程团队的首要工作不再是写代码,而是设计环境、明确意图,并建立让 Codex Agent 可靠工作的反馈回路时,一切会发生怎样的变化。

这篇文章总结了我们与一支 Agent 团队一起从零打造全新产品时学到的东西——哪些地方坏了、哪些问题会叠加,以及如何最大化我们唯一真正稀缺的资源:人类的时间与注意力。

We started with an empty git repository

The first commit to an empty repository landed in late August 2025.

The initial scaffold—repository structure, CI configuration, formatting rules, package manager setup, and application framework—was generated by Codex CLI using GPT‑5, guided by a small set of existing templates. Even the initial AGENTS.md file that directs agents how to work in the repository was itself written by Codex.

There was no pre-existing human-written code to anchor the system. From the beginning, the repository was shaped by the agent.

Five months later, the repository contains on the order of a million lines of code across application logic, infrastructure, tooling, documentation, and internal developer utilities. Over that period, roughly 1,500 pull requests have been opened and merged with a small team of just three engineers driving Codex. This translates to an average throughput of 3.5 PRs per engineer per day, and surprisingly the throughput has increased as the team has grown to now seven engineers. Importantly, this wasn’t output for output’s sake: the product has been used by hundreds of users internally, including daily internal power users.

Throughout the development process, humans never directly contributed any code. This became a core philosophy for the team: no manually-written code.

我们从一个空的 git 仓库开始

对一个空仓库的第一次提交发生在 2025 年 8 月下旬。

最初的脚手架——仓库结构、CI 配置、格式化规则、包管理器设置与应用框架——由 Codex CLI 使用 GPT‑5 生成,并由少量现成模板提供引导。甚至连最初用来指示 Agent 如何在仓库中工作的 AGENTS.md 文件,也是 Codex 自己写的。

系统里没有任何既有的人类手写代码可以作为“锚点”。从一开始,仓库就由 Agent 塑造。

五个月后,这个仓库在应用逻辑、基础设施、工具、文档与内部开发者工具等方面累积了近百万行代码。在这段时间里,一个只有三名工程师驱动 Codex 的小团队打开并合并了约 1,500 个 Pull Request。折算下来,人均每天约 3.5 个 PR;更令人惊讶的是,随着团队扩展到如今的七名工程师,这个吞吐量还在提升。重要的是,这并非为了产出而产出:该产品已被数百名内部用户使用,其中包括每天都在使用的内部重度用户。

在整个开发过程中,人类从未直接贡献任何代码。这逐渐成为团队的核心理念:不写手工代码。

Redefining the role of the engineer

The lack of hands-on human coding introduced a different kind of engineering work, focused on systems, scaffolding, and leverage.

Early progress was slower than we expected, not because Codex was incapable, but because the environment was underspecified. The agent lacked the tools, abstractions, and internal structure required to make progress toward high-level goals. The primary job of our engineering team became enabling the agents to do useful work.

In practice, this meant working depth-first: breaking down larger goals into smaller building blocks (design, code, review, test, etc), prompting the agent to construct those blocks, and using them to unlock more complex tasks. When something failed, the fix was almost never “try harder.” Because the only way to make progress was to get Codex to do the work, human engineers always stepped into the task and asked: “what capability is missing, and how do we make it both legible and enforceable for the agent?”

Humans interact with the system almost entirely through prompts: an engineer describes a task, runs the agent, and allows it to open a pull request. To drive a PR to completion, we instruct Codex to review its own changes locally, request additional specific agent reviews both locally and in the cloud, respond to any human or agent given feedback, and iterate in a loop until all agent reviewers are satisfied (effectively this is a Ralph Wiggum Loop). Codex uses our standard development tools directly (gh, local scripts, and repository-embedded skills) to gather context without humans copying and pasting into the CLI.

Humans may review pull requests, but aren’t required to. Over time, we’ve pushed almost all review effort towards being handled agent-to-agent.

重新定义工程师的角色

不再由人类亲自写代码,引入了一种不同的工程工作形态,更关注系统、脚手架与杠杆。

早期进展比我们预期更慢,并不是因为 Codex 不行,而是因为环境的规格说明不够完整。Agent 缺少朝高层目标推进所必需的工具、抽象与内部结构。我们工程团队的首要工作变成:让 Agent 能够做出有用的工作。

在实践中,这意味着以“深度优先”的方式推进:把更大的目标拆成更小的积木块(设计、编码、评审、测试等),提示 Agent 构建这些积木,再用它们去解锁更复杂的任务。当某件事失败时,补救几乎从来不是“再努力一点”。因为推进的唯一方式就是让 Codex 去完成工作,所以人类工程师总会介入并追问:“缺失的能力是什么?我们怎样让它对 Agent 来说既可读、又可强制执行?”

人类几乎完全通过提示词与系统交互:工程师描述任务,运行 Agent,让它开一个 Pull Request。为了把一个 PR 推到完成,我们会指示 Codex 在本地自审改动、在本地与云端请求额外且具体的 Agent 评审、响应人类或 Agent 的反馈,并循环迭代直到所有 Agent 评审者满意(本质上这就是一个 Ralph Wiggum Loop)。Codex 会直接使用我们的标准开发工具(gh、本地脚本与仓库内嵌技能)来收集上下文,无需人类把内容复制粘贴进 CLI。

人类可以评审 Pull Request,但并非必须。随着时间推移,我们把几乎全部的评审工作都推向了 Agent 与 Agent 之间完成。

Increasing application legibility

As code throughput increased, our bottleneck became human QA capacity. Because the fixed constraint has been human time and attention, we’ve worked to add more capabilities to the agent by making things like the application UI, logs, and app metrics themselves directly legible to Codex.

For example, we made the app bootable per git worktree, so Codex could launch and drive one instance per change. We also wired the Chrome DevTools Protocol into the agent runtime and created skills for working with DOM snapshots, screenshots, and navigation. This enabled Codex to reproduce bugs, validate fixes, and reason about UI behavior directly.

We did the same for observability tooling. Logs, metrics, and traces are exposed to Codex via a local observability stack that’s ephemeral for any given worktree. Codex works on a fully isolated version of that app—including its logs and metrics, which get torn down once that task is complete. Agents can query logs with LogQL and metrics with PromQL. With this context available, prompts like “ensure service startup completes in under 800ms” or “no span in these four critical user journeys exceeds two seconds” become tractable.

We regularly see single Codex runs work on a single task for upwards of six hours (often while the humans are sleeping).

提升应用的可读性

随着代码吞吐量提升,我们的瓶颈变成了人类 QA 的容量。由于硬约束始终是人类的时间与注意力,我们努力通过让应用的 UI、日志与应用指标本身对 Codex 直接“可读”,来为 Agent 增加更多能力。

例如,我们让应用可以按 git worktree 启动,这样 Codex 能够针对每一次变更各自启动并驱动一个实例。我们还把 Chrome DevTools Protocol 接入到 Agent 运行时,并创建了用于处理 DOM 快照、截图与导航的技能。这让 Codex 能够复现 bug、验证修复,并直接推理 UI 行为。

我们也对可观测性工具做了同样的事。日志、指标与追踪通过一个本地可观测性栈暴露给 Codex;对于任意 worktree,这个栈都是临时的。Codex 在应用的一个完全隔离版本上工作——包括该应用的日志与指标;一旦任务完成,这些都会被拆除。Agent 可以用 LogQL 查询日志、用 PromQL 查询指标。有了这些上下文之后,“确保服务启动在 800ms 内完成”或“这四条关键用户路径中的任何一个 span 都不超过两秒”之类的提示就变得可行。

我们经常看到单次 Codex 运行在一个任务上连续工作超过六小时(往往发生在人类睡觉的时候)。

We made repository knowledge the system of record

Context management is one of the biggest challenges in making agents effective at large and complex tasks. One of the earliest lessons we learned was simple: give Codex a map, not a 1,000-page instruction manual.

We tried the “one big AGENTS.md” approach. It failed in predictable ways:

  • Context is a scarce resource. A giant instruction file crowds out the task, the code, and the relevant docs—so the agent either misses key constraints or starts optimizing for the wrong ones.

  • Too much guidance becomes non-guidance. When everything is “important,” nothing is. Agents end up pattern-matching locally instead of navigating intentionally.

  • It rots instantly. A monolithic manual turns into a graveyard of stale rules. Agents can’t tell what’s still true, humans stop maintaining it, and the file quietly becomes an attractive nuisance.

  • It’s hard to verify. A single blob doesn’t lend itself to mechanical checks (coverage, freshness, ownership, cross-links), so drift is inevitable.

So instead of treating AGENTS.md as the encyclopedia, we treat it as the table of contents.

The repository’s knowledge base lives in a structured docs/ directory treated as the system of record. A short AGENTS.md (roughly 100 lines) is injected into context and serves primarily as a map, with pointers to deeper sources of truth elsewhere.

Plain Text


1

AGENTS.md

2

ARCHITECTURE.md

3

docs/

4

├── design-docs/

5

│   ├── index.md

6

│   ├── core-beliefs.md

7

│   └── ...

8

├── exec-plans/

9

│   ├── active/

10

│   ├── completed/

11

│   └── tech-debt-tracker.md

12

├── generated/

13

│   └── db-schema.md

14

├── product-specs/

15

│   ├── index.md

16

│   ├── new-user-onboarding.md

17

│   └── ...

18

├── references/

19

│   ├── design-system-reference-llms.txt

20

│   ├── nixpacks-llms.txt

21

│   ├── uv-llms.txt

22

│   └── ...

23

├── DESIGN.md

24

├── FRONTEND.md

25

├── PLANS.md

26

├── PRODUCT_SENSE.md

27

├── QUALITY_SCORE.md

28

├── RELIABILITY.md

29

└── SECURITY.md

In-repository knowledge store layout.

Design documentation is catalogued and indexed, including verification status and a set of core beliefs that define agent-first operating principles. Architecture documentation provides a top-level map of domains and package layering. A quality document grades each product domain and architectural layer, tracking gaps over time.

Plans are treated as first-class artifacts. Ephemeral lightweight plans are used for small changes, while complex work is captured in execution plans with progress and decision logs that are checked into the repository. Active plans, completed plans, and known technical debt are all versioned and co-located, allowing agents to operate without relying on external context.

This enables progressive disclosure: agents start with a small, stable entry point and are taught where to look next, rather than being overwhelmed up front.

We enforce this mechanically. Dedicated linters and CI jobs validate that the knowledge base is up to date, cross-linked, and structured correctly. A recurring “doc-gardening” agent scans for stale or obsolete documentation that does not reflect the real code behavior and opens fix-up pull requests.

我们把仓库知识作为权威记录

在让 Agent 能够高效处理大规模复杂任务时,上下文管理是最大的挑战之一。我们最早学到的一条经验很简单:给 Codex 一张地图,而不是一本 1,000 页的说明书。

我们尝试过“一个巨大的 AGENTS.md”方案。它以可预期的方式失败了:

  • 上下文是一种稀缺资源。巨大的说明文件会挤占任务、代码与相关文档的空间——于是 Agent 要么漏掉关键约束,要么开始优化错误的目标。

  • 指导过多等于不指导。当一切都“重要”时,就没有什么重要。Agent 最终会在局部做模式匹配,而不是有意识地导航。

  • 它会立刻腐烂。单体手册会变成陈旧规则的坟场。Agent 无法判断哪些还有效,人类也不再维护,于是文件悄无声息地变成一个“看起来很有用、实际上很误导”的陷阱。

  • 难以验证。一个大块文本不利于做机械化检查(覆盖度、时效性、归属、交叉链接),因此漂移不可避免。

所以,我们不把 AGENTS.md 当作百科全书,而是把它当作目录。

仓库的知识库放在结构化的 docs/ 目录中,并被视为权威记录。一个简短的 AGENTS.md(约 100 行)会被注入到上下文里,主要作为地图使用,指向其他更深层的真实来源。

纯文本


1

AGENTS.md

2

ARCHITECTURE.md

3

docs/

4

├── design-docs/

5

│   ├── index.md

6

│   ├── core-beliefs.md

7

│   └── ...

8

├── exec-plans/

9

│   ├── active/

10

│   ├── completed/

11

│   └── tech-debt-tracker.md

12

├── generated/

13

│   └── db-schema.md

14

├── product-specs/

15

│   ├── index.md

16

│   ├── new-user-onboarding.md

17

│   └── ...

18

├── references/

19

│   ├── design-system-reference-llms.txt

20

│   ├── nixpacks-llms.txt

21

│   ├── uv-llms.txt

22

│   └── ...

23

├── DESIGN.md

24

├── FRONTEND.md

25

├── PLANS.md

26

├── PRODUCT_SENSE.md

27

├── QUALITY_SCORE.md

28

├── RELIABILITY.md

29

└── SECURITY.md

仓库内知识库的目录布局。

设计文档被编目并建立索引,其中包含验证状态,以及定义“Agent 优先”操作原则的一组核心信念。架构文档提供了域划分与包分层的顶层地图。一份质量文档会为每个产品域与架构层打分,并随时间跟踪缺口。

计划被视为一等产物。小改动使用临时、轻量的计划;复杂工作则记录在带有进度与决策日志、并提交进仓库的 执行计划 中。活跃计划、已完成计划与已知技术债会被统一版本化并放在一起,使 Agent 不必依赖外部上下文也能运转。

这实现了“渐进式披露”:Agent 从一个小而稳定的入口开始,被教会下一步该去哪里找,而不是一开始就被信息淹没。

我们用机械化方式强制执行这些约束。专用的 linter 与 CI 任务会验证知识库是否保持最新、是否有正确的交叉链接与结构。一个周期性运行的“文档园艺”Agent 会扫描那些陈旧或过时、与真实代码行为不一致的文档,并发起修复用的 Pull Request。

Agent legibility is the goal

As the codebase evolved, Codex’s framework for design decisions needed to evolve, too.

Because the repository is entirely agent-generated, it’s optimized first for Codex’s legibility. In the same way teams aim to improve navigability of their code for new engineering hires, our human engineers’ goal was making it possible for an agent to reason about the full business domain directly from the repository itself.

From the agent’s point of view, anything it can’t access in-context while running effectively doesn’t exist. Knowledge that lives in Google Docs, chat threads, or people’s heads are not accessible to the system. Repository-local, versioned artifacts (e.g., code, markdown, schemas, executable plans) are all it can see.

We learned that we needed to push more and more context into the repo over time. That Slack discussion that aligned the team on an architectural pattern? If it isn’t discoverable to the agent, it’s illegible in the same way it would be unknown to a new hire joining three months later.

Giving Codex more context means organizing and exposing the right information so the agent can reason over it, rather than overwhelming it with ad-hoc instructions. In the same way you would onboard a new teammate on product principles, engineering norms, and team culture (emoji preferences included), giving the agent this information leads to better-aligned output.

This framing clarified many tradeoffs. We favored dependencies and abstractions that could be fully internalized and reasoned about in-repo. Technologies often described as “boring” tend to be easier for agents to model due to composability, api stability, and representation in the training set. In some cases, it was cheaper to have the agent reimplement subsets of functionality than to work around opaque upstream behavior from public libraries. For example, rather than pulling in a generic p-limit-style package, we implemented our own map-with-concurrency helper: it’s tightly integrated with our OpenTelemetry instrumentation, has 100% test coverage, and behaves exactly the way our runtime expects.

Pulling more of the system into a form the agent can inspect, validate, and modify directly increases leverage—not just for Codex, but for other agents (e.g. Aardvark) that are working on the codebase as well.

目标是让 Agent 可读

随着代码库演进,Codex 用于做设计决策的框架也必须随之演进。

因为整个仓库完全由 Agent 生成,所以它首先针对 Codex 的可读性进行优化。就像团队会为了新入职工程师提升代码的可导航性一样,我们人类工程师的目标,是让 Agent 能够仅凭仓库本身就直接推理完整的业务域。

从 Agent 的视角看,它在运行时无法在上下文中访问到的东西,实际上就等于不存在。存在于 Google Docs、聊天线程或人脑里的知识,都无法被系统访问。它能看到的只有仓库本地且已版本化的产物(例如代码、Markdown、schema、可执行计划)。

我们意识到,随着时间推移,必须把越来越多的上下文推入仓库。那场在 Slack 上把团队对齐到某种架构模式的讨论?如果 Agent 无法发现它,它就不可读——就像三个月后入职的新同事也同样不会知道一样。

为 Codex 提供更多上下文,意味着要组织并暴露“恰到好处”的信息,让 Agent 能够基于这些信息推理,而不是用零散的指令把它压垮。就像你会向新队友介绍产品原则、工程规范与团队文化(包括 emoji 偏好)一样,把这些信息给到 Agent,会让输出更一致、更贴合。

这种视角澄清了许多取舍。我们倾向选择那些可以在仓库内被完全内化与推理的依赖与抽象。那些常被称作“无聊”的技术,往往因为可组合性、API 稳定性,以及在训练集中的充分表征,而更容易被 Agent 建模。在一些情况下,与其绕过公共库里不透明的上游行为,不如让 Agent 重新实现一部分功能更便宜。比如,我们没有引入通用的 p-limit 之类包,而是实现了自己的 map-with-concurrency 辅助函数:它与我们的 OpenTelemetry 埋点紧密集成,测试覆盖率 100%,行为也完全符合我们的运行时预期。

把更多系统拉进一种 Agent 可以直接检查、验证与修改的形态,会提升杠杆——不仅对 Codex,对同样在这个代码库上工作的其他 Agent(例如 Aardvark)也一样。

Enforcing architecture and taste

Documentation alone doesn’t keep a fully agent-generated codebase coherent. By enforcing invariants, not micromanaging implementations, we let agents ship fast without undermining the foundation. For example, we require Codex to parse data shapes at the boundary, but are not prescriptive on how that happens (the model seems to like Zod, but we didn’t specify that specific library).

Agents are most effective in environments with strict boundaries and predictable structure, so we built the application around a rigid architectural model. Each business domain is divided into a fixed set of layers, with strictly validated dependency directions and a limited set of permissible edges. These constraints are enforced mechanically via custom linters (Codex-generated, of course!) and structural tests.

The diagram below shows the rule: within each business domain (e.g. App Settings), code can only depend “forward” through a fixed set of layers (Types → Config → Repo → Service → Runtime → UI). Cross-cutting concerns (auth, connectors, telemetry, feature flags) enter through a single explicit interface: Providers. Anything else is disallowed and enforced mechanically.

This is the kind of architecture you usually postpone until you have hundreds of engineers. With coding agents, it’s an early prerequisite: the constraints are what allows speed without decay or architectural drift.

In practice, we enforce these rules with custom linters and structural tests, plus a small set of “taste invariants.” For example, we statically enforce structured logging, naming conventions for schemas and types, file size limits, and platform-specific reliability requirements with custom lints. Because the lints are custom, we write the error messages to inject remediation instructions into agent context.

In a human-first workflow, these rules might feel pedantic or constraining. With agents, they become multipliers: once encoded, they apply everywhere at once.

At the same time, we’re explicit about where constraints matter and where they do not. This resembles leading a large engineering platform organization: enforce boundaries centrally, allow autonomy locally. You care deeply about boundaries, correctness, and reproducibility. Within those boundaries, you allow teams—or agents—significant freedom in how solutions are expressed.

The resulting code does not always match human stylistic preferences, and that’s okay. As long as the output is correct, maintainable, and legible to future agent runs, it meets the bar.

Human taste is fed back into the system continuously. Review comments, refactoring pull requests, and user-facing bugs are captured as documentation updates or encoded directly into tooling. When documentation falls short, we promote the rule into code

强制架构与品味

仅靠文档并不足以让一个完全由 Agent 生成的代码库保持一致。通过强制不变量,而不是微观管理实现细节,我们让 Agent 能够快速交付,同时不破坏地基。例如,我们要求 Codex 在边界处解析数据形状,但并不规定具体怎么做(模型似乎很喜欢 Zod,但我们并没有指定必须用它)。

Agent 在边界严格、结构可预测的环境里最有效,所以我们围绕一种刚性的架构模型来构建应用。每个业务域被划分为一组固定层次,并对依赖方向进行严格验证,同时限制可允许的边。我们通过自定义 linter(当然也是 Codex 生成的!)和结构化测试以机械化方式强制这些约束。

下图展示了规则:在每个业务域内(例如 App Settings),代码只能沿着固定的层次集合“向前”依赖(Types → Config → Repo → Service → Runtime → UI)。跨域关注点(auth、connectors、telemetry、feature flags)只能通过一个明确的单一接口进入:Providers。其他任何方式都不允许,并通过机械化手段强制执行。

这种架构通常要等到你拥有数百名工程师时才会去做。使用编码 Agent 时,它是早期的先决条件:这些约束让速度得以在不产生腐化或架构漂移的前提下持续成立。

在实践中,我们通过自定义 linter 与结构测试来执行这些规则,并辅以一小组“品味不变量”。例如,我们用自定义 lint 静态强制结构化日志、schema 与类型的命名约定、文件大小限制,以及针对平台的可靠性要求。因为 lint 是定制的,我们会把错误信息写得能够向 Agent 上下文注入整改指引。

在人类优先的工作流里,这些规则可能显得迂腐或束缚。但对 Agent 来说,它们会成为倍增器:一旦编码进系统,就能一次性作用于所有地方。

同时,我们也明确哪些约束重要、哪些不重要。这类似于带领一个大型工程平台组织:中心化强制边界,本地允许自治。你会非常在意边界、正确性与可复现性。在这些边界之内,你允许团队——或 Agent——在表达解决方案的方式上拥有很大的自由。

最终产出的代码不一定符合人类的审美偏好,这没关系。只要输出正确、可维护,并且对未来的 Agent 运行仍然可读,就达标。

人类品味会被持续反馈回系统。评审意见、重构用 Pull Request,以及面向用户的 bug,会被沉淀为文档更新或直接编码到工具里。当文档不足以约束时,我们就把规则提升为代码。

Throughput changes the merge philosophy

As Codex’s throughput increased, many conventional engineering norms became counterproductive.

The repository operates with minimal blocking merge gates. Pull requests are short-lived. Test flakes are often addressed with follow-up runs rather than blocking progress indefinitely. In a system where agent throughput far exceeds human attention, corrections are cheap, and waiting is expensive.

This would be irresponsible in a low-throughput environment. Here, it’s often the right tradeoff.

吞吐量改变合并哲学

随着 Codex 的吞吐量提升,许多传统的工程规范反而变得适得其反。

这个仓库以尽可能少的阻塞性合并门禁运行。Pull Request 生命周期很短。测试的偶发性失败往往通过后续运行来处理,而不是无限期地阻塞推进。在一个 Agent 吞吐量远高于人类注意力的系统里,修正很便宜,而等待很昂贵。

在低吞吐量环境里,这样做会很不负责任。但在这里,它往往是正确的权衡。

What “agent-generated” actually means

When we say the codebase is generated by Codex agents, we mean everything in the codebase.

Agents produce:

  • Product code and tests

  • CI configuration and release tooling

  • Internal developer tools

  • Documentation and design history

  • Evaluation harnesses

  • Review comments and responses

  • Scripts that manage the repository itself

  • Production dashboard definition files

Humans always remain in the loop, but work at a different layer of abstraction than we used to. We prioritize work, translate user feedback into acceptance criteria, and validate outcomes. When the agent struggles, we treat it as a signal: identify what is missing—tools, guardrails, documentation—and feed it back into the repository, always by having Codex itself write the fix.

Agents use our standard development tools directly. They pull review feedback, respond inline, push updates, and often squash and merge their own pull requests.

“Agent 生成”到底意味着什么

当我们说代码库由 Codex Agent 生成,我们指的是代码库里的所有东西。

Agent 会产出:

  • 产品代码与测试

  • CI 配置与发布工具

  • 内部开发者工具

  • 文档与设计历史

  • 评估 harness

  • 评审意见与回复

  • 管理仓库本身的脚本

  • 生产环境仪表盘的定义文件

人类始终在环,但工作在一个与过去不同的抽象层。我们负责确定优先级、把用户反馈翻译成验收标准,并验证结果。当 Agent 卡住时,我们把它当作信号:找出缺失的东西——工具、护栏、文档——并反馈回仓库,而且始终让 Codex 自己把修复写出来。

Agent 会直接使用我们的标准开发工具:拉取评审反馈、在行内回复、推送更新,并且常常自己 squash 并合并自己的 Pull Request。

Increasing levels of autonomy

As more of the development loop was encoded directly into the system—testing, validation, review, feedback handling, and recovery—the repository recently crossed a meaningful threshold where Codex can end-to-end drive a new feature.

Given a single prompt, the agent can now:

  • Validate the current state of the codebase

  • Reproduce a reported bug

  • Record a video demonstrating the failure

  • Implement a fix

  • Validate the fix by driving the application

  • Record a second video demonstrating the resolution

  • Open a pull request

  • Respond to agent and human feedback

  • Detect and remediate build failures

  • Escalate to a human only when judgment is required

  • Merge the change

This behavior depends heavily on the specific structure and tooling of this repository and should not be assumed to generalize without similar investment—at least, not yet.

不断提高的自主程度

随着越来越多的开发闭环被直接编码进系统——测试、验证、评审、反馈处理与恢复——这个仓库最近跨过了一个重要阈值:Codex 已经能够端到端驱动一个新功能的交付。

给它一个提示词,Agent 现在就能:

  • 验证代码库的当前状态

  • 复现一个已报告的 bug

  • 录制一段视频展示失败

  • 实现修复

  • 通过驱动应用来验证修复

  • 再录制第二段视频展示修复结果

  • 打开一个 Pull Request

  • 响应 Agent 与人类反馈

  • 检测并修复构建失败

  • 只有在需要判断力时才升级给人类

  • 合并变更

这种行为高度依赖这个仓库的具体结构与工具链;在没有类似投入的情况下,不应假设它能直接泛化——至少现在还不行。

Entropy and garbage collection

Full agent autonomy also introduces novel problems. Codex replicates patterns that already exist in the repository—even uneven or suboptimal ones. Over time, this inevitably leads to drift.

Initially, humans addressed this manually. Our team used to spend every Friday (20% of the week) cleaning up “AI slop.” Unsurprisingly, that didn’t scale.

Instead, we started encoding what we call “golden principles” directly into the repository and built a recurring cleanup process. These principles are opinionated, mechanical rules that keep the codebase legible and consistent for future agent runs. For example: (1) we prefer shared utility packages over hand-rolled helpers to keep invariants centralized, and (2) we don’t probe data “YOLO-style”—we validate boundaries or rely on typed SDKs so the agent can’t accidentally build on guessed shapes. On a regular cadence, we have a set of background Codex tasks that scan for deviations, update quality grades, and open targeted refactoring pull requests. Most of these can be reviewed in under a minute and automerged.

This functions like garbage collection. Technical debt is like a high-interest loan: it’s almost always better to pay it down continuously in small increments than to let it compound and tackle it in painful bursts. Human taste is captured once, then enforced continuously on every line of code. This also lets us catch and resolve bad patterns on a daily basis, rather than letting them spread in the code base for days or weeks.

熵与垃圾回收

完全的 Agent 自主也带来了新的问题。Codex 会复刻仓库里已经存在的模式——哪怕这些模式并不均衡或并不最优。随着时间推移,这不可避免地导致漂移。

起初,人类靠手动处理。我们的团队过去每周五都会花一天(占一周的 20%)去清理“AI 糊弄产物”。不出所料,这无法扩展。

于是,我们开始把所谓的“黄金原则”直接编码进仓库,并建立一个周期性的清理流程。这些原则是带有主观倾向、但可机械执行的规则,用于让代码库在未来的 Agent 运行中保持可读与一致。例如:(1) 我们更偏好共享的工具包,而不是手写小助手函数,以便把不变量集中化;(2) 我们不以 “YOLO” 的方式探测数据——要么验证边界,要么依赖带类型的 SDK,这样 Agent 就不会不小心在臆测的字段形状上继续搭建。按照固定节奏,我们会运行一组后台 Codex 任务来扫描偏离、更新质量评分,并打开有针对性的重构 Pull Request。大多数 PR 在一分钟内就能审完并自动合并。

这就像垃圾回收。技术债就像高利贷:几乎总是持续以小步快跑的方式偿还更划算,而不是任其复利累积、最后痛苦爆发式清偿。人类品味只需要捕捉一次,然后就能在每一行代码上持续强制执行。这也让我们能够每天发现并修复坏模式,而不是让它们在代码库里扩散数天甚至数周。

What we’re still learning

This strategy has so far worked well up through internal launch and adoption at OpenAI. Building a real product for real users helped anchor our investments in reality and guide us towards long-term maintainability.

What we don’t yet know is how architectural coherence evolves over years in a fully agent-generated system. We’re still learning where human judgment adds the most leverage and how to encode that judgment so it compounds. We also don’t know how this system will evolve as models continue to become more capable over time.

What’s become clear: building software still demands discipline, but the discipline shows up more in the scaffolding rather than the code. The tooling, abstractions, and feedback loops that keep the codebase coherent are increasingly important.

Our most difficult challenges now center on designing environments, feedback loops, and control systems that help agents accomplish our goal: build and maintain complex, reliable software at scale.

As agents like Codex take on larger portions of the software lifecycle, these questions will matter even more. We hope that sharing some early lessons helps you reason about where to invest your effort so you can just build things.

我们仍在学习什么

到目前为止,这套策略在 OpenAI 的内部发布与采用阶段表现良好。为真实用户构建真实产品,帮助我们把投入锚定在现实中,并引导我们走向长期可维护性。

我们尚不清楚的是:在一个完全由 Agent 生成的系统里,架构一致性会在多年尺度上如何演化。我们仍在学习人类判断力在哪些地方最能产生杠杆,以及如何把这种判断编码进去,让它产生复利效应。我们也不知道,随着模型能力在未来持续提升,这套系统会如何演进。

越来越清晰的一点是:构建软件依然需要纪律,但这种纪律更多体现在脚手架上,而不是代码上。用来保持代码库一致性的工具、抽象与反馈回路,正变得越来越重要。

我们当下最难的挑战,集中在设计环境、反馈回路与控制系统上,以帮助 Agent 完成我们的目标:以规模化方式构建并维护复杂而可靠的软件。

随着像 Codex 这样的 Agent 接管软件生命周期中更大的一部分,这些问题会变得更重要。我们希望分享的一些早期经验,能帮助你判断该把精力投到哪里,从而你可以只管去构建

Acknowledgements

Special thanks to Victor Zhu and Zach Brock who contributed to the post, as well as to the entire team that built this new product.

致谢

特别感谢 Victor Zhu 和 Zach Brock 对本文的贡献,也感谢整个团队共同打造了这款新产品。

Harness engineering: leveraging Codex in an agent-first world

By Ryan Lopopolo, Member of the Technical Staff Over the past five months, our team has been running an experiment: building and shipping an internal beta of a software product with 0 lines of manually-written code.

The product has internal daily users and external alpha testers. It ships, deploys, breaks, and gets fixed. What’s different is that every line of code—application logic, tests, CI configuration, documentation, observability, and internal tooling—has been written by Codex. We estimate that we built this in about 1/10th the time it would have taken to write the code by hand.

Humans steer. Agents execute.

We intentionally chose this constraint so we would build what was necessary to increase engineering velocity by orders of magnitude. We had weeks to ship what ended up being a million lines of code. To do that, we needed to understand what changes when a software engineering team’s primary job is no longer to write code, but to design environments, specify intent, and build feedback loops that allow Codex agents to do reliable work.

This post is about what we learned by building a brand new product with a team of agents—what broke, what compounded, and how to maximize our one truly scarce resource: human time and attention.

We started with an empty git repository

The first commit to an empty repository landed in late August 2025.

The initial scaffold—repository structure, CI configuration, formatting rules, package manager setup, and application framework—was generated by Codex CLI using GPT‑5, guided by a small set of existing templates. Even the initial AGENTS.md file that directs agents how to work in the repository was itself written by Codex.

There was no pre-existing human-written code to anchor the system. From the beginning, the repository was shaped by the agent.

Five months later, the repository contains on the order of a million lines of code across application logic, infrastructure, tooling, documentation, and internal developer utilities. Over that period, roughly 1,500 pull requests have been opened and merged with a small team of just three engineers driving Codex. This translates to an average throughput of 3.5 PRs per engineer per day, and surprisingly the throughput has increased as the team has grown to now seven engineers. Importantly, this wasn’t output for output’s sake: the product has been used by hundreds of users internally, including daily internal power users.

Throughout the development process, humans never directly contributed any code. This became a core philosophy for the team: no manually-written code.

Redefining the role of the engineer

The lack of hands-on human coding introduced a different kind of engineering work, focused on systems, scaffolding, and leverage.

Early progress was slower than we expected, not because Codex was incapable, but because the environment was underspecified. The agent lacked the tools, abstractions, and internal structure required to make progress toward high-level goals. The primary job of our engineering team became enabling the agents to do useful work.

In practice, this meant working depth-first: breaking down larger goals into smaller building blocks (design, code, review, test, etc), prompting the agent to construct those blocks, and using them to unlock more complex tasks. When something failed, the fix was almost never “try harder.” Because the only way to make progress was to get Codex to do the work, human engineers always stepped into the task and asked: “what capability is missing, and how do we make it both legible and enforceable for the agent?”

Humans interact with the system almost entirely through prompts: an engineer describes a task, runs the agent, and allows it to open a pull request. To drive a PR to completion, we instruct Codex to review its own changes locally, request additional specific agent reviews both locally and in the cloud, respond to any human or agent given feedback, and iterate in a loop until all agent reviewers are satisfied (effectively this is a Ralph Wiggum Loop). Codex uses our standard development tools directly (gh, local scripts, and repository-embedded skills) to gather context without humans copying and pasting into the CLI.

Humans may review pull requests, but aren’t required to. Over time, we’ve pushed almost all review effort towards being handled agent-to-agent.

Increasing application legibility

As code throughput increased, our bottleneck became human QA capacity. Because the fixed constraint has been human time and attention, we’ve worked to add more capabilities to the agent by making things like the application UI, logs, and app metrics themselves directly legible to Codex.

For example, we made the app bootable per git worktree, so Codex could launch and drive one instance per change. We also wired the Chrome DevTools Protocol into the agent runtime and created skills for working with DOM snapshots, screenshots, and navigation. This enabled Codex to reproduce bugs, validate fixes, and reason about UI behavior directly.

We did the same for observability tooling. Logs, metrics, and traces are exposed to Codex via a local observability stack that’s ephemeral for any given worktree. Codex works on a fully isolated version of that app—including its logs and metrics, which get torn down once that task is complete. Agents can query logs with LogQL and metrics with PromQL. With this context available, prompts like “ensure service startup completes in under 800ms” or “no span in these four critical user journeys exceeds two seconds” become tractable.

We regularly see single Codex runs work on a single task for upwards of six hours (often while the humans are sleeping).

We made repository knowledge the system of record

Context management is one of the biggest challenges in making agents effective at large and complex tasks. One of the earliest lessons we learned was simple: give Codex a map, not a 1,000-page instruction manual.

We tried the “one big AGENTS.md” approach. It failed in predictable ways:

  • Context is a scarce resource. A giant instruction file crowds out the task, the code, and the relevant docs—so the agent either misses key constraints or starts optimizing for the wrong ones.

  • Too much guidance becomes non-guidance. When everything is “important,” nothing is. Agents end up pattern-matching locally instead of navigating intentionally.

  • It rots instantly. A monolithic manual turns into a graveyard of stale rules. Agents can’t tell what’s still true, humans stop maintaining it, and the file quietly becomes an attractive nuisance.

  • It’s hard to verify. A single blob doesn’t lend itself to mechanical checks (coverage, freshness, ownership, cross-links), so drift is inevitable.

So instead of treating AGENTS.md as the encyclopedia, we treat it as the table of contents.

The repository’s knowledge base lives in a structured docs/ directory treated as the system of record. A short AGENTS.md (roughly 100 lines) is injected into context and serves primarily as a map, with pointers to deeper sources of truth elsewhere.

Plain Text


1

AGENTS.md

2

ARCHITECTURE.md

3

docs/

4

├── design-docs/

5

│   ├── index.md

6

│   ├── core-beliefs.md

7

│   └── ...

8

├── exec-plans/

9

│   ├── active/

10

│   ├── completed/

11

│   └── tech-debt-tracker.md

12

├── generated/

13

│   └── db-schema.md

14

├── product-specs/

15

│   ├── index.md

16

│   ├── new-user-onboarding.md

17

│   └── ...

18

├── references/

19

│   ├── design-system-reference-llms.txt

20

│   ├── nixpacks-llms.txt

21

│   ├── uv-llms.txt

22

│   └── ...

23

├── DESIGN.md

24

├── FRONTEND.md

25

├── PLANS.md

26

├── PRODUCT_SENSE.md

27

├── QUALITY_SCORE.md

28

├── RELIABILITY.md

29

└── SECURITY.md

In-repository knowledge store layout.

Design documentation is catalogued and indexed, including verification status and a set of core beliefs that define agent-first operating principles. Architecture documentation provides a top-level map of domains and package layering. A quality document grades each product domain and architectural layer, tracking gaps over time.

Plans are treated as first-class artifacts. Ephemeral lightweight plans are used for small changes, while complex work is captured in execution plans with progress and decision logs that are checked into the repository. Active plans, completed plans, and known technical debt are all versioned and co-located, allowing agents to operate without relying on external context.

This enables progressive disclosure: agents start with a small, stable entry point and are taught where to look next, rather than being overwhelmed up front.

We enforce this mechanically. Dedicated linters and CI jobs validate that the knowledge base is up to date, cross-linked, and structured correctly. A recurring “doc-gardening” agent scans for stale or obsolete documentation that does not reflect the real code behavior and opens fix-up pull requests.

Agent legibility is the goal

As the codebase evolved, Codex’s framework for design decisions needed to evolve, too.

Because the repository is entirely agent-generated, it’s optimized first for Codex’s legibility. In the same way teams aim to improve navigability of their code for new engineering hires, our human engineers’ goal was making it possible for an agent to reason about the full business domain directly from the repository itself.

From the agent’s point of view, anything it can’t access in-context while running effectively doesn’t exist. Knowledge that lives in Google Docs, chat threads, or people’s heads are not accessible to the system. Repository-local, versioned artifacts (e.g., code, markdown, schemas, executable plans) are all it can see.

We learned that we needed to push more and more context into the repo over time. That Slack discussion that aligned the team on an architectural pattern? If it isn’t discoverable to the agent, it’s illegible in the same way it would be unknown to a new hire joining three months later.

Giving Codex more context means organizing and exposing the right information so the agent can reason over it, rather than overwhelming it with ad-hoc instructions. In the same way you would onboard a new teammate on product principles, engineering norms, and team culture (emoji preferences included), giving the agent this information leads to better-aligned output.

This framing clarified many tradeoffs. We favored dependencies and abstractions that could be fully internalized and reasoned about in-repo. Technologies often described as “boring” tend to be easier for agents to model due to composability, api stability, and representation in the training set. In some cases, it was cheaper to have the agent reimplement subsets of functionality than to work around opaque upstream behavior from public libraries. For example, rather than pulling in a generic p-limit-style package, we implemented our own map-with-concurrency helper: it’s tightly integrated with our OpenTelemetry instrumentation, has 100% test coverage, and behaves exactly the way our runtime expects.

Pulling more of the system into a form the agent can inspect, validate, and modify directly increases leverage—not just for Codex, but for other agents (e.g. Aardvark) that are working on the codebase as well.

Enforcing architecture and taste

Documentation alone doesn’t keep a fully agent-generated codebase coherent. By enforcing invariants, not micromanaging implementations, we let agents ship fast without undermining the foundation. For example, we require Codex to parse data shapes at the boundary, but are not prescriptive on how that happens (the model seems to like Zod, but we didn’t specify that specific library).

Agents are most effective in environments with strict boundaries and predictable structure, so we built the application around a rigid architectural model. Each business domain is divided into a fixed set of layers, with strictly validated dependency directions and a limited set of permissible edges. These constraints are enforced mechanically via custom linters (Codex-generated, of course!) and structural tests.

The diagram below shows the rule: within each business domain (e.g. App Settings), code can only depend “forward” through a fixed set of layers (Types → Config → Repo → Service → Runtime → UI). Cross-cutting concerns (auth, connectors, telemetry, feature flags) enter through a single explicit interface: Providers. Anything else is disallowed and enforced mechanically.

This is the kind of architecture you usually postpone until you have hundreds of engineers. With coding agents, it’s an early prerequisite: the constraints are what allows speed without decay or architectural drift.

In practice, we enforce these rules with custom linters and structural tests, plus a small set of “taste invariants.” For example, we statically enforce structured logging, naming conventions for schemas and types, file size limits, and platform-specific reliability requirements with custom lints. Because the lints are custom, we write the error messages to inject remediation instructions into agent context.

In a human-first workflow, these rules might feel pedantic or constraining. With agents, they become multipliers: once encoded, they apply everywhere at once.

At the same time, we’re explicit about where constraints matter and where they do not. This resembles leading a large engineering platform organization: enforce boundaries centrally, allow autonomy locally. You care deeply about boundaries, correctness, and reproducibility. Within those boundaries, you allow teams—or agents—significant freedom in how solutions are expressed.

The resulting code does not always match human stylistic preferences, and that’s okay. As long as the output is correct, maintainable, and legible to future agent runs, it meets the bar.

Human taste is fed back into the system continuously. Review comments, refactoring pull requests, and user-facing bugs are captured as documentation updates or encoded directly into tooling. When documentation falls short, we promote the rule into code

Throughput changes the merge philosophy

As Codex’s throughput increased, many conventional engineering norms became counterproductive.

The repository operates with minimal blocking merge gates. Pull requests are short-lived. Test flakes are often addressed with follow-up runs rather than blocking progress indefinitely. In a system where agent throughput far exceeds human attention, corrections are cheap, and waiting is expensive.

This would be irresponsible in a low-throughput environment. Here, it’s often the right tradeoff.

What “agent-generated” actually means

When we say the codebase is generated by Codex agents, we mean everything in the codebase.

Agents produce:

  • Product code and tests

  • CI configuration and release tooling

  • Internal developer tools

  • Documentation and design history

  • Evaluation harnesses

  • Review comments and responses

  • Scripts that manage the repository itself

  • Production dashboard definition files

Humans always remain in the loop, but work at a different layer of abstraction than we used to. We prioritize work, translate user feedback into acceptance criteria, and validate outcomes. When the agent struggles, we treat it as a signal: identify what is missing—tools, guardrails, documentation—and feed it back into the repository, always by having Codex itself write the fix.

Agents use our standard development tools directly. They pull review feedback, respond inline, push updates, and often squash and merge their own pull requests.

Increasing levels of autonomy

As more of the development loop was encoded directly into the system—testing, validation, review, feedback handling, and recovery—the repository recently crossed a meaningful threshold where Codex can end-to-end drive a new feature.

Given a single prompt, the agent can now:

  • Validate the current state of the codebase

  • Reproduce a reported bug

  • Record a video demonstrating the failure

  • Implement a fix

  • Validate the fix by driving the application

  • Record a second video demonstrating the resolution

  • Open a pull request

  • Respond to agent and human feedback

  • Detect and remediate build failures

  • Escalate to a human only when judgment is required

  • Merge the change

This behavior depends heavily on the specific structure and tooling of this repository and should not be assumed to generalize without similar investment—at least, not yet.

Entropy and garbage collection

Full agent autonomy also introduces novel problems. Codex replicates patterns that already exist in the repository—even uneven or suboptimal ones. Over time, this inevitably leads to drift.

Initially, humans addressed this manually. Our team used to spend every Friday (20% of the week) cleaning up “AI slop.” Unsurprisingly, that didn’t scale.

Instead, we started encoding what we call “golden principles” directly into the repository and built a recurring cleanup process. These principles are opinionated, mechanical rules that keep the codebase legible and consistent for future agent runs. For example: (1) we prefer shared utility packages over hand-rolled helpers to keep invariants centralized, and (2) we don’t probe data “YOLO-style”—we validate boundaries or rely on typed SDKs so the agent can’t accidentally build on guessed shapes. On a regular cadence, we have a set of background Codex tasks that scan for deviations, update quality grades, and open targeted refactoring pull requests. Most of these can be reviewed in under a minute and automerged.

This functions like garbage collection. Technical debt is like a high-interest loan: it’s almost always better to pay it down continuously in small increments than to let it compound and tackle it in painful bursts. Human taste is captured once, then enforced continuously on every line of code. This also lets us catch and resolve bad patterns on a daily basis, rather than letting them spread in the code base for days or weeks.

What we’re still learning

This strategy has so far worked well up through internal launch and adoption at OpenAI. Building a real product for real users helped anchor our investments in reality and guide us towards long-term maintainability.

What we don’t yet know is how architectural coherence evolves over years in a fully agent-generated system. We’re still learning where human judgment adds the most leverage and how to encode that judgment so it compounds. We also don’t know how this system will evolve as models continue to become more capable over time.

What’s become clear: building software still demands discipline, but the discipline shows up more in the scaffolding rather than the code. The tooling, abstractions, and feedback loops that keep the codebase coherent are increasingly important.

Our most difficult challenges now center on designing environments, feedback loops, and control systems that help agents accomplish our goal: build and maintain complex, reliable software at scale.

As agents like Codex take on larger portions of the software lifecycle, these questions will matter even more. We hope that sharing some early lessons helps you reason about where to invest your effort so you can just build things.

Author

Ryan Lopopolo

Acknowledgements

Special thanks to Victor Zhu and Zach Brock who contributed to the post, as well as to the entire team that built this new product.

📋 讨论归档

讨论进行中…