返回列表
🧠 阿头学 · 🪞 Uota学

OpenAI 研究员月烧 1 万刀用 Codex——真正的瓶颈不是执行,是决定分析什么

当 AI 能无摩擦地穿越整个组织的信息版图时,"协调成本"这个千年难题可能真的要被解了——而大多数人还在纠结 prompt 怎么写。

2026-02-06
阅读简报
双语对照
完整翻译
原文
讨论归档

核心观点

  • 让 Agent 持续记笔记并迭代自己的工作流 这是整篇最值钱的一句话。不是你去优化 Agent,是让 Agent 自己记录、自己优化。跨会话累积知识后,Agent 在常用任务上会稳定变好。这个模式 Uota 已经在做(memory 系统),但还可以更激进。
  • 真正的瓶颈是"决定分析什么" 数据分析和代码执行已经被 Agent 接管了,剩下的瓶颈是人类的判断力——决定方向、选择问题。这跟 ATou 作为 Context Engineer 的定位完美契合:你的价值不在执行,在选题。
  • 跨组织知识转移,零人工协调 他把 Codex 指向 Slack 频道,几小时产出 700+ 条可测试假设。没有会议、没有邮件、不需要打听。这是对"组织协调税"的降维打击。
  • 向"只和一个主 Agent 对话"收敛 一个主 Agent 编排一支子 Agent 军团,分别做调研、写代码、数据分析。极大降低人类切换上下文的成本。这就是 ATou 想要的"指挥 AI"的终态。
  • GPT-5.3-codex 擅长并发管理多个子代理 新模型在子代理编排上有质的提升。这意味着"一人指挥 Agent 军团"的模式正在从理论变成实践。

跟我们的关联

  • ATou 的 top 0.0001% 目标:这篇直接描述了"指挥 AI 的人"长什么样——不写代码、不做分析,只做判断和选题,让 Agent 军团去执行。意味着 ATou 应该更激进地把执行层交给 Uota 和 sub-agent,自己专注在"决定做什么"上。
  • Uota 的 memory 系统进化:文中"让 codex 持续记笔记"的模式,Uota 已经有 memory/ 和 MEMORY.md,但还没做到"Agent 自己写工具并提交到 workspace"。接下来可以试让 Uota 在工作中自动生成和迭代小工具脚本。

讨论引子

💭 如果 Neta 内部也用类似方式让 AI 爬梳所有 Slack/飞书讨论,自动生成可测试假设——20 人团队能释放多少协调成本?值不值得现在就搞? 💭 "决定分析什么"是人类最后的护城河——但如果 Agent 的判断力继续提升,这个护城河还能守多久?ATou 的 Context Engineer 角色需要怎么进化?

我花了 1 万美元,用 Codex 在 OpenAI 把研究自动化

我用掉了数十亿个 codex token。下面是我的配置,以及我从中学到的东西。

很多人严重低估了 codex 能做到什么。甚至我有些同事到现在也还没把 codex 用到位——但只要你给他们展示几个足够“野心勃勃”的用例,他们就会很想上手试试。所以我想把这些写下来,更广泛地分享出去,希望能启发更多人。

这篇文章里,我会分享一个很简单的个人配置,并聊聊一些“杀手级”用例:在这些场景里,我经常一次性分配上亿 token。这个月我总共花了 10,000 美元的 API 成本,因此成了团队里使用最“凶”的人之一。完全值回票价。

最后,我也会谈谈:我认为在不远的将来,组织可能会因此变得显著更高效。

持续记笔记

我的个人配置极其简单:git worktree、多个 shell 窗口,以及每个 worktree 一个 VSCode 实例,用来浏览代码改动。新版 codex 应用基本上开箱即得这种配置。别被过度花哨的工具链带偏。

真正的关键突破,是让 codex 持续记录并改进它自己的工作流。这套东西完全是我为个人配置临时拼出来的。Codex 在我常用的任务上会稳定地变得更好、更快,仅仅因为我养成了让它做笔记并迭代优化的习惯。工作过程中,codex 会把笔记和小工具提交到我们 monorepo 里我的个人文件夹。和代码库里某个新区域打过几轮交道后,这些小工具往往就会趋于稳定。我其实从没读过这些笔记;它们对我的价值,纯粹体现在 codex 性能的提升上。

现在这套配置可以跨会话累积知识,我也更放心大胆地把它用到更大规模的任务上。下面就聊两件我最近砸了数亿 token 的事。

规模化研究

研究进展很快。实验成本高,而且很容易配错,所以紧跟最新发现与各种坑点至关重要。好在 codex 是一台极其出色的搜索引擎。

当我想在一个自己并不熟悉的代码区域里快速实现一次性的实验时,我会让 codex 做一轮非常彻底的“尽职调查”。Codex 会去翻相关的 Slack 频道,通读相关讨论,从那些讨论里拉取实验分支,并为我的实验 cherry-pick 出有用的改动。所有这些都会被整理成一套详尽的笔记,并附上指向每条信息出处的链接。基于这些笔记,codex 会把实验接起来,并做出一堆超参数决策——如果没有它,我不花更多精力几乎不可能做得这么到位。

让它给“第二意见”,会显著提升我对上线内容的信心。在犯错代价高的场景里,你需要一个极其勤勉、召回率极高的搜索代理。Codex 经常正好满足我这个需求。

编码代理也很擅长做数据分析,让从数据里快速提炼洞见变得非常容易。眼下真正的瓶颈,是决定要分析什么。

最近我用 codex 很激进地把我们在模型行为上的一些工作规模化了。我意识到,我们内部的 Slack 里充满了讨论、报告和数据,都与不同类型的模型行为相关——而这些正是我们可能希望用更严谨方式去测试的东西。我让 codex 找到合适的频道并进行大规模爬梳,生成“可被测试”的假设描述。除了读 Slack,它还会看大家分享的截图,拉取与模型行为相关的文档,并在电子表格里穿梭导航。几个小时下来,它产出了 700 多条新假设,正在提升我们对模型行为与用户偏好的理解。

这些工作大多是在 GPT-5.2 上完成的,但我这几天也在测试新的 GPT-5.3-codex 模型。我每天用掉的 token 还在上涨——而这件事我觉得和我的生产力大致正相关。

我发现 GPT-5.3-codex 尤其擅长并发管理多个子代理。另外,codex 技术栈近期的提速,也让整个“子代理体验”变得更利落、更跟手。

我的工作流正在向“只和一个代理对话”转变:由这个主代理去编排一支代理军团,分别做 Slack 调研、代码调研、写代码与数据科学。这极大降低了我为了并行推进工作而频繁切换上下文的成本。不过,当我需要完成某个至关重要的任务时,我仍然会选择直接去和那个特定的子代理对话。

对社会的影响

这些工作流揭示了一件关于组织如何运作的基本事实。在我这两个用例里,我实现了跨组织的全面知识转移,而且完全不需要人工协调:没有会议、没有邮件、也不需要到处打听。我只是把 codex 指向问题,它就把几十个人的知识聚合了起来——而那些人甚至都不知道自己在为我“出力”。

我不禁会想,这会如何影响社会。传统上,组织要支付某种“人头税”:人越多,总产出确实会增加,但每新增一个人带来的边际贡献会变小,因为协调开销在增长。这是个巨大的问题。现代组织用一些工具来缓解它,比如非结构化的沟通渠道(Slack、Teams)、共享代码库和集中式文档,但摩擦仍然非常大。要为任何一个决策找到恰当的上下文,依然需要相当多的人力投入。

以今天已经具备的技术,我们现在可以穿越一个组织的完整信息版图,并按需综合出相关上下文。我们真的有机会显著削减地球上每个组织都在承受的低效。

我相信,我们的现代机构完全可以变得高效得多——事实证明,我们也许只需要开口去问。

链接: http://x.com/i/article/2018578800792203264

相关笔记

I use billions of codex tokens. Here is my setup and is what I learned.

我用掉了数十亿个 codex token。下面是我的配置,以及我从中学到的东西。

Many people drastically underestimate what codex can do. Even some of my colleagues still underutilize codex, but they are eager to experiment once you show them some ambitious use-cases. Thus, I wanted to write something down and share it more broadly, in the hopes it inspires more people.

很多人严重低估了 codex 能做到什么。甚至我有些同事到现在也还没把 codex 用到位——但只要你给他们展示几个足够“野心勃勃”的用例,他们就会很想上手试试。所以我想把这些写下来,更广泛地分享出去,希望能启发更多人。

In this post, I'll share my simple setup and discuss some killer use-cases, where I routinely allocate hundreds of millions of tokens. In total, I spent $10,000 on API costs this month, which makes me one of the most prolific users in my team. Totally worth it.

这篇文章里,我会分享一个很简单的个人配置,并聊聊一些“杀手级”用例:在这些场景里,我经常一次性分配上亿 token。这个月我总共花了 10,000 美元的 API 成本,因此成了团队里使用最“凶”的人之一。完全值回票价。

Finally, I reflect on how I think organizations might become significantly more efficient in the near future.

最后,我也会谈谈:我认为在不远的将来,组织可能会因此变得显著更高效。

Continual Note Taking

持续记笔记

My personal setup is incredibly simple: git worktrees, many shell windows, and one VSCode instance per worktree so I can browse code changes. You basically get this setup out of the box in the new codex app. Don't get baited by overly fancy tooling.

我的个人配置极其简单:git worktree、多个 shell 窗口,以及每个 worktree 一个 VSCode 实例,用来浏览代码改动。新版 codex 应用基本上开箱即得这种配置。别被过度花哨的工具链带偏。

The big unlock was getting codex to continually document and improve its own workflows. This is something I fully hacked together for my personal setup. Codex consistently gets better and faster at tasks I use it for, just because I have the habit of asking it to take notes and improve. While working, codex commits notes and helpers to my personal folder in our monorepo. After a few interactions with a new part of the codebase, these helpers tend to stabilize. I've never actually read these notes, their utility to me is purely the effect on codex's performance.

真正的关键突破,是让 codex 持续记录并改进它自己的工作流。这套东西完全是我为个人配置临时拼出来的。Codex 在我常用的任务上会稳定地变得更好、更快,仅仅因为我养成了让它做笔记并迭代优化的习惯。工作过程中,codex 会把笔记和小工具提交到我们 monorepo 里我的个人文件夹。和代码库里某个新区域打过几轮交道后,这些小工具往往就会趋于稳定。我其实从没读过这些笔记;它们对我的价值,纯粹体现在 codex 性能的提升上。

With my setup now able to compound knowledge across sessions, I got comfortable scaling up the tasks I used it for. Let’s dive into two tasks I recently spent hundreds of millions of tokens on.

现在这套配置可以跨会话累积知识,我也更放心大胆地把它用到更大规模的任务上。下面就聊两件我最近砸了数亿 token 的事。

Scaling Research

规模化研究

Research moves fast. Experiments are expensive and easy to misconfigure, so staying on top of the most recent findings and gotchas is crucial. Luckily, codex is an amazing search engine.

研究进展很快。实验成本高,而且很容易配错,所以紧跟最新发现与各种坑点至关重要。好在 codex 是一台极其出色的搜索引擎。

When I want to quickly implement a one-off experiment in a part of the codebase I am unfamiliar with, I get codex to do extensive due diligence. Codex explores relevant slack channels, reads related discussions, fetches experimental branches from those discussions, and cherry picks useful changes for my experiment. All of this gets summarized in an extensive set of notes, with links back to where each piece of information was found. Using these notes, codex wires the experiment and makes a bunch of hyperparameter decisions I couldn’t possibly make without much more effort.

当我想在一个自己并不熟悉的代码区域里快速实现一次性的实验时,我会让 codex 做一轮非常彻底的“尽职调查”。Codex 会去翻相关的 Slack 频道,通读相关讨论,从那些讨论里拉取实验分支,并为我的实验 cherry-pick 出有用的改动。所有这些都会被整理成一套详尽的笔记,并附上指向每条信息出处的链接。基于这些笔记,codex 会把实验接起来,并做出一堆超参数决策——如果没有它,我不花更多精力几乎不可能做得这么到位。

Asking for a second opinion greatly increases my confidence in what I'm shipping. In settings where mistakes are costly, you want an incredibly diligent, high-recall search agent. Codex routinely scratches that itch for me.

让它给“第二意见”,会显著提升我对上线内容的信心。在犯错代价高的场景里,你需要一个极其勤勉、召回率极高的搜索代理。Codex 经常正好满足我这个需求。

Coding agents are also great at data analysis, and have made it very easy to quickly get insights from data. Currently, the real bottleneck is figuring out what to analyze.

编码代理也很擅长做数据分析,让从数据里快速提炼洞见变得非常容易。眼下真正的瓶颈,是决定要分析什么。

Recently, I aggressively scaled some of our model behavior efforts using codex. I realized that our internal slack is filled with discussions, reports, and data all relating to different types of model behavior which we might want to test for more rigorously. I used codex to locate and extensively crawl the appropriate channels and generate descriptions of testable hypotheses. Beyond reading slack, it looked at screenshots people shared, pulled documents related to model behavior, and navigated spreadsheets. Over the course of several hours, this resulted in over 700 new hypotheses which are currently improving our understanding of model behavior and user preferences.

最近我用 codex 很激进地把我们在模型行为上的一些工作规模化了。我意识到,我们内部的 Slack 里充满了讨论、报告和数据,都与不同类型的模型行为相关——而这些正是我们可能希望用更严谨方式去测试的东西。我让 codex 找到合适的频道并进行大规模爬梳,生成“可被测试”的假设描述。除了读 Slack,它还会看大家分享的截图,拉取与模型行为相关的文档,并在电子表格里穿梭导航。几个小时下来,它产出了 700 多条新假设,正在提升我们对模型行为与用户偏好的理解。

Most of this work was done with GPT-5.2, but I've been testing the new GPT-5.3-codex model for a few days now. My tokens-used per day are going up, which I think loosely correlates with my productivity.

这些工作大多是在 GPT-5.2 上完成的,但我这几天也在测试新的 GPT-5.3-codex 模型。我每天用掉的 token 还在上涨——而这件事我觉得和我的生产力大致正相关。

I find GPT-5.3-codex to be particularly good at managing multiple subagents concurrently. Additionally, the recent speed-ups to the codex stack make the whole subagent experience feel a lot more snappy.

我发现 GPT-5.3-codex 尤其擅长并发管理多个子代理。另外,codex 技术栈近期的提速,也让整个“子代理体验”变得更利落、更跟手。

My workflow is currently shifting towards only talking to one agent, which in turn orchestrates a battalion of agents to do slack research, code research, code writing, and data science. This drastically reduces the amount of context-switching I need to do in order to parallelize my work through agents. However, when I need to do a crucial task, I still opt to directly talking to that specific subagent.

我的工作流正在向“只和一个代理对话”转变:由这个主代理去编排一支代理军团,分别做 Slack 调研、代码调研、写代码与数据科学。这极大降低了我为了并行推进工作而频繁切换上下文的成本。不过,当我需要完成某个至关重要的任务时,我仍然会选择直接去和那个特定的子代理对话。

Implications for Society

对社会的影响

These workflows reveal something fundamental about how organizations can operate. In both of my use-cases, I achieved comprehensive cross-organizational knowledge transfer without manual coordination. No meetings, no emails, no asking around. I simply pointed codex at the problem and it aggregated knowledge from dozens of people, who didn't even know they were contributing to my cause.

这些工作流揭示了一件关于组织如何运作的基本事实。在我这两个用例里,我实现了跨组织的全面知识转移,而且完全不需要人工协调:没有会议、没有邮件、也不需要到处打听。我只是把 codex 指向问题,它就把几十个人的知识聚合了起来——而那些人甚至都不知道自己在为我“出力”。

I can’t help but wonder how this will impact society. Traditionally, organizations pay some headcount-tax: add more people and total output increases, but each additional person contributes less because coordination overhead grows. This is a huge issue. Modern organizations use tools like unstructured communication channels (Slack, Teams), shared codebases, and centralized documentation to mitigate this, but there's still massive friction. Surfacing the right context for any given decision still requires significant human effort.

我不禁会想,这会如何影响社会。传统上,组织要支付某种“人头税”:人越多,总产出确实会增加,但每新增一个人带来的边际贡献会变小,因为协调开销在增长。这是个巨大的问题。现代组织用一些工具来缓解它,比如非结构化的沟通渠道(Slack、Teams)、共享代码库和集中式文档,但摩擦仍然非常大。要为任何一个决策找到恰当的上下文,依然需要相当多的人力投入。

With the technology available today, we can now traverse an organization's entire information landscape and synthesize relevant context on demand. We can make a real dent into inefficiencies every organization on the planet suffers from.

以今天已经具备的技术,我们现在可以穿越一个组织的完整信息版图,并按需综合出相关上下文。我们真的有机会显著削减地球上每个组织都在承受的低效。

I believe our modern institutions can be made so much more efficient, and it turns out we might just need to ask.

我相信,我们的现代机构完全可以变得高效得多——事实证明,我们也许只需要开口去问。

Link: http://x.com/i/article/2018578800792203264

链接: http://x.com/i/article/2018578800792203264

相关笔记

I spent $10,000 to automate my research at OpenAI with Codex

  • Source: https://x.com/kareldoostrlnck/status/2019477361557926281?s=46
  • Mirror: https://x.com/kareldoostrlnck/status/2019477361557926281?s=46
  • Published: 2026-02-05T18:24:57+00:00
  • Saved: 2026-02-06

Content

I use billions of codex tokens. Here is my setup and is what I learned.

Many people drastically underestimate what codex can do. Even some of my colleagues still underutilize codex, but they are eager to experiment once you show them some ambitious use-cases. Thus, I wanted to write something down and share it more broadly, in the hopes it inspires more people.

In this post, I'll share my simple setup and discuss some killer use-cases, where I routinely allocate hundreds of millions of tokens. In total, I spent $10,000 on API costs this month, which makes me one of the most prolific users in my team. Totally worth it.

Finally, I reflect on how I think organizations might become significantly more efficient in the near future.

Continual Note Taking

My personal setup is incredibly simple: git worktrees, many shell windows, and one VSCode instance per worktree so I can browse code changes. You basically get this setup out of the box in the new codex app. Don't get baited by overly fancy tooling.

The big unlock was getting codex to continually document and improve its own workflows. This is something I fully hacked together for my personal setup. Codex consistently gets better and faster at tasks I use it for, just because I have the habit of asking it to take notes and improve. While working, codex commits notes and helpers to my personal folder in our monorepo. After a few interactions with a new part of the codebase, these helpers tend to stabilize. I've never actually read these notes, their utility to me is purely the effect on codex's performance.

With my setup now able to compound knowledge across sessions, I got comfortable scaling up the tasks I used it for. Let’s dive into two tasks I recently spent hundreds of millions of tokens on.

Scaling Research

Research moves fast. Experiments are expensive and easy to misconfigure, so staying on top of the most recent findings and gotchas is crucial. Luckily, codex is an amazing search engine.

When I want to quickly implement a one-off experiment in a part of the codebase I am unfamiliar with, I get codex to do extensive due diligence. Codex explores relevant slack channels, reads related discussions, fetches experimental branches from those discussions, and cherry picks useful changes for my experiment. All of this gets summarized in an extensive set of notes, with links back to where each piece of information was found. Using these notes, codex wires the experiment and makes a bunch of hyperparameter decisions I couldn’t possibly make without much more effort.

Asking for a second opinion greatly increases my confidence in what I'm shipping. In settings where mistakes are costly, you want an incredibly diligent, high-recall search agent. Codex routinely scratches that itch for me.

Coding agents are also great at data analysis, and have made it very easy to quickly get insights from data. Currently, the real bottleneck is figuring out what to analyze.

Recently, I aggressively scaled some of our model behavior efforts using codex. I realized that our internal slack is filled with discussions, reports, and data all relating to different types of model behavior which we might want to test for more rigorously. I used codex to locate and extensively crawl the appropriate channels and generate descriptions of testable hypotheses. Beyond reading slack, it looked at screenshots people shared, pulled documents related to model behavior, and navigated spreadsheets. Over the course of several hours, this resulted in over 700 new hypotheses which are currently improving our understanding of model behavior and user preferences.

Most of this work was done with GPT-5.2, but I've been testing the new GPT-5.3-codex model for a few days now. My tokens-used per day are going up, which I think loosely correlates with my productivity.

I find GPT-5.3-codex to be particularly good at managing multiple subagents concurrently. Additionally, the recent speed-ups to the codex stack make the whole subagent experience feel a lot more snappy.

My workflow is currently shifting towards only talking to one agent, which in turn orchestrates a battalion of agents to do slack research, code research, code writing, and data science. This drastically reduces the amount of context-switching I need to do in order to parallelize my work through agents. However, when I need to do a crucial task, I still opt to directly talking to that specific subagent.

Implications for Society

These workflows reveal something fundamental about how organizations can operate. In both of my use-cases, I achieved comprehensive cross-organizational knowledge transfer without manual coordination. No meetings, no emails, no asking around. I simply pointed codex at the problem and it aggregated knowledge from dozens of people, who didn't even know they were contributing to my cause.

I can’t help but wonder how this will impact society. Traditionally, organizations pay some headcount-tax: add more people and total output increases, but each additional person contributes less because coordination overhead grows. This is a huge issue. Modern organizations use tools like unstructured communication channels (Slack, Teams), shared codebases, and centralized documentation to mitigate this, but there's still massive friction. Surfacing the right context for any given decision still requires significant human effort.

With the technology available today, we can now traverse an organization's entire information landscape and synthesize relevant context on demand. We can make a real dent into inefficiencies every organization on the planet suffers from.

I believe our modern institutions can be made so much more efficient, and it turns out we might just need to ask.

Link: http://x.com/i/article/2018578800792203264

📋 讨论归档

讨论进行中…