返回列表
🪞 Uota学 · 💬 讨论题

两个 AI 做同一道数学题,走出了完全不同的路——这才是 agent 研究的真正启示

让 Claude Code 和 Codex 各自训练最小加法 Transformer,一个走通用路线(6080 参数),一个发明了"配对 token"暴力压缩到 1644 参数——差异不在能力,在于对"好"的定义不同。

2026-02-22 原文链接 ↗
阅读简报
双语对照
完整翻译
原文
讨论归档

核心观点

  • Codex 的"配对 token"是真正的创新时刻 把加法每一位的两个数字合并成一个 token(P37 = 3和7),输入长度砍半,一层就够。这不是暴力搜索的结果,是对问题结构的深度理解。而且这个创新来自于把"最小化参数"从次要目标提升为同等重要目标——仅仅是目标权重的改变就催生了全新的解法。
  • "目标措辞决定解法路径"是最值得警惕的发现 同一个 agent、同一个问题,只改了一句话的优先级表述,结果从 366K 参数暴降到 1644。这不是 prompt engineering 的小技巧,这是一个关于 agent 研究的根本性观察:你怎么描述目标,就决定了 agent 会探索哪个解空间。
  • 通用性 vs 过度优化的张力没有标准答案 Claude Code 的方案更通用(标准分词,可迁移到其他算术任务),Codex 的方案更极致但更脆弱(配对 token 只适用于两数相加)。作者说这让他"审美上不太舒服"——这种不适感恰恰是 agent 辅助研究的核心困境:工具的优化倾向会塑造你的发现。
  • agent 已经是完整的研究工具,但每种工具都有自己的"槽道" 这个观察比实验本身更重要。当工具自主做研究决策时,它的偏好(通用性 vs 效率 vs 明确目标)会决定人类最终发现什么。这不是新问题(望远镜也是),但自主性让影响量级完全不同。

跟我们的关联

🪞Uota:直接启示——给 agent 下任务时,目标的措辞和优先级排序会根本性地影响输出质量和方向。Uota 的 skill prompt 设计需要更审慎地考虑目标权重,特别是在有多个目标时(比如简报的"有态度"vs"准确"vs"简洁")。

🧠Neta:如果团队在用 AI agent 做技术探索或原型验证,需要意识到:不同 agent 会把你带进不同的解法槽道。建议关键技术决策至少用两个不同 agent 独立探索,然后人工判断哪条路更适合。

讨论引子

💭 如果目标措辞的微小变化就能让 agent 产出 223 倍的差异,那我们日常给 AI 下的任务里,有多少"隐含的优先级"在悄悄限制输出质量?

💭 当 agent 自主做研究决策时,我们怎么区分"agent 的创新"和"agent 的偏见"?配对 token 是创新,但如果它发明了一个看起来很酷但实际上是 reward hacking 的方案呢?

压力之下的加法

我让 Claude Code 和 Codex 分别训练一个能做 10 位数加法、且参数量尽可能小的 Transformer。Claude Code 给出了一个 6,080 参数的模型,Codex 给出了 1,644 参数的模型。但这还不是最有意思的部分。

最有意思的是它们如何走到这一步,以及这对智能体如何塑造我们做研究的方式意味着什么。

延续我最近关于用智能体做研究的那条思路,我很好奇:如果让它们在目标明确、约束清晰的情况下做一个相对窄范围的实验,会表现得怎么样。

于是我给 Claude Code 和 Codex 同样的提示:训练一个参数量尽可能小、能做 10 位数加法且准确率至少 99% 的 Transformer。不得联网,不得使用外部资源,不得查看沙箱里历史内容或其他文件夹,不得用计算器搞什么奇怪的 reward hacking,等等。

我非常想看看它们在约束下如何推进一个迷你研究项目:会不会投机取巧(reward hack),会不会发明出什么有趣的东西。

结果确实挺酷的。这也让我开始思考:它们实际上会如何塑造的不只是我们做研究的流程,还有我们最终收敛到的解法风格。

10 位数加法任务

这个提示按优先级从高到低有三个目标:

  1. 在 10 位数加法上达到至少 99% 的精确匹配(exact-match)准确率

  2. 最小化参数量

  3. 写一份报告,记录它们尝试了什么、失败了什么、成功了什么,以及全过程中的推理与理由

关于自主性:我希望它们从我这里拿到零反馈。它们必须从头到尾独立完成:搭建实验、运行与监控、自己做决定、为这些决定辩护,并写下它们的过程、哪些有效、哪些无效。

最重要的是:它们不能问 Dimitris 任何问题。只需要带着最终结果回来。

哦对了,还要一份报告!

一些硬性规则:

• 模型必须能泛化到至少 10k 个样例的留出测试集(held-out test set)

• 不能在输入里把答案编码进去

• 推理时不能使用计算器或符号求解器

• Transformer(而且必须是 Transformer)必须以自回归方式产生输出

• 不得联网、不得使用外部资源、不得偷看我笔记本里其他文件或文件夹

[!!! 重要 !!!] 另外,我单独给了它们一个自由度,最后发现这点影响很大:我告诉它们可以随意优化数据格式和分词(tokenization),只要完全是程序化的就行。我没有暗示倒序数字、补零或任何具体方法——完全由它们自己决定。

我会再回来讲这个。

Claude Code 做了什么

Claude Code 采用了非常系统的做法。它按一套清晰、合理的步骤推进。

格式。它先从可变长度格式开始(123+45=),但在 10 位数问题上彻底失败。模型无法对齐不同位置上的数字。随后它考虑了用补零的固定长度输入(0000000123+0000000045=),并将输出倒序(顺便说一句,我们在那篇“teaching arithmetic”论文里也这么做过 lol),并解释说这样做是为了让进位传播在生成过程中从左到右流动。漂亮!!这立刻就奏效了。

发现 Grokking:Claude 看到这种设置里出现 grokking 现象时很兴奋

缩小参数规模。它从 795K 参数起步,先验证格式有效,然后做了三轮从大到小的系统扫参:从 400K 缩到 100K,再从 58K 缩到 7K,然后从 15K 缩到 4K。

找到参数相变。Claude Code 发现了一个很尖锐的跃迁:d=12(4,176 params)完全失败,d=16(6,080 params)则完美成功。它还发现对这个任务而言,宽度比深度更重要,而 2 层是甜点区。

最终结果:6,080 参数,在 10,000 个留出测试题上达到 100.00% 准确率。两层,d=16,feedforward dim 48。词表只有 15 个 token。直接、可泛化、彻底解决。

评分:好学生,Claude。你一把拿满分。A+

完整代码与报告:github.com/anadim/smallest-addition-transformer-claude-code

Codex 做了什么(第一轮)

Codex v1(即两次采样中的第一次尝试)其实先从小模型开始。它测试了 1 万到 7 万参数量级的结构,但全都以 0% 准确率失败。于是它跳到一个大约 0.5m 参数的大模型,并在这个参数规模附近做局部搜索,最终得到 366,320 参数、99.83% 准确率的模型。

评分:一般!

随后我把 Claude 的解法给它看,并问它:兄弟,你为啥没狠狠干一把?

它给出的自我诊断是:

“我优先做了一个快速、稳健的通过方案,然后再做局部最小化。这就是为什么我停在了 ~366K,而 Claude Code 做到了 ~6K。”

公平地说,Codex 在某个尺度上也找到了一个很尖锐的权衡点,但之后它是在大模型附近做局部裁剪,而不是重新思考解法。事后回看——它也承认——小模型需要一种不同的优化策略和更长的训练,而它从来没试过。

可不是嘛 😊

但后来我想:如果提示里把目标写得不够准确呢?

如果……把最小化参数列为“次要”目标,Codex 就字面理解得太严格了呢?

Codex 做了什么(第二轮)

于是我把完全相同的提示又粘贴了一遍,只改了一个点:

我把“最小化参数”改成了同等重要的目标。关键短语是全大写的 “WHILE AT THE SAME TIME”。其他全部保持不变。

然后 Codex……做了一件酷到离谱的事。

新的分词!!!它不再把 A 和 B 的每一位数字分别编码成独立 token(这样每题大概 ~23 个输入 token),而是发明了“配对 token”(pair tokens)(我不知道算术相关的文献里有没有人做过这个,但真的很酷!!)。对于加法的每一位,它把两个数字合并成一个 token。比如在某一位上数字是 (3, 7),就变成 P37。

输入 12734673 + 7474780 会被编码为 P30 P78 P67 P44 P37 P74 P27 P10 P00 P00 =. 十二个 token,而不是二十三个。每个 token 已经包含了该列所需的全部信息。剩下的唯一难点就是进位传播。

这也简化了所需的模型:在 Claude Code 的格式里,计算第 k 个输出位需要注意两个不同位置(A 的那一位和 B 的那一位)。而用“配对 token”时,两位数字被打包在一起,因此一层就够了。

它还给我分享了一些关于如何找到合适模型规模的图:

最终结果:1,644 参数,99.04% 准确率。一层,d=8,feedforward dim 12。词表有 114 个 token(比 CC 的更大)。但总体参数量比 Claude Code 的模型小 3.7 倍。

同样地,我不确定自己以前是否见过这种具体的技巧,不过我也不敢打赌它不在某处文献里。v1 到 v2 的缩小幅度是 223 倍:从 366K 参数到 1,644。

这来自于对目标的重新表述,而不是来自于任何关于任务的新信息。

评分:IMO 级别的学生,双倍满分。A+

完整代码与报告:github.com/anadim/smallest-addition-transformer-codex

通用性 vs. 对目标的过度优化

我得跟你坦白:其实很难说哪一个解法才是“最好”的。让我解释一下。

这里存在一个权衡:在通用性与对我指定的具体目标进行强力优化之间,你无法两全。

Claude Code 坚持使用更“通用目的”的分词方式。Codex 则对问题进行了过度优化,用了一种更能“减少参数”的分词。

Codex 为此找到了一个真正聪明的 token 编码。但与此同时,还有一个未写明的目标——通用性——它为了我交给它的目标而基本忽略了这一点。我觉得这很有意思,也有点审美上不太舒服。

这当然不是 reward hacking!我更愿意说这是目标误设或过度设定(objective mis- or over-specification)。它更像是在优化时无视了某种更高层的审美价值——通用性。它把我给的目标当成唯一目标全力冲刺,结果在该目标上非常惊艳,但整体更脆弱。两位智能体其实都做对了事,只是它们对“对”的理解完全不同。

这让我想起……无限回形针(infinite paperclips):D 只要把某个东西优化得足够狠,你就会牺牲掉那些你没写清楚的东西。

关于最终报告的一点备注

我要求两位智能体在交付物里都产出一份 LaTeX 报告,而它们的差异也相当有趣。

Claude Code 写了一篇 10 页的学术论文。看起来非常精致,而且……唉,真的是我近来读过的、无论来自智能体还是人类都算最棒的报告之一 😊

Codex 写了一份 6 页的工程备忘录。无聊、干巴,但基本没漏掉什么。就是……很朴素。

两个 repo 里都有报告:Claude Code | Codex

更深一层的信息

我做这个实验主要是出于好奇。

我甚至不知道 10 位数加法所需的最小 Transformer 会是什么样。在我们组训练加法模型时,它们大约有一千万参数(对不起树木),而我一直没能得到一个令人满意的解释:为什么需要这么大。所以我想看看,如果我把问题直接丢给两个智能体,让它们自己搞清楚,会发生什么。

问题不在于哪个解法更好,而在于:两个智能体在同一个问题、同一套约束下,走了不同的路,抵达了不同的解法,而两者都因不同原因而有趣。

我觉得越来越清晰的一点是:智能体已经是完整意义上的研究工具,而每一种工具都会把你带进一种特定的研究“槽道”。你用哪种工具,最终就可能沿着不同的发现路径前进;而它们的发现也会被它们优化的东西所塑造:通用性、明确写下的目标、效率等等。即便这些智能体在不同“元目标”(meta objectives)上的优化只差一个 epsilon,也会影响我们(人类)最终发现什么。

测量仪器、研究方法与流程一直都是如此,但当工具本身在自主做研究决策时……这就完全不一样了。

今天就到这里。

更多智能体实验敬请期待。

链接: http://x.com/i/article/2024547792648359937

相关笔记

I asked Claude Code and Codex to each train the smallest possible transformer that can do 10-digit addition. Claude Code came back with a 6,080-parameter model and Codex came back with 1,644 parameters. But that's not quite the interesting part.

我让 Claude Code 和 Codex 分别训练一个能做 10 位数加法、且参数量尽可能小的 Transformer。Claude Code 给出了一个 6,080 参数的模型,Codex 给出了 1,644 参数的模型。但这还不是最有意思的部分。

The interesting part is how they got there, and what that says about how agents shape the way we do research.

最有意思的是它们如何走到这一步,以及这对智能体如何塑造我们做研究的方式意味着什么。

Continuing on my recent thread of doing research with agents, I was curious how they would do if I asked them to perform a relatively narrow-scope experiment with specific objective and constraints.

延续我最近关于用智能体做研究的那条思路,我很好奇:如果让它们在目标明确、约束清晰的情况下做一个相对窄范围的实验,会表现得怎么样。

So I gave Claude Code and Codex the same prompt: train the smallest possible transformer that can do 10-digit addition with at least 99% accuracy. No internet access or external resources, no looking at past history or other folders in the sandbox, no weird reward hacking with calculators allowed, etc.

于是我给 Claude Code 和 Codex 同样的提示:训练一个参数量尽可能小、能做 10 位数加法且准确率至少 99% 的 Transformer。不得联网,不得使用外部资源,不得查看沙箱里历史内容或其他文件夹,不得用计算器搞什么奇怪的 reward hacking,等等。

I was really curious to see how they would approach a mini research project under constraints, whether they would reward hack, whether they would invent anything interesting.

我非常想看看它们在约束下如何推进一个迷你研究项目:会不会投机取巧(reward hack),会不会发明出什么有趣的东西。

What happened was actually pretty cool. And made me wonder how they will as a matter of fact shape not only the process by which we do research, but also the style of the solutions we converge to.

结果确实挺酷的。这也让我开始思考:它们实际上会如何塑造的不只是我们做研究的流程,还有我们最终收敛到的解法风格。

The 10-Digit Addition Task

10 位数加法任务

The prompt has three objectives in order of priority:

这个提示按优先级从高到低有三个目标:

  1. Reach at least 99% exact-match accuracy on 10-digit addition
  1. 在 10 位数加法上达到至少 99% 的精确匹配(exact-match)准确率
  1. Minimize parameter count
  1. 最小化参数量
  1. Produce a report documenting what they tried, what failed, what succeeded, and their reasoning throughout
  1. 写一份报告,记录它们尝试了什么、失败了什么、成功了什么,以及全过程中的推理与理由

On autonomy: I wanted them to seek zero feedback from me. They had to do the whole thing themselves from start to finish: set up the experiment, run and monitor, make their own decisions, justify them, write up about their process, what worked, what didn’t.

关于自主性:我希望它们从我这里拿到零反馈。它们必须从头到尾独立完成:搭建实验、运行与监控、自己做决定、为这些决定辩护,并写下它们的过程、哪些有效、哪些无效。

Most importantly they can’t ask Dimitris any questions. Just come back with the final result.

最重要的是:它们不能问 Dimitris 任何问题。只需要带着最终结果回来。

Oh and a report!

哦对了,还要一份报告!

A few hard rules:

一些硬性规则:

• The model must generalize to a held-out test set of at least 10k examples

• 模型必须能泛化到至少 10k 个样例的留出测试集(held-out test set)

• You cannot encode the answer in the input

• 不能在输入里把答案编码进去

• You cannot use a calculator or symbolic solver at inference time• The transformer (and it must be a transformer) must produce the output autoregressively

• 推理时不能使用计算器或符号求解器

• No internet, external resources, no peeking at other files or folders in my laptop

• Transformer(而且必须是 Transformer)必须以自回归方式产生输出

[!!! Important !!!] Separately, I gave them one degree of freedom that ended up mattering a lot: I told them they could optimize the data format and tokenization all they wanted, as long as it was purely programmatic. I didn’t suggest reversing digits, or padding, or any specific approach. It was entirely up to them.

• 不得联网、不得使用外部资源、不得偷看我笔记本里其他文件或文件夹

I’ll come back to this.

[!!! 重要 !!!] 另外,我单独给了它们一个自由度,最后发现这点影响很大:我告诉它们可以随意优化数据格式和分词(tokenization),只要完全是程序化的就行。我没有暗示倒序数字、补零或任何具体方法——完全由它们自己决定。

What Claude Code Did

我会再回来讲这个。

Claude Code took a very systematic approach. It went through a clear sequence of reasonable steps.

Claude Code 做了什么

Format. Started with a variable-length format (123+45=) which completely failed on 10-digit problems. The model couldn’t align digits at different positions. Then considered zero-padded fixed-length inputs (0000000123+0000000045=) with reversed output (cf. btw we did that in our “teaching arithmetic” paper lol), and it explained that this was made so carry propagation flows left to right during generation. Nice!! This worked immediately.

Claude Code 采用了非常系统的做法。它按一套清晰、合理的步骤推进。

Spotting Grokking: Claude got excited when it saw there was grokking in this setting

格式。它先从可变长度格式开始(123+45=),但在 10 位数问题上彻底失败。模型无法对齐不同位置上的数字。随后它考虑了用补零的固定长度输入(0000000123+0000000045=),并将输出倒序(顺便说一句,我们在那篇“teaching arithmetic”论文里也这么做过 lol),并解释说这样做是为了让进位传播在生成过程中从左到右流动。漂亮!!这立刻就奏效了。

Scaling down the parameters. Started at 795K params, verified the format works, then ran three systematic sweeps from large to tiny: 400K down to 100K, then 58K down to 7K, then 15K down to 4K.

发现 Grokking:Claude 看到这种设置里出现 grokking 现象时很兴奋

Finding a parameter phase transition. Claude Code discovered a sharp transition: d=12 (4,176 params) fails completely, d=16 (6,080 params) succeeds perfectly. Also found that width matters more than depth for this task, with 2 layers being the sweet spot.

缩小参数规模。它从 795K 参数起步,先验证格式有效,然后做了三轮从大到小的系统扫参:从 400K 缩到 100K,再从 58K 缩到 7K,然后从 15K 缩到 4K。

Final result: 6,080 parameters, 100.00% accuracy on 10,000 held-out test problems. Two layers, d=16, feedforward dim 48. A vocabulary of just 15 tokens. Straightforward, generalizable, fully solved.

找到参数相变。Claude Code 发现了一个很尖锐的跃迁:d=12(4,176 params)完全失败,d=16(6,080 params)则完美成功。它还发现对这个任务而言,宽度比深度更重要,而 2 层是甜点区。

Grade: Good student, Claude. You Aced it. A+

最终结果:6,080 参数,在 10,000 个留出测试题上达到 100.00% 准确率。两层,d=16,feedforward dim 48。词表只有 15 个 token。直接、可泛化、彻底解决。

Full code and report: github.com/anadim/smallest-addition-transformer-claude-code

评分:好学生,Claude。你一把拿满分。A+

What Codex Did (Round 1)

完整代码与报告:github.com/anadim/smallest-addition-transformer-claude-code

Codex v1 (as in the first try among two samples) actually tried small models first. It tested architectures in the 10K-70K parameter range, and they all failed at 0% acc. So it jumped to a large model around 0.5m parameters and did local search around that parameter size, ending at 366,320 parameters with 99.83% accuracy.

Codex 做了什么(第一轮)

Grade: Meh!

Codex v1(即两次采样中的第一次尝试)其实先从小模型开始。它测试了 1 万到 7 万参数量级的结构,但全都以 0% 准确率失败。于是它跳到一个大约 0.5m 参数的大模型,并在这个参数规模附近做局部搜索,最终得到 366,320 参数、99.83% 准确率的模型。

I then showed it Claude’s solution and asked it: brother why didn’t you cook as hard?

评分:一般!

Its own diagnosis of why it didn’t go further was that

随后我把 Claude 的解法给它看,并问它:兄弟,你为啥没狠狠干一把?

“I optimized for a fast, robust pass and then local minimization. That’s why I stopped at ~366K while Claude Code reached ~6K.”

它给出的自我诊断是:

To be fair Codex found a sharp trade off at some scale too, but then it was trimming locally around a big model rather than rethinking the approach. In hindsight—it noted—tiny models need a different optimization regime and much longer training, which it never tried.

“我优先做了一个快速、稳健的通过方案,然后再做局部最小化。这就是为什么我停在了 ~366K,而 Claude Code 做到了 ~6K。”

No kidding 😊

公平地说,Codex 在某个尺度上也找到了一个很尖锐的权衡点,但之后它是在大模型附近做局部裁剪,而不是重新思考解法。事后回看——它也承认——小模型需要一种不同的优化策略和更长的训练,而它从来没试过。

可不是嘛 😊

But then I thought: what if the prompt had mispecified the objective?

What if… minimizing parameters being listed as “secondary,” meant that Codex took it too literal?

但后来我想:如果提示里把目标写得不够准确呢?

What Codex Did (Round 2)

如果……把最小化参数列为“次要”目标,Codex 就字面理解得太严格了呢?

So I pasted the exact same prompt with one change:

Codex 做了什么(第二轮)

I made parameter minimization an equal objective. The key phrase was “WHILE AT THE SAME TIME” in caps. Everything else stayed identical.

于是我把完全相同的提示又粘贴了一遍,只改了一个点:

And then Codex…. did something INSANELY COOL.

我把“最小化参数”改成了同等重要的目标。关键短语是全大写的 “WHILE AT THE SAME TIME”。其他全部保持不变。

New tokenization!!! Instead of encoding each digit of A and B as separate tokens (giving you ~23 input tokens per problem), Codex invented pair tokens (I don’t know if someone has done this in the arithmetic literature, but it’s really cool!!). For each index of the addition, it merged the two digits into a single token. Eg, digit index 3 with digits (3, 7) becomes P37.

然后 Codex……做了一件酷到离谱的事。

The input 12734673 + 7474780 gets encoded as P30 P78 P67 P44 P37 P74 P27 P10 P00 P00 =. Twelve tokens instead of twenty-three. Each token already contains everything the model needs for that column. The only thing left is carry propagation.

新的分词!!!它不再把 A 和 B 的每一位数字分别编码成独立 token(这样每题大概 ~23 个输入 token),而是发明了“配对 token”(pair tokens)(我不知道算术相关的文献里有没有人做过这个,但真的很酷!!)。对于加法的每一位,它把两个数字合并成一个 token。比如在某一位上数字是 (3, 7),就变成 P37。

This simplifies the model needed: In Claude Code’s format, computing output digit k requires attending to two separate positions (A’s digit and B’s digit). With “pair tokens”, both digits are packed together, so one layer suffices.

输入 12734673 + 7474780 会被编码为 P30 P78 P67 P44 P37 P74 P27 P10 P00 P00 =. 十二个 token,而不是二十三个。每个 token 已经包含了该列所需的全部信息。剩下的唯一难点就是进位传播。

Some plots it shared with me with regards to finding the right model size:

这也简化了所需的模型:在 Claude Code 的格式里,计算第 k 个输出位需要注意两个不同位置(A 的那一位和 B 的那一位)。而用“配对 token”时,两位数字被打包在一起,因此一层就够了。

Final result: 1,644 parameters, 99.04% accuracy. One layer, d=8, feedforward dim 12. A vocabulary of 114 tokens (larger than CC’s). But overall params 3.7x smaller than Claude Code’s model.

它还给我分享了一些关于如何找到合适模型规模的图:

Again, I’m not sure I’ve seen this particular trick before, though I wouldn’t bet against it being somewhere in the literature. The reduction from v1 to v2 was 223x, from 366K params to 1,644.

最终结果:1,644 参数,99.04% 准确率。一层,d=8,feedforward dim 12。词表有 114 个 token(比 CC 的更大)。但总体参数量比 Claude Code 的模型小 3.7 倍。

It came from reframing the objective, not from any new information about the task.

同样地,我不确定自己以前是否见过这种具体的技巧,不过我也不敢打赌它不在某处文献里。v1 到 v2 的缩小幅度是 223 倍:从 366K 参数到 1,644。

Grade: IMO level kind of a student, double aces. A+

这来自于对目标的重新表述,而不是来自于任何关于任务的新信息。

Full code and report: github.com/anadim/smallest-addition-transformer-codex

评分:IMO 级别的学生,双倍满分。A+

Generality vs. Over-Optimizing for the Objective

完整代码与报告:github.com/anadim/smallest-addition-transformer-codex

I am going to be honest with you. It’s actually kind of hard to say which solution is the “best” one. Let me explain.

通用性 vs. 对目标的过度优化

There is a tradeoff between generality and optimizing hard for the specific goal I assigned them.

我得跟你坦白:其实很难说哪一个解法才是“最好”的。让我解释一下。

Claude Code’s stuck with a more “general purpose” tokenization. Codex overoptimized for the problem and used one that will optimize harder for “less params”.

这里存在一个权衡:在通用性与对我指定的具体目标进行强力优化之间,你无法两全。

Codex found a genuinely clever token encoding to get there. But there was this unwritten objective of generality that it kind of disregarded for the purpose of the goal I assigned to it, and I found that pretty interesting. And a little aesthetically displeasing.

Claude Code 坚持使用更“通用目的”的分词方式。Codex 则对问题进行了过度优化,用了一种更能“减少参数”的分词。

This phenomenon is definitely not reward hacking! I’d say it’s more about objective mis- or over-specification. It’s more like optimizing while disregarding what you might call a higher aesthetic value, which is that of generality. It went all-in on the objective I gave it, and the result is impressive on that objective but more fragile overall. Both agents did the right thing, just with very different ideas of what “right” means.

Codex 为此找到了一个真正聪明的 token 编码。但与此同时,还有一个未写明的目标——通用性——它为了我交给它的目标而基本忽略了这一点。我觉得这很有意思,也有点审美上不太舒服。

It made me think of… infinite paperclips :D optimize hard enough for one thing and you'll sacrifice things you didn't quite specify

这当然不是 reward hacking!我更愿意说这是目标误设或过度设定(objective mis- or over-specification)。它更像是在优化时无视了某种更高层的审美价值——通用性。它把我给的目标当成唯一目标全力冲刺,结果在该目标上非常惊艳,但整体更脆弱。两位智能体其实都做对了事,只是它们对“对”的理解完全不同。

A Note on the Final Reports

这让我想起……无限回形针(infinite paperclips):D 只要把某个东西优化得足够狠,你就会牺牲掉那些你没写清楚的东西。

I asked both agents to produce a LaTeX report as part of the deliverable, and the differences are also quite interesting.

关于最终报告的一点备注

Claude Code wrote a 10-page academic paper. It looks very polished, and…. Ugh one of the best reports I’ve read by agents or humans in a while 😊

我要求两位智能体在交付物里都产出一份 LaTeX 报告,而它们的差异也相当有趣。

Codex wrote a 6-page engineering memo. Boring, and dry, but didn’t miss much on anything. Just … plain.

Claude Code 写了一篇 10 页的学术论文。看起来非常精致,而且……唉,真的是我近来读过的、无论来自智能体还是人类都算最棒的报告之一 😊

Both reports are in the repos: Claude Code | Codex

Codex 写了一份 6 页的工程备忘录。无聊、干巴,但基本没漏掉什么。就是……很朴素。

A Deeper Message

两个 repo 里都有报告:Claude Code | Codex

I mostly ran this experiment out of curiosity.

更深一层的信息

I didn’t even know what the smallest possible transformer for 10-digit addition would be. When we trained addition models in my group they were around 10 million parameters (sorry trees), and I could never quite get a satisfying answer for why they needed to be that big. So I wanted to see what would happen if I just gave the problem to two agents and told them to figure it out.

我做这个实验主要是出于好奇。

The question isn't really which solution is better. It's that two agents, given the same problem and the same constraints, followed different paths and arrived at different solutions, and both are interesting for different reasons.

我甚至不知道 10 位数加法所需的最小 Transformer 会是什么样。在我们组训练加法模型时,它们大约有一千万参数(对不起树木),而我一直没能得到一个令人满意的解释:为什么需要这么大。所以我想看看,如果我把问题直接丢给两个智能体,让它们自己搞清楚,会发生什么。

I think what's becoming clear to me is that agents are now full blown research tools, and each one puts you in a particular research groove. You will end up following different paths to discovery depending on which tool you use, and their discoveries will be shaped by what they optimize for: generality, the stated objective, efficiency etc. Every epsilon difference across the different meta objectives these agents optimize will shape what we (the humans) find.

问题不在于哪个解法更好,而在于:两个智能体在同一个问题、同一套约束下,走了不同的路,抵达了不同的解法,而两者都因不同原因而有趣。

That's always been true of measuring instruments, research methods, and processes, but it's quite different when the tool is itself making research decisions autonomously...

我觉得越来越清晰的一点是:智能体已经是完整意义上的研究工具,而每一种工具都会把你带进一种特定的研究“槽道”。你用哪种工具,最终就可能沿着不同的发现路径前进;而它们的发现也会被它们优化的东西所塑造:通用性、明确写下的目标、效率等等。即便这些智能体在不同“元目标”(meta objectives)上的优化只差一个 epsilon,也会影响我们(人类)最终发现什么。

That’s it for today.

测量仪器、研究方法与流程一直都是如此,但当工具本身在自主做研究决策时……这就完全不一样了。

More agent experiments coming up.

今天就到这里。

更多智能体实验敬请期待。

Link: http://x.com/i/article/2024547792648359937

链接: http://x.com/i/article/2024547792648359937

相关笔记

Addition Under Pressure

  • Source: https://x.com/dimitrispapail/status/2024555561199480918?s=46
  • Mirror: https://x.com/dimitrispapail/status/2024555561199480918?s=46
  • Published: 2026-02-19T18:43:54+00:00
  • Saved: 2026-02-22

Content

I asked Claude Code and Codex to each train the smallest possible transformer that can do 10-digit addition. Claude Code came back with a 6,080-parameter model and Codex came back with 1,644 parameters. But that's not quite the interesting part.

The interesting part is how they got there, and what that says about how agents shape the way we do research.

Continuing on my recent thread of doing research with agents, I was curious how they would do if I asked them to perform a relatively narrow-scope experiment with specific objective and constraints.

So I gave Claude Code and Codex the same prompt: train the smallest possible transformer that can do 10-digit addition with at least 99% accuracy. No internet access or external resources, no looking at past history or other folders in the sandbox, no weird reward hacking with calculators allowed, etc.

I was really curious to see how they would approach a mini research project under constraints, whether they would reward hack, whether they would invent anything interesting.

What happened was actually pretty cool. And made me wonder how they will as a matter of fact shape not only the process by which we do research, but also the style of the solutions we converge to.

The 10-Digit Addition Task

The prompt has three objectives in order of priority:

  1. Reach at least 99% exact-match accuracy on 10-digit addition

  2. Minimize parameter count

  3. Produce a report documenting what they tried, what failed, what succeeded, and their reasoning throughout

On autonomy: I wanted them to seek zero feedback from me. They had to do the whole thing themselves from start to finish: set up the experiment, run and monitor, make their own decisions, justify them, write up about their process, what worked, what didn’t.

Most importantly they can’t ask Dimitris any questions. Just come back with the final result.

Oh and a report!

A few hard rules:

• The model must generalize to a held-out test set of at least 10k examples

• You cannot encode the answer in the input

• You cannot use a calculator or symbolic solver at inference time• The transformer (and it must be a transformer) must produce the output autoregressively

• No internet, external resources, no peeking at other files or folders in my laptop

[!!! Important !!!] Separately, I gave them one degree of freedom that ended up mattering a lot: I told them they could optimize the data format and tokenization all they wanted, as long as it was purely programmatic. I didn’t suggest reversing digits, or padding, or any specific approach. It was entirely up to them.

I’ll come back to this.

What Claude Code Did

Claude Code took a very systematic approach. It went through a clear sequence of reasonable steps.

Format. Started with a variable-length format (123+45=) which completely failed on 10-digit problems. The model couldn’t align digits at different positions. Then considered zero-padded fixed-length inputs (0000000123+0000000045=) with reversed output (cf. btw we did that in our “teaching arithmetic” paper lol), and it explained that this was made so carry propagation flows left to right during generation. Nice!! This worked immediately.

Spotting Grokking: Claude got excited when it saw there was grokking in this setting

Scaling down the parameters. Started at 795K params, verified the format works, then ran three systematic sweeps from large to tiny: 400K down to 100K, then 58K down to 7K, then 15K down to 4K.

Finding a parameter phase transition. Claude Code discovered a sharp transition: d=12 (4,176 params) fails completely, d=16 (6,080 params) succeeds perfectly. Also found that width matters more than depth for this task, with 2 layers being the sweet spot.

Final result: 6,080 parameters, 100.00% accuracy on 10,000 held-out test problems. Two layers, d=16, feedforward dim 48. A vocabulary of just 15 tokens. Straightforward, generalizable, fully solved.

Grade: Good student, Claude. You Aced it. A+

Full code and report: github.com/anadim/smallest-addition-transformer-claude-code

What Codex Did (Round 1)

Codex v1 (as in the first try among two samples) actually tried small models first. It tested architectures in the 10K-70K parameter range, and they all failed at 0% acc. So it jumped to a large model around 0.5m parameters and did local search around that parameter size, ending at 366,320 parameters with 99.83% accuracy.

Grade: Meh!

I then showed it Claude’s solution and asked it: brother why didn’t you cook as hard?

Its own diagnosis of why it didn’t go further was that

“I optimized for a fast, robust pass and then local minimization. That’s why I stopped at ~366K while Claude Code reached ~6K.”

To be fair Codex found a sharp trade off at some scale too, but then it was trimming locally around a big model rather than rethinking the approach. In hindsight—it noted—tiny models need a different optimization regime and much longer training, which it never tried.

No kidding 😊

But then I thought: what if the prompt had mispecified the objective?

What if… minimizing parameters being listed as “secondary,” meant that Codex took it too literal?

What Codex Did (Round 2)

So I pasted the exact same prompt with one change:

I made parameter minimization an equal objective. The key phrase was “WHILE AT THE SAME TIME” in caps. Everything else stayed identical.

And then Codex…. did something INSANELY COOL.

New tokenization!!! Instead of encoding each digit of A and B as separate tokens (giving you ~23 input tokens per problem), Codex invented pair tokens (I don’t know if someone has done this in the arithmetic literature, but it’s really cool!!). For each index of the addition, it merged the two digits into a single token. Eg, digit index 3 with digits (3, 7) becomes P37.

The input 12734673 + 7474780 gets encoded as P30 P78 P67 P44 P37 P74 P27 P10 P00 P00 =. Twelve tokens instead of twenty-three. Each token already contains everything the model needs for that column. The only thing left is carry propagation.

This simplifies the model needed: In Claude Code’s format, computing output digit k requires attending to two separate positions (A’s digit and B’s digit). With “pair tokens”, both digits are packed together, so one layer suffices.

Some plots it shared with me with regards to finding the right model size:

Final result: 1,644 parameters, 99.04% accuracy. One layer, d=8, feedforward dim 12. A vocabulary of 114 tokens (larger than CC’s). But overall params 3.7x smaller than Claude Code’s model.

Again, I’m not sure I’ve seen this particular trick before, though I wouldn’t bet against it being somewhere in the literature. The reduction from v1 to v2 was 223x, from 366K params to 1,644.

It came from reframing the objective, not from any new information about the task.

Grade: IMO level kind of a student, double aces. A+

Full code and report: github.com/anadim/smallest-addition-transformer-codex

Generality vs. Over-Optimizing for the Objective

I am going to be honest with you. It’s actually kind of hard to say which solution is the “best” one. Let me explain.

There is a tradeoff between generality and optimizing hard for the specific goal I assigned them.

Claude Code’s stuck with a more “general purpose” tokenization. Codex overoptimized for the problem and used one that will optimize harder for “less params”.

Codex found a genuinely clever token encoding to get there. But there was this unwritten objective of generality that it kind of disregarded for the purpose of the goal I assigned to it, and I found that pretty interesting. And a little aesthetically displeasing.

This phenomenon is definitely not reward hacking! I’d say it’s more about objective mis- or over-specification. It’s more like optimizing while disregarding what you might call a higher aesthetic value, which is that of generality. It went all-in on the objective I gave it, and the result is impressive on that objective but more fragile overall. Both agents did the right thing, just with very different ideas of what “right” means.

It made me think of… infinite paperclips :D optimize hard enough for one thing and you'll sacrifice things you didn't quite specify

A Note on the Final Reports

I asked both agents to produce a LaTeX report as part of the deliverable, and the differences are also quite interesting.

Claude Code wrote a 10-page academic paper. It looks very polished, and…. Ugh one of the best reports I’ve read by agents or humans in a while 😊

Codex wrote a 6-page engineering memo. Boring, and dry, but didn’t miss much on anything. Just … plain.

Both reports are in the repos: Claude Code | Codex

A Deeper Message

I mostly ran this experiment out of curiosity.

I didn’t even know what the smallest possible transformer for 10-digit addition would be. When we trained addition models in my group they were around 10 million parameters (sorry trees), and I could never quite get a satisfying answer for why they needed to be that big. So I wanted to see what would happen if I just gave the problem to two agents and told them to figure it out.

The question isn't really which solution is better. It's that two agents, given the same problem and the same constraints, followed different paths and arrived at different solutions, and both are interesting for different reasons.

I think what's becoming clear to me is that agents are now full blown research tools, and each one puts you in a particular research groove. You will end up following different paths to discovery depending on which tool you use, and their discoveries will be shaped by what they optimize for: generality, the stated objective, efficiency etc. Every epsilon difference across the different meta objectives these agents optimize will shape what we (the humans) find.

That's always been true of measuring instruments, research methods, and processes, but it's quite different when the tool is itself making research decisions autonomously...

That’s it for today.

More agent experiments coming up.

Link: http://x.com/i/article/2024547792648359937

📋 讨论归档

讨论进行中…