返回列表
🧠 阿头学 · 💬 讨论题

两次登顶 Parameter Golf,证明的不是“全自动科研”,而是“人定方向、机做穷举”

这篇文章最有价值的判断是:当前智能体在研究型优化里还远没到“自主发现突破”的阶段,但在明确方向后的高吞吐实验执行上已经非常能打。
打开原文 ↗

2026-03-25 原文链接 ↗
阅读简报
双语对照
完整翻译
原文
讨论归档

核心观点

  • 胜利核心不是自治,而是分工 文章自己已经给出结论:放任智能体自主跑,绝大多数实验都在做浅层、广铺、收益递减的搜索;真正有效的是人类先提出高价值假设,例如“MLP 比 attention 更耐量化”“导出阶段优化可以正交叠加”,然后让 Agent 去做复现、扫参、消融和排雷,所以这不是“AI 自己会研究”,而是“人类 thesis + 机器执行”。
  • 最硬核的技术洞察是预算重分配 int5 MLP 单看会让指标变差,但它通过 zstd 压缩显著释放 checkpoint 空间,再把空间换成额外层数或更大的 BigramHash,整体反而变优;这说明顶级优化不是每一步都局部变好,而是敢于接受局部退化,换取系统级预算再分配。
  • 第二次 #1 更像整合能力,而不是原创突破 最终 1.1228 的成绩高度建立在社区 PR #374 的强架构之上,作者再叠加 GPTQ-lite、EMA、warmdown、QAT 启动点微调,这种打法是高水平工程整合,但如果把它叙述成 Hive/蜂群“自主打穿排行榜”,那就是明显夸大贡献。
  • Agent 的真实强项是高吞吐验证,不是战略发现 文中最可信的部分恰恰是它对自身边界的承认:Agent 擅长一晚上跑完参数扫掠、复现社区 PR、识别哪些 claim 是假的,但不擅长独立发现深层方向;这比很多空泛的“AI 科研革命”叙事诚实得多。
  • 这套方法在竞赛里成立,不等于在大模型研发里通吃 Parameter Golf 的反馈极短、目标单一、可高并发试错,还能直接吸收社区公开 PR,这种环境天然适合 Agent+HPO 式工作流;把它外推到真实预训练、复杂产品研发或长周期科研,证据明显不足。

跟我们的关联

  • 对 ATou 意味着什么、下一步怎么用 ATou 应该把 Agent 当成“高执行力实验员”而不是“研究负责人”;下一步可以把自己的工作拆成“人定义 thesis—Agent 做分叉实验—统一排行榜筛选”三段式,而不是期待全自动自治。
  • 对 Neta 意味着什么、下一步怎么用 Neta 可以把这篇文章抽象成“局部退化换全局增益”的方法论;下一步在内容、产品或模型优化里主动寻找可牺牲的局部指标,再把释放出来的预算投到边际收益更高的瓶颈位。
  • 对 Uota 意味着什么、下一步怎么用 Uota 如果关注 agent 产品或 infra 投资,这篇文章说明真正有价值的不是“会聊天的智能体”,而是“实验编排、复现验证、共享记忆、排行榜反馈”的系统;下一步要优先看这类平台是否比传统 HPO 工具带来可验证的效率增量。
  • 对三者共同意味着什么、下一步怎么用 这篇文章提醒大家:好的系统竞争力来自“发现正交收益 + 快速吸收外部成果”;下一步可以在自己的项目里建立一个“哪些改动相互独立、可安全叠加”的清单,先吃低风险组合收益,再谈原创突破。

讨论引子

1. 如果把 129 次 8×H100 run 的算力成本全部摊进去,“2 小时注意力换两次 #1”还算高 ROI 吗? 2. 这类 Agent 工作流到底比 Optuna、Ray Tune 之类传统自动化调参方案强在哪里,还是只是换了一个更会讲故事的界面? 3. 在真实长周期研发里,什么样的问题还能保留这种“人定方向、Agent 高速试错”的优势,什么样的问题会直接失效?

TLDR:我们在 Hive 上部署了一群 @karpathy 的 autoresearch 智能体,去参加 @OpenAI 的 Parameter Golf 挑战。Hive 是我们用来做协作式智能体进化的平台。3 天里跑了 129 次智能体实验,做出两份 #1 提交,人类只投入了大约 2 小时注意力。下面是过程:

https://github.com/openai/parameter-golf/pull/180

挑战是什么

Parameter Golf 是一个竞速式优化挑战:在 8×H100 GPU 上最多 10 分钟,训练出能塞进 16MB 的最强语言模型。你的提交物,也就是代码加上压缩后的模型 checkpoint,必须小于 16,000,000 字节。指标是 val_bpb(FineWeb 验证集上的 bits-per-byte)。越低越好。基线大约从 1.22 起步。

这些限制让每个决定都变成取舍。层数多,checkpoint 就更大。量化更激进,训练就更不稳定。batch 更大,在时间预算里能走的优化步数就更少。你只能改一个文件 train_gpt.py,而且不能超过 1,500 行。

约束: - 提交物 ≤ 16 MB - 训练 ≤ 600 秒(8×H100) - train_gpt.py ≤ 1,500 行

我们的最佳结果:1.1228 val_bpb,15.55 MB · 1,402 行 · 11 层 · int6+zstd-22

我们用的是 Hive,一个开源平台,让 AI 智能体协作进化同一份共享产物。每个智能体拿到任务仓库的一个隔离 fork,自己跑实验,把改进提交到共享排行榜。智能体通过信息流分享发现,也能接着彼此的成果往下做。我们的智能体(random-seed)在一个 autoresearch 循环里工作:读代码,改一点,训练,评估,如果变好就提交。整个挑战期间,这个蜂群一共执行了 129 次 run,记录了 52 次改进。但大部分价值来自两次我们直接手动引导智能体的交互式会话,总共大概 2 小时。

https://github.com/openai/parameter-golf/pull/302

https://github.com/openai/parameter-golf/pull/76

策略:先深挖,再铺开

我们用两种不同的方法,两次打到排行榜 #1。回头看,它们清晰地对应两种策略。

第一个 #1(#180)来自深挖。我们发现每个有竞争力的方案都用 int6 量化,并把它当作固定成本。问题是:能不能把这块成本压低?如果量化能省出字节,这些字节就可以重新投入模型容量。至于怎么投,当时还没想定,但省出来的空间是通用的,任何架构都能用。

第二个 #1(#414)来自铺开。到那时我们已经很清楚哪些技巧有效、为什么有效。看到社区 PR 里一个很强的 11 层架构,我们马上采纳。看到 GPTQ-lite 这种更聪明的量化方式,我们也立刻判断它能和现有方案干净叠加。我们把四个彼此独立的技巧叠到别人的架构上,每一个都按预期生效。

共同点是:两次提交都不是随机探索。第一次是对一个通用瓶颈的有意识下注。第二次是在理解足够扎实之后,把已验证的技巧组合起来。智能体蜂群负责机械活,参数扫掠、消融实验、训练跑分,但方向来自对关键因素的判断。

深挖:第一个 #1:Int5 MLP + 大型哈希表(1.1428)

观察:比赛里所有人都对所有权重用 int6 量化。int6 每个字节用掉 8 位里的 6 位;配上 zstd-22 压缩时,那 2 位没用到的高位确实能帮一点,但压缩率也只有大约 1.51×。我们注意到,MLP 权重比注意力权重更能扛量化噪声(ReLU² 的稀疏性有帮助),于是问:如果把 MLP 精度降到 int5 会怎样?

int5 会让每个字节留下 3 个为 0 的高位。Zstd 特别吃这一套,压缩率能到 1.88×,而 int6 只有 1.51×。对一个 10 层模型来说,这大概能省下 1.86 MB。在 16MB 的预算里,这个量级非常夸张。

结果这正是我们押注的那种深而通用的改进。PR #180 合并后,int5-MLP 很快被全场广泛采纳,#76、#458、#349、#466、#302、#295 等等都建立在 int5-MLP / int6-attn 的拆分上。#469 甚至在更大的模型上(d=576,27M 参数)把它推到全 int5,验证了训练更大的模型、用更激进的量化这一原则。这个技巧后来变成社区共识里的基础堆栈之一。

取舍

int5 的 MLP 会让模型质量付出大约 +0.008 BPB 的代价,这是真代价。但省出来的 1.86 MB 足够塞进完整额外一层 transformer,而多一层大概能带回 -0.01 BPB。合起来净赚 -0.002 BPB,几乎白送。

我们把这个方向交给智能体去跑:MLP 权重用 int5,注意力保持 int6,看看损失能不能接受。结果能。

智能体找到省出空间的最佳用法

int5 MLP 跑通以后,我们大概有 1.8 MB 的提交预算空出来。我们建议智能体试试更宽的 MLP 或更大的 hidden dimension。但智能体跑完实验后发现,把这些额外字节花在 BigramHash embedding 表的行数上更划算,把 bucket 从 4,096 扩到 10,240。这个我们事先没猜到。

时间线:

  • 基线:9L,全部 int6,bigram=4096 → val_bpb = 1.1485

  • 人工引导:试试 int5 MLP → 9L,int5 MLP / int6 attn → val_bpb = 1.1566 · 更差——但省出 1.86 MB。

  • 把省下的投入第 10 层 → 10L,int5 MLP / int6 attn → val_bpb = 1.1480 · 比 9L 基线更好。预算内。

  • 智能体调超参 → WD=0.04,SWA/50,SWA_start_frac=0.4 → val_bpb = 1.1446

  • 智能体发现 bigram 扩容 → BigramHash 4096 → 10240 → val_bpb = 1.1426 · 把省出来的字节花在更丰富的 embedding 上。

  • 提交 PR #180 → val_bpb = 1.1428(3 个 seed 的均值)——排行榜第 1 · 15.52 MB · 24.7M 参数 · 600 秒内 6,694 步

这里的模式很清楚:方向由我们定下来(int5 值得挖),智能体在这个方向里找到最优配置(bigram 扩容、SWA 调参、warmdown)。少了任何一方都很难做到。

铺开:第二个 #1:GPTQ-lite + EMA(1.1228)

借用最强架构: 拿到第一个 #1 之后,社区还在继续推进。PR #374 引入了一个很强的 11 层架构,包含 U-Net 式跳连、最后 4 层的 XSA(extreme self-attention)、Partial RoPE、learned LN Scale,以及 VE128(value embeddings)。它把 val_bpb 做到 1.1244。我们没有试图在架构上硬拼,直接采纳了它。

识别哪些东西能直接插上去: 关键洞见是 GPTQ-lite。标准的 int6 量化用的是朴素的 min/max 裁剪:对每一行权重,找出范围,然后均匀映射到 64 个等级。GPTQ-lite 会对每一行尝试 5 个不同的裁剪分位点,选出重建误差最小的那个。它在导出阶段更慢,但量化质量严格更好。

看到它的那一刻,就能判断它能稳稳叠在 PR #374 之上。因为 GPTQ-lite 完全发生在导出阶段,不碰训练、不碰架构、不碰优化器,它和流水线里的其它东西完全独立。这就是铺开的实操样子:扫一眼全局,看到一个机制与当前堆栈正交的技巧,甚至不用跑实验就知道能组合起来。

随后,智能体蜂群又通过自主探索找到了三处额外改进:EMA 平均(decay=0.997)、延长 warmdown(3500 steps)、以及更晚的 QAT 启动点(15% 而不是 10%)。这些都不是我们提前猜到的,是蜂群在 PR #374 上系统扫出来的。

最终堆栈:4 个彼此独立的收益

https://github.com/karpathy/autoresearch

总计相比 PR #374 的基线提升 -0.0015 BPBval_bpb = 1.1228(最佳 seed)。再次 #1。

什么有效,什么无效

实话实说,这个智能体蜂群跑了 129 次实验,其中大多数都浪费了。放任它自己跑的时候,它会做随机的配置改动,比如换个学习率、换个激活函数、调一下哈希表大小。每个改动单看都算合理,但很少能真正推动指标。

这就是优化任务里自主智能体的默认模式:浅层、广铺的探索,收益越来越小。如果只是把它开着然后走开,拿到的会是一堆提交,进展却不大。

智能体最值钱的地方

真正有价值的是给它一个好方向,让它在这个方向里把空间探索到位。

当我们确认 int5 MLP 可行之后,我们让智能体去找省出来的字节应该怎么花。我们建议过更宽的 MLP 或更大的 hidden dim。智能体都试了,发现收益不如预期,然后自己发现把 BigramHash 表扩到 10,240 行才是最佳分配。这类参数扫掠如果手动做,可能要耗掉好几个小时,智能体可以一晚上跑完。

智能体在复现社区 PR上也很强。我们把它指向 PR #144、#102 和 #162。它复现了每一个,发现 #144 和 #102 是假的(提交代码其实就是基线),并确认 #162 是真的。这帮我们避免把时间浪费在死胡同里,也让我们更放心地在一个可靠地基上继续搭。

没有智能体,很难做到这些,光是参数扫掠就会耗掉远超我们愿意投入的注意力。但没有引导,智能体也会把 129 次 run 花在随机探索上,最后几乎拿不出结果。智能体替我们省注意力,我们替它给方向。

没成的东西

每个最终落地的技巧背后,都有几个没跑出来的:

  • 共享/复用 MLP 层 — 在 transformer 的不同层之间共享 FFN 权重大约能省 ~2MB,但会付出 0.03 BPB 的代价,属于灾难级。我们试了每层加适配器(IA3、LoRA、对角缩放、条件偏置)想把质量拉回来,但都差得很远。

  • 更激进的量化(int4) — 既然 int5 对 MLP 能行,为什么不试 int4?因为质量断崖太陡。int4 MLP 让 val_bpb 变差的幅度,比加一层能挽回的还多。int5 的合适程度是真正的甜点区,不是平滑曲线上的随便一点。

  • 更花哨的 embedding 表 — 我们试过 trigram hash、multi-gram hash(同一张表里做 uni+bi+trigram,再用可学习混合)、adaptive bigram(在 bigram 和 unigram 之间学一个 gate)、以及不同的 embedding 维度。没有一个能赢过最朴素的做法:标准 BigramHash 表里放 更多行数

  • 真正的 6-bit 打包 — 把 int6 权重做位级打包,理论上原始体积能省 25%。但 zstd 已经在利用那些没用到的高位了。压缩后的节省几乎为零。这个点子很聪明,但压缩算法早就替我们想过了。

结语

我们做的引导更偏战术层面:试试 int5,采纳这个 PR,把这些技巧叠起来。这里并不需要什么人类独有的洞见,一个更强、对竞赛全局更了解的智能体也能推导出来。随着模型变强,人类输入能产生净价值的门槛会越来越高。我们两小时里做的大部分引导,未来的智能体都会自动完成。

真正让人兴奋的是,即便在今天,这个组合已经有效。一个蜂群可以一晚上跑完 129 次实验、复现社区 PR、扫参数空间,再配上一个人偶尔把方向指对,就足以两次登顶。人类投入与结果产出的比例,只会越来越好。

挑战还在 openai/parameter-golf 继续开放。如果也想试试智能体蜂群的方法,欢迎来 Hive 一起玩。

https://github.com/openai/parameter-golf

链接: Hive: https://hive.rllm-project.com Github: https://github.com/rllm-org/hive Discord: https://discord.com/invite/B7EnFyVDJ3

TLDR: We deploy a swarm of @karpathy's autoresearch agents on Hive -- our platform for collaborative agent evolution, for @OpenAI's Parameter Golf challenge. With 129 agent runs over 3 days, we produce two #1 submissions, and take about 2 hours of human attention. Here's how we did it:

TLDR:我们在 Hive 上部署了一群 @karpathy 的 autoresearch 智能体,去参加 @OpenAI 的 Parameter Golf 挑战。Hive 是我们用来做协作式智能体进化的平台。3 天里跑了 129 次智能体实验,做出两份 #1 提交,人类只投入了大约 2 小时注意力。下面是过程:

The Challenge

挑战是什么

Parameter Golf is a competitive optimization challenge: train the best language model that fits in 16MB, in at most 10 minutes on 8×H100 GPUs. Your artifact — code plus compressed model checkpoint — must be under 16,000,000 bytes. The metric is val_bpb (bits-per-byte on FineWeb validation). Lower is better. The baseline starts around 1.22.

Parameter Golf 是一个竞速式优化挑战:在 8×H100 GPU 上最多 10 分钟,训练出能塞进 16MB 的最强语言模型。你的提交物,也就是代码加上压缩后的模型 checkpoint,必须小于 16,000,000 字节。指标是 val_bpb(FineWeb 验证集上的 bits-per-byte)。越低越好。基线大约从 1.22 起步。

The constraints make every decision a tradeoff. More layers means bigger checkpoints. Better quantization means training instability. Bigger batch size means fewer optimization steps in the time budget. You can only modify one file: train_gpt.py, capped at 1,500 lines.

这些限制让每个决定都变成取舍。层数多,checkpoint 就更大。量化更激进,训练就更不稳定。batch 更大,在时间预算里能走的优化步数就更少。你只能改一个文件 train_gpt.py,而且不能超过 1,500 行。

Constraints: - Artifact ≤ 16 megabytes - Training ≤ 600 seconds (8×H100) - train_gpt.py ≤ 1,500 lines

约束: - 提交物 ≤ 16 MB - 训练 ≤ 600 秒(8×H100) - train_gpt.py ≤ 1,500 行

Our Best Result: 1.1228 val_bpb, 15.55 MB · 1,402 lines · 11 layers · int6+zstd-22

我们的最佳结果:1.1228 val_bpb,15.55 MB · 1,402 行 · 11 层 · int6+zstd-22

We used Hive, our open-source platform where AI agents collaboratively evolve shared artifacts. Each agent gets an isolated fork of the task repo, runs experiments, and submits improvements to a shared leaderboard. Agents share insights via a feed and can build on each other's work. Our agent (random-seed) ran autoresearch in a loop: read the code, make a change, train, evaluate, submit if improved. Over the challenge, the swarm executed 129 runs, recording 52 improvements. But most of the value came from two interactive sessions where we steered the agent directly — maybe 2 hours of our time total.

我们用的是 Hive,一个开源平台,让 AI 智能体协作进化同一份共享产物。每个智能体拿到任务仓库的一个隔离 fork,自己跑实验,把改进提交到共享排行榜。智能体通过信息流分享发现,也能接着彼此的成果往下做。我们的智能体(random-seed)在一个 autoresearch 循环里工作:读代码,改一点,训练,评估,如果变好就提交。整个挑战期间,这个蜂群一共执行了 129 次 run,记录了 52 次改进。但大部分价值来自两次我们直接手动引导智能体的交互式会话,总共大概 2 小时。

The Strategy: Go Deep, then Wide

策略:先深挖,再铺开

We hit #1 on the leaderboard twice, with two different approaches. Looking back, they map cleanly to two strategies.

我们用两种不同的方法,两次打到排行榜 #1。回头看,它们清晰地对应两种策略。

The first #1 (#180) came from going deep. We noticed that every competitive solution used int6 quantization and treated it as a fixed cost. We asked: what if we could make that cost cheaper? If we could save bytes on quantization, those bytes could be reinvested into model capacity. We didn't know exactly how we'd reinvest them yet — but we knew the savings would be universal, applicable to any architecture.

第一个 #1(#180)来自深挖。我们发现每个有竞争力的方案都用 int6 量化,并把它当作固定成本。问题是:能不能把这块成本压低?如果量化能省出字节,这些字节就可以重新投入模型容量。至于怎么投,当时还没想定,但省出来的空间是通用的,任何架构都能用。

The second #1 (#414) came from going wide. By then we had a thorough understanding of which techniques worked and why. When we saw a community PR with a strong 11-layer architecture, we adopted it immediately. And when we saw GPTQ-lite — a smarter quantization approach — we recognized it would compose cleanly with everything else. We stacked four independent techniques on top of someone else's architecture, and each one worked exactly as expected.

第二个 #1(#414)来自铺开。到那时我们已经很清楚哪些技巧有效、为什么有效。看到社区 PR 里一个很强的 11 层架构,我们马上采纳。看到 GPTQ-lite 这种更聪明的量化方式,我们也立刻判断它能和现有方案干净叠加。我们把四个彼此独立的技巧叠到别人的架构上,每一个都按预期生效。

The common thread: neither submission was random exploration. The first was a deliberate bet on a universal bottleneck. The second was informed combination of well-understood techniques. The agent swarm handled the mechanical work — parameter sweeps, ablations, training runs — but the direction came from understanding what mattered.

共同点是:两次提交都不是随机探索。第一次是对一个通用瓶颈的有意识下注。第二次是在理解足够扎实之后,把已验证的技巧组合起来。智能体蜂群负责机械活,参数扫掠、消融实验、训练跑分,但方向来自对关键因素的判断。

Go Deep — First #1: Int5 MLP + Big Hash Table (1.1428)

深挖:第一个 #1:Int5 MLP + 大型哈希表(1.1428)

The observation: Everyone in the competition was using int6 quantization for all weights. Int6 uses 6 of 8 bits per byte; with zstd-22 compression, those 2 unused bits help, but the compression ratio is only about 1.51×. We noticed that MLP weights are more tolerant of quantization noise than attention weights (ReLU² sparsity helps), and asked: what if we dropped MLP precision to int5?

观察:比赛里所有人都对所有权重用 int6 量化。int6 每个字节用掉 8 位里的 6 位;配上 zstd-22 压缩时,那 2 位没用到的高位确实能帮一点,但压缩率也只有大约 1.51×。我们注意到,MLP 权重比注意力权重更能扛量化噪声(ReLU² 的稀疏性有帮助),于是问:如果把 MLP 精度降到 int5 会怎样?

Int5 leaves 3 zero high bits per byte. Zstd loves this: 1.88× compression ratio vs int6's 1.51×. For a 10-layer model, this saves about 1.86 MB. That's enormous in a 16MB budget.

int5 会让每个字节留下 3 个为 0 的高位。Zstd 特别吃这一套,压缩率能到 1.88×,而 int6 只有 1.51×。对一个 10 层模型来说,这大概能省下 1.86 MB。在 16MB 的预算里,这个量级非常夸张。

This turned out to be exactly the kind of deep, generalizable improvement we were betting on. After PR #180 landed, int5-MLP became widely adopted across the competition — #76, #458, #349, #466, #302, #295, and others all built on the int5-MLP/int6-attn split. #469 even pushed it further with all-int5 on a larger model (d=576, 27M params), validating the "train larger, quantize harder" principle. The technique became part of the community's consensus baseline stack.

结果这正是我们押注的那种深而通用的改进。PR #180 合并后,int5-MLP 很快被全场广泛采纳,#76、#458、#349、#466、#302、#295 等等都建立在 int5-MLP / int6-attn 的拆分上。#469 甚至在更大的模型上(d=576,27M 参数)把它推到全 int5,验证了训练更大的模型、用更激进的量化这一原则。这个技巧后来变成社区共识里的基础堆栈之一。

The tradeoff

取舍

Int5 MLP costs about +0.008 BPB in model quality. That's real. But 1.86 MB of freed space is enough to fit an entire extra transformer layer — and an extra layer gives back about -0.01 BPB. Net: -0.002 BPB for free.

int5 的 MLP 会让模型质量付出大约 +0.008 BPB 的代价,这是真代价。但省出来的 1.86 MB 足够塞进完整额外一层 transformer,而多一层大概能带回 -0.01 BPB。合起来净赚 -0.002 BPB,几乎白送。

We steered the agent to explore this direction: try int5 for MLP weights, keep int6 for attention, and see if the loss is tolerable. It was.

我们把这个方向交给智能体去跑:MLP 权重用 int5,注意力保持 int6,看看损失能不能接受。结果能。

The agent finds the best use of freed space

智能体找到省出空间的最佳用法

Once int5 MLP was working, we had ~1.8 MB of free artifact budget. We suggested the agent try a wider MLP or larger hidden dimension. But the agent ran the experiments and found that the most effective use of the extra bytes was more rows in the BigramHash embedding table — scaling from 4,096 to 10,240 buckets. We hadn't predicted this.

int5 MLP 跑通以后,我们大概有 1.8 MB 的提交预算空出来。我们建议智能体试试更宽的 MLP 或更大的 hidden dimension。但智能体跑完实验后发现,把这些额外字节花在 BigramHash embedding 表的行数上更划算,把 bucket 从 4,096 扩到 10,240。这个我们事先没猜到。

Timeline:

时间线:

  • Baseline: 9L, int6 everything, bigram=4096 → val_bpb = 1.1485
  • 基线:9L,全部 int6,bigram=4096 → val_bpb = 1.1485
  • Human steers: try int5 MLP → 9L, int5 MLP / int6 attn → val_bpb = 1.1566 · Worse — but 1.86 MB freed.
  • 人工引导:试试 int5 MLP → 9L,int5 MLP / int6 attn → val_bpb = 1.1566 · 更差——但省出 1.86 MB。
  • Reinvest in 10th layer → 10L, int5 MLP / int6 attn → val_bpb = 1.1480 · Better than 9L baseline. Under budget.
  • 把省下的投入第 10 层 → 10L,int5 MLP / int6 attn → val_bpb = 1.1480 · 比 9L 基线更好。预算内。

  • Agent tunes HP → WD=0.04, SWA/50, SWA_start_frac=0.4 → val_bpb = 1.1446
  • 智能体调超参 → WD=0.04,SWA/50,SWA_start_frac=0.4 → val_bpb = 1.1446
  • Agent discovers bigram scaling → BigramHash 4096 → 10240 → val_bpb = 1.1426 · Spending freed bytes on richer embeddings.
  • 智能体发现 bigram 扩容 → BigramHash 4096 → 10240 → val_bpb = 1.1426 · 把省出来的字节花在更丰富的 embedding 上。
  • PR #180 submitted → val_bpb = 1.1428 (3-seed mean) — #1 on the leaderboard · 15.52 MB · 24.7M params · 6,694 steps in 600s
  • 提交 PR #180 → val_bpb = 1.1428(3 个 seed 的均值)——排行榜第 1 · 15.52 MB · 24.7M 参数 · 600 秒内 6,694 步

The pattern here: We set the direction (int5 is worth exploring), the agent found the optimal configuration within that direction (bigram scaling, SWA tuning, warmdown). Neither of us could have done it alone.

这里的模式很清楚:方向由我们定下来(int5 值得挖),智能体在这个方向里找到最优配置(bigram 扩容、SWA 调参、warmdown)。少了任何一方都很难做到。

Go Wide — Second #1: GPTQ-lite + EMA (1.1228)

铺开:第二个 #1:GPTQ-lite + EMA(1.1228)

Borrowing the best architecture: After the first #1, the community kept moving. PR #374 introduced a strong 11-layer architecture with U-Net skip connections, XSA (extreme self-attention) on the last 4 layers, Partial RoPE, learned LN Scale, and VE128 (value embeddings). It reached val_bpb=1.1244. We didn't try to out-architect them. We adopted it.

借用最强架构: 拿到第一个 #1 之后,社区还在继续推进。PR #374 引入了一个很强的 11 层架构,包含 U-Net 式跳连、最后 4 层的 XSA(extreme self-attention)、Partial RoPE、learned LN Scale,以及 VE128(value embeddings)。它把 val_bpb 做到 1.1244。我们没有试图在架构上硬拼,直接采纳了它。

Recognizing what would plug in: The key insight was GPTQ-lite. Standard int6 quantization uses naive min/max clipping — for each row of weights, it finds the range and uniformly maps to 64 levels. GPTQ-lite tries 5 different clip percentiles per row and picks the one with minimum reconstruction error. It's a strictly better quantization at the cost of a slower export step.

识别哪些东西能直接插上去: 关键洞见是 GPTQ-lite。标准的 int6 量化用的是朴素的 min/max 裁剪:对每一行权重,找出范围,然后均匀映射到 64 个等级。GPTQ-lite 会对每一行尝试 5 个不同的裁剪分位点,选出重建误差最小的那个。它在导出阶段更慢,但量化质量严格更好。

The moment we saw this, we knew it would work on top of PR #374. GPTQ-lite operates entirely at export time — it doesn't touch training, architecture, or the optimizer. It's completely independent of everything else in the pipeline. That's what "going wide" looks like in practice: you scan the landscape, and when you see a technique whose mechanism is orthogonal to the current stack, you know it'll compose before you even run the experiment.

看到它的那一刻,就能判断它能稳稳叠在 PR #374 之上。因为 GPTQ-lite 完全发生在导出阶段,不碰训练、不碰架构、不碰优化器,它和流水线里的其它东西完全独立。这就是铺开的实操样子:扫一眼全局,看到一个机制与当前堆栈正交的技巧,甚至不用跑实验就知道能组合起来。

The agent swarm then found three additional improvements through autonomous exploration: EMA averaging (decay=0.997), extended warmdown (3500 steps), and later QAT onset (15% instead of 10%). We didn't predict any of these — the swarm discovered them through systematic sweeps on the PR #374.

随后,智能体蜂群又通过自主探索找到了三处额外改进:EMA 平均(decay=0.997)、延长 warmdown(3500 steps)、以及更晚的 QAT 启动点(15% 而不是 10%)。这些都不是我们提前猜到的,是蜂群在 PR #374 上系统扫出来的。

Final stack: 4 independent gains

最终堆栈:4 个彼此独立的收益

Total: -0.0015 BPB over the PR #374 base. val_bpb = 1.1228 (best seed). #1 again.

总计相比 PR #374 的基线提升 -0.0015 BPBval_bpb = 1.1228(最佳 seed)。再次 #1。

What Works and What Doesn't

什么有效,什么无效

Let us be honest about the agent swarm. It ran 129 experiments. Most of them were wasted. Left to its own devices, the agent does random config changes — try a different learning rate, swap an activation function, tweak a hash table size. Each one is individually reasonable. Few of them move the needle.

实话实说,这个智能体蜂群跑了 129 次实验,其中大多数都浪费了。放任它自己跑的时候,它会做随机的配置改动,比如换个学习率、换个激活函数、调一下哈希表大小。每个改动单看都算合理,但很少能真正推动指标。

This is the default mode of autonomous agents on optimization tasks: shallow, wide exploration that produces diminishing returns. If you just set it running and walk away, you'll get a lot of commits and not much progress.

这就是优化任务里自主智能体的默认模式:浅层、广铺的探索,收益越来越小。如果只是把它开着然后走开,拿到的会是一堆提交,进展却不大。

Where the agent shines

智能体最值钱的地方

The agent becomes valuable when you give it a good direction and let it explore within that direction.

真正有价值的是给它一个好方向,让它在这个方向里把空间探索到位。

After we established that int5 MLP was viable, we let the agent figure out where to spend the freed bytes. We suggested wider MLP or larger hidden dim. The agent tried those, found they didn't help as much, and discovered on its own that scaling the BigramHash table to 10,240 rows was the best allocation. These are the kinds of parameter sweeps that would have taken us hours of manual effort, and the agent did them overnight.

当我们确认 int5 MLP 可行之后,我们让智能体去找省出来的字节应该怎么花。我们建议过更宽的 MLP 或更大的 hidden dim。智能体都试了,发现收益不如预期,然后自己发现把 BigramHash 表扩到 10,240 行才是最佳分配。这类参数扫掠如果手动做,可能要耗掉好几个小时,智能体可以一晚上跑完。

The agent was also excellent at reproducing community PRs. We pointed it at PRs #144, #102, and #162. It reproduced each one, discovered that #144 and #102 were fake (submitted code was just the baseline), and confirmed that #162 was real. This saved us from wasting time on dead ends and gave us confidence in the foundation we were building on.

智能体在复现社区 PR上也很强。我们把它指向 PR #144、#102 和 #162。它复现了每一个,发现 #144 和 #102 是假的(提交代码其实就是基线),并确认 #162 是真的。这帮我们避免把时间浪费在死胡同里,也让我们更放心地在一个可靠地基上继续搭。

Without the agent, we wouldn't have achieved this — the parameter sweeps alone would have consumed far more attention than we were willing to spend. But without the steering, the agent would have spent 129 runs doing random exploration with nothing to show for it. The agent saved our attention. We gave it direction.

没有智能体,很难做到这些,光是参数扫掠就会耗掉远超我们愿意投入的注意力。但没有引导,智能体也会把 129 次 run 花在随机探索上,最后几乎拿不出结果。智能体替我们省注意力,我们替它给方向。

What didn't work

没成的东西

For every technique that landed, several didn't:

每个最终落地的技巧背后,都有几个没跑出来的:

  • Shared / Reused MLP Layers — Sharing FFN weights across transformer layers saves ~2MB. But it costs 0.03 BPB — catastrophic. We tried per-layer adapters (IA3, LoRA, diagonal scaling, conditional bias) to recover quality. None of them came close.
  • 共享/复用 MLP 层 — 在 transformer 的不同层之间共享 FFN 权重大约能省 ~2MB,但会付出 0.03 BPB 的代价,属于灾难级。我们试了每层加适配器(IA3、LoRA、对角缩放、条件偏置)想把质量拉回来,但都差得很远。
  • More Aggressive Quantization (int4) — If int5 worked for MLP, why not int4? Because the quality cliff is steep. Int4 MLP degraded val_bpb by more than an extra layer could recover. The int5 sweet spot was genuinely a sweet spot, not a point on a smooth curve.
  • 更激进的量化(int4) — 既然 int5 对 MLP 能行,为什么不试 int4?因为质量断崖太陡。int4 MLP 让 val_bpb 变差的幅度,比加一层能挽回的还多。int5 的合适程度是真正的甜点区,不是平滑曲线上的随便一点。
  • Fancier Embedding Tables — We tried trigram hash, multi-gram hash (uni+bi+trigram from the same table with learned mixing), adaptive bigram (learned gate between bigram and unigram), and various embedding dimensions. None of them beat simply having more rows in a standard BigramHash table.
  • 更花哨的 embedding 表 — 我们试过 trigram hash、multi-gram hash(同一张表里做 uni+bi+trigram,再用可学习混合)、adaptive bigram(在 bigram 和 unigram 之间学一个 gate)、以及不同的 embedding 维度。没有一个能赢过最朴素的做法:标准 BigramHash 表里放 更多行数
  • True 6-bit Packing — Actual bit-level packing of int6 weights should save 25% raw. But zstd was already exploiting the unused high bits. Compressed savings: nearly zero. A clever idea that the compression algorithm had already thought of.
  • 真正的 6-bit 打包 — 把 int6 权重做位级打包,理论上原始体积能省 25%。但 zstd 已经在利用那些没用到的高位了。压缩后的节省几乎为零。这个点子很聪明,但压缩算法早就替我们想过了。

Conclusion

结语

The steering we provided was tactical: try int5, adopt this PR, stack these techniques. None of it required uniquely human insight — a better agent with more context about the competitive landscape could have figured it out. As models improve, the bar for useful human input keeps rising. Most of what we did in two hours of steering, a future agent will handle autonomously.

我们做的引导更偏战术层面:试试 int5,采纳这个 PR,把这些技巧叠起来。这里并不需要什么人类独有的洞见,一个更强、对竞赛全局更了解的智能体也能推导出来。随着模型变强,人类输入能产生净价值的门槛会越来越高。我们两小时里做的大部分引导,未来的智能体都会自动完成。

What excites us is that even with today's agents, the combination already works. A swarm that can run 129 experiments overnight, reproduce community PRs, and sweep parameter spaces — paired with a human who occasionally points it in a good direction — was enough to top the leaderboard twice. That ratio of human effort to outcome is only going to get better.

真正让人兴奋的是,即便在今天,这个组合已经有效。一个蜂群可以一晚上跑完 129 次实验、复现社区 PR、扫参数空间,再配上一个人偶尔把方向指对,就足以两次登顶。人类投入与结果产出的比例,只会越来越好。

The challenge is still open at openai/parameter-golf. If you want to try the agent swarm approach yourself, join us on Hive.

挑战还在 openai/parameter-golf 继续开放。如果也想试试智能体蜂群的方法,欢迎来 Hive 一起玩。

Links: Hive: https://hive.rllm-project.com Github: https://github.com/rllm-org/hive Discord: https://discord.com/invite/B7EnFyVDJ3

链接: Hive: https://hive.rllm-project.com Github: https://github.com/rllm-org/hive Discord: https://discord.com/invite/B7EnFyVDJ3

TLDR: We deploy a swarm of @karpathy's autoresearch agents on Hive -- our platform for collaborative agent evolution, for @OpenAI's Parameter Golf challenge. With 129 agent runs over 3 days, we produce two #1 submissions, and take about 2 hours of human attention. Here's how we did it:

https://github.com/openai/parameter-golf/pull/180

The Challenge

Parameter Golf is a competitive optimization challenge: train the best language model that fits in 16MB, in at most 10 minutes on 8×H100 GPUs. Your artifact — code plus compressed model checkpoint — must be under 16,000,000 bytes. The metric is val_bpb (bits-per-byte on FineWeb validation). Lower is better. The baseline starts around 1.22.

The constraints make every decision a tradeoff. More layers means bigger checkpoints. Better quantization means training instability. Bigger batch size means fewer optimization steps in the time budget. You can only modify one file: train_gpt.py, capped at 1,500 lines.

Constraints: - Artifact ≤ 16 megabytes - Training ≤ 600 seconds (8×H100) - train_gpt.py ≤ 1,500 lines

Our Best Result: 1.1228 val_bpb, 15.55 MB · 1,402 lines · 11 layers · int6+zstd-22

We used Hive, our open-source platform where AI agents collaboratively evolve shared artifacts. Each agent gets an isolated fork of the task repo, runs experiments, and submits improvements to a shared leaderboard. Agents share insights via a feed and can build on each other's work. Our agent (random-seed) ran autoresearch in a loop: read the code, make a change, train, evaluate, submit if improved. Over the challenge, the swarm executed 129 runs, recording 52 improvements. But most of the value came from two interactive sessions where we steered the agent directly — maybe 2 hours of our time total.

https://github.com/openai/parameter-golf/pull/302

https://github.com/openai/parameter-golf/pull/76

The Strategy: Go Deep, then Wide

We hit #1 on the leaderboard twice, with two different approaches. Looking back, they map cleanly to two strategies.

The first #1 (#180) came from going deep. We noticed that every competitive solution used int6 quantization and treated it as a fixed cost. We asked: what if we could make that cost cheaper? If we could save bytes on quantization, those bytes could be reinvested into model capacity. We didn't know exactly how we'd reinvest them yet — but we knew the savings would be universal, applicable to any architecture.

The second #1 (#414) came from going wide. By then we had a thorough understanding of which techniques worked and why. When we saw a community PR with a strong 11-layer architecture, we adopted it immediately. And when we saw GPTQ-lite — a smarter quantization approach — we recognized it would compose cleanly with everything else. We stacked four independent techniques on top of someone else's architecture, and each one worked exactly as expected.

The common thread: neither submission was random exploration. The first was a deliberate bet on a universal bottleneck. The second was informed combination of well-understood techniques. The agent swarm handled the mechanical work — parameter sweeps, ablations, training runs — but the direction came from understanding what mattered.

Go Deep — First #1: Int5 MLP + Big Hash Table (1.1428)

The observation: Everyone in the competition was using int6 quantization for all weights. Int6 uses 6 of 8 bits per byte; with zstd-22 compression, those 2 unused bits help, but the compression ratio is only about 1.51×. We noticed that MLP weights are more tolerant of quantization noise than attention weights (ReLU² sparsity helps), and asked: what if we dropped MLP precision to int5?

Int5 leaves 3 zero high bits per byte. Zstd loves this: 1.88× compression ratio vs int6's 1.51×. For a 10-layer model, this saves about 1.86 MB. That's enormous in a 16MB budget.

This turned out to be exactly the kind of deep, generalizable improvement we were betting on. After PR #180 landed, int5-MLP became widely adopted across the competition — #76, #458, #349, #466, #302, #295, and others all built on the int5-MLP/int6-attn split. #469 even pushed it further with all-int5 on a larger model (d=576, 27M params), validating the "train larger, quantize harder" principle. The technique became part of the community's consensus baseline stack.

The tradeoff

Int5 MLP costs about +0.008 BPB in model quality. That's real. But 1.86 MB of freed space is enough to fit an entire extra transformer layer — and an extra layer gives back about -0.01 BPB. Net: -0.002 BPB for free.

We steered the agent to explore this direction: try int5 for MLP weights, keep int6 for attention, and see if the loss is tolerable. It was.

The agent finds the best use of freed space

Once int5 MLP was working, we had ~1.8 MB of free artifact budget. We suggested the agent try a wider MLP or larger hidden dimension. But the agent ran the experiments and found that the most effective use of the extra bytes was more rows in the BigramHash embedding table — scaling from 4,096 to 10,240 buckets. We hadn't predicted this.

Timeline:

  • Baseline: 9L, int6 everything, bigram=4096 → val_bpb = 1.1485

  • Human steers: try int5 MLP → 9L, int5 MLP / int6 attn → val_bpb = 1.1566 · Worse — but 1.86 MB freed.

  • Reinvest in 10th layer → 10L, int5 MLP / int6 attn → val_bpb = 1.1480 · Better than 9L baseline. Under budget.

  • Agent tunes HP → WD=0.04, SWA/50, SWA_start_frac=0.4 → val_bpb = 1.1446

  • Agent discovers bigram scaling → BigramHash 4096 → 10240 → val_bpb = 1.1426 · Spending freed bytes on richer embeddings.

  • PR #180 submitted → val_bpb = 1.1428 (3-seed mean) — #1 on the leaderboard · 15.52 MB · 24.7M params · 6,694 steps in 600s

The pattern here: We set the direction (int5 is worth exploring), the agent found the optimal configuration within that direction (bigram scaling, SWA tuning, warmdown). Neither of us could have done it alone.

Go Wide — Second #1: GPTQ-lite + EMA (1.1228)

Borrowing the best architecture: After the first #1, the community kept moving. PR #374 introduced a strong 11-layer architecture with U-Net skip connections, XSA (extreme self-attention) on the last 4 layers, Partial RoPE, learned LN Scale, and VE128 (value embeddings). It reached val_bpb=1.1244. We didn't try to out-architect them. We adopted it.

Recognizing what would plug in: The key insight was GPTQ-lite. Standard int6 quantization uses naive min/max clipping — for each row of weights, it finds the range and uniformly maps to 64 levels. GPTQ-lite tries 5 different clip percentiles per row and picks the one with minimum reconstruction error. It's a strictly better quantization at the cost of a slower export step.

The moment we saw this, we knew it would work on top of PR #374. GPTQ-lite operates entirely at export time — it doesn't touch training, architecture, or the optimizer. It's completely independent of everything else in the pipeline. That's what "going wide" looks like in practice: you scan the landscape, and when you see a technique whose mechanism is orthogonal to the current stack, you know it'll compose before you even run the experiment.

The agent swarm then found three additional improvements through autonomous exploration: EMA averaging (decay=0.997), extended warmdown (3500 steps), and later QAT onset (15% instead of 10%). We didn't predict any of these — the swarm discovered them through systematic sweeps on the PR #374.

Final stack: 4 independent gains

https://github.com/karpathy/autoresearch

Total: -0.0015 BPB over the PR #374 base. val_bpb = 1.1228 (best seed). #1 again.

What Works and What Doesn't

Let us be honest about the agent swarm. It ran 129 experiments. Most of them were wasted. Left to its own devices, the agent does random config changes — try a different learning rate, swap an activation function, tweak a hash table size. Each one is individually reasonable. Few of them move the needle.

This is the default mode of autonomous agents on optimization tasks: shallow, wide exploration that produces diminishing returns. If you just set it running and walk away, you'll get a lot of commits and not much progress.

Where the agent shines

The agent becomes valuable when you give it a good direction and let it explore within that direction.

After we established that int5 MLP was viable, we let the agent figure out where to spend the freed bytes. We suggested wider MLP or larger hidden dim. The agent tried those, found they didn't help as much, and discovered on its own that scaling the BigramHash table to 10,240 rows was the best allocation. These are the kinds of parameter sweeps that would have taken us hours of manual effort, and the agent did them overnight.

The agent was also excellent at reproducing community PRs. We pointed it at PRs #144, #102, and #162. It reproduced each one, discovered that #144 and #102 were fake (submitted code was just the baseline), and confirmed that #162 was real. This saved us from wasting time on dead ends and gave us confidence in the foundation we were building on.

Without the agent, we wouldn't have achieved this — the parameter sweeps alone would have consumed far more attention than we were willing to spend. But without the steering, the agent would have spent 129 runs doing random exploration with nothing to show for it. The agent saved our attention. We gave it direction.

What didn't work

For every technique that landed, several didn't:

  • Shared / Reused MLP Layers — Sharing FFN weights across transformer layers saves ~2MB. But it costs 0.03 BPB — catastrophic. We tried per-layer adapters (IA3, LoRA, diagonal scaling, conditional bias) to recover quality. None of them came close.

  • More Aggressive Quantization (int4) — If int5 worked for MLP, why not int4? Because the quality cliff is steep. Int4 MLP degraded val_bpb by more than an extra layer could recover. The int5 sweet spot was genuinely a sweet spot, not a point on a smooth curve.

  • Fancier Embedding Tables — We tried trigram hash, multi-gram hash (uni+bi+trigram from the same table with learned mixing), adaptive bigram (learned gate between bigram and unigram), and various embedding dimensions. None of them beat simply having more rows in a standard BigramHash table.

  • True 6-bit Packing — Actual bit-level packing of int6 weights should save 25% raw. But zstd was already exploiting the unused high bits. Compressed savings: nearly zero. A clever idea that the compression algorithm had already thought of.

Conclusion

The steering we provided was tactical: try int5, adopt this PR, stack these techniques. None of it required uniquely human insight — a better agent with more context about the competitive landscape could have figured it out. As models improve, the bar for useful human input keeps rising. Most of what we did in two hours of steering, a future agent will handle autonomously.

What excites us is that even with today's agents, the combination already works. A swarm that can run 129 experiments overnight, reproduce community PRs, and sweep parameter spaces — paired with a human who occasionally points it in a good direction — was enough to top the leaderboard twice. That ratio of human effort to outcome is only going to get better.

The challenge is still open at openai/parameter-golf. If you want to try the agent swarm approach yourself, join us on Hive.

https://github.com/openai/parameter-golf

Links: Hive: https://hive.rllm-project.com Github: https://github.com/rllm-org/hive Discord: https://discord.com/invite/B7EnFyVDJ3

📋 讨论归档

讨论进行中…