返回列表
🧠 阿头学 · 💬 讨论题

AutoLab 想测的不是答题能力,而是模型能否扛住失败并持续做实验

AutoLab 抓住了当前大模型评测最缺的一环——闭环迭代能力,但它把“可自动优化的工程任务”包装成“真实科学闭环”的说法明显说大了。
打开原文 ↗

2026-04-07 原文链接 ↗
阅读简报
双语对照
完整翻译
原文
讨论归档

核心观点

  • 从答题转向做实验是对的 文章最站得住的判断是,传统 benchmark 主要测一次性正确率,确实低估了真实研发里最关键的能力:在预算有限、反馈不干净、没有标准答案时,持续做“实验—诊断—修订”。
  • AutoLab 的价值在于把闭环能力做成可测对象 它不给标准答案,只给可运行程序、算力预算和目标指标,这种设计明显比静态问答更接近工程优化现场;连续计分而非二元通过,也更符合真实改进是渐进积累而不是一锤定音。
  • 文章提出的“闭环韧性”是一个有用概念 真正区分强弱的,不是模型第一步想得多妙,而是它在负反馈后能不能改假设、换路径、必要时重构框架;这一点在数据选择和 Connect-3 压缩案例里都讲清楚了。
  • 但它测到的更像“自动化优化能力”,不是完整的科学发现能力 23 个任务大多是边界清晰、可脚本化、目标函数明确的计算机工程问题,这和真实科学研究里目标模糊、观测脏、因果链长、外部约束多的环境不是一回事,把两者直接并列是明显外推。
  • 叙事有前瞻包装,证据还不够硬 文中大量引用未来时间线、理想化案例和成功故事,这种写法很有传播力,但也有 PR 色彩;如果没有更系统的整体统计、失败样本和成本分析,“模型已成研究闭环参与者”的结论还不能下太早。

跟我们的关联

  • 对 ATou 意味着什么 不该再把 AI 只当“回答器”评估,而要把它当“实验执行器”看;下一步可以把自己手头的优化问题按“可运行环境 + 明确 reward + 固定预算”重写,测试 agent 是否真能闭环推进。
  • 对 Neta 意味着什么 在做 benchmark、工具链或工作流设计时,重点不应只放在首答质量,而应放在反馈读取和策略更新;下一步可以建立一套内部评测:看模型 10 轮后比 1 轮强多少,而不是只看第一轮。
  • 对 Uota 意味着什么 这篇文章提醒我们,高价值能力常常不是“第一次说对”,而是“失败后能不能改框架”;下一步讨论时可以把“闭环韧性”当成识别人才、团队和 agent 的通用判断标准。
  • 对投资意味着什么 真正可能形成壁垒的不是单点模型能力,而是“模型 + 工具 + 评测 + 自动实验平台”的系统;下一步应重点看哪些团队不仅会讲 autonomous research 故事,还能证明单位成本下闭环收益持续成立。

讨论引子

1. AutoLab 这种基准,究竟是在测“研究能力”,还是主要在测“自动化工程优化能力”? 2. 如果一个 agent 的提升主要靠高频试错和高 token 消耗堆出来,它的商业价值还成立吗? 3. 在闭环系统里,人类最难被替代的环节到底是提出问题、识别假指标,还是做结构性重构?

从基准测试的视角看,模型能否超越静态回答,开始为真正推动科学与工程进步的迭代实验闭环做出贡献。

💡速览 https://autolab.moe/

  1. 大多数基准测试测的是一锤子买卖式的正确性。但真正的进步,来自在算力受限、反馈嘈杂的条件下,反复经历失败、诊断、修订的循环。

  2. AutoLab 用实验室替代考场。没有标准答案,只有算力预算,以及一条必须在真实约束下通过运行、测量、修改,靠实证一步步摸出来的路径。

  3. 在 23 个任务里,区分成功与失败的关键能力是闭环韧性:能扛住负面的实证反馈,更新假设,并重构方法。

https://arxiv.org/abs/2310.06770

前沿研究的进展由循环驱动,而不是孤立的输出。

进步不是在零样本提示中凭空生成的一次天才灵光。它是通过试错、测量与修订一点点磨出来的。

大多数基准测试考察的是,模型能不能第一次就给出正确输出:正确的代码、正确的推理链、正确的答案。但前沿 AI 不是靠孤立的答案堆出来的。它靠的是试验、测量、解读与修订的循环,往往昼夜不停地连续运转。

训练配方要靠数百次小实验反复调出来。系统性能要靠无数次干预一点点提上去。架构选择也要在修改与测量的循环中逐步成形,而且始终受制于各种约束:有限的算力、有限的时间、不完整的信息、嘈杂的反馈。用这类工作来评估模型,问的其实是它们能否收集到真正找到答案所需的实证证据,而不是第一次就把答案“写出来”。

超越答案引擎

如果一个模型只能完成孤立的子任务,它的影响力从根上就被封顶了。按照 约束理论,只加速某个孤立组件很快会撞到天花板,因为瓶颈只是转移到了别的地方。

要实现数量级的提升,必须重构系统本身。版本控制就是一个典型例子:分支与合并消除了必须按顺序协同的约束,让并行实验成为可能,这不是因为编辑更快了,而是因为瓶颈被彻底消掉了。

当模型进入实验闭环时,也会发生类似的变化。研究中的瓶颈长期以来是人类的迭代速度:研究者一天能跑的实验轮次很少且固定。智能体可以多跑一个数量级,并且 24 小时不停。这不仅是把某一步加速,而是把人类迭代瓶颈从关键路径里整个移走,从而改变了工作的结构本身。

这和“模型变得更聪明了”不是一回事。变化在于,它们正在获得在发现的迭代机器中运转的能力:不只是对实验做推理,而是去执行实验。

为什么要做 AutoLab

AutoLab 是一个衡量参与实验闭环能力的基准,而不只是衡量静态知识或孤立推理。

如果智能体正在成为研究闭环里的结构性参与者,能在一夜之间跑数百个实验,并在无人干预下持续迭代,那就需要一种方法来准确衡量这种能力。这就是 AutoLab 的动机。

每个任务都会给智能体一个可运行但未优化的程序、一份固定的算力预算、以及一个目标指标。没有标准答案。前进路径必须通过实证发现:做性能剖析、提出假设、修改代码、跑实验,并在真实约束下循环迭代。

AutoLab 测的是模型能否完成这种不那么光鲜却至关重要的实证改进工作:不只是想象改变,而是跑完发现改变所必需的闭环。

正在形成的版图

让智能体运行连续优化闭环的想法,已经在产生成果。Karpathy 的 autoresearch [1] 让一个智能体连续跑了两天:700 次实验,发现 20 个优化,在他认为已经调得很到位的项目上把训练速度提升了 11%。AlphaEvolve [2](DeepMind,2025)更进一步:将 LLM 与进化搜索结合,发现了 56 年来第一个超越 Strassen 的矩阵乘法改进,如今正在回收谷歌全球算力的 0.7%。

但该怎么衡量这种能力?RE-Bench [3](METR,2024)在 8 小时的研究任务上,让智能体与人类专家对比。KernelBench [4](斯坦福,2025)在 250 个 GPU kernel 优化任务上评估 LLM。COFFE [5] 引入了指令计数来消除机器噪声。The AI Scientist [6](Nature,2026)端到端生成研究论文。AutoLab 借鉴了其中多项思路,尤其是 COFFE 的指令计数 [5],并为 autoresearch [1] 与 AlphaEvolve [2] 所展示的那类连续优化闭环提供了一个标准化基准。

设计与任务空间

AutoLab 补上了一个目前相对缺位的侧面:解空间巨大、探索不足、智能体必须自己摸出路线的任务。几个例子能说明其设计意图:

组合搜索。 找到一个 16 输入排序网络,使比较器数量最少。已知最佳结果(60)可以追溯到 20 世纪 60 年代,而且从未被证明是最优。

算法再发明。 在单线程、仅限 C 语言的约束下,通过从第一性原理重新推导 blocked tiling、online softmax 和 SIMD 向量化,来优化一个 attention kernel。

反向目标。 压缩一个有 17,924 个参数的下棋网络,在尽量少参数的同时保持相同准确率。参考解能做到 913 个参数,这种方法来自对架构的重新思考,而不是套用标准压缩技巧。

开放式的 ML 研究。 从 50,000 个样本里挑出最好的 5,000 个用于指令跟随微调,或在固定算力预算下从零训练语言模型,使困惑度尽可能低。

首个版本包含 23 个任务,覆盖系统工程(C 中的 SIMD 向量化、C++ 中的 cache-aware 数据结构、Go 中的无锁并发、Rust 中的零拷贝解析)与前沿 AI 研究(算力最优的预训练、数据选择、多源 GRPO 微调)。它们的共同点在于结构:每个任务都给出一个开放的搜索空间,进展只能通过迭代实验来发现。

三个改进轴

https://arxiv.org/abs/2411.15114

这些任务大致聚成三类维度。目标各不相同,但闭环一致:提出、测试、测量、修订。

更快 在保持质量的前提下,通过系统或架构层面的改变降低延迟。例子:从 flash attention 基线出发,智能体通过 cache-aware tiling 与 SIMD 向量化实现 30× 加速。

更聪明 在固定算力预算下提升任务表现。例子:IFEval 的数据选择任务里,最佳智能体达到 46.2% 的 strict accuracy,而随机基线为 37.8%。

更小 在不牺牲正确性的前提下找到更高效的解。例子:通过架构重构,将 Connect-3 玩家从 17,924 压到 913 个参数。

案例研究

解一道编程题和做研究的区别,在于能否承受实证失败带来的摩擦。

把概念放到实践里更容易看清。下面追踪前沿模型如何处理两个 AutoLab 任务,并不是为了给模型排座次,而是为了展示哪些具体行为,把“参与闭环”与“一次性生成”区分开来。

数据选择

随机基线(均匀随机选 5,000 个样本)已经能达到 37.8%。随机与最佳智能体之间的差距只有 8.4 个百分点。这个窄幅差距正是重点:在数据选择上,大部分价值不是来自一开始就想出多聪明的点子,而是来自根据训练信号的实证结果,不断精炼哪些样本真正重要。

最高分的智能体会先写一个初始选择启发式,触发一次训练,然后查看评估的细分结果。当模型在某些约束类型上失败(长度限制、标点规则)时,它们会解析错误分布,找出哪些训练样本能覆盖这些失败点,再重写选择条件。真正拉开分数差距的,不是最初的启发式,而是这种诊断与修订的循环。

https://github.com/karpathy/autoresearch

把每次训练当成一次诊断,而不是简单的通过/失败检查之后,智能体就能映射出模型在哪些约束类型上吃力,并有针对性地塑造数据来补齐缺口。8.4 个点的优势很薄,但它恰好体现了那种以实证为地基的精炼能力,这是“参与闭环”与“一次性生成”的分水岭。

Connect-3 参数高尔夫

https://arxiv.org/abs/2502.10517

不同于延迟类任务里每个百分比都要计较,这个任务存在相变:能解锁紧凑表示的结构性洞见是离散的,因此会出现上面那种 1.0 vs 0.0 的陡峭分裂。这些轨迹之所以有启发性,在于闭环策略如何走向分叉。

Claude 4.6 用的是标准工程直觉来应对约束。当它最初的 MLP 无法泛化时,它注意到棋盘的左右翻转对称性,于是做了数据增强。这是个聪明的数据侧修补,但它始终把原始特征塞进扁平结构里,从未质疑拓扑本身,参数量一直降不下来。

Claude 确实正确诊断了数据问题,但之后的闭环仍停留在同一个结构范式里:缩小层宽、调整激活函数、再训练。每次迭代都带来一点点压缩,却缺少跨过阈值所需的架构转向。这是一种很有代表性的失败模式:闭环在跑,但只是在局部邻域探索,而没有做结构重构。

GPT-5.4 则发生了关键转折。它意识到用扁平的全连接层去评估多列动作,从根本上就是参数效率低,于是彻底重组了网络拓扑,构造了一个共享的 action-scorer,让同一组权重同时复用于所有列。

通过在潜在动作之间复用权重,而不是继续堆更多层,GPT-5.4 用一次结构性洞见绕过了容量约束,在 95.17% 准确率下做到 3,201 个参数。差别不在于孤立的智力或推理质量,而在于当渐进式微调不再奏效时,决策点上发生了什么:Claude 继续在既有框架内优化,GPT-5.4 重构了框架本身。

我们到底在测什么

这两个任务考察的技能差异很大:一个更偏统计直觉,另一个更偏架构推理。但区分成功与失败的能力是同一个:能承受负面的实证反馈,并据此改变策略。我们把它称为闭环韧性。

闭环韧性不同于大多数基准测试所衡量的能力。它不是第一次尝试的正确性、知识广度、或孤立的推理链质量。它是一种复合能力,要求在迭代过程中保持定向:解读含混信号,识别当前方法是否已饱和,并判断是继续在既有框架内细化,还是彻底重构。

有三个设计选择,让这种能力能在整个基准中被量化:

连续计分,而非通过/失败。 奖励按对数缩放。即便还能做到 30×,10× 的加速也会得到相应分数。这能捕捉部分进展,因为每一次闭环迭代都应该推动指标变化。

用指令计数实现可复现的指标。 受 COFFE [5] 启发,许多任务使用 Valgrind/callgrind 统计 CPU 指令数而不是墙钟时间,从而彻底消除机器噪声。

速度之外的指标多样性。 一些任务会最小化模型参数、比较器数量或内存分配次数,把常见的优化目标倒过来,从而测试智能体能否把闭环策略适配到本质不同的奖励信号上。

科学家不会消失

设想一个具体场景:智能体一夜之间跑了五十组超参数扫描,并报告了验证损失最低的配置。研究者扫一眼损失曲线,就识别出这是在一个标注错误的数据划分上过拟合,于是否决结果并重定向搜索。这种判断,知道什么时候指标在误导人,不是闭环本身能提供的。

更宏观的问题是,当机器开始参与实验闭环时,研究本身会不会改变。AutoLab 是用来研究这种转变的仪器。它不是在宣称自主性已经到来,而是提供一个受控环境,让我们能在真实约束下测量智能体是否真的为闭环做出贡献。科学家不会消失。闭环只是多了一个新参与者。

一个活的基准

这里的 23 个任务只是起点。AutoLab 的设计鼓励社区贡献,我们也在积极邀请研究者在三个轴上提交新任务。如果手头有一个带清晰指标、开放搜索空间、且不存在已知最优解的优化问题,它大概率就属于这里。

通过 Contribute a Task 提交任务并运行智能体。每个任务都需要一个指令文件、一个 Dockerfile、一个会写出 reward 的测试脚本,并可选提供参考解。每一次贡献都会让基准继续生长。

参考文献

[1] Karpathy, A. (2026). autoresearch:让 AI 智能体自动在单 GPU 的 nanochat 训练上做研究。github.com/karpathy/autoresearch

[2] Novikov, A., Vu, N., Eisenberger, M., et al. (2025). AlphaEvolve:用于科学与算法发现的编程智能体。Google DeepMind. arXiv preprint arXiv:2506.13131. arxiv.org/abs/2506.13131

[3] Wijk, H., Lin, T., Becker, J., et al. (2024). RE-Bench:将语言模型智能体的前沿 AI 研发能力与人类专家对比评估。ICML 2025. arxiv.org/abs/2411.15114

[4] Ouyang, A., Guo, S., Arora, S., et al. (2025). KernelBench:LLM 能写出高效的 GPU kernel 吗?arXiv preprint arXiv:2502.10517. arxiv.org/abs/2502.10517

[5] Peng, Y., Wan, J., Li, Y., Ren, X. (2025). COFFE:面向代码生成的代码效率基准。FSE 2025. ACM SIGSOFT Distinguished Paper Award. arxiv.org/abs/2502.02827

[6] Lu, C., Lu, C., Lange, R.T., Foerster, J., Clune, J., Ha, D. (2026). 迈向 AI 研究的端到端自动化。Nature, 651, 914–919. doi.org/10.1038/s41586-026-10265-5

[7] Jimenez, C.E., Yang, J., Wettig, A., et al. (2024). SWE-bench:语言模型能解决真实世界的 GitHub issue 吗?ICLR 2024. arxiv.org/abs/2310.06770

[8] Huang, Q., Vora, J., Liang, P., Leskovec, J. (2024). MLAgentBench:评估语言智能体在机器学习实验中的能力。ICML 2024. arxiv.org/abs/2310.03302

[9] Goldratt, E.M. (1984). 目标:持续改进的过程. North River Press.

A benchmark perspective on whether models can move beyond static answers and begin contributing to the iterative experimental loops that actually produce scientific and engineering progress.

💡Quick take https://autolab.moe/

  1. Most benchmarks test one-shot correctness. But real progress emerges from cycles of failure, diagnosis, and revision under limited compute and noisy feedback.

  2. AutoLab replaces the exam with the laboratory: no answer key, just a compute budget and a path that must be discovered empirically by running, measuring, and revising under real constraints.

  3. Across 23 tasks, the capability that separates success from failure is closed-loop resilience: the ability to survive negative empirical feedback, update hypotheses, and restructure the approach.

https://arxiv.org/abs/2310.06770

Frontier research progress is driven by loops, not isolated outputs.

Progress is not a single brilliant insight generated in a zero-shot prompt. It is ground out through trial, measurement, and revision.

Most benchmarks test whether a model can produce the right output on the first try: the right code, the right reasoning chain, the right answer. But frontier AI is not built from isolated answers. It is built from loops of trial, measurement, interpretation, and revision, often running continuously, day and night.

Training recipes are tuned through hundreds of small experiments. System performance is improved through countless interventions. Architecture choices are shaped through loops of modification and measurement, all under constraints: finite compute, limited time, incomplete information, and noisy feedback. Evaluating models on this kind of work asks whether they can gather the empirical evidence required to actually find the answer, not just produce one on the first try.

从基准测试的视角看,模型能否超越静态回答,开始为真正推动科学与工程进步的迭代实验闭环做出贡献。

💡速览 https://autolab.moe/

  1. 大多数基准测试测的是一锤子买卖式的正确性。但真正的进步,来自在算力受限、反馈嘈杂的条件下,反复经历失败、诊断、修订的循环。

  2. AutoLab 用实验室替代考场。没有标准答案,只有算力预算,以及一条必须在真实约束下通过运行、测量、修改,靠实证一步步摸出来的路径。

  3. 在 23 个任务里,区分成功与失败的关键能力是闭环韧性:能扛住负面的实证反馈,更新假设,并重构方法。

https://arxiv.org/abs/2310.06770

前沿研究的进展由循环驱动,而不是孤立的输出。

进步不是在零样本提示中凭空生成的一次天才灵光。它是通过试错、测量与修订一点点磨出来的。

大多数基准测试考察的是,模型能不能第一次就给出正确输出:正确的代码、正确的推理链、正确的答案。但前沿 AI 不是靠孤立的答案堆出来的。它靠的是试验、测量、解读与修订的循环,往往昼夜不停地连续运转。

训练配方要靠数百次小实验反复调出来。系统性能要靠无数次干预一点点提上去。架构选择也要在修改与测量的循环中逐步成形,而且始终受制于各种约束:有限的算力、有限的时间、不完整的信息、嘈杂的反馈。用这类工作来评估模型,问的其实是它们能否收集到真正找到答案所需的实证证据,而不是第一次就把答案“写出来”。

Beyond the answer engine

If a model only completes isolated subtasks, its impact is fundamentally bounded. By the Theory of Constraints, speeding up one isolated component hits a ceiling because the bottleneck simply moves elsewhere.

To achieve an order-of-magnitude improvement, you have to restructure the system itself. Version control is a clear example: branching and merging removed the constraint of sequential coordination, enabling parallel experimentation not by making editing faster, but by eliminating the bottleneck entirely.

When models enter the experimental loop, something analogous happens. The bottleneck in research has long been human iteration speed: the number of experiment cycles a researcher can run in a day is small and fixed. An agent can run an order of magnitude more, and it can run around the clock. This does not just speed up one step; it removes the human iteration bottleneck from the critical path entirely, changing the structure of the work itself.

That is a different claim from saying models are becoming more intelligent. The shift is that they are gaining the ability to function inside the iterative machinery of discovery: not just reason about experiments, but execute them.

超越答案引擎

如果一个模型只能完成孤立的子任务,它的影响力从根上就被封顶了。按照 约束理论,只加速某个孤立组件很快会撞到天花板,因为瓶颈只是转移到了别的地方。

要实现数量级的提升,必须重构系统本身。版本控制就是一个典型例子:分支与合并消除了必须按顺序协同的约束,让并行实验成为可能,这不是因为编辑更快了,而是因为瓶颈被彻底消掉了。

当模型进入实验闭环时,也会发生类似的变化。研究中的瓶颈长期以来是人类的迭代速度:研究者一天能跑的实验轮次很少且固定。智能体可以多跑一个数量级,并且 24 小时不停。这不仅是把某一步加速,而是把人类迭代瓶颈从关键路径里整个移走,从而改变了工作的结构本身。

这和“模型变得更聪明了”不是一回事。变化在于,它们正在获得在发现的迭代机器中运转的能力:不只是对实验做推理,而是去执行实验。

Why we built AutoLab

AutoLab is a benchmark for participation in experimental loops, not just for static knowledge or isolated reasoning.

If agents are becoming structural participants in the research loop, running hundreds of experiments overnight and iterating continuously without human intervention, we need a way to measure exactly that capability. That is the motivation behind AutoLab.

Each task gives the agent a working but unoptimized program, a fixed compute budget, and a target metric. There is no answer key. The path forward must be discovered empirically: profiling, hypothesizing, modifying code, running experiments, and iterating under real constraints.

AutoLab tests whether a model can do the unglamorous but essential work of empirical improvement: not just thinking about change, but executing the loop through which change is found.

为什么要做 AutoLab

AutoLab 是一个衡量参与实验闭环能力的基准,而不只是衡量静态知识或孤立推理。

如果智能体正在成为研究闭环里的结构性参与者,能在一夜之间跑数百个实验,并在无人干预下持续迭代,那就需要一种方法来准确衡量这种能力。这就是 AutoLab 的动机。

每个任务都会给智能体一个可运行但未优化的程序、一份固定的算力预算、以及一个目标指标。没有标准答案。前进路径必须通过实证发现:做性能剖析、提出假设、修改代码、跑实验,并在真实约束下循环迭代。

AutoLab 测的是模型能否完成这种不那么光鲜却至关重要的实证改进工作:不只是想象改变,而是跑完发现改变所必需的闭环。

The emerging landscape

The idea of agents running continuous optimization loops is already producing results. Karpathy's autoresearch [1] ran an agent for two days straight: 700 experiments, 20 discovered optimizations, an 11% training speedup on a project he considered already well-tuned. AlphaEvolve [2] (DeepMind, 2025) takes this further: LLMs paired with evolutionary search discovered the first improvement to matrix multiplication beyond Strassen in 56 years, now recovering 0.7% of Google's worldwide compute.

But how do we measure this capability? RE-Bench [3] (METR, 2024) compares agents against human experts on 8-hour research tasks. KernelBench [4] (Stanford, 2025) evaluates LLMs on 250 GPU kernel optimization tasks. COFFE [5] introduced instruction counting to eliminate machine noise. The AI Scientist [6] (Nature, 2026) generates research papers end-to-end. AutoLab builds on ideas from several of these, particularly COFFE’s instruction counting [5], and provides a standardized benchmark for the kind of continuous optimization loop that autoresearch [1] and AlphaEvolve [2] have demonstrated.

正在形成的版图

让智能体运行连续优化闭环的想法,已经在产生成果。Karpathy 的 autoresearch [1] 让一个智能体连续跑了两天:700 次实验,发现 20 个优化,在他认为已经调得很到位的项目上把训练速度提升了 11%。AlphaEvolve [2](DeepMind,2025)更进一步:将 LLM 与进化搜索结合,发现了 56 年来第一个超越 Strassen 的矩阵乘法改进,如今正在回收谷歌全球算力的 0.7%。

但该怎么衡量这种能力?RE-Bench [3](METR,2024)在 8 小时的研究任务上,让智能体与人类专家对比。KernelBench [4](斯坦福,2025)在 250 个 GPU kernel 优化任务上评估 LLM。COFFE [5] 引入了指令计数来消除机器噪声。The AI Scientist [6](Nature,2026)端到端生成研究论文。AutoLab 借鉴了其中多项思路,尤其是 COFFE 的指令计数 [5],并为 autoresearch [1] 与 AlphaEvolve [2] 所展示的那类连续优化闭环提供了一个标准化基准。

Design and task space

AutoLab contributes a facet that is currently underrepresented: tasks with large, under-explored solution spaces where the agent must discover its own path. A few examples illustrate the design intent:

Combinatorial search. Find a 16-input sorting network with the fewest comparators. The best known result (60) dates to the 1960s and has never been proven optimal.

Algorithmic reinvention. Optimize an attention kernel by re-deriving blocked tiling, online softmax, and SIMD vectorization from first principles under a single-threaded, C-only constraint.

Inverted objectives. Compress a 17,924-parameter game-playing network to achieve the same accuracy with as few parameters as possible. The reference solution reaches 913 parameters, an approach that emerges from rethinking the architecture, not from applying standard compression techniques.

Open-ended ML research. Select the best 5,000 training samples from a pool of 50,000 for instruction-following fine-tuning, or train a language model from scratch with the lowest possible perplexity under a fixed compute budget.

The first release includes 23 tasks spanning systems engineering (SIMD vectorization in C, cache-aware data structures in C++, lock-free concurrency in Go, zero-copy parsing in Rust) and frontier AI research (compute-optimal pretraining, data selection, multi-source GRPO fine-tuning). What unifies them is the structure: every task presents an open search space where progress must be discovered through iterative experimentation.

设计与任务空间

AutoLab 补上了一个目前相对缺位的侧面:解空间巨大、探索不足、智能体必须自己摸出路线的任务。几个例子能说明其设计意图:

组合搜索。 找到一个 16 输入排序网络,使比较器数量最少。已知最佳结果(60)可以追溯到 20 世纪 60 年代,而且从未被证明是最优。

算法再发明。 在单线程、仅限 C 语言的约束下,通过从第一性原理重新推导 blocked tiling、online softmax 和 SIMD 向量化,来优化一个 attention kernel。

反向目标。 压缩一个有 17,924 个参数的下棋网络,在尽量少参数的同时保持相同准确率。参考解能做到 913 个参数,这种方法来自对架构的重新思考,而不是套用标准压缩技巧。

开放式的 ML 研究。 从 50,000 个样本里挑出最好的 5,000 个用于指令跟随微调,或在固定算力预算下从零训练语言模型,使困惑度尽可能低。

首个版本包含 23 个任务,覆盖系统工程(C 中的 SIMD 向量化、C++ 中的 cache-aware 数据结构、Go 中的无锁并发、Rust 中的零拷贝解析)与前沿 AI 研究(算力最优的预训练、数据选择、多源 GRPO 微调)。它们的共同点在于结构:每个任务都给出一个开放的搜索空间,进展只能通过迭代实验来发现。

The three axes of improvement

https://arxiv.org/abs/2411.15114

These tasks cluster along three dimensions. The objectives differ, but the loop is the same: propose, test, measure, revise.

Faster Reduce latency through systems or architectural changes while preserving quality. Example: flash attention baseline → agent achieves 30× speedup via cache-aware tiling and SIMD vectorization.

Smarter Improve task performance under a fixed compute budget. Example: data selection for IFEval, where the best agent reaches 46.2% strict accuracy vs. 37.8% random baseline.

Smaller Find a more efficient solution without sacrificing correctness. Example: Connect-3 player compressed from 17,924 → 913 parameters via architectural restructuring.

三个改进轴

https://arxiv.org/abs/2411.15114

这些任务大致聚成三类维度。目标各不相同,但闭环一致:提出、测试、测量、修订。

更快 在保持质量的前提下,通过系统或架构层面的改变降低延迟。例子:从 flash attention 基线出发,智能体通过 cache-aware tiling 与 SIMD 向量化实现 30× 加速。

更聪明 在固定算力预算下提升任务表现。例子:IFEval 的数据选择任务里,最佳智能体达到 46.2% 的 strict accuracy,而随机基线为 37.8%。

更小 在不牺牲正确性的前提下找到更高效的解。例子:通过架构重构,将 Connect-3 玩家从 17,924 压到 913 个参数。

Case study

The difference between solving a coding problem and doing research is the ability to survive the friction of empirical failure.

The concept is easier to see in practice. Below, we trace how frontier models approached two AutoLab tasks, not to rank which model is better, but to illustrate the specific behaviors that separate loop participation from one-shot generation.

Data Selection

The random baseline (selecting 5,000 samples uniformly) already achieves 37.8%. The gap between random and the best agent is just 8.4 percentage points. That narrow margin is the point: in data selection, most of the value comes not from having a clever initial idea, but from iteratively refining which samples matter based on empirical training signal.

The top-scoring agents wrote an initial selection heuristic, triggered a training run, and inspected the evaluation breakdown. When the model failed on specific constraint types (length limits, punctuation rules), they parsed the error distribution, identified which training samples addressed those failures, and rewrote the selection criteria. This diagnose-and-revise cycle, not the initial heuristic, is where the score differential was earned.

https://github.com/karpathy/autoresearch

By treating each training run as a diagnostic, not just a pass/fail check, the agent mapped which constraint types the model struggled with and shaped the data to address those gaps specifically. The 8.4-point margin is thin, but it represents exactly the kind of empirically-grounded refinement that separates loop participation from one-shot generation.

Connect-3 Parameter Golf

https://arxiv.org/abs/2502.10517

Unlike latency tasks where every percentage counts, this task has a phase transition: the structural insight that unlocks a compact representation is discrete, producing the sharp 1.0-vs-0.0 split above. What makes the trajectories instructive is how the loop strategies diverged.

Claude 4.6 approached the constraint using standard engineering intuition. When its initial MLP failed to generalize, it noticed the horizontal flip symmetry of the game board and augmented the training data. A clever data-side fix, but it kept pushing raw features through a flat structure, never questioning the topology itself. The parameter count stayed too high.

Claude correctly diagnosed the data bug, but its subsequent loop stayed within the same structural paradigm: shrinking layer widths, adjusting activation functions, retraining. Each iteration produced incremental compression without the architectural pivot needed to cross the threshold. This is a revealing failure mode: the loop ran, but it explored a local neighborhood rather than restructuring.

GPT-5.4 experienced the critical shift. After recognizing that flat dense layers were fundamentally parameter-inefficient for evaluating multiple columns, it reorganized the network topology entirely, constructing a shared action-scorer whose weights were reused across all columns simultaneously.

By reusing weights across potential actions instead of adding more layers, GPT-5.4 bypassed the capacity constraint with a structural insight, reaching 3,201 parameters at 95.17% accuracy. The difference was not intelligence or reasoning quality in isolation, but what happened at the decision point when incremental refinement stopped working: Claude kept optimizing within the existing frame, GPT-5.4 restructured the frame itself.

案例研究

解一道编程题和做研究的区别,在于能否承受实证失败带来的摩擦。

把概念放到实践里更容易看清。下面追踪前沿模型如何处理两个 AutoLab 任务,并不是为了给模型排座次,而是为了展示哪些具体行为,把“参与闭环”与“一次性生成”区分开来。

数据选择

随机基线(均匀随机选 5,000 个样本)已经能达到 37.8%。随机与最佳智能体之间的差距只有 8.4 个百分点。这个窄幅差距正是重点:在数据选择上,大部分价值不是来自一开始就想出多聪明的点子,而是来自根据训练信号的实证结果,不断精炼哪些样本真正重要。

最高分的智能体会先写一个初始选择启发式,触发一次训练,然后查看评估的细分结果。当模型在某些约束类型上失败(长度限制、标点规则)时,它们会解析错误分布,找出哪些训练样本能覆盖这些失败点,再重写选择条件。真正拉开分数差距的,不是最初的启发式,而是这种诊断与修订的循环。

https://github.com/karpathy/autoresearch

把每次训练当成一次诊断,而不是简单的通过/失败检查之后,智能体就能映射出模型在哪些约束类型上吃力,并有针对性地塑造数据来补齐缺口。8.4 个点的优势很薄,但它恰好体现了那种以实证为地基的精炼能力,这是“参与闭环”与“一次性生成”的分水岭。

Connect-3 参数高尔夫

https://arxiv.org/abs/2502.10517

不同于延迟类任务里每个百分比都要计较,这个任务存在相变:能解锁紧凑表示的结构性洞见是离散的,因此会出现上面那种 1.0 vs 0.0 的陡峭分裂。这些轨迹之所以有启发性,在于闭环策略如何走向分叉。

Claude 4.6 用的是标准工程直觉来应对约束。当它最初的 MLP 无法泛化时,它注意到棋盘的左右翻转对称性,于是做了数据增强。这是个聪明的数据侧修补,但它始终把原始特征塞进扁平结构里,从未质疑拓扑本身,参数量一直降不下来。

Claude 确实正确诊断了数据问题,但之后的闭环仍停留在同一个结构范式里:缩小层宽、调整激活函数、再训练。每次迭代都带来一点点压缩,却缺少跨过阈值所需的架构转向。这是一种很有代表性的失败模式:闭环在跑,但只是在局部邻域探索,而没有做结构重构。

GPT-5.4 则发生了关键转折。它意识到用扁平的全连接层去评估多列动作,从根本上就是参数效率低,于是彻底重组了网络拓扑,构造了一个共享的 action-scorer,让同一组权重同时复用于所有列。

通过在潜在动作之间复用权重,而不是继续堆更多层,GPT-5.4 用一次结构性洞见绕过了容量约束,在 95.17% 准确率下做到 3,201 个参数。差别不在于孤立的智力或推理质量,而在于当渐进式微调不再奏效时,决策点上发生了什么:Claude 继续在既有框架内优化,GPT-5.4 重构了框架本身。

What we are actually measuring

These two tasks test very different skills: statistical intuition in one case, architectural reasoning in the other. But the capability that separated success from failure was the same: the ability to survive negative empirical feedback and change strategy in response. We call this closed-loop resilience.

Closed-loop resilience is distinct from the capabilities that most benchmarks measure. It is not about correctness on a first attempt, breadth of knowledge, or reasoning chain quality in isolation. It is the compound skill of staying oriented inside an iterative process: interpreting ambiguous signals, recognizing when a current approach has saturated, and deciding whether to refine within the existing frame or restructure entirely.

Three design choices make this capability measurable across the full benchmark:

Continuous scoring, not pass/fail. Reward is log-scaled: a 10× speedup earns credit even if 30× is possible. This captures partial progress, because every loop iteration should move the needle.

Reproducible metrics via instruction counting. Inspired by COFFE [5], many tasks use Valgrind/callgrind to count CPU instructions instead of wall-clock time, eliminating machine noise entirely.

Metric diversity beyond speed. Some tasks minimize model parameters, comparator counts, or memory allocations, inverting the typical optimization objective and testing whether the agent can adapt its loop strategy to fundamentally different reward signals.

我们到底在测什么

这两个任务考察的技能差异很大:一个更偏统计直觉,另一个更偏架构推理。但区分成功与失败的能力是同一个:能承受负面的实证反馈,并据此改变策略。我们把它称为闭环韧性。

闭环韧性不同于大多数基准测试所衡量的能力。它不是第一次尝试的正确性、知识广度、或孤立的推理链质量。它是一种复合能力,要求在迭代过程中保持定向:解读含混信号,识别当前方法是否已饱和,并判断是继续在既有框架内细化,还是彻底重构。

有三个设计选择,让这种能力能在整个基准中被量化:

连续计分,而非通过/失败。 奖励按对数缩放。即便还能做到 30×,10× 的加速也会得到相应分数。这能捕捉部分进展,因为每一次闭环迭代都应该推动指标变化。

用指令计数实现可复现的指标。 受 COFFE [5] 启发,许多任务使用 Valgrind/callgrind 统计 CPU 指令数而不是墙钟时间,从而彻底消除机器噪声。

速度之外的指标多样性。 一些任务会最小化模型参数、比较器数量或内存分配次数,把常见的优化目标倒过来,从而测试智能体能否把闭环策略适配到本质不同的奖励信号上。

The scientist does not disappear

Consider a concrete scenario: an agent runs fifty hyperparameter sweeps overnight and reports a configuration with the lowest validation loss. A human researcher glances at the loss curve, recognizes overfitting on a mislabeled data split, rejects the result, and redirects the search. That judgment call, knowing when a metric is misleading, is not something the loop itself can provide.The broader question is whether research itself changes when machines begin to participate in experimental loops. AutoLab is an instrument for studying that transition: not a grand claim that autonomy has arrived, but a controlled environment where we can measure whether agents contribute to the loop under real constraints. The scientist does not disappear. The loop gets a new participant.

科学家不会消失

设想一个具体场景:智能体一夜之间跑了五十组超参数扫描,并报告了验证损失最低的配置。研究者扫一眼损失曲线,就识别出这是在一个标注错误的数据划分上过拟合,于是否决结果并重定向搜索。这种判断,知道什么时候指标在误导人,不是闭环本身能提供的。

更宏观的问题是,当机器开始参与实验闭环时,研究本身会不会改变。AutoLab 是用来研究这种转变的仪器。它不是在宣称自主性已经到来,而是提供一个受控环境,让我们能在真实约束下测量智能体是否真的为闭环做出贡献。科学家不会消失。闭环只是多了一个新参与者。

A living benchmark

The 23 tasks here are a starting point. AutoLab is designed for community contribution, and we are actively inviting researchers to submit new tasks across all three axes. If you have an optimization problem with a clear metric, an open search space, and no known optimal solution, it probably belongs here.Submit tasks and run agents via Contribute a Task. Each task needs an instruction file, a Dockerfile, a test script that writes a reward, and optionally a reference solution. The benchmark grows with every contribution.

一个活的基准

这里的 23 个任务只是起点。AutoLab 的设计鼓励社区贡献,我们也在积极邀请研究者在三个轴上提交新任务。如果手头有一个带清晰指标、开放搜索空间、且不存在已知最优解的优化问题,它大概率就属于这里。

通过 Contribute a Task 提交任务并运行智能体。每个任务都需要一个指令文件、一个 Dockerfile、一个会写出 reward 的测试脚本,并可选提供参考解。每一次贡献都会让基准继续生长。

References

[1] Karpathy, A. (2026). autoresearch: AI agents running research on single-GPU nanochat training automatically. github.com/karpathy/autoresearch

[2] Novikov, A., Vu, N., Eisenberger, M., et al. (2025). AlphaEvolve: A coding agent for scientific and algorithmic discovery. Google DeepMind. arXiv preprint arXiv:2506.13131. arxiv.org/abs/2506.13131

[3] Wijk, H., Lin, T., Becker, J., et al. (2024). RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts. ICML 2025. arxiv.org/abs/2411.15114

[4] Ouyang, A., Guo, S., Arora, S., et al. (2025). KernelBench: Can LLMs write efficient GPU kernels? arXiv preprint arXiv:2502.10517. arxiv.org/abs/2502.10517

[5] Peng, Y., Wan, J., Li, Y., Ren, X. (2025). COFFE: A code efficiency benchmark for code generation. FSE 2025. ACM SIGSOFT Distinguished Paper Award. arxiv.org/abs/2502.02827

[6] Lu, C., Lu, C., Lange, R.T., Foerster, J., Clune, J., Ha, D. (2026). Towards end-to-end automation of AI research. Nature, 651, 914–919. doi.org/10.1038/s41586-026-10265-5

[7] Jimenez, C.E., Yang, J., Wettig, A., et al. (2024). SWE-bench: Can language models resolve real-world GitHub issues? ICLR 2024. arxiv.org/abs/2310.06770

[8] Huang, Q., Vora, J., Liang, P., Leskovec, J. (2024). MLAgentBench: Evaluating language agents on machine learning experimentation. ICML 2024. arxiv.org/abs/2310.03302

[9] Goldratt, E.M. (1984). The Goal: A Process of Ongoing Improvement. North River Press.

参考文献

[1] Karpathy, A. (2026). autoresearch:让 AI 智能体自动在单 GPU 的 nanochat 训练上做研究。github.com/karpathy/autoresearch

[2] Novikov, A., Vu, N., Eisenberger, M., et al. (2025). AlphaEvolve:用于科学与算法发现的编程智能体。Google DeepMind. arXiv preprint arXiv:2506.13131. arxiv.org/abs/2506.13131

[3] Wijk, H., Lin, T., Becker, J., et al. (2024). RE-Bench:将语言模型智能体的前沿 AI 研发能力与人类专家对比评估。ICML 2025. arxiv.org/abs/2411.15114

[4] Ouyang, A., Guo, S., Arora, S., et al. (2025). KernelBench:LLM 能写出高效的 GPU kernel 吗?arXiv preprint arXiv:2502.10517. arxiv.org/abs/2502.10517

[5] Peng, Y., Wan, J., Li, Y., Ren, X. (2025). COFFE:面向代码生成的代码效率基准。FSE 2025. ACM SIGSOFT Distinguished Paper Award. arxiv.org/abs/2502.02827

[6] Lu, C., Lu, C., Lange, R.T., Foerster, J., Clune, J., Ha, D. (2026). 迈向 AI 研究的端到端自动化。Nature, 651, 914–919. doi.org/10.1038/s41586-026-10265-5

[7] Jimenez, C.E., Yang, J., Wettig, A., et al. (2024). SWE-bench:语言模型能解决真实世界的 GitHub issue 吗?ICLR 2024. arxiv.org/abs/2310.06770

[8] Huang, Q., Vora, J., Liang, P., Leskovec, J. (2024). MLAgentBench:评估语言智能体在机器学习实验中的能力。ICML 2024. arxiv.org/abs/2310.03302

[9] Goldratt, E.M. (1984). 目标:持续改进的过程. North River Press.

A benchmark perspective on whether models can move beyond static answers and begin contributing to the iterative experimental loops that actually produce scientific and engineering progress.

💡Quick take https://autolab.moe/

  1. Most benchmarks test one-shot correctness. But real progress emerges from cycles of failure, diagnosis, and revision under limited compute and noisy feedback.

  2. AutoLab replaces the exam with the laboratory: no answer key, just a compute budget and a path that must be discovered empirically by running, measuring, and revising under real constraints.

  3. Across 23 tasks, the capability that separates success from failure is closed-loop resilience: the ability to survive negative empirical feedback, update hypotheses, and restructure the approach.

https://arxiv.org/abs/2310.06770

Frontier research progress is driven by loops, not isolated outputs.

Progress is not a single brilliant insight generated in a zero-shot prompt. It is ground out through trial, measurement, and revision.

Most benchmarks test whether a model can produce the right output on the first try: the right code, the right reasoning chain, the right answer. But frontier AI is not built from isolated answers. It is built from loops of trial, measurement, interpretation, and revision, often running continuously, day and night.

Training recipes are tuned through hundreds of small experiments. System performance is improved through countless interventions. Architecture choices are shaped through loops of modification and measurement, all under constraints: finite compute, limited time, incomplete information, and noisy feedback. Evaluating models on this kind of work asks whether they can gather the empirical evidence required to actually find the answer, not just produce one on the first try.

Beyond the answer engine

If a model only completes isolated subtasks, its impact is fundamentally bounded. By the Theory of Constraints, speeding up one isolated component hits a ceiling because the bottleneck simply moves elsewhere.

To achieve an order-of-magnitude improvement, you have to restructure the system itself. Version control is a clear example: branching and merging removed the constraint of sequential coordination, enabling parallel experimentation not by making editing faster, but by eliminating the bottleneck entirely.

When models enter the experimental loop, something analogous happens. The bottleneck in research has long been human iteration speed: the number of experiment cycles a researcher can run in a day is small and fixed. An agent can run an order of magnitude more, and it can run around the clock. This does not just speed up one step; it removes the human iteration bottleneck from the critical path entirely, changing the structure of the work itself.

That is a different claim from saying models are becoming more intelligent. The shift is that they are gaining the ability to function inside the iterative machinery of discovery: not just reason about experiments, but execute them.

Why we built AutoLab

AutoLab is a benchmark for participation in experimental loops, not just for static knowledge or isolated reasoning.

If agents are becoming structural participants in the research loop, running hundreds of experiments overnight and iterating continuously without human intervention, we need a way to measure exactly that capability. That is the motivation behind AutoLab.

Each task gives the agent a working but unoptimized program, a fixed compute budget, and a target metric. There is no answer key. The path forward must be discovered empirically: profiling, hypothesizing, modifying code, running experiments, and iterating under real constraints.

AutoLab tests whether a model can do the unglamorous but essential work of empirical improvement: not just thinking about change, but executing the loop through which change is found.

The emerging landscape

The idea of agents running continuous optimization loops is already producing results. Karpathy's autoresearch [1] ran an agent for two days straight: 700 experiments, 20 discovered optimizations, an 11% training speedup on a project he considered already well-tuned. AlphaEvolve [2] (DeepMind, 2025) takes this further: LLMs paired with evolutionary search discovered the first improvement to matrix multiplication beyond Strassen in 56 years, now recovering 0.7% of Google's worldwide compute.

But how do we measure this capability? RE-Bench [3] (METR, 2024) compares agents against human experts on 8-hour research tasks. KernelBench [4] (Stanford, 2025) evaluates LLMs on 250 GPU kernel optimization tasks. COFFE [5] introduced instruction counting to eliminate machine noise. The AI Scientist [6] (Nature, 2026) generates research papers end-to-end. AutoLab builds on ideas from several of these, particularly COFFE’s instruction counting [5], and provides a standardized benchmark for the kind of continuous optimization loop that autoresearch [1] and AlphaEvolve [2] have demonstrated.

Design and task space

AutoLab contributes a facet that is currently underrepresented: tasks with large, under-explored solution spaces where the agent must discover its own path. A few examples illustrate the design intent:

Combinatorial search. Find a 16-input sorting network with the fewest comparators. The best known result (60) dates to the 1960s and has never been proven optimal.

Algorithmic reinvention. Optimize an attention kernel by re-deriving blocked tiling, online softmax, and SIMD vectorization from first principles under a single-threaded, C-only constraint.

Inverted objectives. Compress a 17,924-parameter game-playing network to achieve the same accuracy with as few parameters as possible. The reference solution reaches 913 parameters, an approach that emerges from rethinking the architecture, not from applying standard compression techniques.

Open-ended ML research. Select the best 5,000 training samples from a pool of 50,000 for instruction-following fine-tuning, or train a language model from scratch with the lowest possible perplexity under a fixed compute budget.

The first release includes 23 tasks spanning systems engineering (SIMD vectorization in C, cache-aware data structures in C++, lock-free concurrency in Go, zero-copy parsing in Rust) and frontier AI research (compute-optimal pretraining, data selection, multi-source GRPO fine-tuning). What unifies them is the structure: every task presents an open search space where progress must be discovered through iterative experimentation.

The three axes of improvement

https://arxiv.org/abs/2411.15114

These tasks cluster along three dimensions. The objectives differ, but the loop is the same: propose, test, measure, revise.

Faster Reduce latency through systems or architectural changes while preserving quality. Example: flash attention baseline → agent achieves 30× speedup via cache-aware tiling and SIMD vectorization.

Smarter Improve task performance under a fixed compute budget. Example: data selection for IFEval, where the best agent reaches 46.2% strict accuracy vs. 37.8% random baseline.

Smaller Find a more efficient solution without sacrificing correctness. Example: Connect-3 player compressed from 17,924 → 913 parameters via architectural restructuring.

Case study

The difference between solving a coding problem and doing research is the ability to survive the friction of empirical failure.

The concept is easier to see in practice. Below, we trace how frontier models approached two AutoLab tasks, not to rank which model is better, but to illustrate the specific behaviors that separate loop participation from one-shot generation.

Data Selection

The random baseline (selecting 5,000 samples uniformly) already achieves 37.8%. The gap between random and the best agent is just 8.4 percentage points. That narrow margin is the point: in data selection, most of the value comes not from having a clever initial idea, but from iteratively refining which samples matter based on empirical training signal.

The top-scoring agents wrote an initial selection heuristic, triggered a training run, and inspected the evaluation breakdown. When the model failed on specific constraint types (length limits, punctuation rules), they parsed the error distribution, identified which training samples addressed those failures, and rewrote the selection criteria. This diagnose-and-revise cycle, not the initial heuristic, is where the score differential was earned.

https://github.com/karpathy/autoresearch

By treating each training run as a diagnostic, not just a pass/fail check, the agent mapped which constraint types the model struggled with and shaped the data to address those gaps specifically. The 8.4-point margin is thin, but it represents exactly the kind of empirically-grounded refinement that separates loop participation from one-shot generation.

Connect-3 Parameter Golf

https://arxiv.org/abs/2502.10517

Unlike latency tasks where every percentage counts, this task has a phase transition: the structural insight that unlocks a compact representation is discrete, producing the sharp 1.0-vs-0.0 split above. What makes the trajectories instructive is how the loop strategies diverged.

Claude 4.6 approached the constraint using standard engineering intuition. When its initial MLP failed to generalize, it noticed the horizontal flip symmetry of the game board and augmented the training data. A clever data-side fix, but it kept pushing raw features through a flat structure, never questioning the topology itself. The parameter count stayed too high.

Claude correctly diagnosed the data bug, but its subsequent loop stayed within the same structural paradigm: shrinking layer widths, adjusting activation functions, retraining. Each iteration produced incremental compression without the architectural pivot needed to cross the threshold. This is a revealing failure mode: the loop ran, but it explored a local neighborhood rather than restructuring.

GPT-5.4 experienced the critical shift. After recognizing that flat dense layers were fundamentally parameter-inefficient for evaluating multiple columns, it reorganized the network topology entirely, constructing a shared action-scorer whose weights were reused across all columns simultaneously.

By reusing weights across potential actions instead of adding more layers, GPT-5.4 bypassed the capacity constraint with a structural insight, reaching 3,201 parameters at 95.17% accuracy. The difference was not intelligence or reasoning quality in isolation, but what happened at the decision point when incremental refinement stopped working: Claude kept optimizing within the existing frame, GPT-5.4 restructured the frame itself.

What we are actually measuring

These two tasks test very different skills: statistical intuition in one case, architectural reasoning in the other. But the capability that separated success from failure was the same: the ability to survive negative empirical feedback and change strategy in response. We call this closed-loop resilience.

Closed-loop resilience is distinct from the capabilities that most benchmarks measure. It is not about correctness on a first attempt, breadth of knowledge, or reasoning chain quality in isolation. It is the compound skill of staying oriented inside an iterative process: interpreting ambiguous signals, recognizing when a current approach has saturated, and deciding whether to refine within the existing frame or restructure entirely.

Three design choices make this capability measurable across the full benchmark:

Continuous scoring, not pass/fail. Reward is log-scaled: a 10× speedup earns credit even if 30× is possible. This captures partial progress, because every loop iteration should move the needle.

Reproducible metrics via instruction counting. Inspired by COFFE [5], many tasks use Valgrind/callgrind to count CPU instructions instead of wall-clock time, eliminating machine noise entirely.

Metric diversity beyond speed. Some tasks minimize model parameters, comparator counts, or memory allocations, inverting the typical optimization objective and testing whether the agent can adapt its loop strategy to fundamentally different reward signals.

The scientist does not disappear

Consider a concrete scenario: an agent runs fifty hyperparameter sweeps overnight and reports a configuration with the lowest validation loss. A human researcher glances at the loss curve, recognizes overfitting on a mislabeled data split, rejects the result, and redirects the search. That judgment call, knowing when a metric is misleading, is not something the loop itself can provide.The broader question is whether research itself changes when machines begin to participate in experimental loops. AutoLab is an instrument for studying that transition: not a grand claim that autonomy has arrived, but a controlled environment where we can measure whether agents contribute to the loop under real constraints. The scientist does not disappear. The loop gets a new participant.

A living benchmark

The 23 tasks here are a starting point. AutoLab is designed for community contribution, and we are actively inviting researchers to submit new tasks across all three axes. If you have an optimization problem with a clear metric, an open search space, and no known optimal solution, it probably belongs here.Submit tasks and run agents via Contribute a Task. Each task needs an instruction file, a Dockerfile, a test script that writes a reward, and optionally a reference solution. The benchmark grows with every contribution.

References

[1] Karpathy, A. (2026). autoresearch: AI agents running research on single-GPU nanochat training automatically. github.com/karpathy/autoresearch

[2] Novikov, A., Vu, N., Eisenberger, M., et al. (2025). AlphaEvolve: A coding agent for scientific and algorithmic discovery. Google DeepMind. arXiv preprint arXiv:2506.13131. arxiv.org/abs/2506.13131

[3] Wijk, H., Lin, T., Becker, J., et al. (2024). RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts. ICML 2025. arxiv.org/abs/2411.15114

[4] Ouyang, A., Guo, S., Arora, S., et al. (2025). KernelBench: Can LLMs write efficient GPU kernels? arXiv preprint arXiv:2502.10517. arxiv.org/abs/2502.10517

[5] Peng, Y., Wan, J., Li, Y., Ren, X. (2025). COFFE: A code efficiency benchmark for code generation. FSE 2025. ACM SIGSOFT Distinguished Paper Award. arxiv.org/abs/2502.02827

[6] Lu, C., Lu, C., Lange, R.T., Foerster, J., Clune, J., Ha, D. (2026). Towards end-to-end automation of AI research. Nature, 651, 914–919. doi.org/10.1038/s41586-026-10265-5

[7] Jimenez, C.E., Yang, J., Wettig, A., et al. (2024). SWE-bench: Can language models resolve real-world GitHub issues? ICLR 2024. arxiv.org/abs/2310.06770

[8] Huang, Q., Vora, J., Liang, P., Leskovec, J. (2024). MLAgentBench: Evaluating language agents on machine learning experimentation. ICML 2024. arxiv.org/abs/2310.03302

[9] Goldratt, E.M. (1984). The Goal: A Process of Ongoing Improvement. North River Press.

📋 讨论归档

讨论进行中…