🧠 阿头学 · 💬 讨论题

AutoAgent：开源自我优化智能体把“调 agent”这件事自动化了

AutoAgent 证明了“让智能体优化另一个智能体”在结构化、有评测、有失败轨迹的任务上很可能已经比人工调 harness 更强，但作者把这件事外推成“任何领域都适用”明显说大了。
打开原文 ↗

2026-04-04 原文链接 ↗

阅读简报

双语对照

完整翻译

原文

讨论归档

核心观点

方向判断是成立的 把“做任务”和“优化做任务的方法”拆成 task agent 与 meta agent，这个架构判断是强的，因为执行能力和改进能力本来就不是同一种能力，混成一个体通常会失效。
真正有价值的不是榜单，而是闭环 AutoAgent 最有含金量的不是 SpreadsheetBench 96.5% 和 TerminalBench 55.1% 这两个分数本身，而是它把 harness 优化做成了“修改—运行—读失败轨迹—保留有效改动—回滚失败改动”的自动闭环，这比人工拍脑袋调 prompt 更可扩展。
“轨迹反馈”比“结果分数”更关键 作者强调没有失败轨迹时优化明显变慢，这个判断可信，因为 agent 的问题往往不在最终分数，而在中途工具调用、规划断裂、格式错误和验证缺失，只有轨迹才能支持定向修正。
“模型共情”有启发，但证据不够硬 同模型 meta-agent 与 task-agent 配对更优，这个现象值得重视，因为它指向“认知兼容性”可能比“单纯更强模型”更重要；但目前证据太薄，离一般规律还差严格对照实验。
宣传口径明显超过证据边界 文章把两个高度结构化、可自动评测的 benchmark 成绩，上升为“首个开源的自我优化智能体库”“任何领域里自主变强”“超越人工 harness 调参”的明确证据，这个结论说得太大，带有明显 PR 包装。

跟我们的关联

对 ATou 意味着什么、下一步怎么用 这说明你做 agent 产品时不该把精力全压在单个 agent 提示词上，而该优先搭“评测 + 失败轨迹 + 自动回滚”的优化框架；下一步可以先挑一个边界清晰、可自动打分的任务做最小闭环验证。
对 Neta 意味着什么、下一步怎么用 这说明内容、增长、运营里的“执行层”和“优化层”应该拆开设计；下一步可以把广告文案、转化流程、客服 SOP 当作 harness，用小流量实验和失败样本复盘替代纯主观判断。
对 Uota 意味着什么、下一步怎么用 这说明“理解另一个模型/用户如何失败”本身是一种独立能力，不能假设更强就等于更懂；下一步可以专门研究同构组合与异构组合的效果差异，验证“共情”到底是机制还是偶然兼容。
对三者共同意味着什么、下一步怎么用 这篇材料最值得借鉴的是“别只看 outcome，要抓 process trace”；下一步无论做产品、研究还是投资判断，都应先问：有没有可复用评测、有没有失败轨迹、有没有防过拟合机制，否则自动优化大概率只是自动刷榜。

讨论引子

1. 如果一个系统需要“24+ 小时 + 成千上万并行沙箱”才能赢榜单，它在商业上究竟是效率革命，还是昂贵的 benchmark 机器？ 2. “同模型优化同模型”如果真的更强，会不会让未来 agent 平台从“最佳模型选择”转向“模型生态锁定”？ 3. 在没有清晰自动评测的真实业务里，AutoAgent 这种方法还能成立多少，还是只适合 benchmark 友好场景？

今天我们发布 AutoAgent，一个开源库，让智能体能在任何领域里自主变强。

AutoAgent 在优化 24+ 小时后，同时拿下 SpreadsheetBench 排行榜第 #1（96.5%）以及 TerminalBench 的 #1 GPT-5 评分（55.1%）

这些榜单上的其他条目都是手工工程调出来的。我们不是。

智能体一直被 harness 工程卡住，但我们还在用原始的网格搜索：改一改，评一评，读错误轨迹，再重复

这是第一次有明确证据表明，智能体能在生产基准上，自主超越人工的 harness 调参。

获取地址: https://github.com/kevinrgu/autoagent

它做什么

把 AutoAgent 指向一个带评测的任务领域。一个元智能体会在任务智能体的 harness 上做实验：调整提示词、添加工具、打磨编排，直到性能爬升。

整体配置刻意做得很轻：

任务智能体一开始只有一个 bash 工具
program.md 给元智能体提供研究方向
agent.py 是任务智能体
一个 Harbor 适配器把它连接到你的基准评测

然后元智能体会启动成千上万个并行沙箱来改进任务智能体。24 小时后，它就有了领域专用的工具、验证回路和编排逻辑，全都是自主发现的

循环如下：

修改智能体的 harness

在任务上运行

衡量性能

阅读失败轨迹

保留改进，回滚失败

重复

为什么有效：像智能体那样去看

我们发现，智能体更擅长理解智能体。

Claude Code 团队写过 seeing like an agent：把自己放进模型的脑子里，按它的能力形状去设计工具

我们把自己的直觉投射到推理方式不同的系统上。我们不擅长对模型共情

AutoAgent 把这件事做成了可执行流程。元智能体会读任务智能体的推理轨迹，而且它对自己有一种隐性的理解：自己的局限、自己的倾向。所以当它看到任务智能体在第 14 步迷了路，它能把这种失败模式作为自身世界观的一部分去理解，然后纠正它

我们把这叫作‘模型共情’

现实结果是：Claude 元智能体 + Claude 任务智能体，胜过 Claude 元智能体 + GPT 任务智能体。同模型的配对会赢，因为元智能体写出的 harness，是里面那个模型真正看得懂的。它们共享同一套权重，元智能体也更清楚那个模型究竟如何推理

当智能体的表现超过人类的 99 分位，我们对好 harness 设计的直觉就成了错误先验。像 AlphaZero 一样，它们应该从第一性原理出发去发现

我们没写进去却涌现的行为

spot checking：小改动只跑少量隔离任务，不跑全套。迭代速度大幅提升，算力也省了
forced verification loops：构建了确定性的自检与格式校验器。额外预算轮次用于自我纠错，主预算用于任务本身，奖励轮次用于验证与修正输出
writing tests：引导任务智能体为每个任务自己写单元测试和检查
progressive disclosure：当结果溢出时，把长上下文写到文件里
orchestration logic：领域需要时，构建任务专用的子智能体与交接流程

结果

AutoAgent 在 SpreadsheetBench 拿到 96.5%，在 TerminalBench 拿到 55.1%。两项都是榜单最高分。它在 24+ 小时里自主迭代，分析自己的失败轨迹并持续改进

我们学到的事

拆分有用。 我们试过让一个智能体改进自己。没用。擅长做某个领域的任务，和擅长把这个领域越做越好，是两种不同能力。meta/task 的拆分让它们各自专精
轨迹就是一切。 只给分数不给轨迹时，改进速度会大幅下降。理解为什么变好，和知道它变好了同样重要。轨迹让元智能体能解释任务智能体的推理，这才使有针对性的修改成为可能
智能体会过拟合。 元智能体会变懒，塞入只对评分细则有效的提示词，让任务智能体去刷指标。我们用强制自省来约束它：“如果这道具体任务消失了，这仍然会是一个值得做的 harness 改进吗？”
元智能体的质量很重要。 harness 的修改往往受元智能体自身工具的启发。元智能体设计得差，产出的任务智能体也会差。Codex 不太适合当元智能体，它会无视永不停止改进的指令（在 autoresearch 里也观察到），结果任务智能体太早放弃

为什么重要

做智能体最难的部分在于：每个领域都需要不同的 harness，而 harness 工程需要一个人既深懂领域，又深懂模型如何表现

AutoAgent 把这件事折叠了。领域专家只要定义什么叫成功，harness 由元智能体来搞定

公司要自动化的不是一条流程，而是几百条。每一条都需要不同的 harness。

没有团队能手工把几百个 harness 调到位。元智能体可以

这是智能体舰队的基础设施：在整个组织范围内持续创建、优化并维护面向具体任务的智能体

下一步

AutoAgent 最初是内部做的，但我们决定开源它：https://github.com/kevinrgu/autoagent

描述一个 spec，指向评测，让它自己往上爬。每个人都应该能做到这件事

自我改进的智能体还在婴儿期。下一道边界是：能为任何任务按需、及时地动态组装合适工具与上下文的 harness

我们很快会发布围绕这个的产品。评论区可申请早期访问

today we're releasing AutoAgent, an open source library for autonomously improving an agent on any domain.

今天我们发布 AutoAgent，一个开源库，让智能体能在任何领域里自主变强。

AutoAgent hit both the #1 on SpreadsheetBench (96.5%) and the #1 GPT-5 score on TerminalBench (55.1%) after optimizing for 24+ hours

AutoAgent 在优化 24+ 小时后，同时拿下 SpreadsheetBench 排行榜第 #1（96.5%）以及 TerminalBench 的 #1 GPT-5 评分（55.1%）

every other entry on those leaderboards was hand-engineered. ours wasn't.

这些榜单上的其他条目都是手工工程调出来的。我们不是。

agents have been bottlenecked by harness engineering, yet we're still doing primitive grid search: tweak, eval, read error traces, repeat

智能体一直被 harness 工程卡住，但我们还在用原始的网格搜索：改一改，评一评，读错误轨迹，再重复

this is the first concrete evidence that an agent can autonomously beat manual harness tuning on production benchmarks.

这是第一次有明确证据表明，智能体能在生产基准上，自主超越人工的 harness 调参。

available here: https://github.com/kevinrgu/autoagent

获取地址: https://github.com/kevinrgu/autoagent

here's what it does

它做什么

point AutoAgent at a task domain with evals. a meta-agent experiments on a task agent's harness: tweaking prompts, adding tools, refining orchestration until performance climbs.

把 AutoAgent 指向一个带评测的任务领域。一个元智能体会在任务智能体的 harness 上做实验：调整提示词、添加工具、打磨编排，直到性能爬升。

the setup is minimal by design:

整体配置刻意做得很轻：

the task agent starts with just a bash tool

任务智能体一开始只有一个 bash 工具

program.md gives the meta-agent its research direction

program.md 给元智能体提供研究方向

agent.py is the task agent

agent.py 是任务智能体

a Harbor adapter connects to your benchmark

一个 Harbor 适配器把它连接到你的基准评测

the meta-agent then spins up 1000s of parallel sandboxes to improve the task agent. 24 hours later it has domain-specific tooling, verification loops, and orchestration logic. all discovered autonomously

然后元智能体会启动成千上万个并行沙箱来改进任务智能体。24 小时后，它就有了领域专用的工具、验证回路和编排逻辑，全都是自主发现的

the loop:

循环如下：

edit the agent's harness

run it on tasks

measure performance

read failure traces

keep improvements, revert failures

repeat

修改智能体的 harness

在任务上运行

衡量性能

阅读失败轨迹

保留改进，回滚失败

重复

why this works: seeing like an agent

为什么有效：像智能体那样去看

we discovered agents are better at understanding agents than we are

我们发现，智能体更擅长理解智能体。

the Claude Code team wrote about "seeing like an agent", putting yourself in the mind of the model, designing tools shaped to its abilities

Claude Code 团队写过 seeing like an agent：把自己放进模型的脑子里，按它的能力形状去设计工具

we project our own intuitions onto systems that reason differently. we're bad at empathizing with models

我们把自己的直觉投射到推理方式不同的系统上。我们不擅长对模型共情

AutoAgent operationalizes this. the meta-agent reads the task agent's reasoning traces and already has implicit understanding of itself. its own limitations, tendencies. so when it sees the task agent lost direction at step 14, it understands the failure mode as part of its worldview and corrects it

we call this 'model empathy'

我们把这叫作‘模型共情’

practical consequence: Claude meta-agent + Claude task agent outperformed Claude meta-agent + GPT task agent. same-model pairings win because the meta-agent writes harnesses the inner model actually understands. it shares the same weights and knows exactly how that model reasons

as agents surpass 99th percentile human performance, our intuitions about good harness design become the wrong prior. like AlphaZero, they should discover from first principles

当智能体的表现超过人类的 99 分位，我们对好 harness 设计的直觉就成了错误先验。像 AlphaZero 一样，它们应该从第一性原理出发去发现

emergent behaviors we didn't program

我们没写进去却涌现的行为

spot checking: ran isolated tasks for small edits instead of full suite. dramatically sped up iteration, saved compute

spot checking：小改动只跑少量隔离任务，不跑全套。迭代速度大幅提升，算力也省了

forced verification loops: built deterministic self-checks and formatting validators. budgeted extra turns for self-correction with main budget for the task & bonus turns for verifying and correcting output

forced verification loops：构建了确定性的自检与格式校验器。额外预算轮次用于自我纠错，主预算用于任务本身，奖励轮次用于验证与修正输出

writing tests: steered the task agent to build its own unit tests and checks for each task

writing tests：引导任务智能体为每个任务自己写单元测试和检查

progressive disclosure: dumped long contexts to files when results overflowed

progressive disclosure：当结果溢出时，把长上下文写到文件里

orchestration logic: built task-specific subagents and handoffs when the domain required it

orchestration logic：领域需要时，构建任务专用的子智能体与交接流程

results

结果

AutoAgent hit 96.5% on SpreadsheetBench and 55.1% on TerminalBench. both were the highest scores in the leaderboard. the agent iterated autonomously across 24+ hours, analyzing its own failure traces and improving

AutoAgent 在 SpreadsheetBench 拿到 96.5%，在 TerminalBench 拿到 55.1%。两项都是榜单最高分。它在 24+ 小时里自主迭代，分析自己的失败轨迹并持续改进

what we learned

我们学到的事

splitting helps. we tried one agent improving itself. didn't work. being good at a domain and being good at improving at that domain are different capabilities. the meta/task split lets each specialize

拆分有用。 我们试过让一个智能体改进自己。没用。擅长做某个领域的任务，和擅长把这个领域越做越好，是两种不同能力。meta/task 的拆分让它们各自专精

traces are everything. when we only gave scores without trajectories, improvement rate dropped hard. understanding why something improved matters as much as knowing that it improved. traces give the meta-agent interpretability over the task agent's reasoning—that's what makes targeted edits possible

轨迹就是一切。 只给分数不给轨迹时，改进速度会大幅下降。理解为什么变好，和知道它变好了同样重要。轨迹让元智能体能解释任务智能体的推理，这才使有针对性的修改成为可能

agents overfit. the meta-agent gets lazy, inserting rubric-specific prompting so the task agent can game metrics. we constrain this by forcing self-reflection: "if this exact task disappeared, would this still be a worthwhile harness improvement?"

智能体会过拟合。 元智能体会变懒，塞入只对评分细则有效的提示词，让任务智能体去刷指标。我们用强制自省来约束它：“如果这道具体任务消失了，这仍然会是一个值得做的 harness 改进吗？”

meta-agent quality matters. harness edits are often inspired by the meta-agent's own tooling. a poorly designed meta-agent produces poor task agents. Codex doesn't work well as a meta-agent—it ignores instructions to never stop improving (observed in autoresearch too), and the resulting task agent gives up too early

元智能体的质量很重要。 harness 的修改往往受元智能体自身工具的启发。元智能体设计得差，产出的任务智能体也会差。Codex 不太适合当元智能体，它会无视永不停止改进的指令（在 autoresearch 里也观察到），结果任务智能体太早放弃

why this matters

为什么重要

the hard part of building agents: every domain needs a different harness, and harness engineering requires someone who deeply understands both the domain and how models behave

做智能体最难的部分在于：每个领域都需要不同的 harness，而 harness 工程需要一个人既深懂领域，又深懂模型如何表现

AutoAgent collapses that. domain experts just define what success looks like. the meta-agent figures out the harness

AutoAgent 把这件事折叠了。领域专家只要定义什么叫成功，harness 由元智能体来搞定

companies don't have one workflow to automate, they have hundreds. each needs a different harness.

公司要自动化的不是一条流程，而是几百条。每一条都需要不同的 harness。

no team can hand-tune hundreds of harnesses. a meta-agent can

没有团队能手工把几百个 harness 调到位。元智能体可以

this is infrastructure for agent fleets: continuously spinning up, optimizing, and maintaining task-specific agents across entire organizations

这是智能体舰队的基础设施：在整个组织范围内持续创建、优化并维护面向具体任务的智能体

what's next

下一步

we built AutoAgent internally but decided to open source it: https://github.com/kevinrgu/autoagent

AutoAgent 最初是内部做的，但我们决定开源它：https://github.com/kevinrgu/autoagent

describe a spec, point it at evals, let it climb. everyone should be able to do this

描述一个 spec，指向评测，让它自己往上爬。每个人都应该能做到这件事

self-improving agents are still in their infancy. next frontier: harnesses that dynamically assemble the right tools and context just-in-time for any task

自我改进的智能体还在婴儿期。下一道边界是：能为任何任务按需、及时地动态组装合适工具与上下文的 harness

we're releasing a product around this soon. early access in comments

我们很快会发布围绕这个的产品。评论区可申请早期访问

today we're releasing AutoAgent, an open source library for autonomously improving an agent on any domain.

AutoAgent hit both the #1 on SpreadsheetBench (96.5%) and the #1 GPT-5 score on TerminalBench (55.1%) after optimizing for 24+ hours

every other entry on those leaderboards was hand-engineered. ours wasn't.

agents have been bottlenecked by harness engineering, yet we're still doing primitive grid search: tweak, eval, read error traces, repeat

this is the first concrete evidence that an agent can autonomously beat manual harness tuning on production benchmarks.

available here: https://github.com/kevinrgu/autoagent

here's what it does

point AutoAgent at a task domain with evals. a meta-agent experiments on a task agent's harness: tweaking prompts, adding tools, refining orchestration until performance climbs.

the setup is minimal by design:

the task agent starts with just a bash tool
program.md gives the meta-agent its research direction
agent.py is the task agent
a Harbor adapter connects to your benchmark

the loop:

edit the agent's harness

run it on tasks

measure performance

read failure traces

keep improvements, revert failures

repeat

why this works: seeing like an agent

we discovered agents are better at understanding agents than we are

the Claude Code team wrote about "seeing like an agent", putting yourself in the mind of the model, designing tools shaped to its abilities

we project our own intuitions onto systems that reason differently. we're bad at empathizing with models

we call this 'model empathy'

as agents surpass 99th percentile human performance, our intuitions about good harness design become the wrong prior. like AlphaZero, they should discover from first principles

emergent behaviors we didn't program

spot checking: ran isolated tasks for small edits instead of full suite. dramatically sped up iteration, saved compute
forced verification loops: built deterministic self-checks and formatting validators. budgeted extra turns for self-correction with main budget for the task & bonus turns for verifying and correcting output
writing tests: steered the task agent to build its own unit tests and checks for each task
progressive disclosure: dumped long contexts to files when results overflowed
orchestration logic: built task-specific subagents and handoffs when the domain required it

results

what we learned

splitting helps. we tried one agent improving itself. didn't work. being good at a domain and being good at improving at that domain are different capabilities. the meta/task split lets each specialize
traces are everything. when we only gave scores without trajectories, improvement rate dropped hard. understanding why something improved matters as much as knowing that it improved. traces give the meta-agent interpretability over the task agent's reasoning—that's what makes targeted edits possible
agents overfit. the meta-agent gets lazy, inserting rubric-specific prompting so the task agent can game metrics. we constrain this by forcing self-reflection: "if this exact task disappeared, would this still be a worthwhile harness improvement?"
meta-agent quality matters. harness edits are often inspired by the meta-agent's own tooling. a poorly designed meta-agent produces poor task agents. Codex doesn't work well as a meta-agent—it ignores instructions to never stop improving (observed in autoresearch too), and the resulting task agent gives up too early

why this matters

the hard part of building agents: every domain needs a different harness, and harness engineering requires someone who deeply understands both the domain and how models behave

AutoAgent collapses that. domain experts just define what success looks like. the meta-agent figures out the harness

companies don't have one workflow to automate, they have hundreds. each needs a different harness.

no team can hand-tune hundreds of harnesses. a meta-agent can

this is infrastructure for agent fleets: continuously spinning up, optimizing, and maintaining task-specific agents across entire organizations

what's next

we built AutoAgent internally but decided to open source it: https://github.com/kevinrgu/autoagent

describe a spec, point it at evals, let it climb. everyone should be able to do this

self-improving agents are still in their infancy. next frontier: harnesses that dynamically assemble the right tools and context just-in-time for any task

we're releasing a product around this soon. early access in comments

📋 讨论归档

讨论进行中…