🧠 阿头学 · 💬 讨论题

用评测驱动 Agent 设计，不再靠感觉调框架

这篇文章正确地抓住了 Agent 工程的关键转向——评测应成为设计中枢而不是事后验收，但它把“评测像训练数据”说得过满，明显低估了过拟合、裁判偏差和工程成本。
打开原文 ↗

2026-04-18 原文链接 ↗

阅读简报

双语对照

完整翻译

原文

讨论归档

核心观点

评测前置是对的 作者最站得住的判断是：Agent 设计不能继续主要靠直觉和经验拍脑袋，必须把评测当作核心反馈接口，因为可执行、可归因、可复现的信号，确实比纯 Markdown 规格更能推动系统稳定迭代。
“agent = fit(model, harness, evals)”是强抽象 这个公式有很强的方法论价值，因为它把 Agent 能力拆成模型、运行框架、评测三部分，逼团队在“模型不行、框架不行、还是目标没定义清楚”之间做清晰归因，这个拆法比泛泛谈 prompt engineering 高一个层级。
把评测类比成训练数据有启发，但不严谨 文章把 eval 说成类似样本和梯度，这个比喻有传播力，但严格来说是偷换概念：传统训练是连续优化，Agent 框架改动却高度离散、强路径依赖，还常常需要人工介入，所以“拟合评测”并不会自然地产生可靠、可泛化的更新方向。
Goodhart 风险被明显淡化 如果团队把“通过评测”直接当成目标，系统就大概率学会“过题”而不是“变强”，尤其在评测覆盖窄、长尾样本稀缺、LLM 裁判噪声高的情况下，真实场景表现很可能被高估，这一点文章没有正面处理。
评测资产化是未来壁垒 尽管论证不完整，但文章对行业方向的判断是对的：专用 Agent 的竞争力，会越来越取决于谁能把业务中的“好表现”编码成一套持续更新的 eval 体系，而不是谁一时写出了更花哨的 prompt。

跟我们的关联

对 ATou 意味着什么、下一步怎么用 ATou 如果还在凭手感改 Prompt、流程和工具路由，效率会越来越差；下一步应该先把高频失败案例整理成 20-50 个核心 eval，再要求每次改动必须说明“提升了哪条评测、有没有破坏旧评测”。
对 Neta 意味着什么、下一步怎么用 Neta 如果要判断一个 Agent 项目是否真有护城河，不能只看 demo 和模型名，而要看它有没有持续积累 trace、沉淀 eval、维护回归体系；下一步可以把“评测资产密度”和“评测更新机制”纳入项目评估框架。
对 Uota 意味着什么、下一步怎么用 Uota 如果关注产品化落地，这篇文章提示了一个现实标准：真正可运营的 Agent 不是“看起来聪明”，而是“能被明确验证”；下一步可以用“哪些行为可程序化验证，哪些只能主观打分”来重构需求文档。
对三者共同意味着什么、下一步怎么用 这套方法最值得采纳的不是口号，而是闭环：Trace 挖错例、转 eval、改 harness、跑回归；下一步应先从一个具体垂直任务试点，而不是试图用宏大框架一口气自动长出强 Agent。

讨论引子

1. 如果 Agent 团队把 eval 当“训练数据”，怎样设计机制避免系统只是在刷题而不是真的提升泛化能力？ 2. LLM-as-judge 到底能在多大程度上被当作“可验证信号”，哪些任务必须坚持程序化验证或人工金标？ 3. 一个团队的核心壁垒，未来更可能来自模型选择、运行框架设计，还是评测资产积累？

这是我最近一直在思考和迭代的一套心智模型。我们正在围绕 Agent 搭建自我改进基础设施，主要包括： - 挖掘 Trace 数据，找出错误，并调整 Agent 运行框架 - 构建和维护评测 - 用评测来指导 Agent 的更新和生成过程

拟合一个 Agent 的输入是什么：核心问题是，面向 Agent 的 sklearn fit(model, data) 函数应该长什么样

agent = fit (model, harness, evals)

数据驱动的 Agent 设计：评测就是 Agent 的训练数据。每一个评测都编码了我们希望在 Agent 身上看到的行为，就像标准机器学习里的每一个训练样本都会产生一个梯度，用来推动模型权重发生变化

我们让 Agent 拟合的每一个评测，都会对如何修改运行框架投出一票，目标是让这个评测通过

“我能不能从一个空的运行框架开始，拟合评测，然后产出一个很强的 Agent？这是否应该成为我们的做法，再往里面注入一些人类先验？”

可验证信号：评测类似规格说明，但更好，因为它们可验证、可衡量

我们可以用评分标准为工作提供密集反馈信号，也可以用程序化验证，以及让 LLM 充当裁判，来评估主观行为

你可以一眼直接看出哪些评测通过了，哪些没有通过。这非常有用。单靠一份简单的 Markdown 规格说明，很难同样轻松地做这种归因

随着时间迭代评测：评测依赖模型和运行框架。随着你不再需要某些评测，对评测做“春季大扫除”和更新是很重要的

今天的 Agent 设计更多依赖感觉，但如果能更开放地采用数据驱动设计，会很有帮助。就像前沿实验室会在数据质量上投入数百万美元，团队也可以把时间投入到出色的评测整理和设计上

专用 Agent 的未来，将取决于能否把这种专业化编码成可衡量的东西。数据很重要，好的评测就是一种数据信号，可以用来构建好的 Agent 🚀

Data Driven Agent Design with Evals & Hill Climbing Algorithms

this is a mental model dump i’ve been thinking through + iterating on as we’re building self-improvement infra around agents: - mining Trace Data to find errors and tweak the agent harness - building + maintaining evals - using evals to guide the agent update/generation process

拟合一个 Agent 的输入是什么：核心问题是，面向 Agent 的 sklearn fit(model, data) 函数应该长什么样

What are the inputs for fitting an agent: the main idea is what does a sklearn fit(model, data) function look like for agents

agent = fit (model, harness, evals)

Data Driven Agent Design: Evals are training data for agents - every eval encodes behavior we want to see in our agent, just as every training data point in standard ML produces a gradient to shift model weights

我们让 Agent 拟合的每一个评测，都会对如何修改运行框架投出一票，目标是让这个评测通过

Every eval we fit our agent towards votes for how to alter the harness to make that eval pass “Can I start from an empty harness. Fit to evals, and produce a great agent? Should that be how we do things + inject some more human priors”

“我能不能从一个空的运行框架开始，拟合评测，然后产出一个很强的 Agent？这是否应该成为我们的做法，再往里面注入一些人类先验？”

Verifiable Signals: Evals are akin to specs but better because they’re verifiable and measurable

可验证信号：评测类似规格说明，但更好，因为它们可验证、可衡量

We can use rubrics that give a dense feedback signal for work, programmatic verification, and LLM as a judge to evaluate subjective behavior

我们可以用评分标准为工作提供密集反馈信号，也可以用程序化验证，以及让 LLM 充当裁判，来评估主观行为

You can directly measure which evals pass and which don’t at a glance. This is very useful, you can’t as easily do the same attribution via a simple markdown spec

你可以一眼直接看出哪些评测通过了，哪些没有通过。这非常有用。单靠一份简单的 Markdown 规格说明，很难同样轻松地做这种归因

Iterating on Evals over Time: Evals are model and harness dependent. “Spring Cleaning” and updating of Evals is important as you no longer need them

随着时间迭代评测：评测依赖模型和运行框架。随着你不再需要某些评测，对评测做“春季大扫除”和更新是很重要的

Agent design is focused on vibes today, but could benefit by being open to more data driven design. Just as frontier labs spend millions on data quality, teams can invest time into fantastic eval curation and design.

The future of specialized agents will depend on encoding that specialization is something measurable. Data matters, and good evals are a data signal to build a good agent 🚀

专用 Agent 的未来，将取决于能否把这种专业化编码成可衡量的东西。数据很重要，好的评测就是一种数据信号，可以用来构建好的 Agent 🚀

Data Driven Agent Design with Evals & Hill Climbing Algorithms

What are the inputs for fitting an agent: the main idea is what does a sklearn fit(model, data) function look like for agents

agent = fit (model, harness, evals)

Verifiable Signals: Evals are akin to specs but better because they’re verifiable and measurable

We can use rubrics that give a dense feedback signal for work, programmatic verification, and LLM as a judge to evaluate subjective behavior

You can directly measure which evals pass and which don’t at a glance. This is very useful, you can’t as easily do the same attribution via a simple markdown spec

Iterating on Evals over Time: Evals are model and harness dependent. “Spring Cleaning” and updating of Evals is important as you no longer need them

The future of specialized agents will depend on encoding that specialization is something measurable. Data matters, and good evals are a data signal to build a good agent 🚀

📋 讨论归档

讨论进行中…