返回列表
🧠 阿头学 · 💬 讨论题

把 Evals 当训练数据:Better-Harness 不是调参技巧,而是 Agent 工程体系

这篇文章最有价值的判断是:提升 agent 的关键不只是换模型,而是把 evals、traces、留出集和人工评审做成一套持续优化 harness 的系统;但作者对“分数提升就代表真实泛化”的论证明显偏强,且带有 LangSmith 生态导向的 PR 色彩。
打开原文 ↗

2026-04-11 原文链接 ↗
阅读简报
双语对照
完整翻译
原文
讨论归档

核心观点

  • Evals 不只是验收工具 作者最有价值的判断是把 evals 定义成 agent harness 的“训练数据”,这比把 eval 当成上线前打分表更有效,因为它直接决定优化回路学什么、不学什么。
  • 真正难点不是优化,而是防过拟合 文章明确承认 agent 会对可见 evals 奖励投机,这个判断很重要;留出集、行为标签、人工评审和回归测试不是锦上添花,而是自动优化能否可信的底线。
  • Trace 是 agent 时代的核心数据资产 从生产 traces 挖失败样本,再沉淀成 eval 和回归测试,这条飞轮是站得住的;没有 trace 基础设施,所谓自动改进大概率只是在局部 benchmark 上刷分。
  • Better-Harness 本质是复合系统工程,不是单一算法突破 作者给出的流程——收集 eval、打标签、拆优化集/留出集、基线、迭代修改、验证、人工复核——判断上是成熟的,因为 agent 问题本来就不是一个 prompt 或一个模型参数能解决的。
  • 实验结果有启发,但远不足以下“泛化已被证明”的结论 文中只展示了少量任务、少数模型、同分布留出集上的改进,这可以支持“方法可用”,但不能支持“方法普适”或“真实生产泛化很强”。

跟我们的关联

  • 对 ATou 意味着什么 ATou 如果还在把 agent 调优理解成“改几句提示词”,判断上已经落后;下一步该把内部失败案例、人工纠错和真实使用记录系统化,先建立最小可用 eval 集与 trace 库。
  • 对 Neta 意味着什么 Neta 做研究或产品判断时,不能只看 demo 和 benchmark 分数;下一步要追问每个 agent 项目是否有留出集、回归集和人工评审机制,否则所谓“进步”很可能只是过拟合。
  • 对 Uota 意味着什么 Uota 如果关注工具链或基础设施投资,这篇文章强化了一个判断:trace、eval orchestration、回归监控会是 agent stack 的高价值层;下一步应重点看谁能证明“线上体验提升”而不是只证明“评测提分”。
  • 对三者共同意味着什么 文章给出的不是银弹,而是一套工程纪律;下一步最实用的用法不是照搬 Better-Harness 名词,而是把“失败样本沉淀—标签化—留出验证—回归保护”内化成团队默认流程。

讨论引子

1. 如果 eval 分数和真实用户满意度不稳定相关,那我们是不是只是在把 agent 训练成“会考试”而不是“会工作”? 2. 留出集仍然来自同一套标签体系时,我们凭什么说它代表泛化,而不是更隐蔽的同分布过拟合? 3. 对多数团队来说,最稀缺的到底是更强模型,还是能持续产出高质量 traces 和 evals 的组织能力?

太长不看 我们能靠更好的 harness 做出更好的 agent。但要让系统自动把 harness 变得更好,需要一个足够强的学习信号来做爬山优化。这里分享我们如何把 evals 当作这个信号,以及一些设计选择,帮助 agent 追求泛化而不是过拟合。Better-Harness 是一个原型系统,用 evals 来迭代地发现并改进你的 harness。

Evals 是 Agents 的训练数据

在经典机器学习里,训练数据引导模型的学习过程。每个训练样本都会贡献一个梯度,把模型权重朝着正确方向更新。对 agents,我们也有一套类似的学习回路。

模型 + 训练数据 + 梯度下降 → 更好的模型

harness + evals + harness 工程 → 更好的 agent

https://github.com/langchain-ai/deepagents/tree/main/examples/better-harness

Evals 编码了我们希望 agent 在生产环境表现出的行为。它们就是 harness 工程的训练数据。每个 eval case 都会提供一种信号,例如 agent 是否采取了正确行动,或是否产出了正确结果。这些信号会指导下一次对 harness 的编辑提案。

我们在做模型训练时,对数据质量与数据筛选投入的严谨与用心,同样也应该投入到 eval 设计里。我们在之前的文章里讨论过数据质量的重要性,以及我们如何为 Deep Agents 构建 evals。

最近也有一些很棒的工作,把优化 harness 的步骤做了形式化,比如斯坦福的 Meta-Harness 和 DeepMind 的 Auto-Harness。我们之前也分享过一个 Harness Improvement Loop,只通过调整 harness 层,就对 Terminal Bench 2.0 做了爬山优化。我们认为关于更新算法本身还有很多值得做的未来工作,但 harness 改进是一个复合系统,范围不止更新算法本身,这也是本文要谈的内容。

Better-Harness 是一种复合系统工程的实践。

数据获取 → 实验设计 → 优化 → 评审与验收

因此我们也会包含一些与更新回路配套的实操细节,比如一开始如何获取 evals,如何在设计上对抗过拟合,如何长期存储 Traces,以及如何通过人工评审来做合理性检查,确保上线到生产环境的东西靠谱。

获取高质量的 evals

Evals 是驱动 harness 爬山优化的地基。下面是我们在实践中获取、筛选与使用它们的方法。

人工精选。针对某个任务,团队手写一些能捕捉我们认为 agent 在生产环境应当怎么做的样例。这些样例往往价值很高,但很难规模化生成。

生产 traces。每次 agent 交互都会生成 trace,其中的失败可以变成 eval case。从 traces 里挖 eval 素材,是一种高杠杆、高吞吐的方式,能让 evals 随时间持续变好。甚至在把 agent 跑到 evals 上之前,团队内部先用起来时,经常会直接在 Slack 里报错并附上 Trace 链接。建议在团队里实际使用 agent,并把反馈直接公开共享,这会帮助沉淀对 agent 行为的共同认知。

外部数据集。这些数据集有用,但需要人工筛选,确保用于改进 agent 的测试用例,确实反映我们想要的行为。通常每个任务也要做一些调整,保证它们测到关键行为。

全部打标签。每个 eval 都要按行为类别打标签,例如 工具选择、多步推理 等。标签能支持有意义的留出集和定向实验,也能省很多钱,因为可以只跑 eval 的子集。

构建能泛化的学习系统

任何学习系统的理想结果都是泛化。我们提供一组输入信号,尽可能刻画真实世界里我们想要的行为分布。系统拟合它,然后在从没见过的新输入上也能自然工作。

显而易见的问题 数据不可能无限多。

解决办法 把重要行为编码进精心筛选的 evals。质量大于数量。一小组高质量、打好标签、覆盖你关心行为的 evals,胜过成千上万条噪声很大但覆盖面很广的 evals。

更隐蔽的问题 → agents 以作弊出名 任何学习系统都可能发生奖励投机,agent 会过拟合自己的结构,让它看得见的现有 evals 通过。这很合理,因为回路只想让分数变高,并不知道泛化是什么。我们会在提示词里要求避免过拟合,但并不完美。

解决办法 留出集可以作为真实泛化的代理指标。我们见过一些做法,会把它和人工评审作为第二信号配合使用,于是形成半自动化系统,既能提高分数,又能避免生产环境里不想要的行为。

Better-Harness:harness 爬山优化的一份配方

我们做了一个脚手架,让系统能以 evals 作为每一步的信号,自动改进 harness。研究版本已开源,主要步骤如下。

  1. 获取并打标签 evals。这一步混合了手写 evals、从生产 traces 挖掘、以及使用或改造外部数据集。我们会把每个 eval 归到行为类别里(例如多步检索),并定期移除那些已经饱和或我们不再认为对当前 agent 和当前模型代际有用的 evals。

  2. 按类别划分数据。创建优化集与留出集。这一步非常重要。我们发现自动爬山优化很容易对任务过拟合,所以留出集能保证学到的优化在没见过的数据上仍然有效,同时整体分布应与现有 evals 匹配,这更像生产环境会呈现的样子。

  3. 跑 Baseline。在任何编辑之前,先在优化集与留出集上跑一次 baseline 实验,为后续每一步更新提供落点。

  4. 优化。每次迭代自动运行,可选人工评审。先基于 traces 诊断错误,再用一个有针对性的 harness 改动做实验。我们会尽量一次只改一件事,避免混杂因素,但有时也需要同时更新提示词与工具,让系统整体配合得更好。

  5. 验证。每一步里,回路都会检查提议的改动是否让新的 evals 通过,同时避免已通过 case 的回归。某些改动常见的情况是总体分数上升但出现少量回归。agent 会拿到这些回归的上下文,从而在下一次更新里尝试修复回归,同时不丢掉这次更新带来的收益。

  6. 人工评审。我们会人工复核改动与那些指标捕捉不到的边界情况。这通常会发现一些对优化集过拟合的指令,即使不伤泛化,也会浪费 token。它提供了额外的合理性检查与一道防过拟合的闸门。

https://smith.langchain.com/

harness 改动的例子

下面是优化回路可能发现并验证的一些改动类型。

提示词与指令更新。最常见的一类。比如 agent 总是误解某个工具的输出格式,或者在应该先问澄清问题时过于激进地调用工具。解决方式是加一条有针对性的指令,例如 当查询多个彼此依赖的文件信息时,把信息先落到文件系统里,再聚合整理后给出最终答案。

新增或更新工具,或更新工具描述。agent 可能不知道什么时候该用新工具。改动可能包括怎样使用的示例、如何把该工具与其他工具串联、更新工具描述,以及调整整体工具集来区分相似工具。

Better-Harness 回路的结果

我们用 Claude Sonnet 4.6 和智谱的 GLM-5,在一部分 evals 上测试了这个方法。说明 我们还有一项工作在推进,目标是在 deepagents 里用更大的 eval 套件,把 Better-Harness 推广到更多模型。最终希望发布一系列模型画像,作为公开产物,记录每个模型在我们 evals 上调优后的细微差异。

我们从现有 eval 类别里挑了一小组有代表性的样本,并把它拆成用于爬山优化的集合与用于评估泛化的留出集合。对规模很大或成本很高的 eval 集,我们建议做有代表性的分层抽样,先得到一个适合爬山优化的小集合。流程跑顺之后,再扩到更大的集合。

主要实验目标 发现并修复 evals 上的失败模式,把能提升 eval 表现的通用改动回迁到 harness。

我们之前观察到的失败模式包括追问过多,以及串联新工具时出错。在优化集上完成爬山优化之后,我们用两类指标在留出集上评估最终 harness,tool_selectionfollowup_quality

https://blog.langchain.com/how-we-build-evals-for-deep-agents/

结果在两种模型上都很强。两者都几乎把能力泛化到了留出集上,留出集覆盖相同能力,但样例完全没见过。

很多收益来自对已发现失败模式的更明确指令。下面是优化回路发现的一些我们觉得有意思的具体例子。

对于那些会把新工具注入默认 harness 的 evals,比如 search-then-email,这个回路发现了更好的工具描述,说明该如何使用并组合这些工具。对跨领域打造垂直 agent 的构建者来说,这个现象很有希望,因为优化回路能很好地适配上下文里的任务细节。

Evals 维护与回归

除了爬山优化,evals 也会显式地捕捉并保护系统随时间发生的回归。一旦 agent 能正确处理某个 case,就不希望把这个收益丢掉,这个 eval 就会变成一个回归测试。这与传统软件工程里的思路相近,例如测试驱动开发 TDD。随着时间推移与改动增多,一些回归几乎不可避免,所以我们会选出一小部分始终必须通过的 evals,如果它们突然失败,就会对这次运行保持高度警惕。

我们不认为 eval 套件应该单调增长,定期大扫除是好事。我们会经常评估某个 eval 是否还值得保留,原因可能是模型变聪明了,或者我们希望 agent 呈现不同的行为。

未来:自动化错误检测与修复

这个方法之所以有效,是因为traces 提供了密集的反馈信号。Evals 能借助 traces 做跨版本对比,用数值把哪些改动带来更好分数落到实处,而这个分数应当是更好用户体验的一个良好代理指标。

总体上,我们会把 agentic compute 指向 traces,用于。

  • 自动推导错误。希望持续监控 agent 的 traces,对生产环境里的失败进行分类与聚类。

  • 从生产环境生成 evals。agent 犯错的 trace 就是一条 eval case。用户纠正了 agent 的 trace 更好。飞轮是 使用越多 → traces 越多 → evals 越多 → harness 越好

  • 对比不同 harness 版本。并排的 trace 对比能展示 harness 里哪些变化促成了新行为

每条 trace 都包含能产出潜在 eval 的宝贵数据。每条好的 eval 都能让 harness 变得更好。为此,所有 agent 的运行都会完整记录到 LangSmith,包含全量 traces。这让我们能在优化回路里做 trace 级诊断,在生产监控里做回归检测,也能通过 trace 挖掘生成 evals。

我们的主要结论与正在进行的工作如下。

Evals 是自动化 harness 工程的训练数据。让机器学习训练奏效的原则,例如数据质量、训练集与测试集划分、以及泛化检查,同样适用于 agent 开发。

让模型适配 harness。让每个模型适配它的 harness,需要投入大量工作。比如 codex 的 prompting 指南会建议对 Edit 工具使用某种格式。这会带来更大的搜索空间与 eval 集,我们也很期待分享真实例子,展示任何团队要做这件事时,实际会是什么样子。

归根结底,追踪 traces 并维护高质量 evals,才是这个系统在实践中能跑起来的原因。尽早在团队里投入这些建设,一起构建自动改进 agents 的未来。我们已经把这套脚手架的研究版本开源,供构建者实验。

感谢 @masondrxy 和 @hwchase17 的反馈!

链接 http://x.com/i/article/2041729463918989312

TL;DR: We can build better agents by building better harnesses. But to autonomously build a “better” harness, we need a strong learning signal to “hill-climb” on. We share how we use evals as that signal, plus design decisions that help our agent generalize instead of overfit. Better-Harness is a prototype system for iteratively sourcing and improving your harness with evals.

太长不看 我们能靠更好的 harness 做出更好的 agent。但要让系统自动把 harness 变得更好,需要一个足够强的学习信号来做爬山优化。这里分享我们如何把 evals 当作这个信号,以及一些设计选择,帮助 agent 追求泛化而不是过拟合。Better-Harness 是一个原型系统,用 evals 来迭代地发现并改进你的 harness。

Evals are training data for Agents

In classical machine learning, training data guides the model’s learning process. Each training example contributes a gradient that updates the model’s weights toward “correctness.” We have a similar learning loop for agents.

model + training data + gradient descent → better model

harness + evals + harness engineering → better agent

https://github.com/langchain-ai/deepagents/tree/main/examples/better-harness

Evals encode the behavior we want our agent to exhibit in production. They’re the "training data" for harness engineering. Each eval case contributes a signal like “did the agent take the right action” or “produce the right outcome?” That signal guides the next proposed edit to the harness.

The same rigor and care we put into data quality and curation for model training should also go into eval design. We discuss the importance of data quality in a previous post, how we build evals for Deep Agents.

There’s some great recent work that formalize the steps to optimize harnesses including Meta-Harness from Stanford and Auto-Harness from DeepMind. We also previously shared a Harness Improvement Loop to hill-climb Terminal Bench 2.0 by just tweaking the harness layer. We think there’s great future work to be done around the update algorithm itself, but harness improvement is a compound system that goes beyond the update algorithm which is what we talk about here.

Better-Harness is a take on compound systems engineering.

data sourcing → experiment design —> optimization —> review & acceptance

So we include practical details that go alongside the update loop such as how we source evals in the first place, how we design against overfitting, store Traces over time, and manually review updates to sanity check anything we ship to production.

Evals 是 Agents 的训练数据

在经典机器学习里,训练数据引导模型的学习过程。每个训练样本都会贡献一个梯度,把模型权重朝着正确方向更新。对 agents,我们也有一套类似的学习回路。

模型 + 训练数据 + 梯度下降 → 更好的模型

harness + evals + harness 工程 → 更好的 agent

https://github.com/langchain-ai/deepagents/tree/main/examples/better-harness

Evals 编码了我们希望 agent 在生产环境表现出的行为。它们就是 harness 工程的训练数据。每个 eval case 都会提供一种信号,例如 agent 是否采取了正确行动,或是否产出了正确结果。这些信号会指导下一次对 harness 的编辑提案。

我们在做模型训练时,对数据质量与数据筛选投入的严谨与用心,同样也应该投入到 eval 设计里。我们在之前的文章里讨论过数据质量的重要性,以及我们如何为 Deep Agents 构建 evals。

最近也有一些很棒的工作,把优化 harness 的步骤做了形式化,比如斯坦福的 Meta-Harness 和 DeepMind 的 Auto-Harness。我们之前也分享过一个 Harness Improvement Loop,只通过调整 harness 层,就对 Terminal Bench 2.0 做了爬山优化。我们认为关于更新算法本身还有很多值得做的未来工作,但 harness 改进是一个复合系统,范围不止更新算法本身,这也是本文要谈的内容。

Better-Harness 是一种复合系统工程的实践。

数据获取 → 实验设计 → 优化 → 评审与验收

因此我们也会包含一些与更新回路配套的实操细节,比如一开始如何获取 evals,如何在设计上对抗过拟合,如何长期存储 Traces,以及如何通过人工评审来做合理性检查,确保上线到生产环境的东西靠谱。

Sourcing good evals

Evals are the foundation that power the harness hill-climbing process. Here are the practical ways we source, curate, and use them.

Hand-curated. For any given task, the team manually writes examples that capture what we think the agent should do in production. These are often high value, but difficult to generate at scale.

Production traces. Every agent interaction generates a trace where failures become eval cases. Mining traces for eval material is the leverage, high-throughput way to improve evals over time. Even before running an agent over evals, often a team dogfooding our agent will report errors directly in Slack with a Trace link. We recommend dogfooding agents and directly sharing feedback for everyone to see, it helps build shared knowledge of agent behavior.

External datasets. These datasets are useful but need to be manually curated to make sure the test cases used to improve the agent reflect desired behaviors. Often each task is adjusted to make sure they measure the important behavior.

Tag everything. Every eval gets tagged to behavioral categories: "tool selection," "multi-step reasoning," etc. Tags enable meaningful holdout sets and targeted experiments. It also saves a lot of money because we can run subsets of evals.

获取高质量的 evals

Evals 是驱动 harness 爬山优化的地基。下面是我们在实践中获取、筛选与使用它们的方法。

人工精选。针对某个任务,团队手写一些能捕捉我们认为 agent 在生产环境应当怎么做的样例。这些样例往往价值很高,但很难规模化生成。

生产 traces。每次 agent 交互都会生成 trace,其中的失败可以变成 eval case。从 traces 里挖 eval 素材,是一种高杠杆、高吞吐的方式,能让 evals 随时间持续变好。甚至在把 agent 跑到 evals 上之前,团队内部先用起来时,经常会直接在 Slack 里报错并附上 Trace 链接。建议在团队里实际使用 agent,并把反馈直接公开共享,这会帮助沉淀对 agent 行为的共同认知。

外部数据集。这些数据集有用,但需要人工筛选,确保用于改进 agent 的测试用例,确实反映我们想要的行为。通常每个任务也要做一些调整,保证它们测到关键行为。

全部打标签。每个 eval 都要按行为类别打标签,例如 工具选择、多步推理 等。标签能支持有意义的留出集和定向实验,也能省很多钱,因为可以只跑 eval 的子集。

Building learning systems that generalize

The ideal outcome for any learning system is generalization. We give an input signal that captures the distribution of behaviors we want in the wild. The system fits to it and then “just works” on new inputs it's never seen.

The obvious problem: We don't have unlimited data.

The fix: Encode important behaviors into curated evals. Quality > quantity, a small set of well-tagged evals covering the behaviors you care about beats thousands of noisy but high-coverage evals.

The subtle problem → agents are famous cheaters: Any learning system is prone to reward hacking where the agent overfits its structure to make the existing evals pass that it can see. This makes sense because the loop just wants to “make number go up” and doesn't know about generalization. We prompt to avoid overfitting but it isn’t perfect.

The fix: Holdout sets become a proxy for true generalization. We’ve seen approaches that We pair with human review as a second signal and we get semi-automated systems can improve scores while avoiding behaviors we don’t want in prod.

构建能泛化的学习系统

任何学习系统的理想结果都是泛化。我们提供一组输入信号,尽可能刻画真实世界里我们想要的行为分布。系统拟合它,然后在从没见过的新输入上也能自然工作。

显而易见的问题 数据不可能无限多。

解决办法 把重要行为编码进精心筛选的 evals。质量大于数量。一小组高质量、打好标签、覆盖你关心行为的 evals,胜过成千上万条噪声很大但覆盖面很广的 evals。

更隐蔽的问题 → agents 以作弊出名 任何学习系统都可能发生奖励投机,agent 会过拟合自己的结构,让它看得见的现有 evals 通过。这很合理,因为回路只想让分数变高,并不知道泛化是什么。我们会在提示词里要求避免过拟合,但并不完美。

解决办法 留出集可以作为真实泛化的代理指标。我们见过一些做法,会把它和人工评审作为第二信号配合使用,于是形成半自动化系统,既能提高分数,又能避免生产环境里不想要的行为。

Better-Harness: a recipe for hill climbing your harness

We created a scaffold for autonomously improving our harness using evals as a signal in each step. A research version is open sourced here, here are the main steps:

  1. Source and tag evals. This is a mix of hand-writing evals, mining them from production traces, and using/adapting external datasets. We tag each eval to behavioral categories (like multi-step retrieval) and regularly remove evals that are saturated or we longer feel are useful for the agent + current generation of models.

  2. Split data per category. Create Optimization and Holdout sets. This is very important! We find that autonomouos hill-climbing has a tendency to overfit to tasks so holdout sets ensure that learned optimizations work on previously unseen data, though the general distirbution should match existing evals. This mirrors what production will look like.

  3. Run a Baseline. Run a baseline experiment on the Optimization & Holdout sets before any edits. This grounds all updates in the update steps.

  4. Optimize. Each iteration runs autonomously with optional human review. Diagnose errors from traces. Experiment with a targeted harness change. We scope to one change at a time to avoid confounding but that may mean updating a prompt and tool simultaneously so the system works well together.

  5. Validate: In each step, the loop checks to make sure that the proposed change helped pass new evals while avoiding regressions on existing passing cases. It’s common that some change results in a net overall score gain with some regressions. The agent gets context of these regressions so it can try to fix them in the next update without losing the gains from the existing update.

  6. Human review. We manually review changes and edge cases metrics miss. This often includes instructions that are overfit to the optimization set and although they don’t hurt generalization, they end up being a waste of tokens. This gives us another sanity check and gate against overfitting.

https://smith.langchain.com/

Better-Harness:harness 爬山优化的一份配方

我们做了一个脚手架,让系统能以 evals 作为每一步的信号,自动改进 harness。研究版本已开源,主要步骤如下。

  1. 获取并打标签 evals。这一步混合了手写 evals、从生产 traces 挖掘、以及使用或改造外部数据集。我们会把每个 eval 归到行为类别里(例如多步检索),并定期移除那些已经饱和或我们不再认为对当前 agent 和当前模型代际有用的 evals。

  2. 按类别划分数据。创建优化集与留出集。这一步非常重要。我们发现自动爬山优化很容易对任务过拟合,所以留出集能保证学到的优化在没见过的数据上仍然有效,同时整体分布应与现有 evals 匹配,这更像生产环境会呈现的样子。

  3. 跑 Baseline。在任何编辑之前,先在优化集与留出集上跑一次 baseline 实验,为后续每一步更新提供落点。

  4. 优化。每次迭代自动运行,可选人工评审。先基于 traces 诊断错误,再用一个有针对性的 harness 改动做实验。我们会尽量一次只改一件事,避免混杂因素,但有时也需要同时更新提示词与工具,让系统整体配合得更好。

  5. 验证。每一步里,回路都会检查提议的改动是否让新的 evals 通过,同时避免已通过 case 的回归。某些改动常见的情况是总体分数上升但出现少量回归。agent 会拿到这些回归的上下文,从而在下一次更新里尝试修复回归,同时不丢掉这次更新带来的收益。

  6. 人工评审。我们会人工复核改动与那些指标捕捉不到的边界情况。这通常会发现一些对优化集过拟合的指令,即使不伤泛化,也会浪费 token。它提供了额外的合理性检查与一道防过拟合的闸门。

https://smith.langchain.com/

Examples of harness changes

Here are the kinds of changes the optimization loop can discover and validate:

Prompt and instruction updates. The most common change. The agent keeps misinterpreting a tool's output format, or it's too aggressive about calling a tool when it should ask a clarifying question first. The fix is a targeted instruction update addition like "when querying multiple files that have dependent information, offload information to the filesystem and re-aggregate before giving a final answer."

Adding or updating a tool or tool description. The agent may fail contextualizing when to use a new tool. Edits include examples on of how to use, how to chain this tool, an updated tool description, and editing the overall tool suite to disambiguate similar tools

harness 改动的例子

下面是优化回路可能发现并验证的一些改动类型。

提示词与指令更新。最常见的一类。比如 agent 总是误解某个工具的输出格式,或者在应该先问澄清问题时过于激进地调用工具。解决方式是加一条有针对性的指令,例如 当查询多个彼此依赖的文件信息时,把信息先落到文件系统里,再聚合整理后给出最终答案。

新增或更新工具,或更新工具描述。agent 可能不知道什么时候该用新工具。改动可能包括怎样使用的示例、如何把该工具与其他工具串联、更新工具描述,以及调整整体工具集来区分相似工具。

Results from the Better-Harness loop

We tested this approach with Claude Sonnet 4.6 and Z.ai’s GLM-5 on a subset of our evals. Note: We have other work underway generalizing Better-Harness across many models in deepagents using a bigger eval suite. The goal is to publish a series of model profiles that capture the nuances of each model tuned for our evals as a public artifact.

We assembled a small representative sample from existing eval categories and split that sample into a set for hill-climbing and holdout to evaluate generalization. With large or expensive eval sets, we suggest representative/stratified sampling to give a good set to hill-climb against. Once this works well, it can be scaled up to the larger set.

Main experiment goal: discover & fix failure modes over our evals. Port general changes that increase eval performance back to the harness.

We previously observed failure modes such as over-asking follow-up questions and errors in chaining together new tools. After hill climbing on the optimization set, we evaluated the final harness on the holdout using two categories, tool_selection and followup_quality.

https://blog.langchain.com/how-we-build-evals-for-deep-agents/

The results were strong on both models and both. Both get nearly fully generalization to the holdout set which covered the same capability with totally unseen examples.

Many gains are from more explicit instructions around discovered failure modes. Here are a few concrete examples the optimization loop discovered that we found interesting.

For evals that inject new tools into the default harness like search-then-email , the loop discovered better descriptions of how to use and compose those tools. This is promising for builders creating vertical agents across domains, because optimization loops adapt well to the task specifics in context.

Better-Harness 回路的结果

我们用 Claude Sonnet 4.6 和智谱的 GLM-5,在一部分 evals 上测试了这个方法。说明 我们还有一项工作在推进,目标是在 deepagents 里用更大的 eval 套件,把 Better-Harness 推广到更多模型。最终希望发布一系列模型画像,作为公开产物,记录每个模型在我们 evals 上调优后的细微差异。

我们从现有 eval 类别里挑了一小组有代表性的样本,并把它拆成用于爬山优化的集合与用于评估泛化的留出集合。对规模很大或成本很高的 eval 集,我们建议做有代表性的分层抽样,先得到一个适合爬山优化的小集合。流程跑顺之后,再扩到更大的集合。

主要实验目标 发现并修复 evals 上的失败模式,把能提升 eval 表现的通用改动回迁到 harness。

我们之前观察到的失败模式包括追问过多,以及串联新工具时出错。在优化集上完成爬山优化之后,我们用两类指标在留出集上评估最终 harness,tool_selectionfollowup_quality

https://blog.langchain.com/how-we-build-evals-for-deep-agents/

结果在两种模型上都很强。两者都几乎把能力泛化到了留出集上,留出集覆盖相同能力,但样例完全没见过。

很多收益来自对已发现失败模式的更明确指令。下面是优化回路发现的一些我们觉得有意思的具体例子。

对于那些会把新工具注入默认 harness 的 evals,比如 search-then-email,这个回路发现了更好的工具描述,说明该如何使用并组合这些工具。对跨领域打造垂直 agent 的构建者来说,这个现象很有希望,因为优化回路能很好地适配上下文里的任务细节。

Evals maintenance & regressions

Along with hill climbing, evals also explicitly capture and protect against regressions over time. Once our agent handles a case correctly, we don’t want to lose that gain. The eval becomes a regression test. This is similar to ideas in traditional software engineering like Test Driven Development (TDD). Some regressions are bound to happen across many changes over time so we select a subset of evals that we always want to pass and look at our run suspiciously if these suddenly fail.

We don’t think our eval suite should grow monotonically, spring cleaning of evals is good! We regularly assess whether an eval is still useful because of more intelligent models or a different behavior we want for the agent.

Evals 维护与回归

除了爬山优化,evals 也会显式地捕捉并保护系统随时间发生的回归。一旦 agent 能正确处理某个 case,就不希望把这个收益丢掉,这个 eval 就会变成一个回归测试。这与传统软件工程里的思路相近,例如测试驱动开发 TDD。随着时间推移与改动增多,一些回归几乎不可避免,所以我们会选出一小部分始终必须通过的 evals,如果它们突然失败,就会对这次运行保持高度警惕。

我们不认为 eval 套件应该单调增长,定期大扫除是好事。我们会经常评估某个 eval 是否还值得保留,原因可能是模型变聪明了,或者我们希望 agent 呈现不同的行为。

The Future: automated error detection & fixes

This approach works because traces give us a dense feedback signal. Evals benefit from traces to compare across versions and numerically ground which changes contribute to a better score (which should be a good proxy for a better user experience).

Overall, we point agentic compute at traces to:

  • Derive errors automatically. We want to constantly monitor our agent traces to classify and cluster failures in production.

  • Generate evals from production. A trace where the agent made a mistake is an eval case. A trace where a user corrected the agent is even better. The flywheel: more usage → more traces → more evals → better harness

  • Compare harness versions. Side-by-side trace comparisons show what changed in the harness that contributed to new behavior

Every trace contains valuable data to produce a potential eval. And every (good) eval makes the harness better. To facilitate this, all agent runs are logged to LangSmith with full traces. This gives us trace-level diagnosis for the optimization loop, production monitoring for regression detection, and trace mining for eval generation.

Our main takeaways and ongoing work:

Evals are training data for autonomous harness engineering. The same principles that make ML training work such as data quality, train/test splits, and generalization checks apply to agent development.

Fitting models to harnesses. There’s a large amount of work that goes into fitting every model to its harness. For example, the codex prompting guide suggests a certain format for their Edit tool. This requires a bigger search search space and eval set, we’re excited to share real examples of what that looks like for any team looking to do this.

Overall, tracing and maintaining good evals is what makes this system work in practice. Invest in this early with your team and come build the future of autonomously improving agents. We open sourced a research version of this scaffold for builders to experiment with.

Thanks to @masondrxy and @hwchase17 for their feedback!

未来:自动化错误检测与修复

这个方法之所以有效,是因为traces 提供了密集的反馈信号。Evals 能借助 traces 做跨版本对比,用数值把哪些改动带来更好分数落到实处,而这个分数应当是更好用户体验的一个良好代理指标。

总体上,我们会把 agentic compute 指向 traces,用于。

  • 自动推导错误。希望持续监控 agent 的 traces,对生产环境里的失败进行分类与聚类。

  • 从生产环境生成 evals。agent 犯错的 trace 就是一条 eval case。用户纠正了 agent 的 trace 更好。飞轮是 使用越多 → traces 越多 → evals 越多 → harness 越好

  • 对比不同 harness 版本。并排的 trace 对比能展示 harness 里哪些变化促成了新行为

每条 trace 都包含能产出潜在 eval 的宝贵数据。每条好的 eval 都能让 harness 变得更好。为此,所有 agent 的运行都会完整记录到 LangSmith,包含全量 traces。这让我们能在优化回路里做 trace 级诊断,在生产监控里做回归检测,也能通过 trace 挖掘生成 evals。

我们的主要结论与正在进行的工作如下。

Evals 是自动化 harness 工程的训练数据。让机器学习训练奏效的原则,例如数据质量、训练集与测试集划分、以及泛化检查,同样适用于 agent 开发。

让模型适配 harness。让每个模型适配它的 harness,需要投入大量工作。比如 codex 的 prompting 指南会建议对 Edit 工具使用某种格式。这会带来更大的搜索空间与 eval 集,我们也很期待分享真实例子,展示任何团队要做这件事时,实际会是什么样子。

归根结底,追踪 traces 并维护高质量 evals,才是这个系统在实践中能跑起来的原因。尽早在团队里投入这些建设,一起构建自动改进 agents 的未来。我们已经把这套脚手架的研究版本开源,供构建者实验。

感谢 @masondrxy 和 @hwchase17 的反馈!

链接 http://x.com/i/article/2041729463918989312

TL;DR: We can build better agents by building better harnesses. But to autonomously build a “better” harness, we need a strong learning signal to “hill-climb” on. We share how we use evals as that signal, plus design decisions that help our agent generalize instead of overfit. Better-Harness is a prototype system for iteratively sourcing and improving your harness with evals.

Evals are training data for Agents

In classical machine learning, training data guides the model’s learning process. Each training example contributes a gradient that updates the model’s weights toward “correctness.” We have a similar learning loop for agents.

model + training data + gradient descent → better model

harness + evals + harness engineering → better agent

https://github.com/langchain-ai/deepagents/tree/main/examples/better-harness

Evals encode the behavior we want our agent to exhibit in production. They’re the "training data" for harness engineering. Each eval case contributes a signal like “did the agent take the right action” or “produce the right outcome?” That signal guides the next proposed edit to the harness.

The same rigor and care we put into data quality and curation for model training should also go into eval design. We discuss the importance of data quality in a previous post, how we build evals for Deep Agents.

There’s some great recent work that formalize the steps to optimize harnesses including Meta-Harness from Stanford and Auto-Harness from DeepMind. We also previously shared a Harness Improvement Loop to hill-climb Terminal Bench 2.0 by just tweaking the harness layer. We think there’s great future work to be done around the update algorithm itself, but harness improvement is a compound system that goes beyond the update algorithm which is what we talk about here.

Better-Harness is a take on compound systems engineering.

data sourcing → experiment design —> optimization —> review & acceptance

So we include practical details that go alongside the update loop such as how we source evals in the first place, how we design against overfitting, store Traces over time, and manually review updates to sanity check anything we ship to production.

Sourcing good evals

Evals are the foundation that power the harness hill-climbing process. Here are the practical ways we source, curate, and use them.

Hand-curated. For any given task, the team manually writes examples that capture what we think the agent should do in production. These are often high value, but difficult to generate at scale.

Production traces. Every agent interaction generates a trace where failures become eval cases. Mining traces for eval material is the leverage, high-throughput way to improve evals over time. Even before running an agent over evals, often a team dogfooding our agent will report errors directly in Slack with a Trace link. We recommend dogfooding agents and directly sharing feedback for everyone to see, it helps build shared knowledge of agent behavior.

External datasets. These datasets are useful but need to be manually curated to make sure the test cases used to improve the agent reflect desired behaviors. Often each task is adjusted to make sure they measure the important behavior.

Tag everything. Every eval gets tagged to behavioral categories: "tool selection," "multi-step reasoning," etc. Tags enable meaningful holdout sets and targeted experiments. It also saves a lot of money because we can run subsets of evals.

Building learning systems that generalize

The ideal outcome for any learning system is generalization. We give an input signal that captures the distribution of behaviors we want in the wild. The system fits to it and then “just works” on new inputs it's never seen.

The obvious problem: We don't have unlimited data.

The fix: Encode important behaviors into curated evals. Quality > quantity, a small set of well-tagged evals covering the behaviors you care about beats thousands of noisy but high-coverage evals.

The subtle problem → agents are famous cheaters: Any learning system is prone to reward hacking where the agent overfits its structure to make the existing evals pass that it can see. This makes sense because the loop just wants to “make number go up” and doesn't know about generalization. We prompt to avoid overfitting but it isn’t perfect.

The fix: Holdout sets become a proxy for true generalization. We’ve seen approaches that We pair with human review as a second signal and we get semi-automated systems can improve scores while avoiding behaviors we don’t want in prod.

Better-Harness: a recipe for hill climbing your harness

We created a scaffold for autonomously improving our harness using evals as a signal in each step. A research version is open sourced here, here are the main steps:

  1. Source and tag evals. This is a mix of hand-writing evals, mining them from production traces, and using/adapting external datasets. We tag each eval to behavioral categories (like multi-step retrieval) and regularly remove evals that are saturated or we longer feel are useful for the agent + current generation of models.

  2. Split data per category. Create Optimization and Holdout sets. This is very important! We find that autonomouos hill-climbing has a tendency to overfit to tasks so holdout sets ensure that learned optimizations work on previously unseen data, though the general distirbution should match existing evals. This mirrors what production will look like.

  3. Run a Baseline. Run a baseline experiment on the Optimization & Holdout sets before any edits. This grounds all updates in the update steps.

  4. Optimize. Each iteration runs autonomously with optional human review. Diagnose errors from traces. Experiment with a targeted harness change. We scope to one change at a time to avoid confounding but that may mean updating a prompt and tool simultaneously so the system works well together.

  5. Validate: In each step, the loop checks to make sure that the proposed change helped pass new evals while avoiding regressions on existing passing cases. It’s common that some change results in a net overall score gain with some regressions. The agent gets context of these regressions so it can try to fix them in the next update without losing the gains from the existing update.

  6. Human review. We manually review changes and edge cases metrics miss. This often includes instructions that are overfit to the optimization set and although they don’t hurt generalization, they end up being a waste of tokens. This gives us another sanity check and gate against overfitting.

https://smith.langchain.com/

Examples of harness changes

Here are the kinds of changes the optimization loop can discover and validate:

Prompt and instruction updates. The most common change. The agent keeps misinterpreting a tool's output format, or it's too aggressive about calling a tool when it should ask a clarifying question first. The fix is a targeted instruction update addition like "when querying multiple files that have dependent information, offload information to the filesystem and re-aggregate before giving a final answer."

Adding or updating a tool or tool description. The agent may fail contextualizing when to use a new tool. Edits include examples on of how to use, how to chain this tool, an updated tool description, and editing the overall tool suite to disambiguate similar tools

Results from the Better-Harness loop

We tested this approach with Claude Sonnet 4.6 and Z.ai’s GLM-5 on a subset of our evals. Note: We have other work underway generalizing Better-Harness across many models in deepagents using a bigger eval suite. The goal is to publish a series of model profiles that capture the nuances of each model tuned for our evals as a public artifact.

We assembled a small representative sample from existing eval categories and split that sample into a set for hill-climbing and holdout to evaluate generalization. With large or expensive eval sets, we suggest representative/stratified sampling to give a good set to hill-climb against. Once this works well, it can be scaled up to the larger set.

Main experiment goal: discover & fix failure modes over our evals. Port general changes that increase eval performance back to the harness.

We previously observed failure modes such as over-asking follow-up questions and errors in chaining together new tools. After hill climbing on the optimization set, we evaluated the final harness on the holdout using two categories, tool_selection and followup_quality.

https://blog.langchain.com/how-we-build-evals-for-deep-agents/

The results were strong on both models and both. Both get nearly fully generalization to the holdout set which covered the same capability with totally unseen examples.

Many gains are from more explicit instructions around discovered failure modes. Here are a few concrete examples the optimization loop discovered that we found interesting.

For evals that inject new tools into the default harness like search-then-email , the loop discovered better descriptions of how to use and compose those tools. This is promising for builders creating vertical agents across domains, because optimization loops adapt well to the task specifics in context.

Evals maintenance & regressions

Along with hill climbing, evals also explicitly capture and protect against regressions over time. Once our agent handles a case correctly, we don’t want to lose that gain. The eval becomes a regression test. This is similar to ideas in traditional software engineering like Test Driven Development (TDD). Some regressions are bound to happen across many changes over time so we select a subset of evals that we always want to pass and look at our run suspiciously if these suddenly fail.

We don’t think our eval suite should grow monotonically, spring cleaning of evals is good! We regularly assess whether an eval is still useful because of more intelligent models or a different behavior we want for the agent.

The Future: automated error detection & fixes

This approach works because traces give us a dense feedback signal. Evals benefit from traces to compare across versions and numerically ground which changes contribute to a better score (which should be a good proxy for a better user experience).

Overall, we point agentic compute at traces to:

  • Derive errors automatically. We want to constantly monitor our agent traces to classify and cluster failures in production.

  • Generate evals from production. A trace where the agent made a mistake is an eval case. A trace where a user corrected the agent is even better. The flywheel: more usage → more traces → more evals → better harness

  • Compare harness versions. Side-by-side trace comparisons show what changed in the harness that contributed to new behavior

Every trace contains valuable data to produce a potential eval. And every (good) eval makes the harness better. To facilitate this, all agent runs are logged to LangSmith with full traces. This gives us trace-level diagnosis for the optimization loop, production monitoring for regression detection, and trace mining for eval generation.

Our main takeaways and ongoing work:

Evals are training data for autonomous harness engineering. The same principles that make ML training work such as data quality, train/test splits, and generalization checks apply to agent development.

Fitting models to harnesses. There’s a large amount of work that goes into fitting every model to its harness. For example, the codex prompting guide suggests a certain format for their Edit tool. This requires a bigger search search space and eval set, we’re excited to share real examples of what that looks like for any team looking to do this.

Overall, tracing and maintaining good evals is what makes this system work in practice. Invest in this early with your team and come build the future of autonomously improving agents. We open sourced a research version of this scaffold for builders to experiment with.

Thanks to @masondrxy and @hwchase17 for their feedback!

📋 讨论归档

讨论进行中…