返回列表
🧠 阿头学 · 💬 讨论题

Deep Agents 评测方法论:少而准,比多而杂更能塑造智能体

这篇文章最有价值的判断是:智能体评测不是中性量尺,而是行为塑形器,因此真正有效的评测必须少而准、贴近真实任务,并同时约束正确性与效率,而不是靠堆 benchmark 制造进步幻觉。
打开原文 ↗

2026-03-27 原文链接 ↗
阅读简报
双语对照
完整翻译
原文
讨论归档

核心观点

  • 评测不是测量尺,而是方向盘 作者明确指出,每个 eval 都会持续把智能体推向某种行为,这个判断是对的,因为团队一旦依据评测改 prompt、改工具描述、改路由逻辑,评测就已经在训练系统而不只是检测系统。
  • 少而准的定向评测,优于大而全的堆量评测 文章反对盲目堆几百上千个测试,这个立场在工程上成立,因为很多 benchmark 分数提升确实不能映射到线上体验;但作者没有拿出对照数据证明“少而准”在泛化上一定更优,这部分仍是经验论。
  • 正确性是门槛,效率才决定能否上线 作者把模型选择拆成“先看能否稳定做对,再看延迟、成本、步数、工具调用效率”,这个框架很实用,因为 agent 在线上失败往往不是答错,而是太慢、太贵、太绕。
  • 按行为分类评测,比按来源分类更有用 文章主张按 retrieval、tool_use、file_operations 这类“测什么能力”来组织评测,而不是按 BFCL、Terminal Bench 这类“来自哪里”分类,这一点判断非常扎实,因为来源标签对调优动作几乎没有指导价值。
  • “理想轨迹”能帮助看效率,但也会把偏好伪装成标准 用最短步骤、最少工具调用、最高并行度去定义理想轨迹,对简单任务很有效;但在开放任务里,这种基线带有很强主观性,容易把“团队喜欢的执行方式”误判成“客观最优”。

跟我们的关联

  • 对 ATou 意味着什么、下一步怎么用 ATou 如果在做 agent 产品或工作流自动化,不该再先看总榜分数,而该先列出 5-10 个真实关键行为;下一步可以直接建立“行为清单 → 定向 eval → trace 复盘”的最小闭环。
  • 对 Neta 意味着什么、下一步怎么用 Neta 如果关注方法论抽象,这篇文章提供了“先定义目标行为,再定义指标”的框架;下一步可以把它迁移到非 agent 场景,比如增长漏斗、内容生产、客服质检。
  • 对 Uota 意味着什么、下一步怎么用 Uota 如果从用户体验视角看,这篇文章提醒“答对但很慢”仍然是坏产品;下一步应把延迟、绕路、冗余调用纳入体验评估,而不是只看任务成功率。
  • 对投资/产品判断意味着什么、下一步怎么用 这篇文章说明 agent 基础设施的护城河可能不只是模型本身,而是评测、trace、回归和迭代闭环;下一步评估团队时,应追问他们如何定义关键行为、如何构建 eval、如何处理回归,而不是只看 demo 效果。

讨论引子

1. “少而准”的评测体系,什么时候会演变成对已知场景的过拟合,而失去对未知任务的覆盖? 2. “理想轨迹”到底是在测效率,还是在强行规定 agent 应该怎么思考和行动? 3. 如果 LLM-as-a-judge 本身不稳定,这套评测闭环的可信度还能剩多少?

TLDR: 最好的智能体评测会直接衡量我们真正关心的某种智能体行为。下面介绍我们如何获取数据、设计指标,并长期运行范围清晰、目标明确的实验,让智能体更准确、更可靠。

评测会塑造智能体行为

我们一直在整理和打磨评测,用来衡量并改进 Deep Agents。Deep Agents 是一个开源、与模型无关的智能体框架,为 Fleet 和 Open SWE 等产品提供底座。评测会定义并塑造智能体的行为,所以必须用心设计。

每一个评测都像一根向量,会把你的智能体系统的行为推向某个方向。比如,如果一个关于高效读文件的评测没通过,你大概率会去调整系统提示词,或修改 read_file 工具的描述,来一点点引导行为,直到它能通过。你保留的每一个评测,都会在时间里持续对整体系统施加压力。

新增评测时尤其要克制和谨慎。很容易一上来就盲目加上几百(甚至几千)个测试。结果可能只是制造一种错觉:在一套并不能准确映射线上关键行为的评测集里得分更高,于是以为智能体变好了。

评测越多 ≠ 智能体越好。相反,应当做更有针对性的评测,让它们真实反映线上期望的行为。

在构建 Deep Agents 时,我们会先把线上真正重要的行为列出来,比如在文件系统里跨多个文件检索内容,或者准确地把 5 次以上的工具调用按顺序编排出来。我们不会把一堆基准任务打包当成一个整体来用,而是按下面的方法来整理评测:

  1. 先决定希望智能体遵循哪些行为,再去调研并整理能以可验证方式衡量这些行为的定向评测。

  2. 对每个评测都加一段 docstring,解释它是如何衡量某项智能体能力的。这样就能保证 每个评测自带说明文档。 同时给每个评测打上如 tool_use 之类的分类标签,便于按组运行。

  3. 回看输出 trace,理解失败模式,并据此更新评测覆盖面。

由于我们把每次评测运行都追踪到同一个 LangSmith 项目里,团队里任何人都能随时进来分析问题、修复,并重新评估某个评测是否值得保留。这让新增和维护高质量评测变成了共同责任。与此同时,让很多模型在很多评测上跑一遍很快就会变贵,所以有针对性的评测既省钱,又能把智能体真正变好。

这篇文章会讲:

  • 我们如何整理数据

  • 我们如何定义指标

  • 我们如何运行评测

我们如何整理数据

我们主要从几种渠道获取评测:

  1. 利用我们日常自用智能体(dogfooding)的反馈

  2. 从外部基准(比如 Terminal Bench 2.0 或 BFCL)挑选部分评测,并经常根据某个特定智能体进行改造

  3. 针对我们认为重要的行为,手工写一些自制的(artisanal)评测和单元测试

https://docs.langchain.com/langsmith/langsmith-cli

我们每天都在自用这些智能体。每一次出错,都会变成写一个评测、并更新智能体定义与上下文工程实践的机会。

注意: 我们把 SDK 的单元测试和集成测试(系统提示词透传、打断配置、子智能体路由)与模型能力评测分开。任何模型都能通过这些测试,所以把它们纳入评分不会提供有效信号。单元测试和集成测试当然应该写,但本文只讨论模型能力评测。

自用智能体与读 trace 是评测的好来源

这让我们能更容易找到错误。trace 为理解智能体行为提供了数据。由于 trace 往往很大,我们会用内置智能体,比如 Polly 或 Insights,来做规模化分析。你也可以用其他智能体(比如 Claude Code 或 Deep Agents CLI),再配上一种把 trace 拉下来的方法,比如 LangSmith CLI。我们的目标是把每一种失败模式搞清楚,提出修复方案,重新运行智能体,并长期跟踪进展与回归。

比如,现在有相当一部分修 bug 的 PR,都是通过 Open SWE 来驱动的。Open SWE 是我们的开源后台编程智能体。使用它的团队会接触很多不同的代码库,它们的上下文、约定和目标都不一样,这自然会带来错误。Open SWE 的每一次交互都会被追踪,所以这些交互很容易就能变成评测,用来确保同样的错误不再发生。

还有一些评测会从现有基准里抽取并做调整,比如用 BFCL 来做函数调用相关的评测。对于编程任务,我们会和 Harbor 集成,在沙盒环境里运行数据集中的部分任务,比如 Terminal Bench 2.0 里的任务。也有很多评测是从零写出来的,它们更像是聚焦型测试,用来观察某个单一行为,比如测试一个 read_file 工具。

我们按测试内容给评测分组

建立一套评测分类体系很有帮助,它能提供一个中层视角来看智能体表现(既不是单一数字,也不是逐条运行结果)。

小贴士: 用它们测试什么来做分类,而不是看它们来自哪里。比如,来自 FRAMES 和 BFCL 的任务都可以被标成 "外部基准",但这并不能体现它们分别在衡量检索能力和工具使用能力。

下面是我们定义的一些类别,以及它们各自测试什么:

https://gorilla.cs.berkeley.edu/leaderboard.html

今天,所有评测都是让智能体在一个任务上做端到端运行。我们刻意鼓励评测结构的多样性。有些任务从一个输入提示开始,一步就能结束;也有些任务会进行 10 轮以上,由另一个模型来模拟用户。

我们如何定义指标

在为智能体选择模型时,首先看正确性。如果一个模型无法稳定完成我们关心的任务,其他一切都不重要。我们会在评测上跑多个模型,并随着时间不断打磨 harness,去解决我们发现的问题。

如何衡量正确性,取决于评测在测什么。大多数内部评测会用自定义断言,比如 智能体有没有把工具调用并行化。像 BFCL 这类外部基准,则会把输出和数据集里的标准答案做精确匹配。对于那些正确性更偏语义层面的评测,比如智能体是否把正确内容持久化到记忆里,我们会用 LLM-as-a-judge。

当有几种模型能跨过正确性门槛后,就会转向效率。两种都能解决同一个任务的模型,在实际运行中的行为可能差得很远。一种可能会多走几轮、做一些不必要的工具调用,或者因为模型体量更大而推进得更慢。在生产环境里,这些差异会体现为更高的延迟、更高的成本,以及更差的整体用户体验。

综合来看,我们为每次评测运行衡量的指标包括:

https://x.com/@masondrxy

Solve rate 用来衡量智能体解决任务的速度,并按预期步骤数做归一化。和 latency ratio 一样,它捕捉的是端到端的解决时间,包含模型往返、供应商延迟、走错路的代价,以及工具执行时间。对于一些简单任务,我们可以定义一条理想轨迹,此时 solve rate 往往比 latency ratio 更好用,因为它只需要测量该智能体完成任务的总时长。

这让我们可以用一组有针对性的评测来更简单地选模型:

  1. 先看正确性:在你真正关心的任务上,哪些模型足够准确?

  2. 再比效率:在这些已经够用的模型里,哪一个在正确性、延迟和成本之间给出的权衡最好?

围绕评测的一些实用指标示例

为了让模型对比变得可执行,我们会去看模型如何成功、如何失败。这需要一个具体的参照点,来描述准确率之外,什么才算 好 的执行。我们用的一个基本概念是 理想轨迹。它是一串步骤,在不做任何不必要动作的前提下,得到正确结果。

对于简单、范围清晰的任务,变量定义得足够紧,最优路径通常一眼就能看出来。对于更开放的任务,我们会用目前见过表现最好的模型来近似一条轨迹,然后随着模型和 harness 的进步不断回看并更新这个基线。通过观察智能体行为,我们也会不断修正自己对理想轨迹的先验判断。

考虑一个简单的请求:

"我住的地方现在几点,天气怎么样?"

智能体的理想轨迹可能是这样:

  • 它只做最少且必要的工具调用(例如,确定用户 → 确定地点 → 获取时间和天气)

  • 能并行的独立工具调用尽量并行

  • 不增加不必要的中间轮次,直接给出最终答案

理想轨迹: 4 个步骤,4 次工具调用,约 8 秒

https://github.com/laude-institute/terminal-bench

再对比一条依然技术上正确,但效率更差的轨迹。

低效轨迹: 6 个步骤,5 次工具调用,约 14 秒。

https://github.com/langchain-ai/deepagents

正确但低效的轨迹: 6 个智能体步骤,5 次工具调用,包含一次不必要的工具调用,并且没有并行化工具调用。

上面的例子仅用于说明:用 REPL 可能能更快完成这个任务,但用工具调用版本更容易把想法讲清楚。

两次运行的结果都正确,但第二次会增加延迟和成本,并带来更多失败的机会。

这种框架让我们能够在评测中同时评估正确性和效率。我们会维护并更新一组指标,把每次运行提炼成可度量的数字,用来对比不同实验。按上面的例子,那个正确但低效的运行会得到如下分数:

我们如何运行评测

我们用 pytest 配合 GitHub Actions 在 CI 里跑评测,让每次改动都在干净、可复现的环境中运行。每个评测会用指定模型创建一个 Deep Agent 实例,给它喂一个任务,然后计算正确性与效率指标。

我们也会用标签来只跑一部分评测,以节省成本并衡量更聚焦的实验。比如,如果在构建一个需要大量本地文件处理与综合的智能体,就可能把重心放在打了 file_operations 和 tool_use 标签的子集上。

https://x.com/@hwchase17

我们的评测架构与实现已经在 Deep Agents 仓库中开源。

接下来

我们正在扩展评测套件,也会在开源 LLM 方面做更多工作。有几件事很快会分享:

  • 开源模型在各类评测维度上与闭源前沿模型的对比表现

  • 把评测作为一种机制,用来对真实任务进行实时的自动改进

  • 公开分享我们如何随着时间推移,为每个智能体维护、收缩并扩展评测

感谢帮助审阅并共同撰写这篇文章的团队成员 @masondrxy @veryboldbagel @hwchase17。也已在 LangChain 博客发布。

Deep Agents 是完全开源的。欢迎试用并告诉我们你的感受。期待帮助更多团队构建优秀的智能体与评测。

TLDR: The best agent evals directly measure an agent behavior we care about. Here’s how we source data, create metrics, and run well-scoped, targeted experiments over time to make agents more accurate and reliable.

TLDR: 最好的智能体评测会直接衡量我们真正关心的某种智能体行为。下面介绍我们如何获取数据、设计指标,并长期运行范围清晰、目标明确的实验,让智能体更准确、更可靠。

Evals shape agent behavior

评测会塑造智能体行为

We’ve been curating evaluations to measure and improve Deep Agents. Deep Agents is an open source, model agnostic agent harness that powers products like Fleet and Open SWE. Evals define and shape agent behavior, which is why it’s so important to design them thoughtfully.

我们一直在整理和打磨评测,用来衡量并改进 Deep Agents。Deep Agents 是一个开源、与模型无关的智能体框架,为 Fleet 和 Open SWE 等产品提供底座。评测会定义并塑造智能体的行为,所以必须用心设计。

Every eval is a vector that shifts the behavior of your agentic system. For example, if an eval for efficient file reading fails, you’ll likely tweak the system prompt or the read_file tool description to nudge behavior until it passes. Every eval you keep applies pressure on the overall system over time.

每一个评测都像一根向量,会把你的智能体系统的行为推向某个方向。比如,如果一个关于高效读文件的评测没通过,你大概率会去调整系统提示词,或修改 read_file 工具的描述,来一点点引导行为,直到它能通过。你保留的每一个评测,都会在时间里持续对整体系统施加压力。

It is crucial to be thoughtful when adding evals. It can be tempting to blindly add hundreds (or thousands) of tests. This leads to an illusion of “improving your agent” by scoring well on an eval suite that may not accurately reflect behaviors you care about in production.

新增评测时尤其要克制和谨慎。很容易一上来就盲目加上几百(甚至几千)个测试。结果可能只是制造一种错觉:在一套并不能准确映射线上关键行为的评测集里得分更高,于是以为智能体变好了。

More evals ≠ better agents. Instead, build targeted evals that reflect desired behaviors in production.

评测越多 ≠ 智能体越好。相反,应当做更有针对性的评测,让它们真实反映线上期望的行为。

When building Deep Agents, we catalog the behaviors that matter in production, such as retrieving content across multiple files in the filesystem or accurately composing 5+ tool calls in sequence. Rather than using benchmark tasks in aggregate, we take the following approach to eval curation:

在构建 Deep Agents 时,我们会先把线上真正重要的行为列出来,比如在文件系统里跨多个文件检索内容,或者准确地把 5 次以上的工具调用按顺序编排出来。我们不会把一堆基准任务打包当成一个整体来用,而是按下面的方法来整理评测:

  1. Decide which behaviors we want our agent to follow. Then research and curate targeted evals that measure those behaviors in a verifiable way.
  1. 先决定希望智能体遵循哪些行为,再去调研并整理能以可验证方式衡量这些行为的定向评测。
  1. For each eval, add a docstring that explains how it measures an agent capability. This ensures each eval is self-documenting. We also tag each eval with categories like tool_use to enable grouped runs.
  1. 对每个评测都加一段 docstring,解释它是如何衡量某项智能体能力的。这样就能保证 每个评测自带说明文档。 同时给每个评测打上如 tool_use 之类的分类标签,便于按组运行。
  1. Review output traces to understand failure modes and update eval coverage.
  1. 回看输出 trace,理解失败模式,并据此更新评测覆盖面。

Because we trace every eval run to a shared LangSmith project, anyone on the team can jump in to analyze issues, make fixes, and reassess the value of a given eval. This creates shared responsibility for adding and maintaining good evals. Running many models across many evals can also get expensive, so targeted evals save money while improving your agent.

由于我们把每次评测运行都追踪到同一个 LangSmith 项目里,团队里任何人都能随时进来分析问题、修复,并重新评估某个评测是否值得保留。这让新增和维护高质量评测变成了共同责任。与此同时,让很多模型在很多评测上跑一遍很快就会变贵,所以有针对性的评测既省钱,又能把智能体真正变好。

In this blog we cover:

这篇文章会讲:

  • How we curate data
  • 我们如何整理数据
  • How we define metrics
  • 我们如何定义指标
  • How we run the evals
  • 我们如何运行评测

How we curate data

我们如何整理数据

There’s a few ways we source evals:

我们主要从几种渠道获取评测:

  1. Using feedback from dogfooding our agents
  1. 利用我们日常自用智能体(dogfooding)的反馈
  1. Pulling selected evals from external benchmarks (like Terminal Bench 2.0 or BFCL) and often adapting them for a particular agent
  1. 从外部基准(比如 Terminal Bench 2.0 或 BFCL)挑选部分评测,并经常根据某个特定智能体进行改造
  1. Writing our own (artisanal) evals and unit tests by hand for behaviors we think are important
  1. 针对我们认为重要的行为,手工写一些自制的(artisanal)评测和单元测试

We dogfood our agents every day. Every error becomes an opportunity to write an eval and update our agent definition & context engineering practices.

我们每天都在自用这些智能体。每一次出错,都会变成写一个评测、并更新智能体定义与上下文工程实践的机会。

Note: We separate SDK unit and integration tests (system prompt passthrough, interrupt config, subagent routing) from model capability evals. Any model passes those tests, so including them in scoring adds no signal. You should absolutely write unit and integration tests, but this blog focuses solely on model capability evals.

注意: 我们把 SDK 的单元测试和集成测试(系统提示词透传、打断配置、子智能体路由)与模型能力评测分开。任何模型都能通过这些测试,所以把它们纳入评分不会提供有效信号。单元测试和集成测试当然应该写,但本文只讨论模型能力评测。

Dogfooding agents & reading traces are great sources of evals

自用智能体与读 trace 是评测的好来源

This makes finding mistakes possible. Traces give us data to understand agent behavior. Because traces are often large, we use a built-in agent like Polly or Insights to analyze them at scale. You can do the same with other agents (like Claude Code or the Deep Agents CLI) plus a way to pull down traces, like the LangSmith CLI. Our goal is to understand each failure mode, propose a fix, rerun the agent, and track progress and regressions over time.

这让我们能更容易找到错误。trace 为理解智能体行为提供了数据。由于 trace 往往很大,我们会用内置智能体,比如 Polly 或 Insights,来做规模化分析。你也可以用其他智能体(比如 Claude Code 或 Deep Agents CLI),再配上一种把 trace 拉下来的方法,比如 LangSmith CLI。我们的目标是把每一种失败模式搞清楚,提出修复方案,重新运行智能体,并长期跟踪进展与回归。

For example, a large fraction of bug-fix PRs are now driven through Open SWE, our open source background coding agent. Teams using it touch many different codebases with different context, conventions, and goals. This naturally leads to mistakes. Every interaction of Open SWE is traced, so those can easily become evals to make sure the mistake doesn’t happen again.

比如,现在有相当一部分修 bug 的 PR,都是通过 Open SWE 来驱动的。Open SWE 是我们的开源后台编程智能体。使用它的团队会接触很多不同的代码库,它们的上下文、约定和目标都不一样,这自然会带来错误。Open SWE 的每一次交互都会被追踪,所以这些交互很容易就能变成评测,用来确保同样的错误不再发生。

Other evals are pulled and adjusted from existing benchmarks like BFCL for function calling. For coding tasks, we integrate with Harbor to run selected tasks from datasets like Terminal Bench 2.0 tasks in sandboxed environments. Many evals are written from scratch and act as focused tests to observe isolated behavior, like testing a read_file tool.

还有一些评测会从现有基准里抽取并做调整,比如用 BFCL 来做函数调用相关的评测。对于编程任务,我们会和 Harbor 集成,在沙盒环境里运行数据集中的部分任务,比如 Terminal Bench 2.0 里的任务。也有很多评测是从零写出来的,它们更像是聚焦型测试,用来观察某个单一行为,比如测试一个 read_file 工具。

We group evals by what they test

我们按测试内容给评测分组

It’s helpful to have a taxonomy of evals to get a middle view of how agents perform (not a single number, not individual runs).

建立一套评测分类体系很有帮助,它能提供一个中层视角来看智能体表现(既不是单一数字,也不是逐条运行结果)。

Tip: Create that taxonomy by looking at what they test, not where they come from. For example, tasks from FRAMES and BFCL could be tagged "external benchmarks," but that would not show how they measure retrieval and tool use, respectively.

小贴士: 用它们测试什么来做分类,而不是看它们来自哪里。比如,来自 FRAMES 和 BFCL 的任务都可以被标成 "外部基准",但这并不能体现它们分别在衡量检索能力和工具使用能力。

Here are some categories we define and what they test:

下面是我们定义的一些类别,以及它们各自测试什么:

Today, all evals are end-to-end runs of an agent on a task. We intentionally encourage diversity in eval structure. Some tasks finish in a single step from an input prompt, while others take 10+ turns with another model simulating a user.

今天,所有评测都是让智能体在一个任务上做端到端运行。我们刻意鼓励评测结构的多样性。有些任务从一个输入提示开始,一步就能结束;也有些任务会进行 10 轮以上,由另一个模型来模拟用户。

How we define metrics

我们如何定义指标

When choosing a model for our agent, we start with correctness. If a model can’t reliably complete the tasks we care about, nothing else matters. We run multiple models on our evals and refine the harness over time to address the issues we uncover.

在为智能体选择模型时,首先看正确性。如果一个模型无法稳定完成我们关心的任务,其他一切都不重要。我们会在评测上跑多个模型,并随着时间不断打磨 harness,去解决我们发现的问题。

Measuring correctness depends on what's being tested. Most internal evals use custom assertions such as “did the agent parallelize tool calls?”. External benchmarks like BFCL use exact matching against ground truth answers from the dataset. For evals where correctness is semantic like whether the agent persisted the correct thing in memory, we use LLM-as-a-judge.

如何衡量正确性,取决于评测在测什么。大多数内部评测会用自定义断言,比如 智能体有没有把工具调用并行化。像 BFCL 这类外部基准,则会把输出和数据集里的标准答案做精确匹配。对于那些正确性更偏语义层面的评测,比如智能体是否把正确内容持久化到记忆里,我们会用 LLM-as-a-judge。

Once several models clear that bar, we move to efficiency. Two models that solve the same task can behave very differently as in practice. One might take extra turns, make unnecessary tool calls, or move through the task more slowly because of model size. In production, those differences show up as higher latency, higher cost, and a worse overall user experience.

当有几种模型能跨过正确性门槛后,就会转向效率。两种都能解决同一个任务的模型,在实际运行中的行为可能差得很远。一种可能会多走几轮、做一些不必要的工具调用,或者因为模型体量更大而推进得更慢。在生产环境里,这些差异会体现为更高的延迟、更高的成本,以及更差的整体用户体验。

All together, the metrics we measure for each evaluator run are:

综合来看,我们为每次评测运行衡量的指标包括:

Solve rate measures how quickly an agent solves a task, normalized by the expected number of steps. Like latency ratio, it captures end-to-end time to solve the task, including model round trips, provider latency, wrong turns, and tool execution time. For simple tasks where we can define an ideal trajectory, solve rate can be easier to work with than latency ratio because it only requires measuring the given agent's task duration.

Solve rate 用来衡量智能体解决任务的速度,并按预期步骤数做归一化。和 latency ratio 一样,它捕捉的是端到端的解决时间,包含模型往返、供应商延迟、走错路的代价,以及工具执行时间。对于一些简单任务,我们可以定义一条理想轨迹,此时 solve rate 往往比 latency ratio 更好用,因为它只需要测量该智能体完成任务的总时长。

This gives us a simple way to choose models with a targeted eval set:

这让我们可以用一组有针对性的评测来更简单地选模型:

  1. Check correctness first: which models are accurate enough on the tasks you actually care about?
  1. 先看正确性:在你真正关心的任务上,哪些模型足够准确?
  1. Then, compare efficiency: among the models that are good enough, which one gives the best tradeoff between correctness, latency, and cost?
  1. 再比效率:在这些已经够用的模型里,哪一个在正确性、延迟和成本之间给出的权衡最好?

Example of useful metrics around evals

围绕评测的一些实用指标示例

To make model comparisons actionable, we examine how models succeed and fail. That requires a concrete reference point for what “good” execution looks like beyond accuracy. One primitive we use is an ideal trajectory. This is a sequence of steps that produces a correct outcome with no “unnecessary” actions.

为了让模型对比变得可执行,我们会去看模型如何成功、如何失败。这需要一个具体的参照点,来描述准确率之外,什么才算 好 的执行。我们用的一个基本概念是 理想轨迹。它是一串步骤,在不做任何不必要动作的前提下,得到正确结果。

For simple, well-scoped tasks, the variables are defined tightly enough that the optimal path is usually obvious. For more open-ended tasks, we approximate a trajectory using the best-performing model we’ve seen so far, then revisit the baseline as models and harnesses improve. In this way, observing agent behavior helps us refine our priors about ideal trajectories.

对于简单、范围清晰的任务,变量定义得足够紧,最优路径通常一眼就能看出来。对于更开放的任务,我们会用目前见过表现最好的模型来近似一条轨迹,然后随着模型和 harness 的进步不断回看并更新这个基线。通过观察智能体行为,我们也会不断修正自己对理想轨迹的先验判断。

Consider a simple request:

考虑一个简单的请求:

"What is the current time and weather where I live?"

"我住的地方现在几点,天气怎么样?"

An agent’s ideal trajectory might look like this:

智能体的理想轨迹可能是这样:

  • It makes the fewest necessary tool calls (e.g., resolve user → resolve location → fetch time and weather)
  • 它只做最少且必要的工具调用(例如,确定用户 → 确定地点 → 获取时间和天气)
  • It parallelizes independent tool calls where possible
  • 能并行的独立工具调用尽量并行
  • It produces the final answer without unnecessary intermediate turns
  • 不增加不必要的中间轮次,直接给出最终答案

Ideal trajectory: 4 steps, 4 tool calls, ~8 seconds

理想轨迹: 4 个步骤,4 次工具调用,约 8 秒

Now compare that with a trajectory that is still technically correct, but less efficient.

再对比一条依然技术上正确,但效率更差的轨迹。

Inefficient trajectory: 6 steps, 5 tool calls, ~14 seconds.

低效轨迹: 6 个步骤,5 次工具调用,约 14 秒。

Correct but inefficient trajectory: 6 agent steps, 5 tool calls, includes an unnecessary tool call, and doesn’t parallelize tool calls.

正确但低效的轨迹: 6 个智能体步骤,5 次工具调用,包含一次不必要的工具调用,并且没有并行化工具调用。

The above examples are illustrative: a REPL could solve this task even faster, but the tool-calling version makes the idea easier to explain.

上面的例子仅用于说明:用 REPL 可能能更快完成这个任务,但用工具调用版本更容易把想法讲清楚。

Both runs are correct, but the second run increases latency and cost, and creates more opportunities for failure.

两次运行的结果都正确,但第二次会增加延迟和成本,并带来更多失败的机会。

This framing lets us evaluate both correctness and efficiency over evals. We maintain and update metrics to distill the runs into measurable numbers we can use to compare experiments. From the example above, the inefficient but correct run would score:

这种框架让我们能够在评测中同时评估正确性和效率。我们会维护并更新一组指标,把每次运行提炼成可度量的数字,用来对比不同实验。按上面的例子,那个正确但低效的运行会得到如下分数:

How we run evals

我们如何运行评测

We use pytest with GitHub Actions to run evals in CI so changes run in a clean, reproducible environment. Each eval creates a Deep Agent instance with a given model, feeds it a task, and computes correctness and efficiency metrics.

我们用 pytest 配合 GitHub Actions 在 CI 里跑评测,让每次改动都在干净、可复现的环境中运行。每个评测会用指定模型创建一个 Deep Agent 实例,给它喂一个任务,然后计算正确性与效率指标。

We can also run a subset of eval using tags save costs and measure targeted experiments. For example, if building an agent that requires a lot of local file processing and synthesis, we may focus on the file_operations and tool_use tagged subsets.

我们也会用标签来只跑一部分评测,以节省成本并衡量更聚焦的实验。比如,如果在构建一个需要大量本地文件处理与综合的智能体,就可能把重心放在打了 file_operations 和 tool_use 标签的子集上。

Our eval architecture and implementation is open sourced in the Deep Agents repository.

我们的评测架构与实现已经在 Deep Agents 仓库中开源。

What’s next

接下来

We’re expanding our eval suite and doing more work around open source LLMs! Some things we’re excited to share soon:

我们正在扩展评测套件,也会在开源 LLM 方面做更多工作。有几件事很快会分享:

  • How Open Models measure against closed frontier models across eval categories
  • 开源模型在各类评测维度上与闭源前沿模型的对比表现
  • Evals as a mechanism to auto-improve agents for tasks in real time
  • 把评测作为一种机制,用来对真实任务进行实时的自动改进
  • Openly share how we maintain, reduce, and expand evals per agent over time
  • 公开分享我们如何随着时间推移,为每个智能体维护、收缩并扩展评测

Thanks to the great team who helped review and co-write this blog @masondrxy @veryboldbagel @hwchase17. Also published on the LangChain blog here.

感谢帮助审阅并共同撰写这篇文章的团队成员 @masondrxy @veryboldbagel @hwchase17。也已在 LangChain 博客发布。

Deep Agents is fully open source. Try it and let us know what you think! We’re excited to help teams build great agents & evals.

Deep Agents 是完全开源的。欢迎试用并告诉我们你的感受。期待帮助更多团队构建优秀的智能体与评测。

TLDR: The best agent evals directly measure an agent behavior we care about. Here’s how we source data, create metrics, and run well-scoped, targeted experiments over time to make agents more accurate and reliable.

Evals shape agent behavior

We’ve been curating evaluations to measure and improve Deep Agents. Deep Agents is an open source, model agnostic agent harness that powers products like Fleet and Open SWE. Evals define and shape agent behavior, which is why it’s so important to design them thoughtfully.

Every eval is a vector that shifts the behavior of your agentic system. For example, if an eval for efficient file reading fails, you’ll likely tweak the system prompt or the read_file tool description to nudge behavior until it passes. Every eval you keep applies pressure on the overall system over time.

It is crucial to be thoughtful when adding evals. It can be tempting to blindly add hundreds (or thousands) of tests. This leads to an illusion of “improving your agent” by scoring well on an eval suite that may not accurately reflect behaviors you care about in production.

More evals ≠ better agents. Instead, build targeted evals that reflect desired behaviors in production.

When building Deep Agents, we catalog the behaviors that matter in production, such as retrieving content across multiple files in the filesystem or accurately composing 5+ tool calls in sequence. Rather than using benchmark tasks in aggregate, we take the following approach to eval curation:

  1. Decide which behaviors we want our agent to follow. Then research and curate targeted evals that measure those behaviors in a verifiable way.

  2. For each eval, add a docstring that explains how it measures an agent capability. This ensures each eval is self-documenting. We also tag each eval with categories like tool_use to enable grouped runs.

  3. Review output traces to understand failure modes and update eval coverage.

Because we trace every eval run to a shared LangSmith project, anyone on the team can jump in to analyze issues, make fixes, and reassess the value of a given eval. This creates shared responsibility for adding and maintaining good evals. Running many models across many evals can also get expensive, so targeted evals save money while improving your agent.

In this blog we cover:

  • How we curate data

  • How we define metrics

  • How we run the evals

How we curate data

There’s a few ways we source evals:

  1. Using feedback from dogfooding our agents

  2. Pulling selected evals from external benchmarks (like Terminal Bench 2.0 or BFCL) and often adapting them for a particular agent

  3. Writing our own (artisanal) evals and unit tests by hand for behaviors we think are important

https://docs.langchain.com/langsmith/langsmith-cli

We dogfood our agents every day. Every error becomes an opportunity to write an eval and update our agent definition & context engineering practices.

Note: We separate SDK unit and integration tests (system prompt passthrough, interrupt config, subagent routing) from model capability evals. Any model passes those tests, so including them in scoring adds no signal. You should absolutely write unit and integration tests, but this blog focuses solely on model capability evals.

Dogfooding agents & reading traces are great sources of evals

This makes finding mistakes possible. Traces give us data to understand agent behavior. Because traces are often large, we use a built-in agent like Polly or Insights to analyze them at scale. You can do the same with other agents (like Claude Code or the Deep Agents CLI) plus a way to pull down traces, like the LangSmith CLI. Our goal is to understand each failure mode, propose a fix, rerun the agent, and track progress and regressions over time.

For example, a large fraction of bug-fix PRs are now driven through Open SWE, our open source background coding agent. Teams using it touch many different codebases with different context, conventions, and goals. This naturally leads to mistakes. Every interaction of Open SWE is traced, so those can easily become evals to make sure the mistake doesn’t happen again.

Other evals are pulled and adjusted from existing benchmarks like BFCL for function calling. For coding tasks, we integrate with Harbor to run selected tasks from datasets like Terminal Bench 2.0 tasks in sandboxed environments. Many evals are written from scratch and act as focused tests to observe isolated behavior, like testing a read_file tool.

We group evals by what they test

It’s helpful to have a taxonomy of evals to get a middle view of how agents perform (not a single number, not individual runs).

Tip: Create that taxonomy by looking at what they test, not where they come from. For example, tasks from FRAMES and BFCL could be tagged "external benchmarks," but that would not show how they measure retrieval and tool use, respectively.

Here are some categories we define and what they test:

https://gorilla.cs.berkeley.edu/leaderboard.html

Today, all evals are end-to-end runs of an agent on a task. We intentionally encourage diversity in eval structure. Some tasks finish in a single step from an input prompt, while others take 10+ turns with another model simulating a user.

How we define metrics

When choosing a model for our agent, we start with correctness. If a model can’t reliably complete the tasks we care about, nothing else matters. We run multiple models on our evals and refine the harness over time to address the issues we uncover.

Measuring correctness depends on what's being tested. Most internal evals use custom assertions such as “did the agent parallelize tool calls?”. External benchmarks like BFCL use exact matching against ground truth answers from the dataset. For evals where correctness is semantic like whether the agent persisted the correct thing in memory, we use LLM-as-a-judge.

Once several models clear that bar, we move to efficiency. Two models that solve the same task can behave very differently as in practice. One might take extra turns, make unnecessary tool calls, or move through the task more slowly because of model size. In production, those differences show up as higher latency, higher cost, and a worse overall user experience.

All together, the metrics we measure for each evaluator run are:

https://x.com/@masondrxy

Solve rate measures how quickly an agent solves a task, normalized by the expected number of steps. Like latency ratio, it captures end-to-end time to solve the task, including model round trips, provider latency, wrong turns, and tool execution time. For simple tasks where we can define an ideal trajectory, solve rate can be easier to work with than latency ratio because it only requires measuring the given agent's task duration.

This gives us a simple way to choose models with a targeted eval set:

  1. Check correctness first: which models are accurate enough on the tasks you actually care about?

  2. Then, compare efficiency: among the models that are good enough, which one gives the best tradeoff between correctness, latency, and cost?

Example of useful metrics around evals

To make model comparisons actionable, we examine how models succeed and fail. That requires a concrete reference point for what “good” execution looks like beyond accuracy. One primitive we use is an ideal trajectory. This is a sequence of steps that produces a correct outcome with no “unnecessary” actions.

For simple, well-scoped tasks, the variables are defined tightly enough that the optimal path is usually obvious. For more open-ended tasks, we approximate a trajectory using the best-performing model we’ve seen so far, then revisit the baseline as models and harnesses improve. In this way, observing agent behavior helps us refine our priors about ideal trajectories.

Consider a simple request:

"What is the current time and weather where I live?"

An agent’s ideal trajectory might look like this:

  • It makes the fewest necessary tool calls (e.g., resolve user → resolve location → fetch time and weather)

  • It parallelizes independent tool calls where possible

  • It produces the final answer without unnecessary intermediate turns

Ideal trajectory: 4 steps, 4 tool calls, ~8 seconds

https://github.com/laude-institute/terminal-bench

Now compare that with a trajectory that is still technically correct, but less efficient.

Inefficient trajectory: 6 steps, 5 tool calls, ~14 seconds.

https://github.com/langchain-ai/deepagents

Correct but inefficient trajectory: 6 agent steps, 5 tool calls, includes an unnecessary tool call, and doesn’t parallelize tool calls.

The above examples are illustrative: a REPL could solve this task even faster, but the tool-calling version makes the idea easier to explain.

Both runs are correct, but the second run increases latency and cost, and creates more opportunities for failure.

This framing lets us evaluate both correctness and efficiency over evals. We maintain and update metrics to distill the runs into measurable numbers we can use to compare experiments. From the example above, the inefficient but correct run would score:

How we run evals

We use pytest with GitHub Actions to run evals in CI so changes run in a clean, reproducible environment. Each eval creates a Deep Agent instance with a given model, feeds it a task, and computes correctness and efficiency metrics.

We can also run a subset of eval using tags save costs and measure targeted experiments. For example, if building an agent that requires a lot of local file processing and synthesis, we may focus on the file_operations and tool_use tagged subsets.

https://x.com/@hwchase17

Our eval architecture and implementation is open sourced in the Deep Agents repository.

What’s next

We’re expanding our eval suite and doing more work around open source LLMs! Some things we’re excited to share soon:

  • How Open Models measure against closed frontier models across eval categories

  • Evals as a mechanism to auto-improve agents for tasks in real time

  • Openly share how we maintain, reduce, and expand evals per agent over time

Thanks to the great team who helped review and co-write this blog @masondrxy @veryboldbagel @hwchase17. Also published on the LangChain blog here.

Deep Agents is fully open source. Try it and let us know what you think! We’re excited to help teams build great agents & evals.

📋 讨论归档

讨论进行中…