返回列表
🧠 阿头学 · 💬 讨论题

Deep Agents 正式转向“按模型定制 harness”,这比继续迷信通用 Agent 设计更靠谱

这篇文章最重要的判断是:在 Agent 场景里,“模型能力”经常被“harness 适配度”掩盖,按模型定制 prompts、tools 和 middleware 不是锦上添花,而是会直接改写成绩排名的关键工程层。
打开原文 ↗

2026-04-30 原文链接 ↗
阅读简报
双语对照
完整翻译
原文
讨论归档

核心观点

  • 通用抽象正在吞性能 作者最有价值的判断是“单一 harness 不可能对每个模型都最优”,这在 agent 场景里基本成立,因为不同模型家族对工具命名、提示格式、规划步骤和反思机制的敏感度确实不同,硬用一套默认配置,等于主动放弃可拿到的性能。
  • harness engineering 的收益被低估 文中最强信息不是“加了个新功能”,而是同一模型仅通过 prompt、工具映射和 middleware 调整,就可能拿到 10-20 分提升,甚至历史案例能把成绩从 52.8% 拉到 66.5%;这说明 agent 上限不只看模型,更看“模型 × harness”的组合质量。
  • 工程上,profile 化是正确产品形态 把 prompt、工具别名、中间件、subagent 配置做成声明式 profile,而不是散落在业务代码里的 if/else,这是明显更可维护的设计;它能把经验从“工程师手感”沉淀成“版本化资产”,这一点是文章里最站得住脚的部分。
  • 证据展示明显不及格 文章最大问题是高调声称在 tau2-bench 子集上提升 10-20 分,却没有在正文给出清晰表格、样本规模、筛选规则、方差和逐模型对比,这不是严谨实验汇报,而是典型“把最关键数据放在话术里、把细节留给链接”的发布文风格。
  • 适用性目前仍偏代码/工具型任务 文中所有例子几乎都来自代码、终端、多轮工具调用和指令遵循任务,所以“profile 很重要”这个结论在 agent coding 场景里可信,在开放域知识任务、纯写作任务、轻工具任务里是否同样成立,文章并没有证明。

跟我们的关联

  • 对 ATou 意味着什么、下一步怎么用 这意味着做 agent 不该再只比模型榜单,而该比“模型 × harness”组合;下一步应把现有 agent 配置拆成 base prompt、tool schema、middleware、planner 四层,先做出最小 profile 系统,再针对 OpenAI/Anthropic 各跑一轮任务集。
  • 对 Neta 意味着什么、下一步怎么用 这说明“抽象层”不是越统一越好,过度统一会吃掉效果;下一步可以把 profile 视为“组织知识封装层”,要求每次调优都产出可复用配置,而不是只留下聊天记录和个人经验。
  • 对 Uota 意味着什么、下一步怎么用 这篇材料证明“不同模型像不同工作风格的人”,统一 SOP 会失效;下一步可以把 profile 类比成“不同角色的工作手册”,先讨论哪些规则是通用底线,哪些规则必须按模型个性化。
  • 对产品/投资判断意味着什么、下一步怎么用 这不是单纯一个小功能,而是 agent 基础设施从“统一入口”走向“可差异化编排”的信号;下一步看项目时,应该重点问它有没有 profile/配置层、有没有评估闭环、有没有避免 prompt patch 地狱的机制。

讨论引子

1. 如果 profile 的收益主要来自“更好的工具名和更严格的操作约束”,那它到底是“模型专属优化”,还是“默认 harness 太差”? 2. profile 会不会像前几年 prompt engineering 一样,短期有效、长期被模型原生能力吃掉? 3. 对多数团队来说,维护多套 profile 的复杂度,真的比直接锁定一个模型供应商更划算吗?

TL;DR:Deep Agents 之前采用的是一种通用设计,目标是在各个模型家族上都能较好工作。今天,我们加入了模型专属的 profile,用来调整提示词、工具和中间件。这样一来,我们就能更好地贴合不同模型家族各自的提示词指南。我们开箱即用提供了 OpenAI、Anthropic 和 Google 模型的 profile。我们观察到,在 tau2-bench 的一部分子集上,这能比默认 harness 带来 10 到 20 分的提升。

直到今天,deepagents 一直只提供一套提示词、工具和中间件,目标是在所有大语言模型上都能较好工作。开发者可以切换不同模型,也可以通过额外的工具扩展来补充 system prompt。但基础提示词、工具和中间件本身是固定的,并没有针对每个模型分别优化。

从今天开始,我们很高兴推出 harness profiles,用它来按模型控制这些参数。这件事之所以重要,是因为:

  • 不同模型的提示词指南并不一样。 OpenAI 的 Codex Prompting Guide 推荐了特定的工具实现和命名方式,比如 apply_patchshell_command,这些做法对 Codex 模型确实有明显帮助。Anthropic 的 Claude 提示词指南强调的是另一套约定。即使在同一个家族内部,Opus 4.6 → 4.7 的迁移指南里也指出了一些值得在提示词层面做出的调整。

  • 评测榜单表明,同一个模型放在不同 harness 里,性能可能差很多。 Terminal-Bench 2.0 是目前最清晰的公开案例。Claude Code harness 在 Opus 4.6 的提交中排在最后。我们在之前的工作中也看到过类似现象:通过 harness engineering 改进 Deep Agents。在那次工作里,我们把 gpt-5.2-codex 在 Terminal-Bench 2.0 上的成绩从 52.8% 提升到 66.5%(发布时从 Top 30 提升到 Top 5),仅仅是通过应用 prompts 和 middleware hooks 这类 harness 层改动做到的。

单一的 harness 不可能对每个模型都最优。所以我们把按模型变化 harness 这件事做得更容易了。

这件事到底有多重要?

衡量 profiles 影响的结果

为了判断这件事到底有多重要,我们在 tau2-bench 的一个子集上测量了性能表现(多轮工具使用 + 指令遵循)。我们选用了一个经过整理、难度更高的任务子集,因为前沿模型在这些任务上还没有完全做满,这样更便于衡量 harness 层改动对 agent 的影响。

https://docs.langchain.com/oss/python/deepagents/profiles#ship-a-profile-as-a-plugin

每个模型具体改了什么

我们以 Codex 和 Claude 的提示词指南作为依据,决定在各个 profile 里应用哪些改动。

对于 Codex,主要改动包括:

  • 工具改动: 用推荐的 apply_patch 工具覆盖 deepagents 默认的 file_edit 实现,并把 deepagents 里的 execute 工具名别名为 shell_command

  • 提示词改动: 主要围绕工具调用和规划,采用了提示词指南中的细节

Before any tool call, decide ALL files and resources you will need. Batch reads, searches, and other independent operations into parallel tool calls instead of issuing them one at a time.

对于 Opus,主要改动都集中在提示词上,重点是工具使用和规划。比如,下面这两段内容就是后来加入提示词的。

<tool_result_reflection> After receiving tool results, carefully reflect on their quality and determine optimal next steps before proceeding. Use your thinking to plan and iterate based on this new information, and then take the best next action. </tool_result_reflection>

<tool_usage> When a task depends on the state of files, tests, or system output, use tools to observe that state directly rather than reasoning from memory about what it probably contains. Read files before describing them. Run tests before claiming they pass. Search the codebase before asserting a symbol does or does not exist. Active investigation with tools is the default mode of working, not a fallback. </tool_usage>

我们的结论是,提供一个能按模型自定义 harness 的接口,是一个很有价值的基础能力。这样开发者就能为不同 agent 管理 profile,对它们做版本管理,也能更方便地测试配置差异。

现在就试试

如果今天就想用起来,直接开始使用 deepagents 就行:uv add deepagents

https://github.com/sierra-research/tau2-bench

对于受支持的模型,这些 profile 会自动生效。如果你想看今天每个默认 profile 具体是什么样,可以直接去仓库里看代码。想了解如何注册你自己的 profile,继续往下看。

profiles 在底层是怎么工作的

harness profile 是一层声明式覆盖,用来处理那些会因模型不同而变化的 harness 部分,比如 system prompt 的前缀和后缀、工具的启用和命名、中间件选择、subagent 配置以及 skills。你可以为某个模型或 provider 注册一个 profile,也可以从 YAML 加载现成的 profile;这样当你切换模型时,create_deep_agent 就会自动适配。重要的是,你的调用方式不用改。

我们默认提供 OpenAI、Anthropic 和 Google 模型的配置。你可以覆盖它们,在它们上面叠加自己的配置,或者把 profile 作为插件分发出去。

https://www.langchain.com/blog/tuning-deep-agents-different-models

https://docs.langchain.com/oss/python/deepagents/profiles

如果想看更细的定制细节,可以阅读 Profiles 文档,里面有完整字段范围、合并语义和插件打包方式。你可以在启动时为自己使用的模型注册 profile,也可以直接依赖我们内置提供的 profile。

如果你正在基于 Deep Agents 构建东西,并且想分享一个 profile,可以提一个 PR,或者通过 entry points 把它作为插件分发。我们会继续扩展不同模型上的 profile 能力。目标是,无论你最终选择哪个模型,Deep Agents 都能给你提供合适的工具和默认配置,让你为自己的任务搭出最好的 harness。后续我们也会分享更多信息和实操演示,说明开发者怎样为各自的任务定制 agent harness。

感谢 @masondrxy @hwchase17 和 @chester_curme 参与评审、共同撰写,并帮助推动这次发布!LangChain Blog 上的版本链接在这里。

TL;DR: Deep Agents was previously designed in a generic way to work well across model families. Today we’re adding model-specific profiles to adjust prompts, tools, and middleware. This allows us to better conform to prompting guides specific to model families. We ship profiles for OpenAI, Anthropic, and Google models out of the box, which we see leads to a 10–20 point jump on a subset of tau2-bench over the default harness.

TL;DR:Deep Agents 之前采用的是一种通用设计,目标是在各个模型家族上都能较好工作。今天,我们加入了模型专属的 profile,用来调整提示词、工具和中间件。这样一来,我们就能更好地贴合不同模型家族各自的提示词指南。我们开箱即用提供了 OpenAI、Anthropic 和 Google 模型的 profile。我们观察到,在 tau2-bench 的一部分子集上,这能比默认 harness 带来 10 到 20 分的提升。

Until today, deepagents shipped with a single set of prompts, tools, and middleware aimed to work well across all Large Language Models. Builders could swap in different models or extend the harness with additional tools extensions to the system prompt. But the base prompts, tools, and middleware were fixed and not optimized per model.

直到今天,deepagents 一直只提供一套提示词、工具和中间件,目标是在所有大语言模型上都能较好工作。开发者可以切换不同模型,也可以通过额外的工具扩展来补充 system prompt。但基础提示词、工具和中间件本身是固定的,并没有针对每个模型分别优化。

As of today, we’re excited to launch harness profiles as a way to control these parameters on a per-model basis. This matters because:

从今天开始,我们很高兴推出 harness profiles,用它来按模型控制这些参数。这件事之所以重要,是因为:

  • Prompting guides differ per model. OpenAI's Codex Prompting Guide prescribes specific tool implementations and names (apply_patch, shell_command) that move the needle on Codex models. Anthropic's Claude prompting guidance emphasizes a different set of conventions. Even within a family, the Opus 4.6 → 4.7 migration guide flags prompt-level changes worth making.
  • 不同模型的提示词指南并不一样。 OpenAI 的 Codex Prompting Guide 推荐了特定的工具实现和命名方式,比如 apply_patchshell_command,这些做法对 Codex 模型确实有明显帮助。Anthropic 的 Claude 提示词指南强调的是另一套约定。即使在同一个家族内部,Opus 4.6 → 4.7 的迁移指南里也指出了一些值得在提示词层面做出的调整。
  • Eval leaderboards show that the same model in a different harness can yield much different performance. Terminal-Bench 2.0 is the cleanest public example. The Claude Code harness ranks last among Opus 4.6 submissions.  We saw similar effects of careful harness engineering in previous work: Improving Deep Agents with harness engineering. Here we took gpt-5.2-codex from 52.8% to 66.5% on Terminal-Bench 2.0 (Top 30 → Top 5 at the time of publishing) just by applying harness layer changes like prompts and middleware hooks.
  • 评测榜单表明,同一个模型放在不同 harness 里,性能可能差很多。 Terminal-Bench 2.0 是目前最清晰的公开案例。Claude Code harness 在 Opus 4.6 的提交中排在最后。我们在之前的工作中也看到过类似现象:通过 harness engineering 改进 Deep Agents。在那次工作里,我们把 gpt-5.2-codex 在 Terminal-Bench 2.0 上的成绩从 52.8% 提升到 66.5%(发布时从 Top 30 提升到 Top 5),仅仅是通过应用 prompts 和 middleware hooks 这类 harness 层改动做到的。

A single harness can't be optimal for every model. So we make it easy to support varying the harness per model.

单一的 harness 不可能对每个模型都最优。所以我们把按模型变化 harness 这件事做得更容易了。

How much does this matter?

这件事到底有多重要?

Results on measuring the effect of profiles

衡量 profiles 影响的结果

In order to judge how much this matters, we measured performance on a subset of tau2-bench (multi-turn tool use + instruction following). We use a curated subset of more difficult tasks that frontier models haven’t yet saturated so we can better measure the impacts of harness level changes on agents.

为了判断这件事到底有多重要,我们在 tau2-bench 的一个子集上测量了性能表现(多轮工具使用 + 指令遵循)。我们选用了一个经过整理、难度更高的任务子集,因为前沿模型在这些任务上还没有完全做满,这样更便于衡量 harness 层改动对 agent 的影响。

What changed per model

每个模型具体改了什么

We use the Codex and Claude prompting guides as the source for what changes we applied per profile.

我们以 Codex 和 Claude 的提示词指南作为依据,决定在各个 profile 里应用哪些改动。

For Codex the main changes included:

对于 Codex,主要改动包括:

  • Tool changes: overriding the default file_edit implementation in deepagents with the recommended apply_patch tool, and aliasing the execute tool name in deepagents as shell_command
  • 工具改动: 用推荐的 apply_patch 工具覆盖 deepagents 默认的 file_edit 实现,并把 deepagents 里的 execute 工具名别名为 shell_command
  • Prompt changes: largely around tool calling and planning using details from the prompting guide
  • 提示词改动: 主要围绕工具调用和规划,采用了提示词指南中的细节

Before any tool call, decide ALL files and resources you will need. Batch reads, searches, and other independent operations into parallel tool calls instead of issuing them one at a time.

Before any tool call, decide ALL files and resources you will need. Batch reads, searches, and other independent operations into parallel tool calls instead of issuing them one at a time.

For Opus the main changes were all prompting focused on tool usage and planning. For example, below are two snippets that were added to the prompt.

对于 Opus,主要改动都集中在提示词上,重点是工具使用和规划。比如,下面这两段内容就是后来加入提示词的。

<tool_result_reflection> After receiving tool results, carefully reflect on their quality and determine optimal next steps before proceeding. Use your thinking to plan and iterate based on this new information, and then take the best next action. </tool_result_reflection>

<tool_result_reflection> After receiving tool results, carefully reflect on their quality and determine optimal next steps before proceeding. Use your thinking to plan and iterate based on this new information, and then take the best next action. </tool_result_reflection>

<tool_usage> When a task depends on the state of files, tests, or system output, use tools to observe that state directly rather than reasoning from memory about what it probably contains. Read files before describing them. Run tests before claiming they pass. Search the codebase before asserting a symbol does or does not exist. Active investigation with tools is the default mode of working, not a fallback. </tool_usage>

<tool_usage> When a task depends on the state of files, tests, or system output, use tools to observe that state directly rather than reasoning from memory about what it probably contains. Read files before describing them. Run tests before claiming they pass. Search the codebase before asserting a symbol does or does not exist. Active investigation with tools is the default mode of working, not a fallback. </tool_usage>

Our takeaway is that exposing an interface for customizing the harness per model is a helpful primitive for builders to manage profiles per agent, version them, and easily test differences in configurations.

我们的结论是,提供一个能按模型自定义 harness 的接口,是一个很有价值的基础能力。这样开发者就能为不同 agent 管理 profile,对它们做版本管理,也能更方便地测试配置差异。

Try it today

现在就试试

To use this today, simply start using deepagents: uv add deepagents

如果今天就想用起来,直接开始使用 deepagents 就行:uv add deepagents

The profiles will be automatically applied for supported models. If you want to look into the details of what each default profile looks like today, you can inspect the code in the repo. To learn how to register your own profile, keep reading.

对于受支持的模型,这些 profile 会自动生效。如果你想看今天每个默认 profile 具体是什么样,可以直接去仓库里看代码。想了解如何注册你自己的 profile,继续往下看。

How profiles work under the hood

profiles 在底层是怎么工作的

A harness profile is a declarative override layer for the parts of the harness that vary per model: system prompt prefix/suffix, tool inclusion and naming, middleware selection, subagent configuration, and skills. You register a profile for a model or provider (or load a preexisting one from YAML), and create_deep_agent adapts when you swap the model. Importantly, your call site doesn't change.

harness profile 是一层声明式覆盖,用来处理那些会因模型不同而变化的 harness 部分,比如 system prompt 的前缀和后缀、工具的启用和命名、中间件选择、subagent 配置以及 skills。你可以为某个模型或 provider 注册一个 profile,也可以从 YAML 加载现成的 profile;这样当你切换模型时,create_deep_agent 就会自动适配。重要的是,你的调用方式不用改。

We ship defaults for OpenAI, Anthropic, and Google models. You can override them, layer your own on top, or distribute profiles as plugins.

我们默认提供 OpenAI、Anthropic 和 Google 模型的配置。你可以覆盖它们,在它们上面叠加自己的配置,或者把 profile 作为插件分发出去。

For more custom details read the Profiles docs for the full field surface, merge semantics, and plugin packaging. Register a profile at startup for the models you use, or rely on the built-in profiles we ship.

如果想看更细的定制细节,可以阅读 Profiles 文档,里面有完整字段范围、合并语义和插件打包方式。你可以在启动时为自己使用的模型注册 profile,也可以直接依赖我们内置提供的 profile。

If you're building on Deep Agents and want to share a profile, open a PR or distribute it as a plugin via entry points. We'll keep extending the profile surface across models. The goal is that whichever model you reach choose, Deep Agents gives you the tools and defaults to create the best harness for your task. We’ll be sharing more information and walkthroughs showing how builders can customize their agent harness for their tasks.

如果你正在基于 Deep Agents 构建东西,并且想分享一个 profile,可以提一个 PR,或者通过 entry points 把它作为插件分发。我们会继续扩展不同模型上的 profile 能力。目标是,无论你最终选择哪个模型,Deep Agents 都能给你提供合适的工具和默认配置,让你为自己的任务搭出最好的 harness。后续我们也会分享更多信息和实操演示,说明开发者怎样为各自的任务定制 agent harness。

Thanks to @masondrxy @hwchase17 & @chester_curme for reviews, co-writing, and help pushing on this release! Link to a version on the LangChain Blog here.

感谢 @masondrxy @hwchase17 和 @chester_curme 参与评审、共同撰写,并帮助推动这次发布!LangChain Blog 上的版本链接在这里。

TL;DR: Deep Agents was previously designed in a generic way to work well across model families. Today we’re adding model-specific profiles to adjust prompts, tools, and middleware. This allows us to better conform to prompting guides specific to model families. We ship profiles for OpenAI, Anthropic, and Google models out of the box, which we see leads to a 10–20 point jump on a subset of tau2-bench over the default harness.

Until today, deepagents shipped with a single set of prompts, tools, and middleware aimed to work well across all Large Language Models. Builders could swap in different models or extend the harness with additional tools extensions to the system prompt. But the base prompts, tools, and middleware were fixed and not optimized per model.

As of today, we’re excited to launch harness profiles as a way to control these parameters on a per-model basis. This matters because:

  • Prompting guides differ per model. OpenAI's Codex Prompting Guide prescribes specific tool implementations and names (apply_patch, shell_command) that move the needle on Codex models. Anthropic's Claude prompting guidance emphasizes a different set of conventions. Even within a family, the Opus 4.6 → 4.7 migration guide flags prompt-level changes worth making.

  • Eval leaderboards show that the same model in a different harness can yield much different performance. Terminal-Bench 2.0 is the cleanest public example. The Claude Code harness ranks last among Opus 4.6 submissions.  We saw similar effects of careful harness engineering in previous work: Improving Deep Agents with harness engineering. Here we took gpt-5.2-codex from 52.8% to 66.5% on Terminal-Bench 2.0 (Top 30 → Top 5 at the time of publishing) just by applying harness layer changes like prompts and middleware hooks.

A single harness can't be optimal for every model. So we make it easy to support varying the harness per model.

How much does this matter?

Results on measuring the effect of profiles

In order to judge how much this matters, we measured performance on a subset of tau2-bench (multi-turn tool use + instruction following). We use a curated subset of more difficult tasks that frontier models haven’t yet saturated so we can better measure the impacts of harness level changes on agents.

https://docs.langchain.com/oss/python/deepagents/profiles#ship-a-profile-as-a-plugin

What changed per model

We use the Codex and Claude prompting guides as the source for what changes we applied per profile.

For Codex the main changes included:

  • Tool changes: overriding the default file_edit implementation in deepagents with the recommended apply_patch tool, and aliasing the execute tool name in deepagents as shell_command

  • Prompt changes: largely around tool calling and planning using details from the prompting guide

Before any tool call, decide ALL files and resources you will need. Batch reads, searches, and other independent operations into parallel tool calls instead of issuing them one at a time.

For Opus the main changes were all prompting focused on tool usage and planning. For example, below are two snippets that were added to the prompt.

<tool_result_reflection> After receiving tool results, carefully reflect on their quality and determine optimal next steps before proceeding. Use your thinking to plan and iterate based on this new information, and then take the best next action. </tool_result_reflection>

<tool_usage> When a task depends on the state of files, tests, or system output, use tools to observe that state directly rather than reasoning from memory about what it probably contains. Read files before describing them. Run tests before claiming they pass. Search the codebase before asserting a symbol does or does not exist. Active investigation with tools is the default mode of working, not a fallback. </tool_usage>

Our takeaway is that exposing an interface for customizing the harness per model is a helpful primitive for builders to manage profiles per agent, version them, and easily test differences in configurations.

Try it today

To use this today, simply start using deepagents: uv add deepagents

https://github.com/sierra-research/tau2-bench

The profiles will be automatically applied for supported models. If you want to look into the details of what each default profile looks like today, you can inspect the code in the repo. To learn how to register your own profile, keep reading.

How profiles work under the hood

A harness profile is a declarative override layer for the parts of the harness that vary per model: system prompt prefix/suffix, tool inclusion and naming, middleware selection, subagent configuration, and skills. You register a profile for a model or provider (or load a preexisting one from YAML), and create_deep_agent adapts when you swap the model. Importantly, your call site doesn't change.

We ship defaults for OpenAI, Anthropic, and Google models. You can override them, layer your own on top, or distribute profiles as plugins.

https://www.langchain.com/blog/tuning-deep-agents-different-models

https://docs.langchain.com/oss/python/deepagents/profiles

For more custom details read the Profiles docs for the full field surface, merge semantics, and plugin packaging. Register a profile at startup for the models you use, or rely on the built-in profiles we ship.

If you're building on Deep Agents and want to share a profile, open a PR or distribute it as a plugin via entry points. We'll keep extending the profile surface across models. The goal is that whichever model you reach choose, Deep Agents gives you the tools and defaults to create the best harness for your task. We’ll be sharing more information and walkthroughs showing how builders can customize their agent harness for their tasks.

Thanks to @masondrxy @hwchase17 & @chester_curme for reviews, co-writing, and help pushing on this release! Link to a version on the LangChain Blog here.

📋 讨论归档

讨论进行中…