返回列表
🪞 Uota学 · 🧠 阿头学

Harness 工程——不换模型,只调系统,排名从 Top 30 到 Top 5

模型是原材料,Harness 才是产品——LangChain 团队不换模型、只调"脚手架",在编码基准上暴涨 13.7 个百分点,核心方法论就是 Context Engineering + 自我验证 + Trace 驱动迭代。

2026-02-18 原文链接 ↗
阅读简报
双语对照
完整翻译
原文
讨论归档

核心观点

  • 模型智能是"尖峰状"的,Harness 负责塑形 模型本身能力忽高忽低,harness 工程师的活儿就是用 system prompt、工具、middleware 这三个旋钮,把不稳定的智能稳定输出到目标任务上。这不是 prompt engineering,是系统工程。
  • 最大的失败模式:智能体"自以为做完了" 智能体写完代码 → 回头看一眼 → 觉得没问题 → 收工。解决方案是强制插入验证环节:Planning → Build → Verify → Fix,并在退出前用 middleware 拦截做 checklist。这个洞察极其实用——AI 最危险的时刻不是犯错,是"自信地犯错"。
  • Trace 分析是新的 boosting 把 LangSmith 的 trace 拉下来,用另一组 AI 并行分析失败模式,再汇总成改进建议。本质上就是机器学习里 boosting 的思路:每一轮只盯上一轮的错误。这把"调 AI"从玄学变成了工程循环。
  • 推理算力不是越多越好,"三明治"策略最优 全程高推理反而超时拉垮(53.9%),高-中-高的"reasoning sandwich"才是最优解(66.5%)。知道什么时候该省脑子、什么时候该烧脑子,是一种系统级判断。
  • Harness 是模型特定的,不存在万能模板 Claude Opus 4.6 用旧版 harness 跑出 59.6%,不是模型不行,是没为它做专门的迭代优化。这说明 harness 和模型是耦合的,换模型就得重新跑改进循环。

跟我们的关联

ATou 的头衔就是 Context Engineer——这篇文章基本在给你的工作写教科书。Neta 是 20 人团队做 DAU 10万+ 的产品,本质上就是用系统杠杆放大少量人的产出。文章里的方法论直接可迁移:1)我们自己的 AI 工作流(包括 Uota)需要同样的"自我验证"机制——不能让 AI 自己觉得做完了就算完了;2)Trace 驱动的迭代循环可以用在 Neta 的 AI 功能优化上,把用户对话的 trace 当反馈信号来改进 agent 行为;3)"reasoning sandwich"的思路对我们控制 API 成本和延迟有直接参考价值——不是所有环节都需要最强推理。

讨论引子

1. 我们现在的 AI 工作流里,有没有"智能体自以为做完了"的问题?如果有,能不能也加一个"退出前验证"的机制?

2. 我们有没有在系统性地分析 AI 的失败模式?还是每次出错都是"case by case 修一修"?如果要建一个类似 Trace Analyzer 的反馈循环,第一步应该从哪个场景切入?

3. 文章说"为不同模型定制 Harness"——我们在 Neta 里用多个模型的时候,是不是也该为每个模型做专门的 prompt/流程适配,而不是一套 prompt 打天下?

用 Harness 工程改进 Deep Agents

TL;DR:我们的编码智能体在 Terminal Bench 2.0 上从 Top 30 提升到了 Top 5。我们只改了 harness。下面是我们做 harness engineering 的方法(先剧透:自我验证与 tracing 真的很有用)。

Harness 工程的目标

harness 的目标,是把模型天然“带尖峰”、起伏很大的智能,塑形为我们关心的任务形态。Harness Engineering 关注的是系统:你要围绕模型搭建工具与流程,去优化任务表现、token 效率、时延等目标。设计决策包括 system prompt、工具选择与执行流程。

但要怎样改 harness,才能让你的智能体更强?

在 LangChain,我们用 Traces 来大规模理解智能体的失效模式。今天的模型在很大程度上仍是黑盒,其内部机制很难解释。但我们可以在文本空间里看到它们的输入与输出,并据此构建我们的改进循环。

我们用一套简单的配方,迭代把 deepagents-cli(我们的编码智能体)在 Terminal Bench 2.0 上的分数从 52.8 提升到 66.5,一共提高了 13.7 个百分点。我们只调了 harness,并把模型固定为 gpt-5.2-codex。

实验设置与 Harness 的可调参数

我们使用 Terminal Bench 2.0——如今评估智能体式编码的标准基准之一。它包含 89 个任务,覆盖机器学习、调试、生物等领域。我们用 Harbor 来编排运行。它会启动沙箱(Daytona),与我们的智能体循环交互,并运行验证 + 打分。

每一次智能体动作都会存入 LangSmith,其中还包括时延、token 数量与成本等指标。

我们能拧动的旋钮

一个智能体 harness 有很多可调旋钮:system prompts、tools、hooks/middleware、skills、子智能体委派、记忆系统等等。我们刻意压缩优化空间,只聚焦三类:System Prompt、Tools,以及 Middleware(我们对围绕模型与工具调用的 hooks 的称呼)。

我们从默认 prompt 和标准的 tools+middleware 开始。在 GPT-5.2-Codex 上,这一版本得分为 52.8%。这是一个扎实的分数——今天的榜单上刚好在 Top 30 之外,但还有提升空间。

Trace Analyzer Skill

我们希望 trace 分析可以重复执行,于是把它做成了一个 Agent Skill。这成为我们跨多次运行分析错误并改进 harness 的配方。流程是:

从 LangSmith 拉取实验 traces

并行生成错误分析智能体 → 主智能体综合发现 + 建议

汇总反馈,并对 harness 做有针对性的改动。

这与 boosting 的思路相似:把关注点放在上一次运行的错误上。在第 3 步,人类(虽然不是必须)常常很有帮助,可以核验并讨论拟议的改动。对单一任务过拟合的改动不利于泛化,还可能导致其他 Tasks 出现回退。

自动化 trace 分析节省了数小时,也让快速尝试实验变得很容易。我们会很快发布这个 skill;目前正在把它用于更通用的 prompt 优化测试。

究竟是什么提升了智能体表现

自动化 Trace 分析让我们能调试智能体到底在哪里出错。问题包括推理错误、未遵循任务指令、缺少测试与验证、时间不够等。下文会更详细展开这些改进。

构建与自我验证

当今的模型是极其出色的自我改进机器。

自我验证使智能体能够在单次运行内通过反馈实现自我改进。然而,它们并不会自然而然地进入这种 build-verify 循环。

最常见的失败模式是:智能体写出一个方案,回头读一遍自己的代码,确认看起来没问题,然后就停了。测试是自动化智能体式编码的关键部分:它既能检验整体正确性,也能同时给智能体提供用于爬山式优化的信号。

我们在 system prompt 里加入了如何解题的指导。

Planning & Discovery:阅读任务、扫描代码库,并基于任务规格与如何验证方案来制定初步计划。

Build:在实现计划时把验证放在心上。如果没有测试,就补上测试,并同时覆盖 happy paths 与 edge cases。

Verify:运行测试,通读完整输出,并对照任务要求核对(不是对照你自己的代码)。

Fix:分析错误,回到原始 spec,修复问题。

我们非常强调测试,因为它是每一轮迭代中推动改动的动力。我们发现,除了 prompting 之外,确定性的上下文注入也能帮助智能体验证自己的工作。我们使用一个 PreCompletionChecklistMiddleware:在智能体退出前拦截它,提醒它针对 Task spec 做一次验证。这类似于 Ralph Wiggum Loop:一个 hook 在退出时强制智能体继续执行;我们用它来做验证。

为智能体提供其环境上下文

harness engineering 的一部分,是为 context engineering 构建良好的交付机制。Terminal Bench 的任务自带目录结构、内置工具以及严格的超时限制。

Directory Context & Tooling:LocalContextMiddleware 会在智能体启动时运行,映射 cwd 以及其他父目录 + 子目录。我们会运行 bash 命令来发现诸如 Python 安装之类的工具。上下文发现与搜索很容易出错,因此注入上下文能降低这类错误面,并帮助智能体更快进入其环境。

Teaching Agents to Write Testable Code:智能体并不知道它的代码该如何做到“可测试”。我们在 prompting 中说明:它的产出会被程序化测试评测,类似于提交代码时的要求。比如,Task spec 里提到的文件路径应当被严格遵循,这样解决方案才能在自动化打分环节正常工作。强调 edge-cases 的 prompting 能帮助智能体避免只检查“happy path”的情况。强制模型遵循测试标准,是避免长期出现“slop buildup”的强力策略。

Time Budgeting:我们注入时间预算警告,推动智能体结束构建并切换到验证。众所周知,智能体很不擅长估算时间,因此这个启发式方法在该环境下很有用。现实世界的编码通常没有严格的时间限制,但如果不向智能体注入任何约束信息,它就无法在时间边界内工作。

智能体对其环境、约束与评测标准了解得越多,就越能自主地引导自己的工作。

harness 工程师的职责:准备并交付上下文,让智能体能够自主完成工作。

鼓励智能体后退一步,重新审视计划

智能体一旦决定了计划,就可能变得短视,从而陷入“doom loops”:对同一种错误方法做细微变化(在一些 traces 里甚至重复 10+ 次)。

我们使用 LoopDetectionMiddleware,通过 tool call hooks 跟踪每个文件的编辑次数。在对同一文件编辑达到 N 次后,它会注入类似“……考虑重新思考你的方法”的上下文。这能帮助智能体从 doom loops 中恢复,不过如果模型仍然认为自己的方法正确,它也可能继续沿着同一路径走下去。

重要说明:这是一个围绕当下模型问题的工程化设计启发式。随着模型变强,这些护栏很可能不再需要;但在今天,它们能帮助智能体更正确、更自主地执行。

选择在推理上投入多少算力

推理模型可以自主运行数小时,因此我们必须决定每个子任务要花多少算力。你当然可以对每个任务都用最大推理预算,但大多数工作都会受益于对推理算力投入的优化。

Terminal Bench 的超时限制带来了权衡:更多推理能帮助智能体评估每一步,但也可能消耗超过 2 倍的 tokens/时间。gpt-5.2-codex 有 4 种推理模式:low、medium、high 和 xhigh。

我们发现,推理对 planning 很有帮助,能更完整地理解问题;一些 Terminal Bench 任务非常困难。一个好的计划能更快通向可用的解法。

后期验证也会受益于更多推理:它能捕捉错误,并把解决方案成功提交上去。作为一种启发式,我们选择 xhigh-high-xhigh 的“reasoning sandwich”作为基线。

如果全程只用 xhigh,由于智能体超时,得分反而较差:53.9%;相比之下,high 为 63.6%。在不同推理预算切分的试跑中差异并不大,因此我们保持了这种做法,最终把得分推到 66.5%。

对模型而言,更自然的方法是 Adaptive Reasoning——在 Claude 与 Gemini 模型中就能看到这种做法:由模型自行决定在推理上投入多少算力。

在多模型 harness 中,平衡推理预算的一种方式是:用大模型负责 planning,再交给小模型去实现。

构建 Agent Harness 的实用要点

智能体的设计空间很大。以下是我们基于这些实验,以及整体构建 deepagents 的一些通用原则。

替智能体做 Context Engineering。对当下的智能体来说,上下文组装仍然困难,尤其是在陌生环境里。用目录结构、可用工具、编码最佳实践与解题策略等上下文来“引导上手”,能减少因搜索不佳造成的错误面,并避免规划中的可避免失误。

帮助智能体自我验证。模型容易偏向自己第一个看似可行的解。要强力 prompting 它们通过运行测试来验证工作并迭代改进。这在无人类介入的自动化编码系统中尤其重要。

把 Tracing 当作反馈信号。Traces 让智能体能够自我评估并调试自己。把工具与推理一起调试很重要(ex:模型走上错误路径,是因为它缺少某个工具,或缺少“怎么做某事”的指令)。

短期内检测并修复坏模式。今天的模型并不完美。harness 设计者的工作,是一边围绕当下的缺陷做设计,一边为未来更聪明的模型做规划。盲目重试、不做验证就是典型例子。这些护栏几乎肯定会随着时间逐渐消失,但在今天要构建稳健的智能体应用,它们是非常有价值、值得实验的工具。

为不同模型定制 Harness。Codex 与 Claude 的 prompting 指南表明,不同模型需要不同的 prompting。我们用更早版本的 harness 对 Claude Opus 4.6 做了一次测试,得分 59.6%,具备竞争力但不如 Codex——因为我们没有为 Claude 跑同样的 Improvement Loop。很多原则是可迁移的,比如良好的上下文准备与对验证的强调;但针对你的任务跑几轮 harness 迭代,才能在不同任务上最大化智能体表现。

harness 设计仍有很多开放研究可以做。有趣的方向包括多模型系统(把 Codex、Gemini 和 Claude 组合在一起)、用于持续学习的记忆原语以便智能体能在任务上自主提升,以及跨模型衡量 harness 改动的效果。

在改进智能体的外循环上,我们也在探索类似 RLMs 的方法,以更高效地挖掘 traces。我们会继续改进 harness,并公开分享我们的研究。

我们创建了一份 Traces 数据集,与社区共享。

Deep Agents 是开源的。Python 和 Javascript。

致更多的爬山式优化与开放研究。

链接:http://x.com/i/article/2022906014928904192

相关笔记

TLDR: Our coding agent went from Top 30 to Top 5 on Terminal Bench 2.0. We only changed the harness. Here’s our approach to harness engineering (teaser: self-verification & tracing help a lot).

TL;DR:我们的编码智能体在 Terminal Bench 2.0 上从 Top 30 提升到了 Top 5。我们只改了 harness。下面是我们做 harness engineering 的方法(先剧透:自我验证与 tracing 真的很有用)。

The Goal of Harness Engineering

Harness 工程的目标

The goal of a harness is to mold the inherently spiky intelligence of a model for tasks we care about. Harness Engineering is about systems, you’re building tooling around the model to optimize goals like task performance, token efficiency, latency, etc. Design decisions include the system prompt, tool choice, and execution flow.

harness 的目标,是把模型天然“带尖峰”、起伏很大的智能,塑形为我们关心的任务形态。Harness Engineering 关注的是系统:你要围绕模型搭建工具与流程,去优化任务表现、token 效率、时延等目标。设计决策包括 system prompt、工具选择与执行流程。

But how should you change the harness to improve your agent?

但要怎样改 harness,才能让你的智能体更强?

At LangChain, we use Traces to understand agent failure modes at scale. Models today are largely black-boxes, their inner mechanisms are hard to interpret. But we can see their inputs and outputs in text space which we then use in our improvement loops.

在 LangChain,我们用 Traces 来大规模理解智能体的失效模式。今天的模型在很大程度上仍是黑盒,其内部机制很难解释。但我们可以在文本空间里看到它们的输入与输出,并据此构建我们的改进循环。

We used a simple recipe to iteratively improve deepagents-cli (our coding agent) 13.7 points from 52.8 to 66.5 on Terminal Bench 2.0. We only tweaked the harness and kept the model fixed, gpt-5.2-codex.

我们用一套简单的配方,迭代把 deepagents-cli(我们的编码智能体)在 Terminal Bench 2.0 上的分数从 52.8 提升到 66.5,一共提高了 13.7 个百分点。我们只调了 harness,并把模型固定为 gpt-5.2-codex。

Experiment Setup & The Knobs on a Harness

实验设置与 Harness 的可调参数

We used Terminal Bench 2.0, a now standard benchmark to evaluate agentic coding. It has 89 tasks across domains like machine learning, debugging, and biology. We use Harbor to orchestrate the runs. It spins up sandboxes (Daytona), interacts with our agent loop, and runs verification + scoring.

我们使用 Terminal Bench 2.0——如今评估智能体式编码的标准基准之一。它包含 89 个任务,覆盖机器学习、调试、生物等领域。我们用 Harbor 来编排运行。它会启动沙箱(Daytona),与我们的智能体循环交互,并运行验证 + 打分。

Every agent action is stored in LangSmith. It also includes metrics like latency, token counts, and costs.

每一次智能体动作都会存入 LangSmith,其中还包括时延、token 数量与成本等指标。

The Knobs we can Turn

我们能拧动的旋钮

An agent harness has a lot of knobs: system prompts, tools, hooks/middleware, skills, sub-agent delegation, memory systems, and more. We deliberately compress the optimization space and focus on three: System Prompt, Tools, and Middleware (our term for hooks around model and tool calls).

一个智能体 harness 有很多可调旋钮:system prompts、tools、hooks/middleware、skills、子智能体委派、记忆系统等等。我们刻意压缩优化空间,只聚焦三类:System Prompt、Tools,以及 Middleware(我们对围绕模型与工具调用的 hooks 的称呼)。

We start with a default prompt and standard tools+middleware. This scores 52.8% with GPT-5.2-Codex. A solid score, just outside the Top 30 of the leaderboard today, but room to grow.

我们从默认 prompt 和标准的 tools+middleware 开始。在 GPT-5.2-Codex 上,这一版本得分为 52.8%。这是一个扎实的分数——今天的榜单上刚好在 Top 30 之外,但还有提升空间。

The Trace Analyzer Skill

Trace Analyzer Skill

We wanted trace analysis to be repeatable, so we made it into an Agent Skill. This became our recipe to analyze errors across runs and make improvements to the harness. The flow is:

我们希望 trace 分析可以重复执行,于是把它做成了一个 Agent Skill。这成为我们跨多次运行分析错误并改进 harness 的配方。流程是:

Fetch experiment traces from LangSmith

从 LangSmith 拉取实验 traces

Spawn parallel error analysis agents → main agent synthesizes findings + suggestions

并行生成错误分析智能体 → 主智能体综合发现 + 建议

Aggregate feedback and make targeted changes to the harness.

汇总反馈,并对 harness 做有针对性的改动。

This works similarly to boosting which focuses on mistakes from previous runs. A human can be pretty helpful in Step 3 (though not required) to verify and discuss proposed changes. Changes that overfit to a task are bad for generalization and can lead to regressions in other Tasks.

这与 boosting 的思路相似:把关注点放在上一次运行的错误上。在第 3 步,人类(虽然不是必须)常常很有帮助,可以核验并讨论拟议的改动。对单一任务过拟合的改动不利于泛化,还可能导致其他 Tasks 出现回退。

Automated trace analysis saves hours of time and made it easy to quickly try experiments. We’ll be publishing this skill soon, we’re currently testing it for prompt optimization generally.

自动化 trace 分析节省了数小时,也让快速尝试实验变得很容易。我们会很快发布这个 skill;目前正在把它用于更通用的 prompt 优化测试。

What Actually Improved Agent Performance

究竟是什么提升了智能体表现

Automated Trace analysis allowed us to debug where agents were going wrong. Issues included reasoning errors, not following task instructions, missing testing and verification, running out of time, etc. We go into these improvements in more details in the sections below.

自动化 Trace 分析让我们能调试智能体到底在哪里出错。问题包括推理错误、未遵循任务指令、缺少测试与验证、时间不够等。下文会更详细展开这些改进。

Build & Self-Verify

构建与自我验证

Today’s models are exceptional self-improvement machines.

当今的模型是极其出色的自我改进机器。

Self-verification allows agents to self-improve via feedback within a run. However, they don’t have a natural tendency to enter this build-verify loop.

自我验证使智能体能够在单次运行内通过反馈实现自我改进。然而,它们并不会自然而然地进入这种 build-verify 循环。

The most common failure pattern was that the agent wrote a solution, re-read its own code, confirmed it looks ok, and stopped. Testing is a key part of autonomous agentic coding. It helps test for overall correctness and simultaneously gives agents signal to hill-climb against.

最常见的失败模式是:智能体写出一个方案,回头读一遍自己的代码,确认看起来没问题,然后就停了。测试是自动化智能体式编码的关键部分:它既能检验整体正确性,也能同时给智能体提供用于爬山式优化的信号。

We added guidance to the system prompt on how to approach problem solving.

我们在 system prompt 里加入了如何解题的指导。

Planning & Discovery: Read the task, scan the codebase, and build an initial plan based on the task specification and how to verify the solution.

Planning & Discovery:阅读任务、扫描代码库,并基于任务规格与如何验证方案来制定初步计划。

Build: Implement the plan with verification in mind. Build tests, if they don’t exist and test both happy paths and edge cases.

Build:在实现计划时把验证放在心上。如果没有测试,就补上测试,并同时覆盖 happy paths 与 edge cases。

Verify: Run tests, read the full output, compare against what was asked (not against your own code).

Verify:运行测试,通读完整输出,并对照任务要求核对(不是对照你自己的代码)。

Fix: Analyze any errors, revisit the original spec, and fix issues.

Fix:分析错误,回到原始 spec,修复问题。

We really focus on testing because it powers the changes in every iteration. We found that alongside prompting, deterministic context injection helps agents verify their work. We use a PreCompletionChecklistMiddleware that intercepts the agent before it exits and reminds it to run a verification pass against the Task spec. This is similar to a Ralph Wiggum Loop where a hook forces the agent to continue executing on exit, we use this for verification.

我们非常强调测试,因为它是每一轮迭代中推动改动的动力。我们发现,除了 prompting 之外,确定性的上下文注入也能帮助智能体验证自己的工作。我们使用一个 PreCompletionChecklistMiddleware:在智能体退出前拦截它,提醒它针对 Task spec 做一次验证。这类似于 Ralph Wiggum Loop:一个 hook 在退出时强制智能体继续执行;我们用它来做验证。

Giving Agents Context about their Environment

为智能体提供其环境上下文

Part of harness engineering is building a good delivery mechanism for context engineering. Terminal Bench tasks come with directory structures, built-in tooling, and strict timeouts.

harness engineering 的一部分,是为 context engineering 构建良好的交付机制。Terminal Bench 的任务自带目录结构、内置工具以及严格的超时限制。

Directory Context & Tooling: A LocalContextMiddleware runs on agent start to map the cwd and other parent+children directories. We run bash commands to find tools like Python installations. Context discovery and search are error prone, so injecting context reduces this error surface and helps onboard the agent into its environment.

Directory Context & Tooling:LocalContextMiddleware 会在智能体启动时运行,映射 cwd 以及其他父目录 + 子目录。我们会运行 bash 命令来发现诸如 Python 安装之类的工具。上下文发现与搜索很容易出错,因此注入上下文能降低这类错误面,并帮助智能体更快进入其环境。

Teaching Agents to Write Testable Code: Agents don’t know how their code needs to be testable. We add prompting say their work will be measured against programatic tests, similar to when committing code. For example, Task specs that mention file paths should be followed exactly so the solutions works in an automated scoring step. Prompting that stresses edge-cases helps the agent avoid only checking “happy path” cases. Forcing models to conform to testing standards is a powerful strategy to avoid “slop buildup” over time.

Teaching Agents to Write Testable Code:智能体并不知道它的代码该如何做到“可测试”。我们在 prompting 中说明:它的产出会被程序化测试评测,类似于提交代码时的要求。比如,Task spec 里提到的文件路径应当被严格遵循,这样解决方案才能在自动化打分环节正常工作。强调 edge-cases 的 prompting 能帮助智能体避免只检查“happy path”的情况。强制模型遵循测试标准,是避免长期出现“slop buildup”的强力策略。

Time Budgeting: We inject time budget warnings to nudge the agent to finish work and shift to verification. Agents are famously bad at time estimation so this heuristic helps in this environment. Real world coding usually doesn’t have strict time limits, but without adding any knowledge of constraints, agents won’t work within time bounds.

Time Budgeting:我们注入时间预算警告,推动智能体结束构建并切换到验证。众所周知,智能体很不擅长估算时间,因此这个启发式方法在该环境下很有用。现实世界的编码通常没有严格的时间限制,但如果不向智能体注入任何约束信息,它就无法在时间边界内工作。

The more that agents know about their environment, constraints, and evaluation criteria, the better they can autonomously self-direct their work.

智能体对其环境、约束与评测标准了解得越多,就越能自主地引导自己的工作。

The purpose of the harness engineer: prepare and deliver context so agents can autonomously complete work.

harness 工程师的职责:准备并交付上下文,让智能体能够自主完成工作。

Encouraging Agents to Step Back & Reconsider Plans

鼓励智能体后退一步,重新审视计划

Agents can be myopic once they’ve decided on a plan which results in “doom loops” that make small variations to the same broken approach (10+ times in some traces).

智能体一旦决定了计划,就可能变得短视,从而陷入“doom loops”:对同一种错误方法做细微变化(在一些 traces 里甚至重复 10+ 次)。

We use a LoopDetectionMiddleware that tracks per-file edit counts via tool call hooks. It adds context like “…consider reconsidering your approach” after N edits to the same file. This can help agents recover from doom loops, though the model can continue down the same path if it thinks it’s correct.

我们使用 LoopDetectionMiddleware,通过 tool call hooks 跟踪每个文件的编辑次数。在对同一文件编辑达到 N 次后,它会注入类似“……考虑重新思考你的方法”的上下文。这能帮助智能体从 doom loops 中恢复,不过如果模型仍然认为自己的方法正确,它也可能继续沿着同一路径走下去。

Important note. This is a design heuristic that engineers around today’s perceived model issues. As models improve, these guardrails will likely be unnecessary, but today helps agents execute correctly and autonomously.

重要说明:这是一个围绕当下模型问题的工程化设计启发式。随着模型变强,这些护栏很可能不再需要;但在今天,它们能帮助智能体更正确、更自主地执行。

Choosing How Much Compute to Spend on Reasoning

选择在推理上投入多少算力

Reasoning models can run autonomously for hours so we have to decide how much compute to spend on every subtask. You can use the max reasoning budget on every task, but most work can benefit from optimizing reasoning compute spend.

推理模型可以自主运行数小时,因此我们必须决定每个子任务要花多少算力。你当然可以对每个任务都用最大推理预算,但大多数工作都会受益于对推理算力投入的优化。

Terminal Bench timeout limits create a tradeoff. More reasoning helps agents evaluate each step, but can burn over 2x more tokens/time. gpt-5.2-codex has 4 reasoning modes, low, medium, high, and xhigh.

Terminal Bench 的超时限制带来了权衡:更多推理能帮助智能体评估每一步,但也可能消耗超过 2 倍的 tokens/时间。gpt-5.2-codex 有 4 种推理模式:low、medium、high 和 xhigh。

We found that reasoning helps with planning to fully understand the problem, some Terminal Bench tasks are very difficult. A good plan helps get to a working solution more quickly.

我们发现,推理对 planning 很有帮助,能更完整地理解问题;一些 Terminal Bench 任务非常困难。一个好的计划能更快通向可用的解法。

Later stage verification also benefits from more reasoning to catch mistakes and get a solution submitted. As a heuristic, we choose a xhigh-high-xhigh "reasoning sandwich" as a baseline.

后期验证也会受益于更多推理:它能捕捉错误,并把解决方案成功提交上去。作为一种启发式,我们选择 xhigh-high-xhigh 的“reasoning sandwich”作为基线。

Running only at xhigh scored poorly at 53.9% due to agent timeouts compared to 63.6% at high. There weren’t large differences in trial runs across reasoning budget splits so we stuck with our approach which pushed the score to 66.5%.

如果全程只用 xhigh,由于智能体超时,得分反而较差:53.9%;相比之下,high 为 63.6%。在不同推理预算切分的试跑中差异并不大,因此我们保持了这种做法,最终把得分推到 66.5%。

The natural approach for models is Adaptive Reasoning, seen with Claude and Gemini models where the model decides how much compute to spend on reasoning.

对模型而言,更自然的方法是 Adaptive Reasoning——在 Claude 与 Gemini 模型中就能看到这种做法:由模型自行决定在推理上投入多少算力。

In a multi-model harness, balancing reasoning budgets could play out as using a large model for planning and handing off to a smaller model for implementation.

在多模型 harness 中,平衡推理预算的一种方式是:用大模型负责 planning,再交给小模型去实现。

Practical Takeaways for Building Agent Harnesses

构建 Agent Harness 的实用要点

The design space of agents is big. Here are some general principles from our experiments and building deepagents overall.

智能体的设计空间很大。以下是我们基于这些实验,以及整体构建 deepagents 的一些通用原则。

Context Engineering on Behalf of Agents. Context assembly is still difficult for agents today, especially in unseen environments. Onboarding models with context like directory structures, available tools, coding best practices, and problem solving strategies helps reduce the error surface for poor search and avoidable errors in planning.

替智能体做 Context Engineering。对当下的智能体来说,上下文组装仍然困难,尤其是在陌生环境里。用目录结构、可用工具、编码最佳实践与解题策略等上下文来“引导上手”,能减少因搜索不佳造成的错误面,并避免规划中的可避免失误。

Help agents self-verify their work. Models are biased towards their first plausible solution. Prompt them aggressively to verify their work by running tests and refining solutions. This is especially important in autonomous coding systems that don’t have humans in the loop.

帮助智能体自我验证。模型容易偏向自己第一个看似可行的解。要强力 prompting 它们通过运行测试来验证工作并迭代改进。这在无人类介入的自动化编码系统中尤其重要。

Tracing as a feedback signal. Traces allow agents to self-evaluate and debug themselves. It’s important to debug tooling and reasoning together (ex: models go down wrong paths because they lack a tool or instructions how to do something).

把 Tracing 当作反馈信号。Traces 让智能体能够自我评估并调试自己。把工具与推理一起调试很重要(ex:模型走上错误路径,是因为它缺少某个工具,或缺少“怎么做某事”的指令)。

Detect and fix bad patterns in the short term. Models today aren’t perfect. The job of the harness designer is to design around today’s shortcomings while planning for smarter models in the future. Blind retries and not verifying work are good examples. These guardrails will almost surely dissolve over time, but to build robust agent applications today, they’re useful tools to experiment with.

短期内检测并修复坏模式。今天的模型并不完美。harness 设计者的工作,是一边围绕当下的缺陷做设计,一边为未来更聪明的模型做规划。盲目重试、不做验证就是典型例子。这些护栏几乎肯定会随着时间逐渐消失,但在今天要构建稳健的智能体应用,它们是非常有价值、值得实验的工具。

Tailor Harnesses to Models. The Codex and Claude prompting guides show that models require different prompting. A test run with Claude Opus 4.6 scored 59.6% with an earlier harness version, competitive but worse than Codex because we didn’t run the same Improvement Loop with Claude. Many principles generalize like good context preparation and a focus on verification, but running a few rounds of harness iterations for your task helps maximize agent performance across tasks.

为不同模型定制 Harness。Codex 与 Claude 的 prompting 指南表明,不同模型需要不同的 prompting。我们用更早版本的 harness 对 Claude Opus 4.6 做了一次测试,得分 59.6%,具备竞争力但不如 Codex——因为我们没有为 Claude 跑同样的 Improvement Loop。很多原则是可迁移的,比如良好的上下文准备与对验证的强调;但针对你的任务跑几轮 harness 迭代,才能在不同任务上最大化智能体表现。

There’s more open research to do in harness design. Interesting avenues include multi-model systems (Codex, Gemini, and Claude together), memory primitives for continual learning so agents can autonomously improve on tasks, and measuring harness changes across models.

harness 设计仍有很多开放研究可以做。有趣的方向包括多模型系统(把 Codex、Gemini 和 Claude 组合在一起)、用于持续学习的记忆原语以便智能体能在任务上自主提升,以及跨模型衡量 harness 改动的效果。

For the outer loop of improving agents, we’re looking at methods like RLMs to more efficiently mine traces. We’ll be continuing work to improve the harness and openly share our research.

在改进智能体的外循环上,我们也在探索类似 RLMs 的方法,以更高效地挖掘 traces。我们会继续改进 harness,并公开分享我们的研究。

We created a dataset of our Traces to share with the community.

我们创建了一份 Traces 数据集,与社区共享。

Deep Agents is open source. Python and Javascript.

Deep Agents 是开源的。Python 和 Javascript。

To more hill climbing and open research.

致更多的爬山式优化与开放研究。

Link: http://x.com/i/article/2022906014928904192

链接:http://x.com/i/article/2022906014928904192

相关笔记

Improving Deep Agents with Harness Engineering

  • Source: https://x.com/vtrivedy10/status/2023805578561060992?s=46
  • Mirror: https://x.com/vtrivedy10/status/2023805578561060992?s=46
  • Published: 2026-02-17T17:03:45+00:00
  • Saved: 2026-02-18

Content

TLDR: Our coding agent went from Top 30 to Top 5 on Terminal Bench 2.0. We only changed the harness. Here’s our approach to harness engineering (teaser: self-verification & tracing help a lot).

The Goal of Harness Engineering

The goal of a harness is to mold the inherently spiky intelligence of a model for tasks we care about. Harness Engineering is about systems, you’re building tooling around the model to optimize goals like task performance, token efficiency, latency, etc. Design decisions include the system prompt, tool choice, and execution flow.

But how should you change the harness to improve your agent?

At LangChain, we use Traces to understand agent failure modes at scale. Models today are largely black-boxes, their inner mechanisms are hard to interpret. But we can see their inputs and outputs in text space which we then use in our improvement loops.

We used a simple recipe to iteratively improve deepagents-cli (our coding agent) 13.7 points from 52.8 to 66.5 on Terminal Bench 2.0. We only tweaked the harness and kept the model fixed, gpt-5.2-codex.

Experiment Setup & The Knobs on a Harness

We used Terminal Bench 2.0, a now standard benchmark to evaluate agentic coding. It has 89 tasks across domains like machine learning, debugging, and biology. We use Harbor to orchestrate the runs. It spins up sandboxes (Daytona), interacts with our agent loop, and runs verification + scoring.

Every agent action is stored in LangSmith. It also includes metrics like latency, token counts, and costs.

The Knobs we can Turn

An agent harness has a lot of knobs: system prompts, tools, hooks/middleware, skills, sub-agent delegation, memory systems, and more. We deliberately compress the optimization space and focus on three: System Prompt, Tools, and Middleware (our term for hooks around model and tool calls).

We start with a default prompt and standard tools+middleware. This scores 52.8% with GPT-5.2-Codex. A solid score, just outside the Top 30 of the leaderboard today, but room to grow.

The Trace Analyzer Skill

We wanted trace analysis to be repeatable, so we made it into an Agent Skill. This became our recipe to analyze errors across runs and make improvements to the harness. The flow is:

Fetch experiment traces from LangSmith

Spawn parallel error analysis agents → main agent synthesizes findings + suggestions

Aggregate feedback and make targeted changes to the harness.

This works similarly to boosting which focuses on mistakes from previous runs. A human can be pretty helpful in Step 3 (though not required) to verify and discuss proposed changes. Changes that overfit to a task are bad for generalization and can lead to regressions in other Tasks.

Automated trace analysis saves hours of time and made it easy to quickly try experiments. We’ll be publishing this skill soon, we’re currently testing it for prompt optimization generally.

What Actually Improved Agent Performance

Automated Trace analysis allowed us to debug where agents were going wrong. Issues included reasoning errors, not following task instructions, missing testing and verification, running out of time, etc. We go into these improvements in more details in the sections below.

Build & Self-Verify

Today’s models are exceptional self-improvement machines.

Self-verification allows agents to self-improve via feedback within a run. However, they don’t have a natural tendency to enter this build-verify loop.

The most common failure pattern was that the agent wrote a solution, re-read its own code, confirmed it looks ok, and stopped. Testing is a key part of autonomous agentic coding. It helps test for overall correctness and simultaneously gives agents signal to hill-climb against.

We added guidance to the system prompt on how to approach problem solving.

Planning & Discovery: Read the task, scan the codebase, and build an initial plan based on the task specification and how to verify the solution.

Build: Implement the plan with verification in mind. Build tests, if they don’t exist and test both happy paths and edge cases.

Verify: Run tests, read the full output, compare against what was asked (not against your own code).

Fix: Analyze any errors, revisit the original spec, and fix issues.

We really focus on testing because it powers the changes in every iteration. We found that alongside prompting, deterministic context injection helps agents verify their work. We use a PreCompletionChecklistMiddleware that intercepts the agent before it exits and reminds it to run a verification pass against the Task spec. This is similar to a Ralph Wiggum Loop where a hook forces the agent to continue executing on exit, we use this for verification.

Giving Agents Context about their Environment

Part of harness engineering is building a good delivery mechanism for context engineering. Terminal Bench tasks come with directory structures, built-in tooling, and strict timeouts.

Directory Context & Tooling: A LocalContextMiddleware runs on agent start to map the cwd and other parent+children directories. We run bash commands to find tools like Python installations. Context discovery and search are error prone, so injecting context reduces this error surface and helps onboard the agent into its environment.

Teaching Agents to Write Testable Code: Agents don’t know how their code needs to be testable. We add prompting say their work will be measured against programatic tests, similar to when committing code. For example, Task specs that mention file paths should be followed exactly so the solutions works in an automated scoring step. Prompting that stresses edge-cases helps the agent avoid only checking “happy path” cases. Forcing models to conform to testing standards is a powerful strategy to avoid “slop buildup” over time.

Time Budgeting: We inject time budget warnings to nudge the agent to finish work and shift to verification. Agents are famously bad at time estimation so this heuristic helps in this environment. Real world coding usually doesn’t have strict time limits, but without adding any knowledge of constraints, agents won’t work within time bounds.

The more that agents know about their environment, constraints, and evaluation criteria, the better they can autonomously self-direct their work.

The purpose of the harness engineer: prepare and deliver context so agents can autonomously complete work.

Encouraging Agents to Step Back & Reconsider Plans

Agents can be myopic once they’ve decided on a plan which results in “doom loops” that make small variations to the same broken approach (10+ times in some traces).

We use a LoopDetectionMiddleware that tracks per-file edit counts via tool call hooks. It adds context like “…consider reconsidering your approach” after N edits to the same file. This can help agents recover from doom loops, though the model can continue down the same path if it thinks it’s correct.

Important note. This is a design heuristic that engineers around today’s perceived model issues. As models improve, these guardrails will likely be unnecessary, but today helps agents execute correctly and autonomously.

Choosing How Much Compute to Spend on Reasoning

Reasoning models can run autonomously for hours so we have to decide how much compute to spend on every subtask. You can use the max reasoning budget on every task, but most work can benefit from optimizing reasoning compute spend.

Terminal Bench timeout limits create a tradeoff. More reasoning helps agents evaluate each step, but can burn over 2x more tokens/time. gpt-5.2-codex has 4 reasoning modes, low, medium, high, and xhigh.

We found that reasoning helps with planning to fully understand the problem, some Terminal Bench tasks are very difficult. A good plan helps get to a working solution more quickly.

Later stage verification also benefits from more reasoning to catch mistakes and get a solution submitted. As a heuristic, we choose a xhigh-high-xhigh "reasoning sandwich" as a baseline.

Running only at xhigh scored poorly at 53.9% due to agent timeouts compared to 63.6% at high. There weren’t large differences in trial runs across reasoning budget splits so we stuck with our approach which pushed the score to 66.5%.

The natural approach for models is Adaptive Reasoning, seen with Claude and Gemini models where the model decides how much compute to spend on reasoning.

In a multi-model harness, balancing reasoning budgets could play out as using a large model for planning and handing off to a smaller model for implementation.

Practical Takeaways for Building Agent Harnesses

The design space of agents is big. Here are some general principles from our experiments and building deepagents overall.

Context Engineering on Behalf of Agents. Context assembly is still difficult for agents today, especially in unseen environments. Onboarding models with context like directory structures, available tools, coding best practices, and problem solving strategies helps reduce the error surface for poor search and avoidable errors in planning.

Help agents self-verify their work. Models are biased towards their first plausible solution. Prompt them aggressively to verify their work by running tests and refining solutions. This is especially important in autonomous coding systems that don’t have humans in the loop.

Tracing as a feedback signal. Traces allow agents to self-evaluate and debug themselves. It’s important to debug tooling and reasoning together (ex: models go down wrong paths because they lack a tool or instructions how to do something).

Detect and fix bad patterns in the short term. Models today aren’t perfect. The job of the harness designer is to design around today’s shortcomings while planning for smarter models in the future. Blind retries and not verifying work are good examples. These guardrails will almost surely dissolve over time, but to build robust agent applications today, they’re useful tools to experiment with.

Tailor Harnesses to Models. The Codex and Claude prompting guides show that models require different prompting. A test run with Claude Opus 4.6 scored 59.6% with an earlier harness version, competitive but worse than Codex because we didn’t run the same Improvement Loop with Claude. Many principles generalize like good context preparation and a focus on verification, but running a few rounds of harness iterations for your task helps maximize agent performance across tasks.

There’s more open research to do in harness design. Interesting avenues include multi-model systems (Codex, Gemini, and Claude together), memory primitives for continual learning so agents can autonomously improve on tasks, and measuring harness changes across models.

For the outer loop of improving agents, we’re looking at methods like RLMs to more efficiently mine traces. We’ll be continuing work to improve the harness and openly share our research.

We created a dataset of our Traces to share with the community.

Deep Agents is open source. Python and Javascript.

To more hill climbing and open research.

Link: http://x.com/i/article/2022906014928904192

📋 讨论归档

讨论进行中…