返回列表
🧠 阿头学 · 💬 讨论题

Agent Harness 才是智能体产品的主战场

这篇文章的判断基本成立:生产级 agent 的成败主要不取决于模型本身,而取决于围绕模型构建的 harness,但作者对这一点的外推明显偏大,带有强烈行业布道色彩。
打开原文 ↗

2026-04-19 原文链接 ↗
阅读简报
双语对照
完整翻译
原文
讨论归档

核心观点

  • Harness 不是配件而是系统本体 文章最强的判断是,所谓 agent 并不是“会调用工具的模型”,而是模型加上一整套运行时基础设施;把编排循环、工具、记忆、上下文管理、状态持久化、错误恢复、安全和验证合在一起看,才解释了为什么同模型能做出完全不同的产品表现。
  • 生产问题的主矛盾在模型外部 作者对 demo 与 production 的区分是准确的:真实失败往往不是“模型不会想”,而是“看错了、忘了、调用坏了、上下文烂了、结果没验证还继续编”;这比单纯优化 prompt 更接近一线工程现实。
  • 上下文管理和验证循环最值得重视 文中最站得住的部分,不是 12 个模块的清单,而是把上下文腐烂、错误累积、验证闭环抬到核心位置;这个判断很硬,因为长任务、多工具、跨轮次执行里,缺少压缩、检索、检查点和测试反馈,系统必然不稳定。
  • 多智能体不是默认更高级,往往是更脆弱 作者明确主张先把单智能体做到极致,这个判断是对的;多智能体会引入额外路由成本、上下文丢失和调试复杂度,除非任务域明显分裂或工具负载真的超标,否则它更像过度工程而不是能力升级。
  • “Harness 就是产品”有启发,但说得过满 在代码代理、终端代理、长时任务里,harness 的确高度决定结果;但把这套结论直接推广到客服、内容生成、轻工作流场景就不严谨了,这些场景对复杂 harness 的收益未必高,甚至可能被成本和延迟反噬。

跟我们的关联

  • 对 ATou 意味着什么、下一步怎么用 意味着不要再把 agent 评估停留在“首轮回答聪不聪明”,而要改成“整条任务链能否稳定完成”;下一步应把现有 agent 拆成 prompt / context / harness 三层排查,优先找上下文装配、错误恢复和验证闭环的短板。
  • 对 Neta 意味着什么、下一步怎么用 意味着做 agent 相关研究或产品分析时,不能只比较模型,要把工具范围、状态管理、记忆策略和权限机制列为一等变量;下一步可以建立一套 harness 评测表,把“成功率、恢复率、压缩策略、验证成本”纳入统一比较框架。
  • 对 Uota 意味着什么、下一步怎么用 意味着如果要理解 agent 产品差异,重点不该放在品牌宣传的“智能体人格”,而该放在背后的运行时设计;下一步可以把不同框架的核心 trade-off 做成对照卡片,例如薄 harness vs 厚 harness、单 agent vs 多 agent、ReAct vs plan-and-execute。
  • 对 ATou/Neta/Uota 共同意味着什么、下一步怎么用 意味着讨论 agent 时要先问“失败发生在哪一层”,而不是先怪模型;下一步最实用的做法是把“七个架构决策”改写成内部评审 checklist,用来筛掉那些看起来高级、其实不可维护的方案。

讨论引子

1. 如果未来模型原生整合长期状态、工具使用和验证能力,今天定义的 harness 会被削薄到什么程度,哪些模块会最先商品化? 2. 在你的场景里,复杂 harness 带来的收益,是否真的超过它引入的延迟、成本和调试负担? 3. 多智能体到底是必要拆分,还是团队在用架构复杂度掩盖单智能体能力不足?

深入拆解 Anthropic、OpenAI、Perplexity 和 LangChain 到底在构建什么。本文涵盖编排循环、工具、记忆、上下文管理,以及所有把无状态 LLM 转化为强大智能体的关键部分。

你已经做出了一个聊天机器人。也许还接了一个 ReAct 循环,配上几个工具。拿来演示没问题。然后你试着做一个生产级产品,问题就开始暴露:模型忘了三步之前做过什么,工具调用失败却悄无声息,上下文窗口被垃圾信息塞满。

问题不在你的模型。问题在模型周围的一切。

LangChain 证明了这一点。他们只改了包在 LLM 外面的基础设施,模型相同,权重相同,结果在 TerminalBench 2.0 上从 30 名开外跃升到第 5。另一个独立研究项目让 LLM 自己优化基础设施,最终达到 76.4% 的通过率,超过了人工设计的系统。

这套基础设施现在有了名字:agent harness

什么是 Agent Harness?

这个术语是在 2026 年初被正式化的,但概念早就存在。harness 是包裹 LLM 的完整软件基础设施:编排循环、工具、记忆、上下文管理、状态持久化、错误处理和护栏。Anthropic 的 Claude Code 文档说得很简单:SDK 是“驱动 Claude Code 的 agent harness”。OpenAI 的 Codex 团队也采用同样的说法,明确把“agent”和“harness”这两个词等同起来,用来指让 LLM 真正有用的非模型基础设施

我很喜欢 LangChain 的 Vivek Trivedy 给出的经典公式:“如果你不是模型,你就是 harness。”

这里有个常让人混淆的区别。“智能体”是涌现出来的行为:用户交互到的那个有目标、会用工具、能自我修正的实体。harness 是产生这种行为的机器。当有人说“我做了一个智能体”,实际意思是他们做了一个 harness,然后把它接到了一个模型上。

https://www.anthropic.com/research/long-running-Claude

Beren Millidge 在 2023 年的文章《Scaffolded LLMs as Natural Language Computers》中,把这个类比讲得很准确。原始 LLM 就像一颗没有 RAM、没有磁盘、没有 I/O 的 CPU。上下文窗口充当 RAM,速度快但容量有限。外部数据库相当于磁盘存储,容量大但速度慢。工具集成像设备驱动。harness 就是操作系统。正如 Millidge 所写:“我们重新发明了冯·诺依曼架构”,因为它是任何计算系统都自然会走向的一种抽象。

三层工程

模型周围有三层同心工程:

  • 提示词工程设计模型接收到的指令。

  • 上下文工程管理模型在什么时候看到什么。

  • Harness 工程包含前两者,同时还包括完整的应用基础设施:工具编排、状态持久化、错误恢复、验证循环、安全执行和生命周期管理。

harness 不是套在提示词外面的包装。它是让自主智能体行为成为可能的完整系统。

生产级 Harness 的 12 个组成部分

综合 Anthropic、OpenAI、LangChain 以及更广泛的一线实践者经验,一个生产级 agent harness 有十二个不同组成部分。下面逐一展开。

1. 编排循环

这是心跳。它实现 Thought-Action-Observation,也就是 TAO 循环,也叫 ReAct 循环。这个循环会依次执行:组装提示词、调用 LLM、解析输出、执行工具调用、把结果喂回模型,不断重复直到完成。

从机制上看,它通常只是一个 while 循环。复杂性存在于循环所管理的一切之中,而不是循环本身。Anthropic 把他们的运行时描述为一个“笨循环”,所有智能都在模型里。harness 只负责管理轮次。

2. 工具

工具是智能体的手。它们以 schema 的形式定义,包括名称、描述、参数类型,并被注入到 LLM 的上下文中,让模型知道自己能用什么。工具层负责注册、schema 校验、参数提取、沙箱执行、结果捕获,以及把结果格式化成 LLM 能读懂的 observation。

Claude Code 提供六类工具:文件操作、搜索、执行、网页访问、代码智能和子智能体生成。OpenAI 的 Agents SDK 支持函数工具,通过 @function_tool;托管工具,如 WebSearch、CodeInterpreter、FileSearch;以及 MCP 服务器工具。

3. 记忆

记忆运行在多个时间尺度上。短期记忆是单次会话内的对话历史。长期记忆跨会话持久存在:Anthropic 使用 CLAUDE.md 项目文件和自动生成的 MEMORY.md 文件;LangGraph 使用按命名空间组织的 JSON Stores;OpenAI 支持由 SQLite 或 Redis 支撑的 Sessions。

Claude Code 实现了三层结构:轻量索引,每条约 150 个字符,始终加载;按需拉取的详细主题文件;以及只能通过搜索访问的原始转录记录。一个关键设计原则是:智能体把自己的记忆当作“提示”,在行动前会对照真实状态进行验证

4. 上下文管理

很多智能体都是在这里无声失败的。核心问题是上下文腐烂:当关键信息落在窗口中部位置时,模型性能会下降 30% 以上,这是 Chroma 的研究结果,也得到斯坦福“Lost in the Middle”发现的印证。即使是百万 token 窗口,随着上下文增长,指令遵循能力也会退化。

生产环境中的策略包括:

  • 压缩:接近限制时总结对话历史。Claude Code 会保留架构决策和未解决 bug,同时丢弃冗余工具输出。

  • Observation 遮蔽:JetBrains 的 Junie 会隐藏旧工具输出,但保留工具调用可见。

  • 即时检索:维护轻量标识符,动态加载数据。Claude Code 使用 grep、glob、head、tail,而不是加载完整文件。

  • 子智能体委派:每个子智能体可以大范围探索,但只返回 1,000 到 2,000 token 的浓缩摘要。

Anthropic 的上下文工程指南把目标说得很明确:找到尽可能小、信号密度尽可能高的 token 集合,最大化得到理想结果的概率。

5. 提示词构建

这一步组装模型在每一步真正看到的内容。它是分层的:系统提示词、工具定义、记忆文件、对话历史,以及当前用户消息。

OpenAI 的 Codex 使用严格的优先级栈:服务端控制的系统消息,优先级最高;工具定义;开发者指令;用户指令,包括级联的 AGENTS.md 文件,限制 32 KiB;然后是对话历史。

6. 输出解析

现代 harness 依赖原生工具调用,模型返回结构化的 tool_calls 对象,而不是必须解析的自由文本。harness 检查:有没有工具调用?有就执行并继续循环。没有工具调用?那就是最终答案。

对于结构化输出,OpenAI 和 LangChain 都支持通过 Pydantic 模型进行 schema 约束响应。像 RetryWithErrorOutputParser 这样的旧方法仍然可用于边缘情况,它会把原始提示词、失败的 completion 和解析错误一起反馈给模型。

7. 状态管理

LangGraph 把状态建模为流经图节点的类型化字典,并用 reducers 合并更新。检查点发生在 super-step 边界,因此可以在中断后恢复,也可以进行时间旅行式调试。OpenAI 提供四种互斥策略:应用内存、SDK sessions、服务端 Conversations API,或轻量的 previous_response_id 链接。Claude Code 采用不同路线:把 git commit 当作检查点,把进度文件当作结构化草稿本

8. 错误处理

这件事之所以重要,是因为:一个 10 步流程,即使每一步成功率都是 99%,端到端成功率也只有约 90.4%。错误会迅速叠加。

LangGraph 区分四类错误:瞬时错误,带退避重试;LLM 可恢复错误,把错误作为 ToolMessage 返回,让模型调整;用户可修复错误,中断并等待人工输入;意外错误,向上抛出用于调试。Anthropic 会在工具处理器内部捕获失败,并把它们作为错误结果返回,以保持循环继续运行。Stripe 的生产 harness 把重试次数限制在两次。

9. 护栏与安全

OpenAI 的 SDK 实现了三个层级:输入护栏,在第一个智能体上运行;输出护栏,在最终输出上运行;工具护栏,在每次工具调用时运行。“tripwire”机制会在触发时立刻停止智能体。

Anthropic 在架构上把权限执行和模型推理分离。模型决定尝试什么;工具系统决定什么被允许。Claude Code 独立管控约 40 个离散工具能力,并分为三个阶段:项目加载时建立信任;每次工具调用前检查权限;高风险操作需要用户明确确认。

10. 验证循环

这是玩具 demo 和生产级智能体的分水岭。Anthropic 推荐三种方法:基于规则的反馈,如测试、linter、类型检查器;视觉反馈,如 UI 任务中通过 Playwright 截图;LLM-as-judge,即让一个独立子智能体评估输出。

Claude Code 的创造者 Boris Cherny 曾指出,给模型一种验证自身工作的方式,可以把质量提升 2 到 3 倍

11. 子智能体编排

Claude Code 支持三种执行模型:Fork,对父上下文做字节级完全相同的复制;Teammate,使用单独终端面板,并通过基于文件的 mailbox 通信;Worktree,每个智能体拥有自己的 git worktree 和隔离分支。OpenAI 的 SDK 支持 agents-as-tools,由专家处理有边界的子任务;也支持 handoffs,由专家接管完整控制。LangGraph 把子智能体实现为嵌套状态图。

循环如何运转:一步一步走完

现在你已经了解这些组成部分,我们来追踪它们如何在一次循环中协同工作。

http://memory.md/

步骤 1(提示词组装):harness 构建完整输入:系统提示词 + 工具 schema + 记忆文件 + 对话历史 + 当前用户消息。重要上下文会被放在提示词的开头和结尾,也就是“Lost in the Middle”发现所提示的位置策略。

步骤 2(LLM 推理):组装好的提示词发送到模型 API。模型生成输出 token:文本、工具调用请求,或两者都有。

步骤 3(输出分类):如果模型只生成文本,没有工具调用,循环结束。如果它请求工具调用,则进入执行。如果请求 handoff,则更新当前智能体并重新开始。

步骤 4(工具执行):对于每个工具调用,harness 校验参数、检查权限、在沙箱环境中执行,并捕获结果。只读操作可以并发运行;会修改状态的操作串行运行。

步骤 5(结果打包):工具结果被格式化为 LLM 可读的消息。错误会被捕获,并作为错误结果返回,让模型可以自我修正。

步骤 6(上下文更新):结果被追加到对话历史。如果接近上下文窗口限制,harness 会触发压缩。

步骤 7(循环):回到步骤 1。重复直到终止。

终止条件是分层的:模型产生没有工具调用的响应;超过最大轮次限制;token 预算耗尽;护栏 tripwire 被触发;用户中断;或返回安全拒绝。一个简单问题可能只需要 1 到 2 轮。一个复杂重构任务可能跨许多轮,串联几十次工具调用。

对于跨越多个上下文窗口的长时间任务,Anthropic 开发了一种两阶段的“Ralph Loop”模式:一个 Initializer Agent 设置环境,包括 init script、进度文件、功能列表和初始 git commit;随后每个会话中的 Coding Agent 都会读取 git logs 和进度文件来定位自身,选择优先级最高且未完成的功能,开始工作、提交,并写下摘要。文件系统为多个上下文窗口之间提供连续性。

真实框架如何实现这一模式

http://agents.md/

Anthropic 的 Claude Agent SDK 通过单个 query() 函数暴露 harness。这个函数创建 agentic loop,并返回一个流式输出消息的 async iterator。运行时是一个“笨循环”。所有智能都在模型中。Claude Code 使用 Gather-Act-Verify 循环:收集上下文,搜索文件、阅读代码;采取行动,编辑文件、运行命令;验证结果,运行测试、检查输出;然后重复。

OpenAI 的 Agents SDK 通过 Runner 类实现 harness,有三种模式:async、sync 和 streamed。SDK 是“code-first”的:工作流逻辑用原生 Python 表达,而不是图 DSL。Codex harness 在此基础上扩展为三层架构:Codex Core,包含智能体代码和运行时;App Server,提供双向 JSON-RPC API;以及客户端界面,包括 CLI、VS Code、Web app。所有界面共享同一个 harness,这就是为什么“Codex 模型在 Codex 界面里比在普通聊天窗口里感觉更好”。

LangGraph 把 harness 建模为显式状态图。两个节点,llm_call 和 tool_node,通过条件边连接:如果存在工具调用,就路由到 tool_node;如果不存在,就路由到 END。LangGraph 从 LangChain 的 AgentExecutor 演化而来;后者在 v0.2 中被废弃,因为难以扩展且缺少多智能体支持。LangChain 的 Deep Agents 明确使用“agent harness”这个术语:内置工具、规划能力,也就是 write_todos 工具;用于上下文管理的文件系统;子智能体生成;以及持久记忆。

CrewAI 实现的是基于角色的多智能体架构:Agent,即包在 LLM 周围的 harness,由角色、目标、背景故事和工具定义;Task,即工作单元;Crew,即智能体集合。CrewAI 的 Flows 层增加了一条“确定性骨架,在真正需要的地方注入智能”,负责路由和校验,而 Crews 处理自主协作。

AutoGen,后来演进为 Microsoft Agent Framework,开创了会话驱动的编排方式。它的三层架构 Core、AgentChat、Extensions 支持五种编排模式:顺序、并发,即 fan-out/fan-in;群聊;handoff;以及 magentic,即一个管理智能体维护动态任务账本,协调各类专家。

脚手架隐喻

脚手架这个隐喻不是装饰性的,而是准确的。建筑脚手架是一种临时基础设施,让工人能建造原本够不到的结构。它本身不做施工。但没有它,工人到不了更高楼层。

关键洞见是:建筑完成后,脚手架会被拆除。 随着模型进步,harness 的复杂度应该下降。Manus 在六个月内重写了五次,每次重写都在移除复杂性。复杂的工具定义变成了通用 shell 执行。“管理智能体”变成了简单的结构化 handoff。

这指向了共同演化原则:模型现在是在特定 harness 参与的循环中进行后训练的。Claude Code 的模型学会了使用自己训练时搭配的特定 harness。因为这种紧密耦合,改变工具实现可能会降低性能。

harness 设计的“面向未来测试”是:如果随着模型变强,性能能够提升,而不需要增加 harness 复杂度,那么这个设计就是健康的。

https://x.com/@akshay_pachaar

定义每个 Harness 的七个决策

每个 harness 架构师都要面对七个选择:

  1. 单智能体还是多智能体。 Anthropic 和 OpenAI 都说:先把单智能体做到最大化。多智能体系统会增加开销,包括额外的 LLM 调用用于路由,以及 handoff 过程中的上下文丢失。只有当工具过载超过约 10 个重叠工具,或存在明显分离的任务领域时,才拆分。

  2. ReAct 还是 plan-and-execute。 ReAct 在每一步交错推理和行动,灵活但每步成本更高。plan-and-execute 把规划和执行分开。LLMCompiler 报告称,相比顺序 ReAct,它实现了 3.6 倍加速

  3. 上下文窗口管理策略。 生产环境中有五种方法:按时间清理、对话总结、observation 遮蔽、结构化记笔记,以及子智能体委派。ACON 研究显示,通过优先保留推理痕迹而不是原始工具输出,可以减少 26% 到 54% 的 token,同时保持 95% 以上准确率

  4. 验证循环设计。 计算式验证,如测试和 linter,提供确定性的真实依据。推断式验证,如 LLM-as-judge,可以捕捉语义问题,但会增加延迟。Martin Fowler 的 Thoughtworks 团队把它表述为 guides,也就是前馈,在行动前引导;以及 sensors,也就是反馈,在行动后观察。

  5. 权限和安全架构。 宽松模式更快但风险更高,会自动批准大多数操作;限制模式更安全但更慢,每个操作都需要批准。选择取决于部署环境。

  6. 工具范围策略。 工具越多,性能往往越差。Vercel 从 v0 中移除了 80% 的工具,结果反而更好。Claude Code 通过延迟加载实现了 95% 的上下文减少。原则是:只暴露当前步骤所需的最小工具集合。

  7. Harness 厚度。 有多少逻辑放在 harness 里,有多少交给模型。Anthropic 押注薄 harness 和模型进步。基于图的框架押注显式控制。随着新版模型内化某些能力,Anthropic 会定期从 Claude Code 的 harness 中删除规划步骤。

Harness 就是产品

两个产品即使用完全相同的模型,也可能仅仅因为 harness 设计不同而表现天差地别。TerminalBench 的证据很清楚:只改变 harness,就能让智能体排名移动 20 多个名次。

harness 不是一个已经解决的问题,也不是商品化层。真正困难的工程就在这里:把上下文当作稀缺资源来管理;设计能在失败叠加前抓住问题的验证循环;构建能提供连续性又不会制造幻觉的记忆系统;并在架构上判断到底该搭多少脚手架,又该把多少能力留给模型。

随着模型进步,这个领域正在走向更薄的 harness。但 harness 本身不会消失。哪怕是最强的模型,也仍然需要某种东西来管理它的上下文窗口,执行它的工具调用,持久化它的状态,并验证它的工作。

下次你的智能体失败时,先别怪模型。看看 harness。

https://www.beren.io/2023-04-11-Scaffolded-LLMs-natural-language-computers/

就到这里!

如果你喜欢这篇文章:

来找我 →@akshay_pachaar ✔️

每天,我都会分享关于 AI、机器学习和 vibe coding 最佳实践的教程与见解。

A deep dive into what Anthropic, OpenAI, Perplexity and LangChain are actually building. Covering the orchestration loop, tools, memory, context management, and everything else that transforms a stateless LLM into a capable agent.

深入拆解 Anthropic、OpenAI、Perplexity 和 LangChain 到底在构建什么。本文涵盖编排循环、工具、记忆、上下文管理,以及所有把无状态 LLM 转化为强大智能体的关键部分。

You've built a chatbot. Maybe you've wired up a ReAct loop with a few tools. It works for demos. Then you try to build something production-grade, and the wheels come off: the model forgets what it did three steps ago, tool calls fail silently, and context windows fill up with garbage.

你已经做出了一个聊天机器人。也许还接了一个 ReAct 循环,配上几个工具。拿来演示没问题。然后你试着做一个生产级产品,问题就开始暴露:模型忘了三步之前做过什么,工具调用失败却悄无声息,上下文窗口被垃圾信息塞满。

The problem isn't your model. It's everything around your model.

问题不在你的模型。问题在模型周围的一切。

LangChain proved this when they changed only the infrastructure wrapping their LLM (same model, same weights) and jumped from outside the top 30 to rank 5 on TerminalBench 2.0. A separate research project hit a 76.4% pass rate by having an LLM optimize the infrastructure itself, surpassing hand-designed systems.

LangChain 证明了这一点。他们只改了包在 LLM 外面的基础设施,模型相同,权重相同,结果在 TerminalBench 2.0 上从 30 名开外跃升到第 5。另一个独立研究项目让 LLM 自己优化基础设施,最终达到 76.4% 的通过率,超过了人工设计的系统。

That infrastructure has a name now: the agent harness.

这套基础设施现在有了名字:agent harness

What Is the Agent Harness?

什么是 Agent Harness?

The term was formalized in early 2026, but the concept existed long before. The harness is the complete software infrastructure wrapping an LLM: orchestration loop, tools, memory, context management, state persistence, error handling, and guardrails. Anthropic's Claude Code documentation puts it simply: the SDK is "the agent harness that powers Claude Code." OpenAI's Codex team uses the same framing, explicitly equating the terms "agent" and "harness" to refer to the non-model infrastructure that makes the LLM useful.

这个术语是在 2026 年初被正式化的,但概念早就存在。harness 是包裹 LLM 的完整软件基础设施:编排循环、工具、记忆、上下文管理、状态持久化、错误处理和护栏。Anthropic 的 Claude Code 文档说得很简单:SDK 是“驱动 Claude Code 的 agent harness”。OpenAI 的 Codex 团队也采用同样的说法,明确把“agent”和“harness”这两个词等同起来,用来指让 LLM 真正有用的非模型基础设施

I really liked the canonical formula, from LangChain's Vivek Trivedy: "If you're not the model, you're the harness."

我很喜欢 LangChain 的 Vivek Trivedy 给出的经典公式:“如果你不是模型,你就是 harness。”

Here's the distinction that trips people up. The "agent" is the emergent behavior: the goal-directed, tool-using, self-correcting entity the user interacts with. The harness is the machinery producing that behavior. When someone says "I built an agent," they mean they built a harness and pointed it at a model.

这里有个常让人混淆的区别。“智能体”是涌现出来的行为:用户交互到的那个有目标、会用工具、能自我修正的实体。harness 是产生这种行为的机器。当有人说“我做了一个智能体”,实际意思是他们做了一个 harness,然后把它接到了一个模型上。

Beren Millidge made this analogy precise in his 2023 essay "Scaffolded LLMs as Natural Language Computers." A raw LLM is a CPU with no RAM, no disk, and no I/O. The context window serves as RAM (fast but limited). External databases function as disk storage (large but slow). Tool integrations act as device drivers. The harness is the operating system. As Millidge wrote: "We have reinvented the Von Neumann architecture" because it's a natural abstraction for any computing system.

Beren Millidge 在 2023 年的文章《Scaffolded LLMs as Natural Language Computers》中,把这个类比讲得很准确。原始 LLM 就像一颗没有 RAM、没有磁盘、没有 I/O 的 CPU。上下文窗口充当 RAM,速度快但容量有限。外部数据库相当于磁盘存储,容量大但速度慢。工具集成像设备驱动。harness 就是操作系统。正如 Millidge 所写:“我们重新发明了冯·诺依曼架构”,因为它是任何计算系统都自然会走向的一种抽象。

Three Levels of Engineering

三层工程

Three concentric levels of engineering surround the model:

模型周围有三层同心工程:

  • Prompt engineering crafts the instructions the model receives.
  • 提示词工程设计模型接收到的指令。
  • Context engineering manages what the model sees and when.
  • 上下文工程管理模型在什么时候看到什么。
  • Harness engineering encompasses both, plus the entire application infrastructure: tool orchestration, state persistence, error recovery, verification loops, safety enforcement, and lifecycle management.
  • Harness 工程包含前两者,同时还包括完整的应用基础设施:工具编排、状态持久化、错误恢复、验证循环、安全执行和生命周期管理。

The harness is not a wrapper around a prompt. It is the complete system that makes autonomous agent behavior possible.

harness 不是套在提示词外面的包装。它是让自主智能体行为成为可能的完整系统。

The 12 Components of a Production Harness

生产级 Harness 的 12 个组成部分

Synthesizing across Anthropic, OpenAI, LangChain, and the broader practitioner community, a production agent harness has twelve distinct components. Let's walk through each one.

综合 Anthropic、OpenAI、LangChain 以及更广泛的一线实践者经验,一个生产级 agent harness 有十二个不同组成部分。下面逐一展开。

1. The Orchestration Loop

1. 编排循环

This is the heartbeat. It implements the Thought-Action-Observation (TAO) cycle, also called the ReAct loop. The loop runs: assemble prompt, call LLM, parse output, execute any tool calls, feed results back, repeat until done.

这是心跳。它实现 Thought-Action-Observation,也就是 TAO 循环,也叫 ReAct 循环。这个循环会依次执行:组装提示词、调用 LLM、解析输出、执行工具调用、把结果喂回模型,不断重复直到完成。

Mechanically, it's often just a while loop. The complexity lives in everything the loop manages, not the loop itself. Anthropic describes their runtime as a "dumb loop" where all intelligence lives in the model. The harness just manages turns.

从机制上看,它通常只是一个 while 循环。复杂性存在于循环所管理的一切之中,而不是循环本身。Anthropic 把他们的运行时描述为一个“笨循环”,所有智能都在模型里。harness 只负责管理轮次。

2. Tools

2. 工具

Tools are the agent's hands. They're defined as schemas (name, description, parameter types) injected into the LLM's context so the model knows what's available. The tool layer handles registration, schema validation, argument extraction, sandboxed execution, result capture, and formatting results back into LLM-readable observations.

工具是智能体的手。它们以 schema 的形式定义,包括名称、描述、参数类型,并被注入到 LLM 的上下文中,让模型知道自己能用什么。工具层负责注册、schema 校验、参数提取、沙箱执行、结果捕获,以及把结果格式化成 LLM 能读懂的 observation。

Claude Code provides tools across six categories: file operations, search, execution, web access, code intelligence, and subagent spawning. OpenAI's Agents SDK supports function tools (via @function_tool), hosted tools (WebSearch, CodeInterpreter, FileSearch), and MCP server tools.

Claude Code 提供六类工具:文件操作、搜索、执行、网页访问、代码智能和子智能体生成。OpenAI 的 Agents SDK 支持函数工具,通过 @function_tool;托管工具,如 WebSearch、CodeInterpreter、FileSearch;以及 MCP 服务器工具。

3. Memory

3. 记忆

Memory operates at multiple timescales. Short-term memory is conversation history within a single session. Long-term memory persists across sessions: Anthropic uses CLAUDE.md project files and auto-generated MEMORY.md files; LangGraph uses namespace-organized JSON Stores; OpenAI supports Sessions backed by SQLite or Redis.

记忆运行在多个时间尺度上。短期记忆是单次会话内的对话历史。长期记忆跨会话持久存在:Anthropic 使用 CLAUDE.md 项目文件和自动生成的 MEMORY.md 文件;LangGraph 使用按命名空间组织的 JSON Stores;OpenAI 支持由 SQLite 或 Redis 支撑的 Sessions。

Claude Code implements a three-tier hierarchy: a lightweight index (~150 characters per entry, always loaded), detailed topic files pulled in on demand, and raw transcripts accessed via search only. A critical design principle: the agent treats its own memory as a "hint" and verifies against actual state before acting.

Claude Code 实现了三层结构:轻量索引,每条约 150 个字符,始终加载;按需拉取的详细主题文件;以及只能通过搜索访问的原始转录记录。一个关键设计原则是:智能体把自己的记忆当作“提示”,在行动前会对照真实状态进行验证

4. Context Management

4. 上下文管理

This is where many agents fail silently. The core problem is context rot: model performance degrades 30%+ when key content falls in mid-window positions (Chroma research, corroborated by Stanford's "Lost in the Middle" finding). Even million-token windows suffer from instruction-following degradation as context grows.

很多智能体都是在这里无声失败的。核心问题是上下文腐烂:当关键信息落在窗口中部位置时,模型性能会下降 30% 以上,这是 Chroma 的研究结果,也得到斯坦福“Lost in the Middle”发现的印证。即使是百万 token 窗口,随着上下文增长,指令遵循能力也会退化。

Production strategies include:

生产环境中的策略包括:

  • Compaction: summarizing conversation history when approaching limits (Claude Code preserves architectural decisions and unresolved bugs while discarding redundant tool outputs)
  • 压缩:接近限制时总结对话历史。Claude Code 会保留架构决策和未解决 bug,同时丢弃冗余工具输出。
  • Observation masking: JetBrains' Junie hides old tool outputs while keeping tool calls visible
  • Observation 遮蔽:JetBrains 的 Junie 会隐藏旧工具输出,但保留工具调用可见。
  • Just-in-time retrieval: maintaining lightweight identifiers and loading data dynamically (Claude Code uses grep, glob, head, tail rather than loading full files)
  • 即时检索:维护轻量标识符,动态加载数据。Claude Code 使用 grep、glob、head、tail,而不是加载完整文件。
  • Sub-agent delegation: each subagent explores extensively but returns only 1,000 to 2,000 token condensed summaries
  • 子智能体委派:每个子智能体可以大范围探索,但只返回 1,000 到 2,000 token 的浓缩摘要。

Anthropic's context engineering guide states the goal: find the smallest possible set of high-signal tokens that maximize likelihood of the desired outcome.

Anthropic 的上下文工程指南把目标说得很明确:找到尽可能小、信号密度尽可能高的 token 集合,最大化得到理想结果的概率。

5. Prompt Construction

5. 提示词构建

This assembles what the model actually sees at each step. It's hierarchical: system prompt, tool definitions, memory files, conversation history, and the current user message.

这一步组装模型在每一步真正看到的内容。它是分层的:系统提示词、工具定义、记忆文件、对话历史,以及当前用户消息。

OpenAI's Codex uses a strict priority stack: server-controlled system message (highest priority), tool definitions, developer instructions, user instructions (cascading AGENTS.md files, 32 KiB limit), then conversation history.

OpenAI 的 Codex 使用严格的优先级栈:服务端控制的系统消息,优先级最高;工具定义;开发者指令;用户指令,包括级联的 AGENTS.md 文件,限制 32 KiB;然后是对话历史。

6. Output Parsing

6. 输出解析

Modern harnesses rely on native tool calling, where the model returns structured tool_calls objects rather than free-text that must be parsed. The harness checks: are there tool calls? Execute them and loop. No tool calls? That's the final answer.

现代 harness 依赖原生工具调用,模型返回结构化的 tool_calls 对象,而不是必须解析的自由文本。harness 检查:有没有工具调用?有就执行并继续循环。没有工具调用?那就是最终答案。

For structured outputs, both OpenAI and LangChain support schema-constrained responses via Pydantic models. Legacy approaches like RetryWithErrorOutputParser (which feeds the original prompt, the failed completion, and the parsing error back to the model) remain available for edge cases.

对于结构化输出,OpenAI 和 LangChain 都支持通过 Pydantic 模型进行 schema 约束响应。像 RetryWithErrorOutputParser 这样的旧方法仍然可用于边缘情况,它会把原始提示词、失败的 completion 和解析错误一起反馈给模型。

7. State Management

7. 状态管理

LangGraph models state as typed dictionaries flowing through graph nodes, with reducers merging updates. Checkpointing happens at super-step boundaries, enabling resume after interruptions and time-travel debugging. OpenAI offers four mutually exclusive strategies: application memory, SDK sessions, server-side Conversations API, or lightweight previous_response_id chaining. Claude Code takes a different approach: git commits as checkpoints and progress files as structured scratchpads.

LangGraph 把状态建模为流经图节点的类型化字典,并用 reducers 合并更新。检查点发生在 super-step 边界,因此可以在中断后恢复,也可以进行时间旅行式调试。OpenAI 提供四种互斥策略:应用内存、SDK sessions、服务端 Conversations API,或轻量的 previous_response_id 链接。Claude Code 采用不同路线:把 git commit 当作检查点,把进度文件当作结构化草稿本

8. Error Handling

8. 错误处理

Here's why this matters: a 10-step process with 99% per-step success still has only ~90.4% end-to-end success. Errors compound fast.

这件事之所以重要,是因为:一个 10 步流程,即使每一步成功率都是 99%,端到端成功率也只有约 90.4%。错误会迅速叠加。

LangGraph distinguishes four error types: transient (retry with backoff), LLM-recoverable (return error as ToolMessage so the model can adjust), user-fixable (interrupt for human input), and unexpected (bubble up for debugging). Anthropic catches failures within tool handlers and returns them as error results to keep the loop running. Stripe's production harness caps retry attempts at two.

LangGraph 区分四类错误:瞬时错误,带退避重试;LLM 可恢复错误,把错误作为 ToolMessage 返回,让模型调整;用户可修复错误,中断并等待人工输入;意外错误,向上抛出用于调试。Anthropic 会在工具处理器内部捕获失败,并把它们作为错误结果返回,以保持循环继续运行。Stripe 的生产 harness 把重试次数限制在两次。

9. Guardrails and Safety

9. 护栏与安全

OpenAI's SDK implements three levels: input guardrails (run on first agent), output guardrails (run on final output), and tool guardrails (run on every tool invocation). A "tripwire" mechanism halts the agent immediately when triggered.

OpenAI 的 SDK 实现了三个层级:输入护栏,在第一个智能体上运行;输出护栏,在最终输出上运行;工具护栏,在每次工具调用时运行。“tripwire”机制会在触发时立刻停止智能体。

Anthropic separates permission enforcement from model reasoning architecturally. The model decides what to attempt; the tool system decides what's allowed. Claude Code gates ~40 discrete tool capabilities independently, with three stages: trust establishment at project load, permission check before each tool call, and explicit user confirmation for high-risk operations.

Anthropic 在架构上把权限执行和模型推理分离。模型决定尝试什么;工具系统决定什么被允许。Claude Code 独立管控约 40 个离散工具能力,并分为三个阶段:项目加载时建立信任;每次工具调用前检查权限;高风险操作需要用户明确确认。

10. Verification Loops

10. 验证循环

This is what separates toy demos from production agents. Anthropic recommends three approaches: rules-based feedback (tests, linters, type checkers), visual feedback (screenshots via Playwright for UI tasks), and LLM-as-judge (a separate subagent evaluates output).

这是玩具 demo 和生产级智能体的分水岭。Anthropic 推荐三种方法:基于规则的反馈,如测试、linter、类型检查器;视觉反馈,如 UI 任务中通过 Playwright 截图;LLM-as-judge,即让一个独立子智能体评估输出。

Boris Cherny, creator of Claude Code, noted that giving the model a way to verify its work improves quality by 2 to 3x.

Claude Code 的创造者 Boris Cherny 曾指出,给模型一种验证自身工作的方式,可以把质量提升 2 到 3 倍

11. Subagent Orchestration

11. 子智能体编排

Claude Code supports three execution models: Fork (byte-identical copy of parent context), Teammate (separate terminal pane with file-based mailbox communication), and Worktree (own git worktree, isolated branch per agent). OpenAI's SDK supports agents-as-tools (specialist handles bounded subtask) and handoffs (specialist takes full control). LangGraph implements subagents as nested state graphs.

Claude Code 支持三种执行模型:Fork,对父上下文做字节级完全相同的复制;Teammate,使用单独终端面板,并通过基于文件的 mailbox 通信;Worktree,每个智能体拥有自己的 git worktree 和隔离分支。OpenAI 的 SDK 支持 agents-as-tools,由专家处理有边界的子任务;也支持 handoffs,由专家接管完整控制。LangGraph 把子智能体实现为嵌套状态图。

The Loop in Motion: A Step-by-Step Walkthrough

循环如何运转:一步一步走完

Now that you know the components, let's trace how they work together in a single cycle.

现在你已经了解这些组成部分,我们来追踪它们如何在一次循环中协同工作。

Step 1 (Prompt Assembly): The harness constructs the full input: system prompt + tool schemas + memory files + conversation history + current user message. Important context is positioned at the beginning and end of the prompt (the "Lost in the Middle" finding).

步骤 1(提示词组装):harness 构建完整输入:系统提示词 + 工具 schema + 记忆文件 + 对话历史 + 当前用户消息。重要上下文会被放在提示词的开头和结尾,也就是“Lost in the Middle”发现所提示的位置策略。

Step 2 (LLM Inference): The assembled prompt goes to the model API. The model generates output tokens: text, tool call requests, or both.

步骤 2(LLM 推理):组装好的提示词发送到模型 API。模型生成输出 token:文本、工具调用请求,或两者都有。

Step 3 (Output Classification): If the model produced text with no tool calls, the loop ends. If it requested tool calls, proceed to execution. If a handoff was requested, update the current agent and restart.

步骤 3(输出分类):如果模型只生成文本,没有工具调用,循环结束。如果它请求工具调用,则进入执行。如果请求 handoff,则更新当前智能体并重新开始。

Step 4 (Tool Execution): For each tool call, the harness validates arguments, checks permissions, executes in a sandboxed environment, and captures results. Read-only operations can run concurrently; mutating operations run serially.

步骤 4(工具执行):对于每个工具调用,harness 校验参数、检查权限、在沙箱环境中执行,并捕获结果。只读操作可以并发运行;会修改状态的操作串行运行。

Step 5 (Result Packaging): Tool results are formatted as LLM-readable messages. Errors are caught and returned as error results so the model can self-correct.

步骤 5(结果打包):工具结果被格式化为 LLM 可读的消息。错误会被捕获,并作为错误结果返回,让模型可以自我修正。

Step 6 (Context Update): Results are appended to conversation history. If approaching the context window limit, the harness triggers compaction.

步骤 6(上下文更新):结果被追加到对话历史。如果接近上下文窗口限制,harness 会触发压缩。

Step 7 (Loop): Return to Step 1. Repeat until termination.

步骤 7(循环):回到步骤 1。重复直到终止。

Termination conditions are layered: the model produces a response with no tool calls, maximum turn limit is exceeded, token budget is exhausted, a guardrail tripwire fires, the user interrupts, or a safety refusal is returned. A simple question might take 1 to 2 turns. A complex refactoring task can chain dozens of tool calls across many turns.

终止条件是分层的:模型产生没有工具调用的响应;超过最大轮次限制;token 预算耗尽;护栏 tripwire 被触发;用户中断;或返回安全拒绝。一个简单问题可能只需要 1 到 2 轮。一个复杂重构任务可能跨许多轮,串联几十次工具调用。

For long-running tasks spanning multiple context windows, Anthropic developed a two-phase "Ralph Loop" pattern: an Initializer Agent sets up the environment (init script, progress file, feature list, initial git commit), then a Coding Agent in every subsequent session reads git logs and progress files to orient itself, picks the highest-priority incomplete feature, works on it, commits, and writes summaries. The filesystem provides continuity across context windows.

对于跨越多个上下文窗口的长时间任务,Anthropic 开发了一种两阶段的“Ralph Loop”模式:一个 Initializer Agent 设置环境,包括 init script、进度文件、功能列表和初始 git commit;随后每个会话中的 Coding Agent 都会读取 git logs 和进度文件来定位自身,选择优先级最高且未完成的功能,开始工作、提交,并写下摘要。文件系统为多个上下文窗口之间提供连续性。

How Real Frameworks Implement the Pattern

真实框架如何实现这一模式

Anthropic's Claude Agent SDK exposes the harness through a single query() function that creates the agentic loop and returns an async iterator streaming messages. The runtime is a "dumb loop." All intelligence lives in the model. Claude Code uses a Gather-Act-Verify cycle: gather context (search files, read code), take action (edit files, run commands), verify results (run tests, check output), repeat.

Anthropic 的 Claude Agent SDK 通过单个 query() 函数暴露 harness。这个函数创建 agentic loop,并返回一个流式输出消息的 async iterator。运行时是一个“笨循环”。所有智能都在模型中。Claude Code 使用 Gather-Act-Verify 循环:收集上下文,搜索文件、阅读代码;采取行动,编辑文件、运行命令;验证结果,运行测试、检查输出;然后重复。

OpenAI's Agents SDK implements the harness through the Runner class with three modes: async, sync, and streamed. The SDK is "code-first": workflow logic is expressed in native Python rather than graph DSLs. The Codex harness extends this with a three-layer architecture: Codex Core (agent code + runtime), App Server (bidirectional JSON-RPC API), and client surfaces (CLI, VS Code, web app). All surfaces share the same harness, which is why "Codex models feel better on Codex surfaces than a generic chat window."

OpenAI 的 Agents SDK 通过 Runner 类实现 harness,有三种模式:async、sync 和 streamed。SDK 是“code-first”的:工作流逻辑用原生 Python 表达,而不是图 DSL。Codex harness 在此基础上扩展为三层架构:Codex Core,包含智能体代码和运行时;App Server,提供双向 JSON-RPC API;以及客户端界面,包括 CLI、VS Code、Web app。所有界面共享同一个 harness,这就是为什么“Codex 模型在 Codex 界面里比在普通聊天窗口里感觉更好”。

LangGraph models the harness as an explicit state graph. Two nodes (llm_call and tool_node) connected by a conditional edge: if tool calls present, route to tool_node; if absent, route to END. LangGraph evolved from LangChain's AgentExecutor, which was deprecated in v0.2 because it was hard to extend and lacked multi-agent support. LangChain's Deep Agents explicitly use the term "agent harness": built-in tools, planning (write_todos tool), file systems for context management, subagent spawning, and persistent memory.

LangGraph 把 harness 建模为显式状态图。两个节点,llm_call 和 tool_node,通过条件边连接:如果存在工具调用,就路由到 tool_node;如果不存在,就路由到 END。LangGraph 从 LangChain 的 AgentExecutor 演化而来;后者在 v0.2 中被废弃,因为难以扩展且缺少多智能体支持。LangChain 的 Deep Agents 明确使用“agent harness”这个术语:内置工具、规划能力,也就是 write_todos 工具;用于上下文管理的文件系统;子智能体生成;以及持久记忆。

CrewAI implements a role-based multi-agent architecture: Agent (the harness around the LLM, defined by role, goal, backstory, and tools), Task (the unit of work), and Crew (the collection of agents). CrewAI's Flows layer adds a "deterministic backbone with intelligence where it matters," managing routing and validation while Crews handle autonomous collaboration.

CrewAI 实现的是基于角色的多智能体架构:Agent,即包在 LLM 周围的 harness,由角色、目标、背景故事和工具定义;Task,即工作单元;Crew,即智能体集合。CrewAI 的 Flows 层增加了一条“确定性骨架,在真正需要的地方注入智能”,负责路由和校验,而 Crews 处理自主协作。

AutoGen (evolving into Microsoft Agent Framework) pioneered conversation-driven orchestration. Its three-layer architecture (Core, AgentChat, Extensions) supports five orchestration patterns: sequential, concurrent (fan-out/fan-in), group chat, handoff, and magentic (a manager agent maintains a dynamic task ledger coordinating specialists).

AutoGen,后来演进为 Microsoft Agent Framework,开创了会话驱动的编排方式。它的三层架构 Core、AgentChat、Extensions 支持五种编排模式:顺序、并发,即 fan-out/fan-in;群聊;handoff;以及 magentic,即一个管理智能体维护动态任务账本,协调各类专家。

The Scaffolding Metaphor

脚手架隐喻

The scaffolding metaphor isn't decorative. It's precise. Construction scaffolding is temporary infrastructure that enables workers to build a structure they couldn't reach otherwise. It doesn't do the construction. But without it, workers can't reach the upper floors.

脚手架这个隐喻不是装饰性的,而是准确的。建筑脚手架是一种临时基础设施,让工人能建造原本够不到的结构。它本身不做施工。但没有它,工人到不了更高楼层。

The key insight: scaffolding is removed when the building is complete. As models improve, harness complexity should decrease. Manus was rebuilt five times in six months, each rewrite removing complexity. Complex tool definitions became general shell execution. "Management agents" became simple structured handoffs.

关键洞见是:建筑完成后,脚手架会被拆除。 随着模型进步,harness 的复杂度应该下降。Manus 在六个月内重写了五次,每次重写都在移除复杂性。复杂的工具定义变成了通用 shell 执行。“管理智能体”变成了简单的结构化 handoff。

This points to the co-evolution principle: models are now post-trained with specific harnesses in the loop. Claude Code's model learned to use the specific harness it was trained with. Changing tool implementations can degrade performance because of this tight coupling.

这指向了共同演化原则:模型现在是在特定 harness 参与的循环中进行后训练的。Claude Code 的模型学会了使用自己训练时搭配的特定 harness。因为这种紧密耦合,改变工具实现可能会降低性能。

The "future-proofing test" for harness design: if performance scales up with more powerful models without adding harness complexity, the design is sound.

harness 设计的“面向未来测试”是:如果随着模型变强,性能能够提升,而不需要增加 harness 复杂度,那么这个设计就是健康的。

Seven Decisions That Define Every Harness

定义每个 Harness 的七个决策

Every harness architect faces seven choices:

每个 harness 架构师都要面对七个选择:

  1. Single-agent vs. multi-agent. Both Anthropic and OpenAI say: maximize a single agent first. Multi-agent systems add overhead (extra LLM calls for routing, context loss during handoffs). Split only when tool overload exceeds ~10 overlapping tools or clearly separate task domains exist.
  1. 单智能体还是多智能体。 Anthropic 和 OpenAI 都说:先把单智能体做到最大化。多智能体系统会增加开销,包括额外的 LLM 调用用于路由,以及 handoff 过程中的上下文丢失。只有当工具过载超过约 10 个重叠工具,或存在明显分离的任务领域时,才拆分。
  1. ReAct vs. plan-and-execute. ReAct interleaves reasoning and action at every step (flexible but higher per-step cost). Plan-and-execute separates planning from execution. LLMCompiler reports a 3.6x speedup over sequential ReAct.
  1. ReAct 还是 plan-and-execute。 ReAct 在每一步交错推理和行动,灵活但每步成本更高。plan-and-execute 把规划和执行分开。LLMCompiler 报告称,相比顺序 ReAct,它实现了 3.6 倍加速
  1. Context window management strategy. Five production approaches: time-based clearing, conversation summarization, observation masking, structured note-taking, and sub-agent delegation. ACON research showed 26 to 54% token reduction while preserving 95%+ accuracy by prioritizing reasoning traces over raw tool outputs.
  1. 上下文窗口管理策略。 生产环境中有五种方法:按时间清理、对话总结、observation 遮蔽、结构化记笔记,以及子智能体委派。ACON 研究显示,通过优先保留推理痕迹而不是原始工具输出,可以减少 26% 到 54% 的 token,同时保持 95% 以上准确率
  1. Verification loop design. Computational verification (tests, linters) provides deterministic ground truth. Inferential verification (LLM-as-judge) catches semantic issues but adds latency. Martin Fowler's Thoughtworks team frames this as guides (feedforward, steer before action) versus sensors (feedback, observe after action).
  1. 验证循环设计。 计算式验证,如测试和 linter,提供确定性的真实依据。推断式验证,如 LLM-as-judge,可以捕捉语义问题,但会增加延迟。Martin Fowler 的 Thoughtworks 团队把它表述为 guides,也就是前馈,在行动前引导;以及 sensors,也就是反馈,在行动后观察。
  1. Permission and safety architecture. Permissive (fast but risky, auto-approve most actions) versus restrictive (safe but slow, require approval for each action). The choice depends on deployment context.
  1. 权限和安全架构。 宽松模式更快但风险更高,会自动批准大多数操作;限制模式更安全但更慢,每个操作都需要批准。选择取决于部署环境。
  1. Tool scoping strategy. More tools often means worse performance. Vercel removed 80% of tools from v0 and got better results. Claude Code achieves 95% context reduction via lazy loading. The principle: expose the minimum tool set needed for the current step.
  1. 工具范围策略。 工具越多,性能往往越差。Vercel 从 v0 中移除了 80% 的工具,结果反而更好。Claude Code 通过延迟加载实现了 95% 的上下文减少。原则是:只暴露当前步骤所需的最小工具集合。
  1. Harness thickness. How much logic lives in the harness versus the model. Anthropic bets on thin harnesses and model improvement. Graph-based frameworks bet on explicit control. Anthropic regularly deletes planning steps from Claude Code's harness as new model versions internalize that capability.
  1. Harness 厚度。 有多少逻辑放在 harness 里,有多少交给模型。Anthropic 押注薄 harness 和模型进步。基于图的框架押注显式控制。随着新版模型内化某些能力,Anthropic 会定期从 Claude Code 的 harness 中删除规划步骤。

The Harness Is the Product

Harness 就是产品

Two products using identical models can have wildly different performance based solely on harness design. The TerminalBench evidence is clear: changing only the harness moved agents by 20+ ranking positions.

两个产品即使用完全相同的模型,也可能仅仅因为 harness 设计不同而表现天差地别。TerminalBench 的证据很清楚:只改变 harness,就能让智能体排名移动 20 多个名次。

The harness is not a solved problem or a commodity layer. It's where the hard engineering lives: managing context as a scarce resource, designing verification loops that catch failures before they compound, building memory systems that provide continuity without hallucination, and making architectural bets about how much scaffolding to build versus how much to leave to the model.

harness 不是一个已经解决的问题,也不是商品化层。真正困难的工程就在这里:把上下文当作稀缺资源来管理;设计能在失败叠加前抓住问题的验证循环;构建能提供连续性又不会制造幻觉的记忆系统;并在架构上判断到底该搭多少脚手架,又该把多少能力留给模型。

The field is moving toward thinner harnesses as models improve. But the harness itself isn't going away. Even the most capable model needs something to manage its context window, execute its tool calls, persist its state, and verify its work.

随着模型进步,这个领域正在走向更薄的 harness。但 harness 本身不会消失。哪怕是最强的模型,也仍然需要某种东西来管理它的上下文窗口,执行它的工具调用,持久化它的状态,并验证它的工作。

The next time your agent fails, don't blame the model. Look at the harness.

下次你的智能体失败时,先别怪模型。看看 harness。

That's a wrap!

就到这里!

If you enjoyed reading this:

如果你喜欢这篇文章:

Find me →@akshay_pachaar ✔️

来找我 →@akshay_pachaar ✔️

Every day, I share tutorials and insights on AI, Machine Learning, and vibe coding best practices.

每天,我都会分享关于 AI、机器学习和 vibe coding 最佳实践的教程与见解。

A deep dive into what Anthropic, OpenAI, Perplexity and LangChain are actually building. Covering the orchestration loop, tools, memory, context management, and everything else that transforms a stateless LLM into a capable agent.

You've built a chatbot. Maybe you've wired up a ReAct loop with a few tools. It works for demos. Then you try to build something production-grade, and the wheels come off: the model forgets what it did three steps ago, tool calls fail silently, and context windows fill up with garbage.

The problem isn't your model. It's everything around your model.

LangChain proved this when they changed only the infrastructure wrapping their LLM (same model, same weights) and jumped from outside the top 30 to rank 5 on TerminalBench 2.0. A separate research project hit a 76.4% pass rate by having an LLM optimize the infrastructure itself, surpassing hand-designed systems.

That infrastructure has a name now: the agent harness.

What Is the Agent Harness?

The term was formalized in early 2026, but the concept existed long before. The harness is the complete software infrastructure wrapping an LLM: orchestration loop, tools, memory, context management, state persistence, error handling, and guardrails. Anthropic's Claude Code documentation puts it simply: the SDK is "the agent harness that powers Claude Code." OpenAI's Codex team uses the same framing, explicitly equating the terms "agent" and "harness" to refer to the non-model infrastructure that makes the LLM useful.

I really liked the canonical formula, from LangChain's Vivek Trivedy: "If you're not the model, you're the harness."

Here's the distinction that trips people up. The "agent" is the emergent behavior: the goal-directed, tool-using, self-correcting entity the user interacts with. The harness is the machinery producing that behavior. When someone says "I built an agent," they mean they built a harness and pointed it at a model.

https://www.anthropic.com/research/long-running-Claude

Beren Millidge made this analogy precise in his 2023 essay "Scaffolded LLMs as Natural Language Computers." A raw LLM is a CPU with no RAM, no disk, and no I/O. The context window serves as RAM (fast but limited). External databases function as disk storage (large but slow). Tool integrations act as device drivers. The harness is the operating system. As Millidge wrote: "We have reinvented the Von Neumann architecture" because it's a natural abstraction for any computing system.

Three Levels of Engineering

Three concentric levels of engineering surround the model:

  • Prompt engineering crafts the instructions the model receives.

  • Context engineering manages what the model sees and when.

  • Harness engineering encompasses both, plus the entire application infrastructure: tool orchestration, state persistence, error recovery, verification loops, safety enforcement, and lifecycle management.

The harness is not a wrapper around a prompt. It is the complete system that makes autonomous agent behavior possible.

The 12 Components of a Production Harness

Synthesizing across Anthropic, OpenAI, LangChain, and the broader practitioner community, a production agent harness has twelve distinct components. Let's walk through each one.

1. The Orchestration Loop

This is the heartbeat. It implements the Thought-Action-Observation (TAO) cycle, also called the ReAct loop. The loop runs: assemble prompt, call LLM, parse output, execute any tool calls, feed results back, repeat until done.

Mechanically, it's often just a while loop. The complexity lives in everything the loop manages, not the loop itself. Anthropic describes their runtime as a "dumb loop" where all intelligence lives in the model. The harness just manages turns.

2. Tools

Tools are the agent's hands. They're defined as schemas (name, description, parameter types) injected into the LLM's context so the model knows what's available. The tool layer handles registration, schema validation, argument extraction, sandboxed execution, result capture, and formatting results back into LLM-readable observations.

Claude Code provides tools across six categories: file operations, search, execution, web access, code intelligence, and subagent spawning. OpenAI's Agents SDK supports function tools (via @function_tool), hosted tools (WebSearch, CodeInterpreter, FileSearch), and MCP server tools.

3. Memory

Memory operates at multiple timescales. Short-term memory is conversation history within a single session. Long-term memory persists across sessions: Anthropic uses CLAUDE.md project files and auto-generated MEMORY.md files; LangGraph uses namespace-organized JSON Stores; OpenAI supports Sessions backed by SQLite or Redis.

Claude Code implements a three-tier hierarchy: a lightweight index (~150 characters per entry, always loaded), detailed topic files pulled in on demand, and raw transcripts accessed via search only. A critical design principle: the agent treats its own memory as a "hint" and verifies against actual state before acting.

4. Context Management

This is where many agents fail silently. The core problem is context rot: model performance degrades 30%+ when key content falls in mid-window positions (Chroma research, corroborated by Stanford's "Lost in the Middle" finding). Even million-token windows suffer from instruction-following degradation as context grows.

Production strategies include:

  • Compaction: summarizing conversation history when approaching limits (Claude Code preserves architectural decisions and unresolved bugs while discarding redundant tool outputs)

  • Observation masking: JetBrains' Junie hides old tool outputs while keeping tool calls visible

  • Just-in-time retrieval: maintaining lightweight identifiers and loading data dynamically (Claude Code uses grep, glob, head, tail rather than loading full files)

  • Sub-agent delegation: each subagent explores extensively but returns only 1,000 to 2,000 token condensed summaries

Anthropic's context engineering guide states the goal: find the smallest possible set of high-signal tokens that maximize likelihood of the desired outcome.

5. Prompt Construction

This assembles what the model actually sees at each step. It's hierarchical: system prompt, tool definitions, memory files, conversation history, and the current user message.

OpenAI's Codex uses a strict priority stack: server-controlled system message (highest priority), tool definitions, developer instructions, user instructions (cascading AGENTS.md files, 32 KiB limit), then conversation history.

6. Output Parsing

Modern harnesses rely on native tool calling, where the model returns structured tool_calls objects rather than free-text that must be parsed. The harness checks: are there tool calls? Execute them and loop. No tool calls? That's the final answer.

For structured outputs, both OpenAI and LangChain support schema-constrained responses via Pydantic models. Legacy approaches like RetryWithErrorOutputParser (which feeds the original prompt, the failed completion, and the parsing error back to the model) remain available for edge cases.

7. State Management

LangGraph models state as typed dictionaries flowing through graph nodes, with reducers merging updates. Checkpointing happens at super-step boundaries, enabling resume after interruptions and time-travel debugging. OpenAI offers four mutually exclusive strategies: application memory, SDK sessions, server-side Conversations API, or lightweight previous_response_id chaining. Claude Code takes a different approach: git commits as checkpoints and progress files as structured scratchpads.

8. Error Handling

Here's why this matters: a 10-step process with 99% per-step success still has only ~90.4% end-to-end success. Errors compound fast.

LangGraph distinguishes four error types: transient (retry with backoff), LLM-recoverable (return error as ToolMessage so the model can adjust), user-fixable (interrupt for human input), and unexpected (bubble up for debugging). Anthropic catches failures within tool handlers and returns them as error results to keep the loop running. Stripe's production harness caps retry attempts at two.

9. Guardrails and Safety

OpenAI's SDK implements three levels: input guardrails (run on first agent), output guardrails (run on final output), and tool guardrails (run on every tool invocation). A "tripwire" mechanism halts the agent immediately when triggered.

Anthropic separates permission enforcement from model reasoning architecturally. The model decides what to attempt; the tool system decides what's allowed. Claude Code gates ~40 discrete tool capabilities independently, with three stages: trust establishment at project load, permission check before each tool call, and explicit user confirmation for high-risk operations.

10. Verification Loops

This is what separates toy demos from production agents. Anthropic recommends three approaches: rules-based feedback (tests, linters, type checkers), visual feedback (screenshots via Playwright for UI tasks), and LLM-as-judge (a separate subagent evaluates output).

Boris Cherny, creator of Claude Code, noted that giving the model a way to verify its work improves quality by 2 to 3x.

11. Subagent Orchestration

Claude Code supports three execution models: Fork (byte-identical copy of parent context), Teammate (separate terminal pane with file-based mailbox communication), and Worktree (own git worktree, isolated branch per agent). OpenAI's SDK supports agents-as-tools (specialist handles bounded subtask) and handoffs (specialist takes full control). LangGraph implements subagents as nested state graphs.

The Loop in Motion: A Step-by-Step Walkthrough

Now that you know the components, let's trace how they work together in a single cycle.

http://memory.md/

Step 1 (Prompt Assembly): The harness constructs the full input: system prompt + tool schemas + memory files + conversation history + current user message. Important context is positioned at the beginning and end of the prompt (the "Lost in the Middle" finding).

Step 2 (LLM Inference): The assembled prompt goes to the model API. The model generates output tokens: text, tool call requests, or both.

Step 3 (Output Classification): If the model produced text with no tool calls, the loop ends. If it requested tool calls, proceed to execution. If a handoff was requested, update the current agent and restart.

Step 4 (Tool Execution): For each tool call, the harness validates arguments, checks permissions, executes in a sandboxed environment, and captures results. Read-only operations can run concurrently; mutating operations run serially.

Step 5 (Result Packaging): Tool results are formatted as LLM-readable messages. Errors are caught and returned as error results so the model can self-correct.

Step 6 (Context Update): Results are appended to conversation history. If approaching the context window limit, the harness triggers compaction.

Step 7 (Loop): Return to Step 1. Repeat until termination.

Termination conditions are layered: the model produces a response with no tool calls, maximum turn limit is exceeded, token budget is exhausted, a guardrail tripwire fires, the user interrupts, or a safety refusal is returned. A simple question might take 1 to 2 turns. A complex refactoring task can chain dozens of tool calls across many turns.

For long-running tasks spanning multiple context windows, Anthropic developed a two-phase "Ralph Loop" pattern: an Initializer Agent sets up the environment (init script, progress file, feature list, initial git commit), then a Coding Agent in every subsequent session reads git logs and progress files to orient itself, picks the highest-priority incomplete feature, works on it, commits, and writes summaries. The filesystem provides continuity across context windows.

How Real Frameworks Implement the Pattern

http://agents.md/

Anthropic's Claude Agent SDK exposes the harness through a single query() function that creates the agentic loop and returns an async iterator streaming messages. The runtime is a "dumb loop." All intelligence lives in the model. Claude Code uses a Gather-Act-Verify cycle: gather context (search files, read code), take action (edit files, run commands), verify results (run tests, check output), repeat.

OpenAI's Agents SDK implements the harness through the Runner class with three modes: async, sync, and streamed. The SDK is "code-first": workflow logic is expressed in native Python rather than graph DSLs. The Codex harness extends this with a three-layer architecture: Codex Core (agent code + runtime), App Server (bidirectional JSON-RPC API), and client surfaces (CLI, VS Code, web app). All surfaces share the same harness, which is why "Codex models feel better on Codex surfaces than a generic chat window."

LangGraph models the harness as an explicit state graph. Two nodes (llm_call and tool_node) connected by a conditional edge: if tool calls present, route to tool_node; if absent, route to END. LangGraph evolved from LangChain's AgentExecutor, which was deprecated in v0.2 because it was hard to extend and lacked multi-agent support. LangChain's Deep Agents explicitly use the term "agent harness": built-in tools, planning (write_todos tool), file systems for context management, subagent spawning, and persistent memory.

CrewAI implements a role-based multi-agent architecture: Agent (the harness around the LLM, defined by role, goal, backstory, and tools), Task (the unit of work), and Crew (the collection of agents). CrewAI's Flows layer adds a "deterministic backbone with intelligence where it matters," managing routing and validation while Crews handle autonomous collaboration.

AutoGen (evolving into Microsoft Agent Framework) pioneered conversation-driven orchestration. Its three-layer architecture (Core, AgentChat, Extensions) supports five orchestration patterns: sequential, concurrent (fan-out/fan-in), group chat, handoff, and magentic (a manager agent maintains a dynamic task ledger coordinating specialists).

The Scaffolding Metaphor

The scaffolding metaphor isn't decorative. It's precise. Construction scaffolding is temporary infrastructure that enables workers to build a structure they couldn't reach otherwise. It doesn't do the construction. But without it, workers can't reach the upper floors.

The key insight: scaffolding is removed when the building is complete. As models improve, harness complexity should decrease. Manus was rebuilt five times in six months, each rewrite removing complexity. Complex tool definitions became general shell execution. "Management agents" became simple structured handoffs.

This points to the co-evolution principle: models are now post-trained with specific harnesses in the loop. Claude Code's model learned to use the specific harness it was trained with. Changing tool implementations can degrade performance because of this tight coupling.

The "future-proofing test" for harness design: if performance scales up with more powerful models without adding harness complexity, the design is sound.

https://x.com/@akshay_pachaar

Seven Decisions That Define Every Harness

Every harness architect faces seven choices:

  1. Single-agent vs. multi-agent. Both Anthropic and OpenAI say: maximize a single agent first. Multi-agent systems add overhead (extra LLM calls for routing, context loss during handoffs). Split only when tool overload exceeds ~10 overlapping tools or clearly separate task domains exist.

  2. ReAct vs. plan-and-execute. ReAct interleaves reasoning and action at every step (flexible but higher per-step cost). Plan-and-execute separates planning from execution. LLMCompiler reports a 3.6x speedup over sequential ReAct.

  3. Context window management strategy. Five production approaches: time-based clearing, conversation summarization, observation masking, structured note-taking, and sub-agent delegation. ACON research showed 26 to 54% token reduction while preserving 95%+ accuracy by prioritizing reasoning traces over raw tool outputs.

  4. Verification loop design. Computational verification (tests, linters) provides deterministic ground truth. Inferential verification (LLM-as-judge) catches semantic issues but adds latency. Martin Fowler's Thoughtworks team frames this as guides (feedforward, steer before action) versus sensors (feedback, observe after action).

  5. Permission and safety architecture. Permissive (fast but risky, auto-approve most actions) versus restrictive (safe but slow, require approval for each action). The choice depends on deployment context.

  6. Tool scoping strategy. More tools often means worse performance. Vercel removed 80% of tools from v0 and got better results. Claude Code achieves 95% context reduction via lazy loading. The principle: expose the minimum tool set needed for the current step.

  7. Harness thickness. How much logic lives in the harness versus the model. Anthropic bets on thin harnesses and model improvement. Graph-based frameworks bet on explicit control. Anthropic regularly deletes planning steps from Claude Code's harness as new model versions internalize that capability.

The Harness Is the Product

Two products using identical models can have wildly different performance based solely on harness design. The TerminalBench evidence is clear: changing only the harness moved agents by 20+ ranking positions.

The harness is not a solved problem or a commodity layer. It's where the hard engineering lives: managing context as a scarce resource, designing verification loops that catch failures before they compound, building memory systems that provide continuity without hallucination, and making architectural bets about how much scaffolding to build versus how much to leave to the model.

The field is moving toward thinner harnesses as models improve. But the harness itself isn't going away. Even the most capable model needs something to manage its context window, execute its tool calls, persist its state, and verify its work.

The next time your agent fails, don't blame the model. Look at the harness.

https://www.beren.io/2023-04-11-Scaffolded-LLMs-natural-language-computers/

That's a wrap!

If you enjoyed reading this:

Find me →@akshay_pachaar ✔️

Every day, I share tutorials and insights on AI, Machine Learning, and vibe coding best practices.

📋 讨论归档

讨论进行中…