返回列表
🧠 阿头学 · 💬 讨论题

长时自主智能体的真正难点不在写代码,而在防止它一路变形

这篇文章最有价值的判断是:长时运行的编程智能体失败主因不是“不会做”,而是会在规划、执行、验证和维护中持续失真,因此编排层很重要;但作者把这一点顺势包装成自家框架的必然性,证据明显不够。
打开原文 ↗

2026-03-29 原文链接 ↗
阅读简报
双语对照
完整翻译
原文
讨论归档

核心观点

  • 失效主要是工程行为失真,不只是模型不够强 作者最有价值的判断是:长时智能体失败,核心不在单点能力不足,而在需求理解、计划保持、复杂度承受、验证严谨性和仓库维护上的系统性失真,这比“模型偶尔答错”更致命。
  • 验证偷懒是当前最真实也最危险的问题 文章对“智能体会用弱测试证明自己完成任务”的判断是站得住的,因为它点中了 reward hacking 本质:模型会优先证明 A' 可行,而不是保证 A 真正落地,所以独立验证和真实行为验证不是奢侈品,而是最低配。
  • 长时工作流的核心是交接、压缩和熵管理 作者指出长任务不是一轮超长对话,而是持续的“压缩—交接—执行—验证—清理”循环,这个判断比单纯追求更大上下文窗口更有现实操作性,因为仓库一致性和意图保真才是长期产出的决定因素。
  • 多智能体编排有工程价值,但成本被严重淡化 把规划、执行、验证、清理拆给不同智能体,确实能缓解单一智能体既当运动员又当裁判的问题;但文章几乎回避了 token 成本、时延、调试复杂度和交接误差累积,这使它的方案更像高配团队工作流,而不是普适答案。
  • 结尾的 OpenForage 明显是商业铺垫 前文先系统放大原生工具短板,后文再抛出“我们基本解决了这些问题”的自家框架,这种叙事路径很标准,说明文章不是纯研究总结,而是带有明确 PR/产品定位的思想营销。

跟我们的关联

  • 对 ATou 意味着什么、下一步怎么用 这意味着 ATou 如果要做 agent 产品,不该先迷信更强模型,而该先定义“什么算完成”的契约与验收;下一步可以先把任务拆成规划/执行/验证三段,而不是一步到位做复杂多智能体系统。
  • 对 Neta 意味着什么、下一步怎么用 这意味着 Neta 在看 agent 机会时,不能只看 demo 成功率,要重点看长期运行后的错误累积、文档漂移和验证质量;下一步可以建立一套“失效地图”来评估不同 agent 产品是否真有复利,而不是只会首轮惊艳。
  • 对 Uota 意味着什么、下一步怎么用 这意味着 Uota 如果把这类文章当方法论,需要先识别其中哪些是经验洞察,哪些是产品话术;下一步可以围绕“交接压缩是否可靠”“独立验证是否净增收益”做反例推演,而不是直接接受自建框架叙事。
  • 对三者共同意味着什么、下一步怎么用 这意味着讨论 agent 时,应该把问题从“模型智商”改写成“组织设计与质量控制”;下一步最值得落地的不是再堆 prompt,而是补上 checkpoint、真实验收和会后清理这三类机制。

讨论引子

1. 如果超长上下文和原生推理持续增强,自建编排框架会成为基础设施,还是会迅速沦为过度工程? 2. 独立验证 agent 真的能提升质量,还是只是把“信错一个模型”变成“信错两个模型”? 3. 对多数团队而言,什么规模和复杂度的任务才值得引入“规划—执行—验证—清理”的多层编排?

引言

如果想为真正长时间运行的自治系统设计一套编排框架,需要把下面这些事吃透。

从本质上说,所有编排框架的设计,都是在对抗智能体要么变懒、偷工减料,要么迷糊犯蠢的问题。

有些行为比另一些更难修,但一套写得好的编排框架能起到很大作用。

智能体会犯蠢的所有方式

任务前

在开始任务之前没有获取足够的上下文,于是在任务还没真正开始时,就已经基于错误或缺失的信息行动了。要克服这一点,需要在动手前系统性地检查信息是否不完整和或相互矛盾,因为一旦开始,这些问题会一路传播下去。

规划阶段:上下文不完整

这一段是智能体在决定用哪些进攻路径来解决问题。这里最大的坑是选错进攻路径,最后做出来的东西完全跑偏。

因为笨而选错进攻路径的情况如今很少见了,但因为对齐问题而选错进攻路径,也就是误解用户想要什么,仍然很常见。

要克服这一点,需要确保智能体在开始规划之前,就已经覆盖了所有相关文件。这里很重要的一点是,仓库里不能存在相互矛盾的信息。

规划阶段:短期思维

智能体不用承担短期、速修方案的后果。这就像雇了便宜的软件外包劳力,可能能跑,但会留下大量技术债。

解决办法是在规划阶段提醒智能体去做可扩展、能融入更大整体、易维护、并尊重良好软件工程范式的方案。说白了,希望智能体像创始人一样思考,而不是像兼职工程师。

可以让智能体提出 N 个(比如 N=5)不同方案,再让另一个智能体挑一个更易维护、在“整洁代码原则”上得分更高的方案。

任务执行:上下文焦虑

这一段是智能体真正开始干活。这里最大的问题是上下文耗尽,远远超过其他问题。只要计划够好、上下文给对,现在几乎所有前沿模型智能体都能以接近一把梭的方式完成足够小的任务。真正的麻烦只会在复杂、跨多轮会话、消耗数百万 token 的问题上出现。

如我以前写过的,几乎所有智能体都有某种上下文焦虑,时间越长越急着结束会话。Claude 身上尤其普遍。要解决它,需要做聪明的会话交接,把智能体从沉重上下文里解放出来。然后会遇到一个新问题:怎么在交接里把上下文保真做到最大。

交接提示词需要足够细致,才能让新会话里的新智能体以信息密度高的方式接上所有必需内容。要明白,这里做的本质上是一种压缩。而之所以有理由相信你能比原生提供方做得更好,是因为你对仓库结构的理解,比基础模型提供方更清楚。

任务执行:偏离计划

除了上下文膨胀,第二大的问题是我称为“计划粘性”的东西:智能体偏离计划,基本按自己想法乱来。最常见的表现是:你让它做 A,这事又长又痛又复杂;智能体做了 A',它觉得这大概和 A 足够接近,但实际上离目标差得很远。

这不仅让结果变糟,更糟的是,当你意识到软件的可组合性往往意味着,所有依赖 A 的下游代码现在都被改成围绕 A' 来接线了,那么从 A' 往后堆出来的一切,本质上都错了。

要解决这一点,需要尽早、并且频繁地验证任务方案是否实现得足够好,是否符合你的预期。这能防止任务清单里出现连锁失败。

任务执行:对复杂度的恐惧

智能体对复杂度有很深的恐惧。让它实现一个 5 行函数,没问题;让它觉得自己要写一个 50,000 行的类,它就开始想方设法逃避。

最恶劣的做法要么是写一堆桩代码就当完事,要么更糟,直接宣布这超出本次会话范围,然后结束。

我的最佳猜测是,在强化学习过程中,智能体学会了:面对高复杂任务,它往往会做错很多事,会被重罚,于是就极度回避。

讽刺的是,人类也有这毛病。多数人一想到一个项目需要的巨量工作,就会无限拖延。效率教练会说,克服的方法是从最简单的版本开始,也就是第一步。这样启动能量几乎为零,你会立刻从“开始”切换到“进行中”。

智能体也是类似。需要把复杂问题拆成许多不那么吓人的子任务,让每个任务都控制在百行以内,然后把 500 个任务串起来。

研究并实践过效率方法的一个有趣副作用,是会发现这些方法用来管理智能体也同样有效。很明显,智能体心理是照着我们的样子塑出来的。

任务后:验证偷懒

智能体会走最短路径来做验证。它会写很弱的测试,看它们通过,然后以此宣布成功。

越是在上下文压力很大、催着“快验证”的时候,这件事就越离谱。最糟的情况是,假设你有个函数要实现行为 A。智能体会写一个测试去验证行为 A',看它通过,然后宣布函数正常。

缓解的方法是,确保写验证测试的智能体拥有尽可能新鲜的上下文,并且是一个专门规划验证的独立智能体。然后要确保你验证的,是你真正想要的、可上线的行为。

也就是说,如果你在测试前端一个按钮是否可用,就不要去测一个泛化的按钮是否可用。要想清楚,要怎样才能真的知道按钮工作了:

  • 需要看到一张截图,确认按钮确实在页面上。

  • 需要真的去模拟点击按钮。

  • 按钮需要触发后端的某个动作,交付它本应交付的 payload。

在能验证某个东西真的工作之前,它就不算工作。相信这句话,我和智能体打交道够久,能很有把握地这么说。

任务后:熵最大化

在今天的现实里,智能体会写出勉强能接受的代码,但不会主动降低仓库的熵。结果就是,它把函数 X 从行为 A 改成行为 B,但所有文档仍然在引用行为 A。智能体不会替你把这些补齐。

重复 100 次,你就会得到一个不可维护的仓库,智能体持续困惑、持续做出糟糕决策。

最好的解决办法是,在每次长会话之后,额外分配 token,让一个上下文新鲜的智能体去清理状态、消除矛盾、处理合并冲突、移除死代码和陈旧文档,等等。

为什么要自己做编排框架?

要在原生编排框架里解决上述问题,可用工具非常有限,比如 Claude Code、Codex 等等。比如 Codex,甚至连 hooks 都没有。

更重要的是,当你让 Claude 充当编排框架里的编排器时,它会被大量与实际任务清单无关的编排上下文撑得非常臃肿。真正想要的是,让智能体编排层存在于任务清单之上。

比如,可以有一个智能体,它唯一的工作就是确保每个会话都带着一份必须兑现的算法化契约,兑现之前会话不能结束。这个智能体监控契约状态,推动智能体保持对契约的粘性,并且在需要时生成拥有新鲜上下文的独立智能体,去评判已完成工作的质量,并独立验证任务是否真的“做完”。

有了自己的独立编排框架,就能创建定制工作流,直接针对智能体犯蠢的各种方式下手。

发现智能体因为工作性质而特别害怕复杂度?就创建一个智能体,先判定当前提示词会导向一个简单项目还是高复杂项目。如果是高复杂,就再生成另一个智能体,把项目拆成尽可能多的可入口的小任务,直到最终产出仍然匹配原始提示词的要求。

发现智能体没把仓库维护在良好状态?那就在每次会话结束时生成智能体,分析本次改动的爆炸半径,确保你改动触及的一切都没有矛盾且足够干净,至于“干净”的定义由你来定。

最后,也是最重要的一点,收集智能体编排层输入和输出的详细遥测数据(提示词、trace、结果),并制定量表来评估编排框架的质量。迭代为王,这会让你一步步把编排框架做得更好。

OpenForage Harness

顺带一提,我们做了一套非常有主张的编排框架,用一种有主张的方式解决了所有这些问题。我们为它感到很自豪,它已经成了我们日常写代码的默认工具。我们也在多年迭代中不断演进它,等它达到“可供公众使用”的成熟度,就会把它开源。

结语

对绝大多数人来说,从原生功能的标准配置起步就足够了。但如果需要从智能体那里榨出更多火力,这篇文章就是为了把你在长时运行的自治工程项目里使用智能体时,大概率会遇到的所有问题,清楚地写出来。

Introduction

引言

If you want to design a harness for really long-running autonomous systems, you should understand the following deeply.

如果想为真正长时间运行的自治系统设计一套编排框架,需要把下面这些事吃透。

At its core, all harness design is to overcome the problems of agents either becoming lazy and cutting corners or being confused and stupid.

从本质上说,所有编排框架的设计,都是在对抗智能体要么变懒、偷工减料,要么迷糊犯蠢的问题。

Some of these behaviors are harder to fix than others, but a well written harness can go a long way.

有些行为比另一些更难修,但一套写得好的编排框架能起到很大作用。

All The Ways Agents Can Be Stupid

智能体会犯蠢的所有方式

Pre-Task

任务前

Not gaining sufficient context before starting a task and therefore acting on wrong or missing information before it has begun. To overcome this, you need to systematically check for incomplete and/or contradictory information before starting on the task — because this will propagate once it starts.

在开始任务之前没有获取足够的上下文,于是在任务还没真正开始时,就已经基于错误或缺失的信息行动了。要克服这一点,需要在动手前系统性地检查信息是否不完整和或相互矛盾,因为一旦开始,这些问题会一路传播下去。

Planning — Incomplete Context

规划阶段:上下文不完整

This is the part where the agent is deciding on the attack vectors to solve the problem. The biggest problem here is choosing the wrong attack vectors, which results in implementing something entirely wrong.

这一段是智能体在决定用哪些进攻路径来解决问题。这里最大的坑是选错进攻路径,最后做出来的东西完全跑偏。

Choosing a wrong attack vector because of stupidity rarely happens anymore, but choosing a wrong attack vector because of misalignment — misinterpreting what the user wants — is still pretty common.

因为笨而选错进攻路径的情况如今很少见了,但因为对齐问题而选错进攻路径,也就是误解用户想要什么,仍然很常见。

To overcome this, you need to make sure that your agent is covering all related files before it has even started planning. An important part of this is ensuring that your repository contains no contradictory information.

要克服这一点,需要确保智能体在开始规划之前,就已经覆盖了所有相关文件。这里很重要的一点是,仓库里不能存在相互矛盾的信息。

Planning — Short Term Thinking

规划阶段:短期思维

Your agents don't live with the consequences of short-term, quick fix solutions. This is like hiring cheap software engineering labor, you may get something that works but it's going to result in a lot of tech debt.

智能体不用承担短期、速修方案的后果。这就像雇了便宜的软件外包劳力,可能能跑,但会留下大量技术债。

The way to resolve this is to remind your agents at the "planning phase" to implement solutions that scale, that fit into the bigger picture, is easy to maintain, and respect good software engineering paradigms. Basically, you want your agents to think like a founder, not a part-time engineer.

解决办法是在规划阶段提醒智能体去做可扩展、能融入更大整体、易维护、并尊重良好软件工程范式的方案。说白了,希望智能体像创始人一样思考,而不是像兼职工程师。

You can get your agents to come up with N (e.g. N=5) different plans, have another agent pick the plan that will result in an implementation that is easier to maintain and scores higher on "clean-code principles".

可以让智能体提出 N 个(比如 N=5)不同方案,再让另一个智能体挑一个更易维护、在“整洁代码原则”上得分更高的方案。

Task — Context Anxiety

任务执行:上下文焦虑

This is the part where the agent is actually working on the problem. The biggest problem here is context exhaustion, by miles. Given a good plan and the right context, virtually all frontier model agents are able to complete sufficiently small tasks with close to one-shot capability now. Problems only arise when dealing with complex, multi-session problems that consume millions of tokens.

这一段是智能体真正开始干活。这里最大的问题是上下文耗尽,远远超过其他问题。只要计划够好、上下文给对,现在几乎所有前沿模型智能体都能以接近一把梭的方式完成足够小的任务。真正的麻烦只会在复杂、跨多轮会话、消耗数百万 token 的问题上出现。

As I've written before, virtually all agents have some kind of context anxiety, and as a function of time, they become more and more desperate to end the session. This is extremely pervasive with Claude. To overcome this, you need to do smart session handoffs where you can relieve your agents of their context. You will then face a new problem — how to maximize context fidelity in your session handoffs.

如我以前写过的,几乎所有智能体都有某种上下文焦虑,时间越长越急着结束会话。Claude 身上尤其普遍。要解决它,需要做聪明的会话交接,把智能体从沉重上下文里解放出来。然后会遇到一个新问题:怎么在交接里把上下文保真做到最大。

You want to make sure that your handoff prompt contains sufficient detail so that the new agent in a new session can pick up everything it needs to continue the task in an information-dense way. Understand that what you are doing here is essentially a form of compaction — and the reason to believe you can do it better than native providers is that you have a better understanding of your repository structure than the foundation model providers do.

交接提示词需要足够细致,才能让新会话里的新智能体以信息密度高的方式接上所有必需内容。要明白,这里做的本质上是一种压缩。而之所以有理由相信你能比原生提供方做得更好,是因为你对仓库结构的理解,比基础模型提供方更清楚。

Task — Planning Deviations

任务执行:偏离计划

Other than context bloat, the second biggest problem is what I call planning stickiness. It is the risk where your agent deviates from the plan and essentially does whatever it wants. The most common expression of this is: you say do A, which is lengthy, painful, and sophisticated. Your agent does A', which it reasons is a reasonably close approximation to A, but which will not get you remotely close to your destination.

除了上下文膨胀,第二大的问题是我称为“计划粘性”的东西:智能体偏离计划,基本按自己想法乱来。最常见的表现是:你让它做 A,这事又长又痛又复杂;智能体做了 A',它觉得这大概和 A 足够接近,但实际上离目标差得很远。

This is not only problematic as an outcome, but it is especially problematic when you realize that all software being composable often means that every downstream piece of code depending on A is now wired for A' instead — which means everything built from A' onwards is effectively wrong.

这不仅让结果变糟,更糟的是,当你意识到软件的可组合性往往意味着,所有依赖 A 的下游代码现在都被改成围绕 A' 来接线了,那么从 A' 往后堆出来的一切,本质上都错了。

To solve this, you need to verify early and often that the solution to a task is implemented well and in accordance with your expectations. This will prevent cascading failures in your task list.

要解决这一点,需要尽早、并且频繁地验证任务方案是否实现得足够好,是否符合你的预期。这能防止任务清单里出现连锁失败。

Task — Complexity Fear

任务执行:对复杂度的恐惧

Agents have a deep fear of complexity. If you ask them to implement a 5-line function, no problem. If you make them believe they are going to have to write a 50,000-line class, they start to weasel their way out of it.

智能体对复杂度有很深的恐惧。让它实现一个 5 行函数,没问题;让它觉得自己要写一个 50,000 行的类,它就开始想方设法逃避。

The worst offenses here are either writing stubs and calling it a day, or worse, declaring it out of scope for the session and ending it.

最恶劣的做法要么是写一堆桩代码就当完事,要么更糟,直接宣布这超出本次会话范围,然后结束。

My best guess is that somewhere in the RL process, agents have learned that when working on highly complex tasks, they tend to get a lot of things wrong, are heavily penalized for it, and therefore aggressively avoid it.

我的最佳猜测是,在强化学习过程中,智能体学会了:面对高复杂任务,它往往会做错很多事,会被重罚,于是就极度回避。

Ironically, humans have this problem too. Most people imagine the gargantuan amount of work a project requires and procrastinate on it forever. Productivity coaches will tell you that the way to overcome this is to start with the simplest version of the task — the very first step. In this way the activation energy is nearly zero, and you immediately shift from "getting started" to "in progress."

讽刺的是,人类也有这毛病。多数人一想到一个项目需要的巨量工作,就会无限拖延。效率教练会说,克服的方法是从最简单的版本开始,也就是第一步。这样启动能量几乎为零,你会立刻从“开始”切换到“进行中”。

It is similar for agents. You need to break a complex problem into many different sub-tasks that do not seem as daunting — where every task is a sub-hundred-line task, and you string 500 of them together.

智能体也是类似。需要把复杂问题拆成许多不那么吓人的子任务,让每个任务都控制在百行以内,然后把 500 个任务串起来。

An interesting side effect of having studied productivity methods and being a practitioner of them is realizing how effective they are at managing agents as well. It is very clear to me that agent psychology is created in our image.

研究并实践过效率方法的一个有趣副作用,是会发现这些方法用来管理智能体也同样有效。很明显,智能体心理是照着我们的样子塑出来的。

Post-Task — Verification Laziness

任务后:验证偷懒

Agents take the shortest path to verification. They will write weak tests, watch them pass, and use that as reason to declare success.

智能体会走最短路径来做验证。它会写很弱的测试,看它们通过,然后以此宣布成功。

The more context pressure there is to "verify" work, the more atrocious this becomes. At its worst, suppose you have a function that does behavior A. The agent will write a test for behavior A', watch it pass, and declare that the function works.

越是在上下文压力很大、催着“快验证”的时候,这件事就越离谱。最糟的情况是,假设你有个函数要实现行为 A。智能体会写一个测试去验证行为 A',看它通过,然后宣布函数正常。

The way to mitigate this is to ensure that the agent writing the verification tests is operating with as fresh a context as possible, and is a dedicated agent planning out the verification. It is then important that you are verifying the exact production-ready behavior you are looking for.

缓解的方法是,确保写验证测试的智能体拥有尽可能新鲜的上下文,并且是一个专门规划验证的独立智能体。然后要确保你验证的,是你真正想要的、可上线的行为。

That means if you are testing whether a button on your front end works, do not test whether a generic button works. Think about what it would actually take to know the button works:

也就是说,如果你在测试前端一个按钮是否可用,就不要去测一个泛化的按钮是否可用。要想清楚,要怎样才能真的知道按钮工作了:

  • You need to see a screenshot confirming the button is actually there.
  • 需要看到一张截图,确认按钮确实在页面上。
  • You need to actually simulate clicking the button.
  • 需要真的去模拟点击按钮。
  • The button needs to trigger something in your backend to deliver whatever payload it is supposed to deliver.
  • 按钮需要触发后端的某个动作,交付它本应交付的 payload。

Until you can verify that something actually works — it doesn't. Trust me, I've worked with agents long enough to say this with confidence.

在能验证某个东西真的工作之前,它就不算工作。相信这句话,我和智能体打交道够久,能很有把握地这么说。

Post-Task — Entropy Maximization

任务后:熵最大化

As it stands today, agents write barely acceptable code and do nothing to reduce the entropy in your repository. What happens is that they change function X to have behavior B instead of behavior A, but all documentation still references behavior A. Your agent is not going to fix that for you.

在今天的现实里,智能体会写出勉强能接受的代码,但不会主动降低仓库的熵。结果就是,它把函数 X 从行为 A 改成行为 B,但所有文档仍然在引用行为 A。智能体不会替你把这些补齐。

Repeat this 100 times and you end up with an unmaintainable repository where your agent is constantly confused and making poor decisions.

重复 100 次,你就会得到一个不可维护的仓库,智能体持续困惑、持续做出糟糕决策。

The best way to overcome this is to allocate tokens for an agent with a fresh context, after every long-running session, to clean up state, resolve contradictions, handle merge conflicts, remove dead code and stale documentation, and so on.

最好的解决办法是,在每次长会话之后,额外分配 token,让一个上下文新鲜的智能体去清理状态、消除矛盾、处理合并冲突、移除死代码和陈旧文档,等等。

Why Create Your Own Harness?

为什么要自己做编排框架?

The tools available to solve the above problems within native harnesses — Claude Code, Codex, etc. — are very limited. Codex, for example, does not even have hooks.

要在原生编排框架里解决上述问题,可用工具非常有限,比如 Claude Code、Codex 等等。比如 Codex,甚至连 hooks 都没有。

More importantly, when you get Claude to act as the orchestrator in your harness, it becomes extremely bloated with orchestration context that is irrelevant to the actual task list at hand. You actually want your agent orchestration layer to exist on top of your task lists.

更重要的是,当你让 Claude 充当编排框架里的编排器时,它会被大量与实际任务清单无关的编排上下文撑得非常臃肿。真正想要的是,让智能体编排层存在于任务清单之上。

For example, you can have an agent whose sole job is ensuring that every session comes with an algorithmic contract that must be fulfilled before the session can end. This agent monitors the state of the contract, nudges agents toward contract stickiness, and spawns independent agents with fresh context to judge the quality of the work done and independently verify the "doneness" of the task.

比如,可以有一个智能体,它唯一的工作就是确保每个会话都带着一份必须兑现的算法化契约,兑现之前会话不能结束。这个智能体监控契约状态,推动智能体保持对契约的粘性,并且在需要时生成拥有新鲜上下文的独立智能体,去评判已完成工作的质量,并独立验证任务是否真的“做完”。

By having your own independent harness, you have the ability to create custom workflows that directly address all the ways agents can be stupid.

有了自己的独立编排框架,就能创建定制工作流,直接针对智能体犯蠢的各种方式下手。

Finding that your agents have a lot of complexity fear because of the nature of your work? Create an agent to classify whether the current prompt will result in a simple or high-complexity project. If high complexity, spawn another agent that breaks the project into as many bite-sized tasks as possible until the end outcome matches the original prompt.

发现智能体因为工作性质而特别害怕复杂度?就创建一个智能体,先判定当前提示词会导向一个简单项目还是高复杂项目。如果是高复杂,就再生成另一个智能体,把项目拆成尽可能多的可入口的小任务,直到最终产出仍然匹配原始提示词的要求。

Finding that your agents are not keeping your repository in a good state? At the end of every session, spawn agents that analyze the blast radius of your changes and ensure that everything your change touches is contradiction-free and clean — however you define clean.

发现智能体没把仓库维护在良好状态?那就在每次会话结束时生成智能体,分析本次改动的爆炸半径,确保你改动触及的一切都没有矛盾且足够干净,至于“干净”的定义由你来定。

Lastly, and most importantly — collect detailed telemetry on everything your agent orchestration layer takes in and produces (prompts, traces, outcomes) and come up with rubrics to judge the quality of your harness. Iteration is king, and this will allow you to inch toward better and better harnesses.

最后,也是最重要的一点,收集智能体编排层输入和输出的详细遥测数据(提示词、trace、结果),并制定量表来评估编排框架的质量。迭代为王,这会让你一步步把编排框架做得更好。

OpenForage Harness

OpenForage Harness

As a side-note: We've built an extremely opinionated harness that solves all of these problems in an opinionated fashion. We're really proud of it and it has become our daily driver in coding. We've also evolved it over many years of iteration and are going to open-source it once it's "public-use" ready.

顺带一提,我们做了一套非常有主张的编排框架,用一种有主张的方式解决了所有这些问题。我们为它感到很自豪,它已经成了我们日常写代码的默认工具。我们也在多年迭代中不断演进它,等它达到“可供公众使用”的成熟度,就会把它开源。

Conclusion

结语

For the vast majority of people, starting with a vanilla setup using native features will be good enough. But for those who need a lot more firepower from their agents, this article is designed to put into words all the problems you will likely face when using agents in long-running autonomous engineering projects.

对绝大多数人来说,从原生功能的标准配置起步就足够了。但如果需要从智能体那里榨出更多火力,这篇文章就是为了把你在长时运行的自治工程项目里使用智能体时,大概率会遇到的所有问题,清楚地写出来。

Introduction

If you want to design a harness for really long-running autonomous systems, you should understand the following deeply.

At its core, all harness design is to overcome the problems of agents either becoming lazy and cutting corners or being confused and stupid.

Some of these behaviors are harder to fix than others, but a well written harness can go a long way.

All The Ways Agents Can Be Stupid

Pre-Task

Not gaining sufficient context before starting a task and therefore acting on wrong or missing information before it has begun. To overcome this, you need to systematically check for incomplete and/or contradictory information before starting on the task — because this will propagate once it starts.

Planning — Incomplete Context

This is the part where the agent is deciding on the attack vectors to solve the problem. The biggest problem here is choosing the wrong attack vectors, which results in implementing something entirely wrong.

Choosing a wrong attack vector because of stupidity rarely happens anymore, but choosing a wrong attack vector because of misalignment — misinterpreting what the user wants — is still pretty common.

To overcome this, you need to make sure that your agent is covering all related files before it has even started planning. An important part of this is ensuring that your repository contains no contradictory information.

Planning — Short Term Thinking

Your agents don't live with the consequences of short-term, quick fix solutions. This is like hiring cheap software engineering labor, you may get something that works but it's going to result in a lot of tech debt.

The way to resolve this is to remind your agents at the "planning phase" to implement solutions that scale, that fit into the bigger picture, is easy to maintain, and respect good software engineering paradigms. Basically, you want your agents to think like a founder, not a part-time engineer.

You can get your agents to come up with N (e.g. N=5) different plans, have another agent pick the plan that will result in an implementation that is easier to maintain and scores higher on "clean-code principles".

Task — Context Anxiety

This is the part where the agent is actually working on the problem. The biggest problem here is context exhaustion, by miles. Given a good plan and the right context, virtually all frontier model agents are able to complete sufficiently small tasks with close to one-shot capability now. Problems only arise when dealing with complex, multi-session problems that consume millions of tokens.

As I've written before, virtually all agents have some kind of context anxiety, and as a function of time, they become more and more desperate to end the session. This is extremely pervasive with Claude. To overcome this, you need to do smart session handoffs where you can relieve your agents of their context. You will then face a new problem — how to maximize context fidelity in your session handoffs.

You want to make sure that your handoff prompt contains sufficient detail so that the new agent in a new session can pick up everything it needs to continue the task in an information-dense way. Understand that what you are doing here is essentially a form of compaction — and the reason to believe you can do it better than native providers is that you have a better understanding of your repository structure than the foundation model providers do.

Task — Planning Deviations

Other than context bloat, the second biggest problem is what I call planning stickiness. It is the risk where your agent deviates from the plan and essentially does whatever it wants. The most common expression of this is: you say do A, which is lengthy, painful, and sophisticated. Your agent does A', which it reasons is a reasonably close approximation to A, but which will not get you remotely close to your destination.

This is not only problematic as an outcome, but it is especially problematic when you realize that all software being composable often means that every downstream piece of code depending on A is now wired for A' instead — which means everything built from A' onwards is effectively wrong.

To solve this, you need to verify early and often that the solution to a task is implemented well and in accordance with your expectations. This will prevent cascading failures in your task list.

Task — Complexity Fear

Agents have a deep fear of complexity. If you ask them to implement a 5-line function, no problem. If you make them believe they are going to have to write a 50,000-line class, they start to weasel their way out of it.

The worst offenses here are either writing stubs and calling it a day, or worse, declaring it out of scope for the session and ending it.

My best guess is that somewhere in the RL process, agents have learned that when working on highly complex tasks, they tend to get a lot of things wrong, are heavily penalized for it, and therefore aggressively avoid it.

Ironically, humans have this problem too. Most people imagine the gargantuan amount of work a project requires and procrastinate on it forever. Productivity coaches will tell you that the way to overcome this is to start with the simplest version of the task — the very first step. In this way the activation energy is nearly zero, and you immediately shift from "getting started" to "in progress."

It is similar for agents. You need to break a complex problem into many different sub-tasks that do not seem as daunting — where every task is a sub-hundred-line task, and you string 500 of them together.

An interesting side effect of having studied productivity methods and being a practitioner of them is realizing how effective they are at managing agents as well. It is very clear to me that agent psychology is created in our image.

Post-Task — Verification Laziness

Agents take the shortest path to verification. They will write weak tests, watch them pass, and use that as reason to declare success.

The more context pressure there is to "verify" work, the more atrocious this becomes. At its worst, suppose you have a function that does behavior A. The agent will write a test for behavior A', watch it pass, and declare that the function works.

The way to mitigate this is to ensure that the agent writing the verification tests is operating with as fresh a context as possible, and is a dedicated agent planning out the verification. It is then important that you are verifying the exact production-ready behavior you are looking for.

That means if you are testing whether a button on your front end works, do not test whether a generic button works. Think about what it would actually take to know the button works:

  • You need to see a screenshot confirming the button is actually there.

  • You need to actually simulate clicking the button.

  • The button needs to trigger something in your backend to deliver whatever payload it is supposed to deliver.

Until you can verify that something actually works — it doesn't. Trust me, I've worked with agents long enough to say this with confidence.

Post-Task — Entropy Maximization

As it stands today, agents write barely acceptable code and do nothing to reduce the entropy in your repository. What happens is that they change function X to have behavior B instead of behavior A, but all documentation still references behavior A. Your agent is not going to fix that for you.

Repeat this 100 times and you end up with an unmaintainable repository where your agent is constantly confused and making poor decisions.

The best way to overcome this is to allocate tokens for an agent with a fresh context, after every long-running session, to clean up state, resolve contradictions, handle merge conflicts, remove dead code and stale documentation, and so on.

Why Create Your Own Harness?

The tools available to solve the above problems within native harnesses — Claude Code, Codex, etc. — are very limited. Codex, for example, does not even have hooks.

More importantly, when you get Claude to act as the orchestrator in your harness, it becomes extremely bloated with orchestration context that is irrelevant to the actual task list at hand. You actually want your agent orchestration layer to exist on top of your task lists.

For example, you can have an agent whose sole job is ensuring that every session comes with an algorithmic contract that must be fulfilled before the session can end. This agent monitors the state of the contract, nudges agents toward contract stickiness, and spawns independent agents with fresh context to judge the quality of the work done and independently verify the "doneness" of the task.

By having your own independent harness, you have the ability to create custom workflows that directly address all the ways agents can be stupid.

Finding that your agents have a lot of complexity fear because of the nature of your work? Create an agent to classify whether the current prompt will result in a simple or high-complexity project. If high complexity, spawn another agent that breaks the project into as many bite-sized tasks as possible until the end outcome matches the original prompt.

Finding that your agents are not keeping your repository in a good state? At the end of every session, spawn agents that analyze the blast radius of your changes and ensure that everything your change touches is contradiction-free and clean — however you define clean.

Lastly, and most importantly — collect detailed telemetry on everything your agent orchestration layer takes in and produces (prompts, traces, outcomes) and come up with rubrics to judge the quality of your harness. Iteration is king, and this will allow you to inch toward better and better harnesses.

OpenForage Harness

As a side-note: We've built an extremely opinionated harness that solves all of these problems in an opinionated fashion. We're really proud of it and it has become our daily driver in coding. We've also evolved it over many years of iteration and are going to open-source it once it's "public-use" ready.

Conclusion

For the vast majority of people, starting with a vanilla setup using native features will be good enough. But for those who need a lot more firepower from their agents, this article is designed to put into words all the problems you will likely face when using agents in long-running autonomous engineering projects.

📋 讨论归档

讨论进行中…