返回列表
🧠 阿头学 · 💬 讨论题

多智能体真正有效的,不是并行蜂群,而是单写者协作

这篇文章最有价值的判断是:当下真正能落地的多智能体,不是多个 agent 一起写,而是一个 agent 负责写、其他 agent 负责审查、规划和路由智能,否则系统大概率会因为隐含决策冲突而变脆。
打开原文 ↗

2026-04-23 原文链接 ↗
阅读简报
双语对照
完整翻译
原文
讨论归档

核心观点

  • 单线程写入才是当前可用架构 作者明确修正了自己早先对多智能体的全面悲观看法,但修正范围很窄:并行写入型多智能体依然不靠谱,真正有效的是“多个智能体贡献判断,写入始终单线程”。这个判断是可信的,因为软件开发里的大量关键决策本来就隐含在风格、边界条件和模块模式里,多 agent 同时落盘极易互相打架。
  • 干净上下文的 reviewer 比共享上下文更有用 文中最反直觉、也最站得住的一点是:审查 agent 不该默认继承编码 agent 的完整上下文。作者判断,短上下文让模型更聪明,独立视角让 reviewer 更敢质疑原始实现,这比“信息完全共享”更能抓 bug。这个机制解释大体成立,但文中没有把“短上下文优势”和“任务角色分离优势”严格拆开验证。
  • 多智能体的本质问题不是 prompt,而是通信协议 作者把问题重心放在上下文工程、信息转移、何时求助、如何回传,这个方向比市面上大量“agent 自治”叙事扎实得多。判断上看,这篇文章最强的部分不是证明 multi-agent 有多神,而是指出卡点其实是上下文分发和沟通桥梁。
  • 大小模型协作目前只在强模型组合里更靠谱 所谓 smart friend 模式,本质是让主模型在需要时调用更强更贵的模型。作者自己承认,弱主模型往往不知道自己什么时候该升级求助,因此质量天花板仍被主模型限制。这个判断很重要,因为它直接否定了“便宜小模型通过调度大模型就能稳定逼近前沿”的乐观叙事。
  • “无结构群体智能”大概率是演示友好、生产不友好 作者认为真实可用的形态是 manager-subagent 的 map-reduce-and-manage,而不是 agent swarm 自由协商。这一判断在工程上很合理,但证据强度不足,因为文章主要基于自家软件工程场景,外推到一般规律仍然偏快。

跟我们的关联

  • 对 ATou 意味着什么、下一步怎么用 ATou 如果在设计 agent 工作流,不该先追求“多开几个 agent”,而该先定义唯一写入者和明确的审查者。下一步最该测试的不是 swarm,而是“主 agent 产出 + 干净上下文 reviewer 复审”的闭环。
  • 对 Neta 意味着什么、下一步怎么用 Neta 如果在做内容、研究或策略流程,不该默认让所有 agent 共享全量上下文,因为这会抹平独立判断。下一步可以试一个“生成者有全量背景、审稿者只看结果物”的流程,看质检是否明显提升。
  • 对 Uota 意味着什么、下一步怎么用 Uota 如果关注产品机会,这篇文章说明 agent 基础设施的价值不在花哨 orchestration UI,而在权限控制、上下文裁剪、消息回传和任务路由。下一步应优先找“单写入、多供智”的产品切口,而不是押注无结构 swarm。
  • 对 ATou/Neta/Uota 都意味着什么、下一步怎么用 这篇文章共同提醒三者:真正的竞争壁垒不是 prompt,而是通信协议。下一步讨论 agent 产品或内部工作流时,应该强制回答三个问题:谁有写权限、谁保留干净视角、谁负责过滤建议。

讨论引子

1. 如果“干净上下文”真的更有效,那很多团队默认追求的信息全透明,是否反而在降低 reviewer 的判断质量? 2. “单写者 + 多供智者”只是当前模型能力不够下的工程妥协,还是未来很长时间都会成立的组织原则? 3. 如果弱模型始终学不会何时该求助更强模型,那“小模型调大模型”的成本优化故事会不会从根上就站不住?

10 个月前,我写过一篇《不要构建多智能体》,其中主张大多数人都不该尝试构建多智能体系统 [1]。并行运行的智能体会对风格、边界情况和代码模式做出隐含选择。在当时,这些决定往往彼此冲突,最后做出来的产品很脆弱。此后发生了很多变化。

在 Cognition,我们已经开始部署一些在实践中真正有效的多智能体系统。我们最初的观察,放到并行写入型的智能体集群上,今天依然成立。那个方向里,大多数看起来很炫的想法,依旧没有得到真正有意义的采用。但我们发现,有一类更收窄的模式是有效的。也就是多个智能体为同一项任务贡献智能,但写入始终保持单线程。在这篇文章里,我会总结我们在构建这类系统时学到的东西。

关于上下文工程的一个回顾

在上一篇文章里,我们建议读者把构建智能体这件事,从提示词工程重新理解为上下文工程。提示词工程容易鼓励一些噱头式技巧,比如你是一位资深软件工程师,或者多想一会儿。上下文工程则更耐用,它关注的是如何把正确的上下文交给模型,同时默认模型会随着时间推移变得更强。出于很多原因,在多智能体环境里,上下文工程会变得非常有挑战。过去我们建议遵循以下原则。

  1. 尽可能在智能体之间共享上下文。 要确保它们看到的是同一套信息源,始终保持一致理解,比如同一份待办列表、计划文件,也对整体任务目标有相同的先验认知。必要时要帮助它们彼此沟通

  2. 行动本身会携带隐含决策。 当某个智能体做出修改或编辑时,它可能同时做出一些隐含选择,比如风格、代码模式、某些边界情况该怎么处理,而这些选择可能与其他并行智能体的隐含选择发生冲突。因此,在一个有多个智能体同时进行写操作的多智能体世界里,决策很容易变得碎片化。

过去几个月里很多事情都变了,但对精心设计上下文工程的需求并没有变。由于原则 2 的存在,现实里的大多数多智能体系统都局限在只读型子智能体上,比如网页搜索子智能体、代码搜索子智能体。举个例子,Devin 可以调用一个 Deepwiki 子智能体来获取代码库上下文。但这类子智能体本质上更像工具调用,而不是真正的多智能体协作。我们想探索的是,当智能体以更有互动性的方式协作时,究竟能解锁什么能力。

过去 10 个月里变了什么

首先,模型本身变得自然得多,也更有智能体感。它们会直觉地理解工具该怎么用,知道自己的上下文限制,也知道该怎么把自己的上下文提炼给协作者,无论对方是人还是别的智能体。结果就是,智能体的使用量增长了很多,多得离谱。 即便只看 Devin 在我们最大企业客户这一细分市场中的使用情况,这本来是最谨慎、最不愿意率先采用新技术的一群人,过去 6 个月里我们也看到了爆发式增长,大约增长了 8 倍。

https://cursor.com/blog/scaling-agents

这种使用量的爆发,同时从推和拉两个方向,把大家带向了多智能体。

从推的角度看,能力增强以后,用户自然会开始尝试更多种多智能体配置。当你开始大量使用智能体时,瓶颈就会自然转移到智能体之外的部分,比如管理、规划和审查。比如,有些人已经写出了让 Devin 管理其他 Devin 的脚本。也有很多人开始让自己的编码智能体和审查智能体来回迭代。

从拉的角度看,智能体使用量暴涨,也带来了成本暴涨。随着一类新的 Mythos 级别模型即将出现,它们会更大、更强,自然就会冒出一个问题,怎么才能用更低成本拿到接近前沿的能力。而多智能体系统也许正是一个自然的答案。

与此同时,还出现了一波很吸睛的演示,展示如何把大量智能体扔进大型工程项目里。比较知名的例子包括构建一个网页浏览器,20 万行代码,构建一个 C 编译器,10 万行代码,以及优化一个 LLM 训练脚本,迭代 1 万多轮。这些演示确实很让人兴奋,但它们有一个共同点,而大多数真实软件并不具备这一点,那就是成功标准简单而且可验证。真实软件需要的是一种能放大人类品味和决策能力的系统,而这正是我们探索多智能体系统时所处的语境。

一些实用的多智能体实验

1)代码审查循环,蠢到按理说不该有效

你可能会觉得,让模型去审查它自己写的代码,应该不会查出什么有用的问题。但即使面对 Devin 自己写的 PR,Devin Review 平均每个 PR 也能抓出 2 个 bug,其中大约 58% 是严重问题,比如逻辑错误、遗漏的边界情况、安全漏洞。这个系统经常还会经过多轮代码审查循环,每一轮都能找到新的 bug,当然这也不总是好事,因为会花不少时间。现在我们让 Devin 和 Devin Review 原生地彼此迭代,所以人类打开 PR 时,大部分 bug 通常已经被解决了。

最反直觉的部分。 有意思的是,我们发现这种技术在编码智能体和审查智能体事先完全不共享上下文时,效果反而最好。为什么。

这里面既有偏理念层面的理由,也有技术层面的理由。首先需要记住,把同一个模型放进两个智能体里,即使智能体外壳完全一样,也不会像你想象中一个人同时做两件事那样,天然带来那种自我偏见或高度相关性。归根到底,这些智能体都是根据各自上下文来工作的系统。它们没有自我意识,而任何可能存在的共享偏差,最终都来自训练过程。以现在的情况看,我们可以默认这个训练过程已经相当高质量了。

让审查智能体拥有一份完全干净的上下文,也会帮助它深入到原始编码智能体没深入的地方。一个原因是,它被迫在没有规格说明的情况下,从实现结果倒推思考,因此可以公开质疑一些事情,而原始智能体可能因为用户指令里的错误而忽略了这些问题。比如用户让智能体实现一种不安全的模式。但也许更重要的是,干净的上下文本身就会让智能体更聪明,这和注意力机制的数学特性有关。上下文腐烂是一个已有充分记录的现象,本质上是模型在上下文越来越长时,决策质量会下降。模型通常只有有限数量的注意力头,当它需要处理越来越长的一串指令、提示、代码等内容时,重要细节就可能无法充分进入决策过程。当编码智能体在一个任务上工作了几个小时,看了整个仓库,跑了很多命令,思考过不同方案,又修过错误,它很快就会积累起一大段上下文。专门的审查智能体则可以跳过这些额外上下文,只看 diff,然后在从头阅读代码的过程中重新发现自己需要的上下文。上下文更短,智能体本身就会更聪明,于是也更容易发现那些微妙的问题。

https://www.anthropic.com/engineering/building-c-compiler

要让这个系统真正运转得好,最后一个关键点是编码智能体和审查智能体之间的沟通桥梁。说白了,就是 Devin 能不能利用自己掌握的更完整上下文,比如用户指令、已有决策等,对 Devin Review 返回的 bug 做出过滤。这一点对避免循环、避免违背用户意图、避免做出范围之外的工作等等都很关键。我们发现,只要在提示设计上做一些专门处理,今天的模型已经能在这里做出相当合理的判断,于是你会看到这两个智能体和人类之间出现一些非常有意思的互动。

https://cognition.ai/blog/swe-1-5

结论是,在生成者加验证者的循环里,干净上下文会明显提升能力。但如果想获得连贯的整体体验,清晰的沟通机制,以及和整体上下文之间的综合判断,同样重要。

2)大型高价模型又回来了,介绍一下 smart friend

如果你去看过去几个月里最受欢迎的模型,会看到一个明显转向。大家正从 Anthropic 的 Sonnet 这一类中等规模模型,转向 Anthropic 的 Opus 这类大型模型,因为性能更强。随着 Mythos 即将到来,基本可以说,规模扩张又回来了

这里有个不太显眼但很重要的含义。前沿级智能,很快就会因为太贵,也可能太慢,而不适合大多数日常任务。与此同时,使用小模型时你也会遇到一个难题,一个任务原本看起来不难,但做着做着可能发现比预期复杂得多。

怎样才能两边都占到。我们在 Windsurf 做过一个实验,就是冲着这个目标去的。当我们在 10 月推出 SWE-1.5 时,它是一个每秒 950 token 的次前沿模型。我们发现,如果把它和 Sonnet 4.5 搭配,让后者负责规划,那么在保持低成本和高速度的同时,也能弥补一小部分性能差距。

http://deepwiki.com/

我们当时采用的实际架构,是把更聪明、能力更强的模型做成一个 smart friend 工具,让主要模型,也就是更小的那个模型,在需要时主动调用它。也就是让主要模型自己判断,什么时候情况已经复杂到值得请教这个更聪明也更贵的模型。但很快我们就发现,上下文转移和沟通方式并不好设计。

1. 主要模型得知道怎么和 smart friend 说话。

这种配置里最核心的难点,在于一个问题,比较笨的模型怎么知道自己已经到极限了。和更常见的反向结构不同,那种结构里是聪明的主模型把任务委派给更小的子智能体,而这里决定何时委派的并不是更聪明的模型。这里有几种可能的解决办法。比如,你可以鼓励主要智能体至少总要调用一次聪明智能体,让它评估一下,是不是有什么复杂点被漏掉了。你也可以通过提示微调或训练,让主要模型在这个决策上更有校准能力。具体还要看主要模型本身的智能程度,有时还需要一些领域内更明确的规定性指导,比如遇到合并冲突时一律调用 smart friend。

这种沟通方式的另一个难点,是主要模型该把哪些上下文分享给 smart friend。再进一步说,主要模型该怎么向 smart friend 提问。如果主要模型只分享了自己全部上下文中的一部分,那么聪明模型就可能无法做出充分知情的判断。我们发现,对当下的模型来说,一个比较合理的 80/20 方案,就是直接把主要模型的完整上下文 fork 一份给聪明模型。同样地,我们也发现,鼓励主要模型提出更宽泛的问题,比如我该怎么做,然后让聪明模型自己决定哪些内容值得讨论,效果更好。

2. smart friend 也得知道怎么把话说回给主要模型

不管你怎么优化前一种沟通方式,最后多半还是会因为上下文损失而存在质量缺口。把反方向的沟通调好,可以弥补这些缺口。比如,假设主要模型从来没看过 important_file.py,却跑去问聪明模型一个必须知道这个文件内容的问题。那这种情况下,聪明模型的正确回答,不该是编一些理论出来,虽然这往往是默认行为,而是应该明确告诉主要模型去调查这个文件,之后再回来问。类似地,让 smart friend 不只回答主要模型正在问的问题,而是顺着整个智能体的执行轨迹,多给出一些重要建议,即使主要模型没问到这些点,通常也会有不错效果。我们发现,这种范围适度放大的 smart friend,通常会带来更有意思的互动。

https://www.anthropic.com/engineering/multi-agent-research-system

smart friend 实际上发生了什么

有件事得先讲清楚。SWE 1.5 作为这个结构里的主要模型,其实还不够好,所以这个方案并没有真正达到理想效果。它和 Sonnet 4.5 之间的差距,恰好出现在这个配置最关键的地方。什么时候该升级求助,应该问什么。成本和速度上的优势是真实存在的,但质量上限是由主要模型决定的,而这个主要模型还不够强。SWE 1.6 是最近的后续版本,在 SWE-bench 上已经达到了 Opus-4.5 级别表现,它明显更好,也足够缩小了这道差距,让这种模式开始变得有回报,但依然还没到我们想要的程度。我们相当有信心,这本质上是个训练问题,未来的 SWE 模型会在训练时把这种来回交互考虑进去 [2]。

真正有效,而且效果不错的场景,是发生在前沿模型彼此之间。我们已经在生产环境里让 Claude 和 GPT 按这种方式一起工作过相当长一段时间,而且在最棘手的场景下确实带来了实际收益。有意思的是,这里的提示调优问题,和小模型加大模型的情况并不一样。前沿模型之间的沟通,不太是让较弱模型知道何时去问更强模型,而更像是把任务分流给最擅长那个具体子任务的模型。有些模型更会调 bug,有些更擅长视觉推理,有些更会写测试。这时,委派逻辑就不再是一个难度升级器,而变成了一个能力路由器。

结论是,如果两个模型都足够强,smart-friend 这种模式今天就能成立。而要让它在主模型明显更弱的情况下也成立,也就是那个最有可能带来最大解锁的版本,仍然是一个开放问题。我们认为这本质上是训练问题。如果你也在做类似探索,欢迎来交流。

展望未来,高层级委派

上面这两种模式有一个共同结构。只有一个写入者,其他智能体则围绕它贡献智能。接下来最自然的问题就是,这种模式能不能扩展到更大范围的所有权。比如一个横跨十个 PR 的产品功能,一个涉及十几个服务的迁移任务,或者一整周的工作,而不是一个下午的活。

这件事今天已经在 Devin 里上线。一个管理者 Devin 可以把更大的任务拆成多个部分,生成子 Devin 去处理,并通过一个内部 MCP 协调它们的进展。要把这件事做得足够连贯,所需的上下文工程比我们预期得更多。那些原本训练于小范围委派任务的管理者,默认会变得过于规定式,而当管理者自己并不具备深入的代码库上下文时,这种做法反而会适得其反。智能体还会默认以为自己和子智能体共享状态,实际上并没有。跨智能体沟通,也就是子智能体把信息写回管理者,再由管理者传给团队里的其他智能体,这件事默认并不会发生,因为模型并没有在那种必须这样做的环境里训练过。这些问题每一个都需要专门的修正工作,而且我们现在还在持续改进。

那无结构的群体智能体呢。 我们认为,无结构集群这种做法,也就是任意一张智能体网络彼此协商,大多只是干扰项。真正实用的形态是 map-reduce-and-manage,也就是管理者拆分工作,子智能体执行,管理者再综合结果并汇报。如何让这种系统在体验上像一个智能体处理一个任务那样连贯,会是我们 2026 年接下来一些工作的核心。

我们今天知道的事

这些实验背后有一条共同主线。以今天的情况看,多智能体系统在写操作保持单线程、额外智能体只贡献智能而不直接执行动作时,效果最好。一个拥有干净上下文的审查者,能抓住编码者看不到的 bug。一个前沿级的 smart friend,能发现较弱主模型漏掉的细微问题。一个管理者能在多个子智能体之间协调范围,而不把决策搞碎。

剩下的开放问题,本质上全都是沟通问题。较弱模型怎么学会什么时候该升级求助。子智能体怎么把一个足以改变其他同伴工作方向的新发现及时上报。怎样在智能体之间转移上下文,又不把接收方淹没。仅靠提示设计,你已经可以走得不算太短,但我们也相信,下一代模型,包括我们自己训练的那些模型,会开始逐步填上这些空白。

我们正在朝一个世界推进。在那个世界里,智能会被注入软件开发生命周期的每一个阶段,规划、编码、审查、测试、监控。它不是一群自治行动者组成的蜂群,而是一个能够放大人类品味的协同系统。

欢迎你到 devin.ai 或 windsurf.com 试试我们的工作。如果你也愿意和我们一起探索这些构建智能体的原则,欢迎联系 walden@cognition.ai

[1] 巧的是,Anthropic 第二天也发了一篇相关博客,讨论如何构建一个多智能体研究系统。两篇文章都提到了上下文工程中的相似挑战,也都得出了类似结论,也就是最先适用的领域会是只读型智能体

[2] 最近,Anthropic 也推出了一个类似的 beta 实验,让他们的小模型以相同方式去调用更大的模型。至少这说明,位于 smart friend 一端的模型,也会越来越擅长把信息有效地传回给主要模型。

10 months ago, I wrote Don't Build Multi-Agents, arguing that most people shouldn't try to build multi-agent systems [1]. Parallel agents make implicit choices about style, edge cases, and code patterns. At the time, these decisions often conflicted with each other, leading to fragile products. A lot has changed since then.

10 个月前,我写过一篇《不要构建多智能体》,其中主张大多数人都不该尝试构建多智能体系统 [1]。并行运行的智能体会对风格、边界情况和代码模式做出隐含选择。在当时,这些决定往往彼此冲突,最后做出来的产品很脆弱。此后发生了很多变化。

At Cognition, we've begun to deploy multi-agent systems that actually work in practice. Our original observations still hold today for parallel-writer swarms: most of the sexy ideas in that space still don’t see meaningful adoption. But we've found a narrower class of patterns that do: setups where multiple agents contribute intelligence to a task while writes stay single-threaded. In this post, I'll summarize what we've learned building them.

在 Cognition,我们已经开始部署一些在实践中真正有效的多智能体系统。我们最初的观察,放到并行写入型的智能体集群上,今天依然成立。那个方向里,大多数看起来很炫的想法,依旧没有得到真正有意义的采用。但我们发现,有一类更收窄的模式是有效的。也就是多个智能体为同一项任务贡献智能,但写入始终保持单线程。在这篇文章里,我会总结我们在构建这类系统时学到的东西。

A Refresher on Context Engineering

关于上下文工程的一个回顾

In the last post, we encouraged readers to reframe agent-building from “prompt engineering” to “context engineering”. Prompt engineering encourages gimmicky techniques like “you’re a senior software engineer” or “think for longer.” Context engineering is more durable and focuses on giving the right context to models while assuming the models become more capable over time. For many reasons, context engineering can get very challenging in a multi-agent setup. In the past, we recommended the following principles:

在上一篇文章里,我们建议读者把构建智能体这件事,从提示词工程重新理解为上下文工程。提示词工程容易鼓励一些噱头式技巧,比如你是一位资深软件工程师,或者多想一会儿。上下文工程则更耐用,它关注的是如何把正确的上下文交给模型,同时默认模型会随着时间推移变得更强。出于很多原因,在多智能体环境里,上下文工程会变得非常有挑战。过去我们建议遵循以下原则。

  1. Share as much context as possible between the agents. Make sure they see the same sources of information, stay on the same page (todo list, plan files), and share the same priors about the overall task they are meant to accomplish. Help them communicate if needed
  1. 尽可能在智能体之间共享上下文。 要确保它们看到的是同一套信息源,始终保持一致理解,比如同一份待办列表、计划文件,也对整体任务目标有相同的先验认知。必要时要帮助它们彼此沟通
  1. Actions carry implicit decisions. When one agent makes certain changes or edits, it might make implicit choices (style, code patterns, how certain edge cases should be handled) that might conflict with the implicit choices of other parallel agents. As a result, decision-making can get quite fragmented in a multi-agent world where multiple agents are taking write actions.
  1. 行动本身会携带隐含决策。 当某个智能体做出修改或编辑时,它可能同时做出一些隐含选择,比如风格、代码模式、某些边界情况该怎么处理,而这些选择可能与其他并行智能体的隐含选择发生冲突。因此,在一个有多个智能体同时进行写操作的多智能体世界里,决策很容易变得碎片化。

Though many things have changed in the last few months, the need for thoughtful context engineering has not. As a consequence of principle 2, most multi-agent setups in the world are limited to “readonly” subagents, like web search subagents and code search subagents. For example, Devin can call out to a Deepwiki subagent to acquire codebase context. But these types of subagents mostly resemble tool calls rather than true multi-agent collaboration. We wanted to explore what capabilities we can unlock when agents collaborate in a more interactive way.

过去几个月里很多事情都变了,但对精心设计上下文工程的需求并没有变。由于原则 2 的存在,现实里的大多数多智能体系统都局限在只读型子智能体上,比如网页搜索子智能体、代码搜索子智能体。举个例子,Devin 可以调用一个 Deepwiki 子智能体来获取代码库上下文。但这类子智能体本质上更像工具调用,而不是真正的多智能体协作。我们想探索的是,当智能体以更有互动性的方式协作时,究竟能解锁什么能力。

What Changed in the Last 10 Months

过去 10 个月里变了什么

To start, models have become way more naturally “agentic.” They intuitively understand tool use, their own context limits, and how to distill their context for collaborators (human or otherwise). As a result, usage of agents has grown … a shit ton. Even when we look at Devin usage in our largest enterprises segment, the segment that has traditionally been cautious toward adopting new technologies, we see an explosion over the last 6 months (~8x).

首先,模型本身变得自然得多,也更有智能体感。它们会直觉地理解工具该怎么用,知道自己的上下文限制,也知道该怎么把自己的上下文提炼给协作者,无论对方是人还是别的智能体。结果就是,智能体的使用量增长了很多,多得离谱。 即便只看 Devin 在我们最大企业客户这一细分市场中的使用情况,这本来是最谨慎、最不愿意率先采用新技术的一群人,过去 6 个月里我们也看到了爆发式增长,大约增长了 8 倍。

This explosion of usage has led to both a push and a pull to multi-agents.

这种使用量的爆发,同时从推和拉两个方向,把大家带向了多智能体。

On the push side of things, the increased capabilities have led users to naturally experiment with many more multi-agent setups. When you are using so many agents, you naturally start to become bottlenecked on everything around those agents: the management, planning, and reviewing. For instance, some have created scripts for Devins to manage other Devins. Many have also leaned into having their coding agents iterate back and forth with their review agents.

从推的角度看,能力增强以后,用户自然会开始尝试更多种多智能体配置。当你开始大量使用智能体时,瓶颈就会自然转移到智能体之外的部分,比如管理、规划和审查。比如,有些人已经写出了让 Devin 管理其他 Devin 的脚本。也有很多人开始让自己的编码智能体和审查智能体来回迭代。

On the pull side of things, the explosion of agent usage has resulted in an explosion of costs. With a new Mythos class of even larger & more capable models on the horizon, the natural question of how one can achieve frontier capabilities at a lower cost arises. And multi-agent systems may be a natural answer.

从拉的角度看,智能体使用量暴涨,也带来了成本暴涨。随着一类新的 Mythos 级别模型即将出现,它们会更大、更强,自然就会冒出一个问题,怎么才能用更低成本拿到接近前沿的能力。而多智能体系统也许正是一个自然的答案。

There's also been a wave of sensational demos of throwing tons of agents at large engineering projects. Notable examples include building a web browser (200k LOC), building a C compiler (100k LOC), and optimizing an LLM training script (10k+ iterations). These are exciting but they all share a property most real software doesn't: a simple, verifiable success criterion. Real software requires a system that scales human taste and decision-making, which is the context in which we set out to explore multi-agent systems.

与此同时,还出现了一波很吸睛的演示,展示如何把大量智能体扔进大型工程项目里。比较知名的例子包括构建一个网页浏览器,20 万行代码,构建一个 C 编译器,10 万行代码,以及优化一个 LLM 训练脚本,迭代 1 万多轮。这些演示确实很让人兴奋,但它们有一个共同点,而大多数真实软件并不具备这一点,那就是成功标准简单而且可验证。真实软件需要的是一种能放大人类品味和决策能力的系统,而这正是我们探索多智能体系统时所处的语境。

Some Practical Multi-agent Experiments

一些实用的多智能体实验

1) The Code-Review-Loop that’s so stupid it shouldn’t work

1)代码审查循环,蠢到按理说不该有效

You would think that making a model review its own code would not result in any useful findings. But even on PRs written by Devin, Devin Review catches an average of 2 bugs per PR, of which roughly 58% are severe (logic errors, missing edge cases, security vulnerabilities). Often the system will loop through multiple code-review cycles, finding new bugs each time (which isn't always great since it can take a while). Today, we make Devin and Devin Review natively iterate against one another, so that most bugs are already resolved by the time a human opens the PR.

你可能会觉得,让模型去审查它自己写的代码,应该不会查出什么有用的问题。但即使面对 Devin 自己写的 PR,Devin Review 平均每个 PR 也能抓出 2 个 bug,其中大约 58% 是严重问题,比如逻辑错误、遗漏的边界情况、安全漏洞。这个系统经常还会经过多轮代码审查循环,每一轮都能找到新的 bug,当然这也不总是好事,因为会花不少时间。现在我们让 Devin 和 Devin Review 原生地彼此迭代,所以人类打开 PR 时,大部分 bug 通常已经被解决了。

The counterintuitive part. Interestingly, we found this technique to work best when the coding and review agents do not share any context beforehand. Why?

最反直觉的部分。 有意思的是,我们发现这种技术在编码智能体和审查智能体事先完全不共享上下文时,效果反而最好。为什么。

There’s a mix of philosophical and technical justifications for this. To start, we must remember that putting the same model in two agents, even if the agent harness is exactly the same, does not quite make them self-biased/correlated in the same way you might imagine one human doing both tasks would be. These agents are ultimately systems that perform based on their context. They don’t have egos, and any shared bias that might exist ultimately comes from their training process, which nowadays we can assume is quite high-quality.

这里面既有偏理念层面的理由,也有技术层面的理由。首先需要记住,把同一个模型放进两个智能体里,即使智能体外壳完全一样,也不会像你想象中一个人同时做两件事那样,天然带来那种自我偏见或高度相关性。归根到底,这些智能体都是根据各自上下文来工作的系统。它们没有自我意识,而任何可能存在的共享偏差,最终都来自训练过程。以现在的情况看,我们可以默认这个训练过程已经相当高质量了。

The review agent having a completely clean context also helps it go deeper into areas the original coding agent may not. For one, this is because it is forced to reason backward from the implementation without the spec, and can openly question things which the original agent might have overlooked due to errors in user instruction (ex. a user telling the agent to implement an insecure pattern). Perhaps more importantly though, having a clean context makes the agent smarter because of the math of attention. Context Rot is a well-documented phenomenon that is a result of models making less intelligent decisions at longer and longer context lengths. Models usually have a limited number of attention heads, and when they need to work on a growing context of instructions, prompts, code, etc, important details may not be fully incorporated into its decision-making. When the coding agent has been working for hours on a task, reading through the repo, running commands, thinking about different approaches, fixing errors, it quickly builds up a large context. The dedicated review agent gets to skip this extraneous context, only look at the diff, and re-discover any context it needs as it reads the code from scratch. With a shorter context, the improved intelligence naturally leads to increased detection of nuanced issues.

让审查智能体拥有一份完全干净的上下文,也会帮助它深入到原始编码智能体没深入的地方。一个原因是,它被迫在没有规格说明的情况下,从实现结果倒推思考,因此可以公开质疑一些事情,而原始智能体可能因为用户指令里的错误而忽略了这些问题。比如用户让智能体实现一种不安全的模式。但也许更重要的是,干净的上下文本身就会让智能体更聪明,这和注意力机制的数学特性有关。上下文腐烂是一个已有充分记录的现象,本质上是模型在上下文越来越长时,决策质量会下降。模型通常只有有限数量的注意力头,当它需要处理越来越长的一串指令、提示、代码等内容时,重要细节就可能无法充分进入决策过程。当编码智能体在一个任务上工作了几个小时,看了整个仓库,跑了很多命令,思考过不同方案,又修过错误,它很快就会积累起一大段上下文。专门的审查智能体则可以跳过这些额外上下文,只看 diff,然后在从头阅读代码的过程中重新发现自己需要的上下文。上下文更短,智能体本身就会更聪明,于是也更容易发现那些微妙的问题。

The final key part to making this system work really well is the communication bridge between the coding agent and review agent. Basically, does Devin properly use its broader context of user instructions, decisions, etc. to filter the bugs that come back from Devin Review? This is key to preventing looping, disobeying the user, doing work that is out of scope, and so on. We found that with some dedicated prompting, models today can make reasonable judgment calls here, and you end up getting some very interesting interactions between the two agents and humans.

要让这个系统真正运转得好,最后一个关键点是编码智能体和审查智能体之间的沟通桥梁。说白了,就是 Devin 能不能利用自己掌握的更完整上下文,比如用户指令、已有决策等,对 Devin Review 返回的 bug 做出过滤。这一点对避免循环、避免违背用户意图、避免做出范围之外的工作等等都很关键。我们发现,只要在提示设计上做一些专门处理,今天的模型已经能在这里做出相当合理的判断,于是你会看到这两个智能体和人类之间出现一些非常有意思的互动。

Takeaways: clean context leads to a notable improvement in capabilities when using a generator-verifier loop. But clear communication and synthesis with the overall context is important for a cohesive experience.

结论是,在生成者加验证者的循环里,干净上下文会明显提升能力。但如果想获得连贯的整体体验,清晰的沟通机制,以及和整体上下文之间的综合判断,同样重要。

2) Large, expensive models are back - introducing “Smart Friend”

2)大型高价模型又回来了,介绍一下 smart friend

If you look at the most popular models over the last few months, you see a distinct shift from mid-sized models like Anthropic’s Sonnet-class models to large models like Anthropic’s Opus-class models for the sake of performance. And with Mythos coming, we can basically say “scaling is back”

如果你去看过去几个月里最受欢迎的模型,会看到一个明显转向。大家正从 Anthropic 的 Sonnet 这一类中等规模模型,转向 Anthropic 的 Opus 这类大型模型,因为性能更强。随着 Mythos 即将到来,基本可以说,规模扩张又回来了

The quiet implication of this is that frontier intelligence will soon be too expensive (and perhaps slow) for most day-to-day tasks. At the same time, you face a dilemma with smaller models that a task might prove more difficult than originally expected.

这里有个不太显眼但很重要的含义。前沿级智能,很快就会因为太贵,也可能太慢,而不适合大多数日常任务。与此同时,使用小模型时你也会遇到一个难题,一个任务原本看起来不难,但做着做着可能发现比预期复杂得多。

How can we get the best of both worlds? In Windsurf, we tried an experiment with this goal when we launched SWE-1.5 in October, a 950 tok/sec sub-frontier model. We found that when paired with Sonnet 4.5 for “planning”, we were able to make up for a small bit of the performance gap while keeping the low cost and fast speeds.

怎样才能两边都占到。我们在 Windsurf 做过一个实验,就是冲着这个目标去的。当我们在 10 月推出 SWE-1.5 时,它是一个每秒 950 token 的次前沿模型。我们发现,如果把它和 Sonnet 4.5 搭配,让后者负责规划,那么在保持低成本和高速度的同时,也能弥补一小部分性能差距。

The actual architecture we used to achieve this was by offering the smarter/expansive model as a “smart friend” tool that the primary/smaller model could make a call out to. Basically, let the primary/smaller model decide when a situation was tricky enough to be worth consulting the smarter/expensive model. But we soon found that engineering the context transfer and communication was tricky:

我们当时采用的实际架构,是把更聪明、能力更强的模型做成一个 smart friend 工具,让主要模型,也就是更小的那个模型,在需要时主动调用它。也就是让主要模型自己判断,什么时候情况已经复杂到值得请教这个更聪明也更贵的模型。但很快我们就发现,上下文转移和沟通方式并不好设计。

1. The primary model needs to know how to talk to smart friend.

1. 主要模型得知道怎么和 smart friend 说话。

The core trickiness of this setup comes from the problem of “how does a dumber model know it’s at its limits?” Unlike the more popular inverted setup with a smart primary model delegating tasks to smaller subagents, the model deciding when to delegate isn’t the smarter one. There’s a few potential solutions here. For one, you might encourage the primary agent to always make at least one call to the smart agent to evaluate whether it thinks there is some trickiness that was missed. You might also prompt-tune or train the primary model to be more calibrated on this decision. Depending on the intelligence of the primary model, certain kinds of domain-specific prescriptive guidance may be necessary, such as always invoking the smart friend for merge conflicts.

这种配置里最核心的难点,在于一个问题,比较笨的模型怎么知道自己已经到极限了。和更常见的反向结构不同,那种结构里是聪明的主模型把任务委派给更小的子智能体,而这里决定何时委派的并不是更聪明的模型。这里有几种可能的解决办法。比如,你可以鼓励主要智能体至少总要调用一次聪明智能体,让它评估一下,是不是有什么复杂点被漏掉了。你也可以通过提示微调或训练,让主要模型在这个决策上更有校准能力。具体还要看主要模型本身的智能程度,有时还需要一些领域内更明确的规定性指导,比如遇到合并冲突时一律调用 smart friend。

The other tricky question with this communication method is what context should the primary model share with the smart friend? Moreover, what should the primary model ask the smart friend? If the primary model only shares a subset of its total context, then the smart model might not make a fully-informed decision. We found that for today’s models, a reasonable 80/20 solution is to just share a fork of the full context of the primary model with the smart model. Similarly, we found that encouraging the primary model to ask broad questions (”what should I do?”) and letting the smart model decide what is interesting to discuss is better.

这种沟通方式的另一个难点,是主要模型该把哪些上下文分享给 smart friend。再进一步说,主要模型该怎么向 smart friend 提问。如果主要模型只分享了自己全部上下文中的一部分,那么聪明模型就可能无法做出充分知情的判断。我们发现,对当下的模型来说,一个比较合理的 80/20 方案,就是直接把主要模型的完整上下文 fork 一份给聪明模型。同样地,我们也发现,鼓励主要模型提出更宽泛的问题,比如我该怎么做,然后让聪明模型自己决定哪些内容值得讨论,效果更好。

2. The smart friend needs to know how to talk back to the primary model

2. smart friend 也得知道怎么把话说回给主要模型

No matter how well you tune (1) you will likely find there are still gaps in quality due to context loss. Tuning the communication in the other direction can make up for these gaps. For instance, suppose the primary model never looked at important_file.py and asked the smart model about something that requires knowledge of the contents of this file. In this case, the right answer from the smart model is not to make up some theories (which is often the default behavior), but to specifically instruct the primary model to investigate this file and ask again later. Similarly, it’s often also fruitful to ask the smart friend to look beyond the question the primary model is asking, and suggest any important guidance based on the agent trajectory, even if the primary model didn’t ask for it. We’ve found this “over-scoped” smart friend to generally lead to more interesting interactions.

不管你怎么优化前一种沟通方式,最后多半还是会因为上下文损失而存在质量缺口。把反方向的沟通调好,可以弥补这些缺口。比如,假设主要模型从来没看过 important_file.py,却跑去问聪明模型一个必须知道这个文件内容的问题。那这种情况下,聪明模型的正确回答,不该是编一些理论出来,虽然这往往是默认行为,而是应该明确告诉主要模型去调查这个文件,之后再回来问。类似地,让 smart friend 不只回答主要模型正在问的问题,而是顺着整个智能体的执行轨迹,多给出一些重要建议,即使主要模型没问到这些点,通常也会有不错效果。我们发现,这种范围适度放大的 smart friend,通常会带来更有意思的互动。

What Actually Happened with Smart Friend

smart friend 实际上发生了什么

We should be upfront: SWE 1.5 was not good enough at being the primary model for this setup to really work. The gap between it and Sonnet 4.5 was too wide in exactly the places that mattered for this setup: knowing when to escalate, knowing what to ask. The cost and speed wins were real, but the quality ceiling was set by the primary, and the primary wasn't strong enough. SWE 1.6 (a recent followup achieving Opus-4.5 level performance on SWE-bench) is meaningfully better and closes enough of that gap that the pattern starts to pay off, but it's still not where we want it. We're reasonably confident this is a training problem, and future SWE models will be trained with this back-and-forth in mind [2].

有件事得先讲清楚。SWE 1.5 作为这个结构里的主要模型,其实还不够好,所以这个方案并没有真正达到理想效果。它和 Sonnet 4.5 之间的差距,恰好出现在这个配置最关键的地方。什么时候该升级求助,应该问什么。成本和速度上的优势是真实存在的,但质量上限是由主要模型决定的,而这个主要模型还不够强。SWE 1.6 是最近的后续版本,在 SWE-bench 上已经达到了 Opus-4.5 级别表现,它明显更好,也足够缩小了这道差距,让这种模式开始变得有回报,但依然还没到我们想要的程度。我们相当有信心,这本质上是个训练问题,未来的 SWE 模型会在训练时把这种来回交互考虑进去 [2]。

Where the pattern did work, and worked well, was across frontier models. We’ve run Claude and GPT together in this setup in production for a meaningful stretch, and it produced real gains in the trickiest scenarios. The interesting finding is that the prompt-tuning problems are different from the small-model-to-large-model case. Cross-frontier communication is less about a weaker model knowing when to ask a stronger one, and more about routing to whichever model is best at the specific sub-task. Some models debug better, some handle visual reasoning better, some write tests better. The delegation logic becomes a capability router rather than a difficulty escalator.

真正有效,而且效果不错的场景,是发生在前沿模型彼此之间。我们已经在生产环境里让 Claude 和 GPT 按这种方式一起工作过相当长一段时间,而且在最棘手的场景下确实带来了实际收益。有意思的是,这里的提示调优问题,和小模型加大模型的情况并不一样。前沿模型之间的沟通,不太是让较弱模型知道何时去问更强模型,而更像是把任务分流给最擅长那个具体子任务的模型。有些模型更会调 bug,有些更擅长视觉推理,有些更会写测试。这时,委派逻辑就不再是一个难度升级器,而变成了一个能力路由器。

Takeaways: smart-friend works today when both models are strong. Getting it to work with an asymmetrically weaker primary, which is the version that leads to the biggest unlocks, is still an open problem, and we think it's a training one. Reach out if you want to compare notes.

结论是,如果两个模型都足够强,smart-friend 这种模式今天就能成立。而要让它在主模型明显更弱的情况下也成立,也就是那个最有可能带来最大解锁的版本,仍然是一个开放问题。我们认为这本质上是训练问题。如果你也在做类似探索,欢迎来交流。

Looking Ahead: Higher-Level Delegation

展望未来,高层级委派

The two patterns above share a structure: one writer, augmented by other agents contributing intelligence around it. The obvious next question is whether this extends to agents owning larger scope, for example, a product feature that spans ten PRs, a migration that touches a dozen services, a week of work rather than an afternoon's.

上面这两种模式有一个共同结构。只有一个写入者,其他智能体则围绕它贡献智能。接下来最自然的问题就是,这种模式能不能扩展到更大范围的所有权。比如一个横跨十个 PR 的产品功能,一个涉及十几个服务的迁移任务,或者一整周的工作,而不是一个下午的活。

This is live in Devin today. A manager Devin can break a larger task into pieces, spawn child Devins to work on them, and coordinate their progress through an internal MCP. Getting it to feel coherent took more context engineering than we expected. Managers trained on small-scoped delegation default to being overly prescriptive, which backfires when the manager lacks deep codebase context. Agents assume they share state with their children when they don't. Cross-agent communication, a sub-agent writing messages back to its manager to be passed to other agents in the agent team, doesn't happen by default, because models haven't been trained in environments where it needed to. Each of these took dedicated work to fix, and we're still improving on all of them.

这件事今天已经在 Devin 里上线。一个管理者 Devin 可以把更大的任务拆成多个部分,生成子 Devin 去处理,并通过一个内部 MCP 协调它们的进展。要把这件事做得足够连贯,所需的上下文工程比我们预期得更多。那些原本训练于小范围委派任务的管理者,默认会变得过于规定式,而当管理者自己并不具备深入的代码库上下文时,这种做法反而会适得其反。智能体还会默认以为自己和子智能体共享状态,实际上并没有。跨智能体沟通,也就是子智能体把信息写回管理者,再由管理者传给团队里的其他智能体,这件事默认并不会发生,因为模型并没有在那种必须这样做的环境里训练过。这些问题每一个都需要专门的修正工作,而且我们现在还在持续改进。

What about unstructured swarms? We think the unstructured-swarm approach, arbitrary networks of agents negotiating with each other, is mostly a distraction. The practical shape is map-reduce-and-manage: a manager splits work, children execute, the manager synthesizes and reports back. Making this type of system feel as coherent as a single agent working on a single task is at the center of some of our upcoming work in 2026.

那无结构的群体智能体呢。 我们认为,无结构集群这种做法,也就是任意一张智能体网络彼此协商,大多只是干扰项。真正实用的形态是 map-reduce-and-manage,也就是管理者拆分工作,子智能体执行,管理者再综合结果并汇报。如何让这种系统在体验上像一个智能体处理一个任务那样连贯,会是我们 2026 年接下来一些工作的核心。

What We Know Today

我们今天知道的事

There’s a shared through-line with all of these experiments: multi-agent systems work best today when writes stay single-threaded and the additional agents contribute intelligence rather than actions. A clean-context reviewer catches bugs the coder can't see. A frontier-level smart friend catches subtleties a weaker primary misses. A manager coordinates scope across child agents without fragmenting decisions.

这些实验背后有一条共同主线。以今天的情况看,多智能体系统在写操作保持单线程、额外智能体只贡献智能而不直接执行动作时,效果最好。一个拥有干净上下文的审查者,能抓住编码者看不到的 bug。一个前沿级的 smart friend,能发现较弱主模型漏掉的细微问题。一个管理者能在多个子智能体之间协调范围,而不把决策搞碎。

The open problems are all communication problems. How does a weaker model learn when to escalate? How does a child agent surface a discovery that should change its siblings' work? How do you transfer context between agents without drowning the receiver? You can get decently far with prompting, but we also expect the next generation of models, including the ones we train ourselves, to start closing these gaps.

剩下的开放问题,本质上全都是沟通问题。较弱模型怎么学会什么时候该升级求助。子智能体怎么把一个足以改变其他同伴工作方向的新发现及时上报。怎样在智能体之间转移上下文,又不把接收方淹没。仅靠提示设计,你已经可以走得不算太短,但我们也相信,下一代模型,包括我们自己训练的那些模型,会开始逐步填上这些空白。

We're building toward a world where intelligence is injected at every stage of the software development lifecycle — planning, coding, review, testing, and monitoring — not as a swarm of autonomous actors, but as a coordinated system that scales human taste.

我们正在朝一个世界推进。在那个世界里,智能会被注入软件开发生命周期的每一个阶段,规划、编码、审查、测试、监控。它不是一群自治行动者组成的蜂群,而是一个能够放大人类品味的协同系统。

We welcome you to try our work at devin.ai or windsurf.com. And if you would enjoy discovering some of these agent-building principles with us, reach out to walden@cognition.ai

欢迎你到 devin.ai 或 windsurf.com 试试我们的工作。如果你也愿意和我们一起探索这些构建智能体的原则,欢迎联系 walden@cognition.ai

[1] Coincidentally, Anthropic came out the next day with a related blogpost about building a multi-agent research system. Both blogposts touched on similar challenges with context engineering and came to similar conclusions about the first area of applicability being in readonly agents

[1] 巧的是,Anthropic 第二天也发了一篇相关博客,讨论如何构建一个多智能体研究系统。两篇文章都提到了上下文工程中的相似挑战,也都得出了类似结论,也就是最先适用的领域会是只读型智能体

[2] Recently, Anthropic launched a similar beta experiment to let their smaller models make calls out to their larger models in the same fashion. At a minimum, this suggest the models on the “smart friend” end will also get better at communicating back with the primary model.

[2] 最近,Anthropic 也推出了一个类似的 beta 实验,让他们的小模型以相同方式去调用更大的模型。至少这说明,位于 smart friend 一端的模型,也会越来越擅长把信息有效地传回给主要模型。

10 months ago, I wrote Don't Build Multi-Agents, arguing that most people shouldn't try to build multi-agent systems [1]. Parallel agents make implicit choices about style, edge cases, and code patterns. At the time, these decisions often conflicted with each other, leading to fragile products. A lot has changed since then.

At Cognition, we've begun to deploy multi-agent systems that actually work in practice. Our original observations still hold today for parallel-writer swarms: most of the sexy ideas in that space still don’t see meaningful adoption. But we've found a narrower class of patterns that do: setups where multiple agents contribute intelligence to a task while writes stay single-threaded. In this post, I'll summarize what we've learned building them.

A Refresher on Context Engineering

In the last post, we encouraged readers to reframe agent-building from “prompt engineering” to “context engineering”. Prompt engineering encourages gimmicky techniques like “you’re a senior software engineer” or “think for longer.” Context engineering is more durable and focuses on giving the right context to models while assuming the models become more capable over time. For many reasons, context engineering can get very challenging in a multi-agent setup. In the past, we recommended the following principles:

  1. Share as much context as possible between the agents. Make sure they see the same sources of information, stay on the same page (todo list, plan files), and share the same priors about the overall task they are meant to accomplish. Help them communicate if needed

  2. Actions carry implicit decisions. When one agent makes certain changes or edits, it might make implicit choices (style, code patterns, how certain edge cases should be handled) that might conflict with the implicit choices of other parallel agents. As a result, decision-making can get quite fragmented in a multi-agent world where multiple agents are taking write actions.

Though many things have changed in the last few months, the need for thoughtful context engineering has not. As a consequence of principle 2, most multi-agent setups in the world are limited to “readonly” subagents, like web search subagents and code search subagents. For example, Devin can call out to a Deepwiki subagent to acquire codebase context. But these types of subagents mostly resemble tool calls rather than true multi-agent collaboration. We wanted to explore what capabilities we can unlock when agents collaborate in a more interactive way.

What Changed in the Last 10 Months

To start, models have become way more naturally “agentic.” They intuitively understand tool use, their own context limits, and how to distill their context for collaborators (human or otherwise). As a result, usage of agents has grown … a shit ton. Even when we look at Devin usage in our largest enterprises segment, the segment that has traditionally been cautious toward adopting new technologies, we see an explosion over the last 6 months (~8x).

https://cursor.com/blog/scaling-agents

This explosion of usage has led to both a push and a pull to multi-agents.

On the push side of things, the increased capabilities have led users to naturally experiment with many more multi-agent setups. When you are using so many agents, you naturally start to become bottlenecked on everything around those agents: the management, planning, and reviewing. For instance, some have created scripts for Devins to manage other Devins. Many have also leaned into having their coding agents iterate back and forth with their review agents.

On the pull side of things, the explosion of agent usage has resulted in an explosion of costs. With a new Mythos class of even larger & more capable models on the horizon, the natural question of how one can achieve frontier capabilities at a lower cost arises. And multi-agent systems may be a natural answer.

There's also been a wave of sensational demos of throwing tons of agents at large engineering projects. Notable examples include building a web browser (200k LOC), building a C compiler (100k LOC), and optimizing an LLM training script (10k+ iterations). These are exciting but they all share a property most real software doesn't: a simple, verifiable success criterion. Real software requires a system that scales human taste and decision-making, which is the context in which we set out to explore multi-agent systems.

Some Practical Multi-agent Experiments

1) The Code-Review-Loop that’s so stupid it shouldn’t work

You would think that making a model review its own code would not result in any useful findings. But even on PRs written by Devin, Devin Review catches an average of 2 bugs per PR, of which roughly 58% are severe (logic errors, missing edge cases, security vulnerabilities). Often the system will loop through multiple code-review cycles, finding new bugs each time (which isn't always great since it can take a while). Today, we make Devin and Devin Review natively iterate against one another, so that most bugs are already resolved by the time a human opens the PR.

The counterintuitive part. Interestingly, we found this technique to work best when the coding and review agents do not share any context beforehand. Why?

There’s a mix of philosophical and technical justifications for this. To start, we must remember that putting the same model in two agents, even if the agent harness is exactly the same, does not quite make them self-biased/correlated in the same way you might imagine one human doing both tasks would be. These agents are ultimately systems that perform based on their context. They don’t have egos, and any shared bias that might exist ultimately comes from their training process, which nowadays we can assume is quite high-quality.

The review agent having a completely clean context also helps it go deeper into areas the original coding agent may not. For one, this is because it is forced to reason backward from the implementation without the spec, and can openly question things which the original agent might have overlooked due to errors in user instruction (ex. a user telling the agent to implement an insecure pattern). Perhaps more importantly though, having a clean context makes the agent smarter because of the math of attention. Context Rot is a well-documented phenomenon that is a result of models making less intelligent decisions at longer and longer context lengths. Models usually have a limited number of attention heads, and when they need to work on a growing context of instructions, prompts, code, etc, important details may not be fully incorporated into its decision-making. When the coding agent has been working for hours on a task, reading through the repo, running commands, thinking about different approaches, fixing errors, it quickly builds up a large context. The dedicated review agent gets to skip this extraneous context, only look at the diff, and re-discover any context it needs as it reads the code from scratch. With a shorter context, the improved intelligence naturally leads to increased detection of nuanced issues.

https://www.anthropic.com/engineering/building-c-compiler

The final key part to making this system work really well is the communication bridge between the coding agent and review agent. Basically, does Devin properly use its broader context of user instructions, decisions, etc. to filter the bugs that come back from Devin Review? This is key to preventing looping, disobeying the user, doing work that is out of scope, and so on. We found that with some dedicated prompting, models today can make reasonable judgment calls here, and you end up getting some very interesting interactions between the two agents and humans.

https://cognition.ai/blog/swe-1-5

Takeaways: clean context leads to a notable improvement in capabilities when using a generator-verifier loop. But clear communication and synthesis with the overall context is important for a cohesive experience.

2) Large, expensive models are back - introducing “Smart Friend”

If you look at the most popular models over the last few months, you see a distinct shift from mid-sized models like Anthropic’s Sonnet-class models to large models like Anthropic’s Opus-class models for the sake of performance. And with Mythos coming, we can basically say “scaling is back”

The quiet implication of this is that frontier intelligence will soon be too expensive (and perhaps slow) for most day-to-day tasks. At the same time, you face a dilemma with smaller models that a task might prove more difficult than originally expected.

How can we get the best of both worlds? In Windsurf, we tried an experiment with this goal when we launched SWE-1.5 in October, a 950 tok/sec sub-frontier model. We found that when paired with Sonnet 4.5 for “planning”, we were able to make up for a small bit of the performance gap while keeping the low cost and fast speeds.

http://deepwiki.com/

The actual architecture we used to achieve this was by offering the smarter/expansive model as a “smart friend” tool that the primary/smaller model could make a call out to. Basically, let the primary/smaller model decide when a situation was tricky enough to be worth consulting the smarter/expensive model. But we soon found that engineering the context transfer and communication was tricky:

1. The primary model needs to know how to talk to smart friend.

The core trickiness of this setup comes from the problem of “how does a dumber model know it’s at its limits?” Unlike the more popular inverted setup with a smart primary model delegating tasks to smaller subagents, the model deciding when to delegate isn’t the smarter one. There’s a few potential solutions here. For one, you might encourage the primary agent to always make at least one call to the smart agent to evaluate whether it thinks there is some trickiness that was missed. You might also prompt-tune or train the primary model to be more calibrated on this decision. Depending on the intelligence of the primary model, certain kinds of domain-specific prescriptive guidance may be necessary, such as always invoking the smart friend for merge conflicts.

The other tricky question with this communication method is what context should the primary model share with the smart friend? Moreover, what should the primary model ask the smart friend? If the primary model only shares a subset of its total context, then the smart model might not make a fully-informed decision. We found that for today’s models, a reasonable 80/20 solution is to just share a fork of the full context of the primary model with the smart model. Similarly, we found that encouraging the primary model to ask broad questions (”what should I do?”) and letting the smart model decide what is interesting to discuss is better.

2. The smart friend needs to know how to talk back to the primary model

No matter how well you tune (1) you will likely find there are still gaps in quality due to context loss. Tuning the communication in the other direction can make up for these gaps. For instance, suppose the primary model never looked at important_file.py and asked the smart model about something that requires knowledge of the contents of this file. In this case, the right answer from the smart model is not to make up some theories (which is often the default behavior), but to specifically instruct the primary model to investigate this file and ask again later. Similarly, it’s often also fruitful to ask the smart friend to look beyond the question the primary model is asking, and suggest any important guidance based on the agent trajectory, even if the primary model didn’t ask for it. We’ve found this “over-scoped” smart friend to generally lead to more interesting interactions.

https://www.anthropic.com/engineering/multi-agent-research-system

What Actually Happened with Smart Friend

We should be upfront: SWE 1.5 was not good enough at being the primary model for this setup to really work. The gap between it and Sonnet 4.5 was too wide in exactly the places that mattered for this setup: knowing when to escalate, knowing what to ask. The cost and speed wins were real, but the quality ceiling was set by the primary, and the primary wasn't strong enough. SWE 1.6 (a recent followup achieving Opus-4.5 level performance on SWE-bench) is meaningfully better and closes enough of that gap that the pattern starts to pay off, but it's still not where we want it. We're reasonably confident this is a training problem, and future SWE models will be trained with this back-and-forth in mind [2].

Where the pattern did work, and worked well, was across frontier models. We’ve run Claude and GPT together in this setup in production for a meaningful stretch, and it produced real gains in the trickiest scenarios. The interesting finding is that the prompt-tuning problems are different from the small-model-to-large-model case. Cross-frontier communication is less about a weaker model knowing when to ask a stronger one, and more about routing to whichever model is best at the specific sub-task. Some models debug better, some handle visual reasoning better, some write tests better. The delegation logic becomes a capability router rather than a difficulty escalator.

Takeaways: smart-friend works today when both models are strong. Getting it to work with an asymmetrically weaker primary, which is the version that leads to the biggest unlocks, is still an open problem, and we think it's a training one. Reach out if you want to compare notes.

Looking Ahead: Higher-Level Delegation

The two patterns above share a structure: one writer, augmented by other agents contributing intelligence around it. The obvious next question is whether this extends to agents owning larger scope, for example, a product feature that spans ten PRs, a migration that touches a dozen services, a week of work rather than an afternoon's.

This is live in Devin today. A manager Devin can break a larger task into pieces, spawn child Devins to work on them, and coordinate their progress through an internal MCP. Getting it to feel coherent took more context engineering than we expected. Managers trained on small-scoped delegation default to being overly prescriptive, which backfires when the manager lacks deep codebase context. Agents assume they share state with their children when they don't. Cross-agent communication, a sub-agent writing messages back to its manager to be passed to other agents in the agent team, doesn't happen by default, because models haven't been trained in environments where it needed to. Each of these took dedicated work to fix, and we're still improving on all of them.

What about unstructured swarms? We think the unstructured-swarm approach, arbitrary networks of agents negotiating with each other, is mostly a distraction. The practical shape is map-reduce-and-manage: a manager splits work, children execute, the manager synthesizes and reports back. Making this type of system feel as coherent as a single agent working on a single task is at the center of some of our upcoming work in 2026.

What We Know Today

There’s a shared through-line with all of these experiments: multi-agent systems work best today when writes stay single-threaded and the additional agents contribute intelligence rather than actions. A clean-context reviewer catches bugs the coder can't see. A frontier-level smart friend catches subtleties a weaker primary misses. A manager coordinates scope across child agents without fragmenting decisions.

The open problems are all communication problems. How does a weaker model learn when to escalate? How does a child agent surface a discovery that should change its siblings' work? How do you transfer context between agents without drowning the receiver? You can get decently far with prompting, but we also expect the next generation of models, including the ones we train ourselves, to start closing these gaps.

We're building toward a world where intelligence is injected at every stage of the software development lifecycle — planning, coding, review, testing, and monitoring — not as a swarm of autonomous actors, but as a coordinated system that scales human taste.

We welcome you to try our work at devin.ai or windsurf.com. And if you would enjoy discovering some of these agent-building principles with us, reach out to walden@cognition.ai

[1] Coincidentally, Anthropic came out the next day with a related blogpost about building a multi-agent research system. Both blogposts touched on similar challenges with context engineering and came to similar conclusions about the first area of applicability being in readonly agents

[2] Recently, Anthropic launched a similar beta experiment to let their smaller models make calls out to their larger models in the same fashion. At a minimum, this suggest the models on the “smart friend” end will also get better at communicating back with the primary model.

📋 讨论归档

讨论进行中…