返回列表
🪞 Uota学 · 🧠 阿头学

当 Agent 开始管理 Agent——编排层才是真正的杠杆点

AI 编码的瓶颈不是模型能力,是你的注意力带宽——而一个"会思考的编排器"比任何单个 agent 的能力提升都更值钱。

2026-02-25 原文链接 ↗
阅读简报
双语对照
完整翻译
原文
讨论归档

核心观点

  • 人类是瓶颈,不是 agent 作者并行跑 30 个编码 agent,发现自己从"写代码"退化成"刷 GitHub 标签页的项目经理"。真正的效率墙不在 agent 写代码的质量,而在人类审 PR、转发 CI 失败、路由评审评论的带宽。这个观察精准到痛——ATou 你现在用 Uota 的方式,本质上也在撞这面墙。
  • 编排器本身必须是 agent,不是脚本 这是全文最关键的判断。作者从 2500 行 bash 脚本起步,最终让 agent 把脚本重写成了一个有智能的编排器——它能理解代码库、拆任务、分配 agent、自动处理 CI 失败和评审评论。区别在于:脚本是 if-else,agent 编排器有上下文理解和决策能力。这个架构跃迁值得认真看。
  • 递归自改进是真实发生的,但被过度浪漫化了 8 天 4 万行、agent 构建自己的编排器——听起来很科幻,但实际上作者承认"真正集中工作约 3 天",而且人类仍然做了所有架构决策、冲突解决和方向判断。递归改进是真的,但人类仍然是系统的"大脑皮层",agent 是"小脑+脊髓"。别被叙事冲昏头。
  • 插件化 8 槽位架构是个聪明的工程选择 Runtime/Agent/SCM/Tracker/Notifier 全部可替换——不绑定 Claude Code、不绑定 GitHub、不绑定 tmux。这意味着编排层和执行层彻底解耦。对比 OpenClaw 的 skill 系统,思路是一致的:让编排层成为稳定的中枢,执行层随时可换。

跟我们的关联

🪞Uota:这篇文章描述的"编排器 agent"和 Uota 的 subagent spawn 模式高度同构。当前 Uota 的编排还是半手动的(主 session 分批 spawn、手动收口),下一步可以考虑:让编排层自动感知 subagent 状态、自动重试失败、自动路由结果。Agent Orchestrator 的 reactions 系统(CI 失败自动回传 agent)是个可以直接借鉴的模式。

👤ATou:作者的核心身份转变——从"写代码的人"到"做架构决策+审 PR 的人"——就是 Context Engineer 的终极形态。ATou 现在跑 6 条线,如果每条线都有一个"编排器 agent"在管理执行层 agent,ATou 只需要做跨线的战略决策和冲突仲裁。这不是幻想,这篇文章证明了它在纯编码场景已经可行。

🧠Neta:20 人特种部队 + AI agent 编排 = 理论上可以输出 200 人团队的工程量。但前提是编排层足够智能。这篇文章的 8 天 4 万行是个数据点,值得在 Neta 的 AI-native 效能系统里参考。

讨论引子

💭 作者说"编排比任何单个 agent 的能力提升都更重要"——如果这是对的,那 Neta 应该把多少工程资源投在"agent 编排基础设施"上,而不是直接投在产品功能上?

💭 文章里 agent 的 CI 通过率是 84.6%,剩下 15% 需要多轮自修复甚至人工介入。在 Neta 的场景里,什么样的任务适合放给 agent 自治,什么样的任务人类必须在 loop 里?边界在哪?

💭 递归自改进听起来很美,但作者承认人类仍然做了所有架构决策。如果 agent 开始做架构决策,你怎么验证它的决策质量?谁来 review reviewer?

自我进化的 AI 系统:它自己造出了自己

我当时只想更快交付

我手里有一套代码库、一堆待办要做,而一天的时间永远不够用。于是我开始并行跑 AI 编码代理——每个代理分配一个任务,让它们写代码,我来审 PR、合并,然后重复。一开始两三个,后来五个,再后来十个。

代理很快,问题在我:我跟不上它们。我得检查 CI 是否通过,读评审评论,把错误复制粘贴回去。我从写代码变成了看护那些写代码的东西。这当然无法扩展。

于是我写了一些 bash 脚本来自动协调——大约 2,500 行,负责管理 tmux 会话、git worktrees,以及切换标签页。每个代理都有自己隔离的 tmux 会话和 worktree。这个编排器能拉起它们,偷窥它们在干什么,把 CI 失败回传给对应的代理;我只要说一句“带我去 PR #1121 的标签页”,就能在各会话间跳转。它勉强能用。

接着我把代理对准这些 bash 脚本本身。它们做出了真正的编排器 v1。v1 管着那些打造 v2 的代理。而从那以后,v2 一直在自我改进。

结果是:40,000 行 TypeScript、17 个插件、3,288 个测试——在 8 天内完成,且大部分由这个系统所编排的代理完成。每次提交都会带一个 git trailer,用来标识是哪一个 AI 模型写的。人类做了什么、代理做了什么,没有任何含糊空间。

我们已经开源:Agent Orchestrator(github.com/ComposioHQ/agent-orchestrator)。

关键要理解的是:编排器本身就是一个 AI 代理。不是一个仪表盘。不是一个 cron job。也不是一个轮询 GitHub 的脚本。它是一个代理——它会读你的代码库,理解你的待办,决定如何把一个功能拆成可并行的任务,把每个任务分配给一个编码代理,并监控它们的进展。CI 挂了,它会把失败信息注入回对应的代理会话——代理读日志并修复。评审评论来了,它会带着上下文把评论路由到正确的代理会话。无需人工打通管道。

这就是它与所有“并行跑代理”的方案不同之处:负责管理代理的那个东西,本身就有智能。

AI 辅助编码里的真正瓶颈

大多数人把 AI 编码代理的问题看错了。代理会写代码。这不是瓶颈。瓶颈是你。

你起五个任务,去喝杯咖啡,20 分钟后回来,发现自己只是在刷新 GitHub 标签页——等 PR、查 CI、读评审评论。恭喜,你把工程自动化了——然后用项目管理取而代之。还是糟糕的项目管理。

编排器代理在这个闭环里替代了你。不是靠脚本——而是一个真正的 AI 代理,它对每个活跃会话、每个打开的 PR、每次 CI 运行都有上下文。它跟踪一切,盯住失败,把评审评论回传给编码代理,只有在确实需要人类决策时才来找你。一旦那个瓶颈——你的注意力——被移除,增益就会迅速叠加。

你打开仪表盘想看状态,可编排器代理早就已经在干活——它看过你所有的工作流,然后告诉你:“这个 PR 正在阻塞另外三个任务,这个 CI 失败是个 flaky 测试,而这条评审评论才是真正关键的那一条。”它不是在给你看数据,而是在给你决策。

另一个同样重要的点:万物皆可插拔。不同的代理运行时?不同的问题跟踪器?不同的通知渠道?换掉就是。编排器不在乎你用的是 Claude Code 还是 Aider,是 tmux 还是 Docker,是 GitHub 还是 Linear。八个插件槽位,全部可替换

时间线

人们看到“8 天 4 万行”就以为我躲进了山洞闭关。我还有一份全职工作。整个过程大概只有 ~3 天是真正集中投入的工作,分散在 8 天里,其余空档由代理填上。

模式很简单:睡前把会话都搭好,代理通宵干活;早上上班前审一轮、合并;再开新会话;重复。

最夸张的一天:2 月 14 日(周六)。一天合并了 27 个 PR。整个平台都交付了——核心服务、CLI、Web 仪表盘、全部 17 个插件、npm 发布。我审 PR、合 PR 的速度快到来不及逐字读完,但每个 PR 在此之前都已经通过了 CI 和自动化代码评审。

哪些模型做了什么

每次提交都会通过 git trailers 记录模型:

提交总数会超过 722,因为有些提交是一个模型写的、另一个模型审查/修复的。Opus 4.6 负责硬骨头——复杂架构、跨包集成。Sonnet 负责跑量——插件实现、测试、文档。

全自动代码评审:700 条评论,1% 由人类完成

代理不只是写完代码就扔过去。这里有一整套自动化评审循环:

代理创建 PR 并推送代码

Cursor Bugbot 自动评审并发布行内评论

代理阅读评论,修复代码,再次推送

Bugbot 重新评审

700 条自动化代码评审评论。Bugbot 抓到的都是真问题——通过 exec() 的 shell 注入、路径遍历、未关闭的 interval、缺失的 null 检查。代理约有 ~68% 立即修复,约 ~7% 解释为有意为之,约 ~4% 延后到未来的 PR。

ao-58 的故事

最戏剧性的例子:PR #125,一次仪表盘重设计。它经历了 12 轮 “CI 失败→修复” 循环。每一次,代理拿到失败输出,诊断问题(类型错误、lint 失败、测试回归),然后推送修复。没有任何人碰过它。

12 轮。零人工干预。干净上线。

9 个分支上的全部 41 次 CI 失败,最终都由代理自行纠正。总体 CI 通过率:84.6%。

架构

编排器使用一个带 8 个可替换槽位的插件系统:

会话生命周期:

Tracker 拉取一个 issue(GitHub 或 Linear)

Workspace 创建一个隔离的 worktree 或 clone

Runtime 启动一个 tmux 会话或进程

Agent(Claude Code、Aider 等)自主工作

Terminal 让你通过 iTerm2 或 Web 仪表盘实时观察

SCM 创建 PR,并为其补充上下文

Reactions 在 CI 失败或评审评论出现时自动重启代理

Notifier 只在需要人类判断时才通知你

不用 tmux?用进程运行时。不用 GitHub?用 Linear。不用 Claude Code?插上 Aider 或 Codex。任何一块都能替换。

自愈 CI:会修自己的失败的代理

这是最有用的功能。对 GitHub 事件的自动响应:

CI 挂了?代理接手。评审要求修改?代理读评论并修代码。PR 通过了?你会收到 Slack 通知。那 41 次 CI 失败之所以能自我纠正,就是因为 reactions 系统把失败自动转发回代理了。

起源:AI 代理在建造自己的编排器

我同时跑了 30 个代理在做 Agent Orchestrator。它们在构建 TypeScript 替代版,而我正用 bash 脚本版来管理它们。被建造的东西,正在管理它自己的建造过程。

我实际做的事情:

架构决策(插件槽位、配置 schema、会话生命周期)

拉起会话并分配 issue

审 PR(大多看架构,不逐行抠)

解决跨代理冲突(两个代理改了同一个文件)

做判断(否掉这个方案,改试那个)

代理做的事情:

全部实现(40K 行 TypeScript)

全部测试(3,288 个测试用例)

全部 PR 创建(102 个 PR 里有 86 个)

全部评审评论修复

全部 CI 失败修复

我从未直接向任何 feature 分支提交过。每一行代码都走了 PR。

活动检测

更棘手的问题之一:不问代理,怎么判断它实际上在干什么。

Claude Code 会在每个会话期间写结构化的 JSONL 事件文件。与其依赖代理自述(它们会撒谎,或者至少会搞混),编排器直接读取这些文件:

代理是在主动生成 tokens 吗?

它在等待工具执行吗?

它空闲了吗?

它完成了吗?

agent-claude-code 插件知道如何解析 Claude 的会话文件。未来的 agent-aider 插件会读取 Aider 的对应文件。

Web 仪表盘

Next.js 15,用 Server-Sent Events 做实时更新。不靠轮询。

Attention zones——按需要你关注的类型分组会话(CI 失败、等待评审、运行良好)

Live terminal——浏览器里的 xterm.js,实时展示代理的真实终端输出

Session detail——正在编辑的文件、最近的提交、PR 状态、CI 状态

Config discovery——自动找到你的 ao.config.yaml 并显示可用会话

自我改进的 AI 回路

每个代理会话都会产生信号。哪些提示词让 PR 一次就干净?哪些会一路滚成 12 轮 CI 失败循环?哪些模式容易引发合并冲突?

大多数代理方案会把这些信号丢掉。会话一结束,你就转向下一件事,下一次会话又从零开始。

Agent Orchestrator 有一套自我改进系统(ao-52——它本身也是由代理构建的),用来记录性能、追踪会话结果并做复盘。它会学习哪些任务能一次成功、哪些需要更严格的护栏。

代理构建功能 → 编排器观察什么有效 → 调整它未来如何管理会话 → 代理构建出更好的功能。回路会叠加增益。

而且由于编排器是由代理构建的,编排器又让代理更有效,而这些代理又持续改进编排器——这就是递归。这个工具通过它所管理的代理,在改进它自己。

我认为这也是为什么“编排”比任何单个代理的能力提升都更重要。上限不是“Claude Code 写 TypeScript 有多强”,而是“一个系统能把并行工作的几十个代理部署得多好、观测得多好、改进得多好”。这个上限高得多。而且每跑一轮回路,它就会再抬高一点。

接下来:走向完全自治的软件工程

随时随地和你的代理对话。现在你必须坐在桌前。你应该能从 Telegram 或 Slack 给编排器发消息——查状态、批准合并、重定向某个代理——哪怕你正在散步。

更紧的会话中反馈。代理会漂移。它们会开始解决错的问题,把一个简单修复过度工程化,或者钻进兔子洞。编排器需要把代理的工作与最初意图对齐,在它们走错方向浪费 20 分钟之前就注入纠偏。

自动升级。代理解不出来?升级给编排器。编排器需要判断?升级给你。你只会看到真正需要人类决策的事情。其他一切都会自我解决。

再往后:用于并行代理间自动冲突解决的 reconciler、针对长期分支的自动 rebase、面向云部署的 Docker/K8s 运行时,以及用于社区贡献的插件市场。

试试看

启动编排器,打开仪表盘,然后和它对话。告诉它要构建什么。剩下的它全包——拉起代理、创建 PR、盯 CI、转发评审评论。你只负责做决定。

我们正在寻找贡献者:新插件(代理运行时、tracker、notifier)、Docker/K8s 运行时、用于自动冲突检测的 reconciler,以及更好的升级规则。

仓库已上线:github.com/ComposioHQ/agent-orchestrator

完整指标报告:github.com/ComposioHQ/agent-orchestrator/releases/tag/metrics-v1

构建数据的交互式可视化:pkarnal.com/ao-labs/

我在 Composio 构建 Agent Orchestrator 以及开发者工具层。如果“自我进化的 AI 系统”听起来正是你想做的事——我们正在 SF 和 Bangalore 招人:jobs.ashbyhq.com/composio

链接:http://x.com/i/article/2025981530498375680

相关笔记

I was trying to ship faster

我当时只想更快交付

I had a codebase, a backlog of things to build, and not enough hours in the day. So I started running AI coding agents in parallel — give each one a task, let them write code, review the PRs, merge, repeat. I started with two or three. Then five. Then ten.

我手里有一套代码库、一堆待办要做,而一天的时间永远不够用。于是我开始并行跑 AI 编码代理——每个代理分配一个任务,让它们写代码,我来审 PR、合并,然后重复。一开始两三个,后来五个,再后来十个。

The agents were fast. The problem was me. I couldn't keep up with them. I was the one checking if CI passed, reading review comments, copy-pasting errors back. I'd gone from writing code to babysitting the things that write code. That doesn't scale.

代理很快,问题在我:我跟不上它们。我得检查 CI 是否通过,读评审评论,把错误复制粘贴回去。我从写代码变成了看护那些写代码的东西。这当然无法扩展。

So I wrote some bash scripts to automate the coordination — about 2,500 lines that managed tmux sessions, git worktrees, and tab switching. Each agent got its own isolated tmux session and worktree. The orchestrator could spawn them, peek at what they were doing, forward CI failures back, and let me jump between sessions just by asking "take me to the tab for PR #1121." It worked, barely.

于是我写了一些 bash 脚本来自动协调——大约 2,500 行,负责管理 tmux 会话、git worktrees,以及切换标签页。每个代理都有自己隔离的 tmux 会话和 worktree。这个编排器能拉起它们,偷窥它们在干什么,把 CI 失败回传给对应的代理;我只要说一句“带我去 PR #1121 的标签页”,就能在各会话间跳转。它勉强能用。

Then I pointed the agents at the bash scripts themselves. They built v1 of a proper orchestrator. v1 managed the agents that built v2. And v2 has been improving itself since.

接着我把代理对准这些 bash 脚本本身。它们做出了真正的编排器 v1。v1 管着那些打造 v2 的代理。而从那以后,v2 一直在自我改进。

The result: 40,000 lines of TypeScript, 17 plugins, 3,288 tests — built in 8 days, mostly by the agents the system orchestrates. Every commit has a git trailer identifying which AI model wrote it. There's no ambiguity about what humans did vs what agents did.

结果是:40,000 行 TypeScript、17 个插件、3,288 个测试——在 8 天内完成,且大部分由这个系统所编排的代理完成。每次提交都会带一个 git trailer,用来标识是哪一个 AI 模型写的。人类做了什么、代理做了什么,没有任何含糊空间。

We've open-sourced it: Agent Orchestrator (github.com/ComposioHQ/agent-orchestrator).

我们已经开源:Agent Orchestrator(github.com/ComposioHQ/agent-orchestrator)。

The key thing to understand: the orchestrator itself is an AI agent. Not a dashboard. Not a cron job. Not a script that polls GitHub. It's an agent — it reads your codebase, understands your backlog, decides how to decompose a feature into parallelizable tasks, assigns each task to a coding agent, and monitors their progress. When CI fails, it injects the failure back into the agent session — the agent reads the logs and fixes it. When a review comment comes in, it routes it to the right agent session with context. No human plumbing.

关键要理解的是:编排器本身就是一个 AI 代理。不是一个仪表盘。不是一个 cron job。也不是一个轮询 GitHub 的脚本。它是一个代理——它会读你的代码库,理解你的待办,决定如何把一个功能拆成可并行的任务,把每个任务分配给一个编码代理,并监控它们的进展。CI 挂了,它会把失败信息注入回对应的代理会话——代理读日志并修复。评审评论来了,它会带着上下文把评论路由到正确的代理会话。无需人工打通管道。

That's what makes this different from every "run agents in parallel" setup. The thing managing the agents is itself intelligent.

这就是它与所有“并行跑代理”的方案不同之处:负责管理代理的那个东西,本身就有智能。

The real bottleneck in AI-assisted coding

AI 辅助编码里的真正瓶颈

Most people get the AI coding agent problem wrong. The agents can code. That's not the bottleneck. You are.

大多数人把 AI 编码代理的问题看错了。代理会写代码。这不是瓶颈。瓶颈是你。

You spawn five tasks, go grab coffee, come back 20 minutes later and now you're just refreshing GitHub tabs — waiting for PRs, checking CI, reading review comments. Congratulations, you've automated engineering and replaced it with project management. Bad project management.

你起五个任务,去喝杯咖啡,20 分钟后回来,发现自己只是在刷新 GitHub 标签页——等 PR、查 CI、读评审评论。恭喜,你把工程自动化了——然后用项目管理取而代之。还是糟糕的项目管理。

The orchestrator agent replaces you in that loop. Not with a script — with an actual AI agent that has context on every active session, every open PR, every CI run. It tracks everything, watches for failures, forwards review comments back to coding agents, and only pings you when something actually needs a human decision. Once that bottleneck — your attention — goes away, things start compounding fast.

编排器代理在这个闭环里替代了你。不是靠脚本——而是一个真正的 AI 代理,它对每个活跃会话、每个打开的 PR、每次 CI 运行都有上下文。它跟踪一切,盯住失败,把评审评论回传给编码代理,只有在确实需要人类决策时才来找你。一旦那个瓶颈——你的注意力——被移除,增益就会迅速叠加。

You open the dashboard to see status. But the orchestrator agent is already working — it's looked at all your workstreams and it tells you: "This PR is blocking three other tasks, this CI failure is a flaky test, and this review comment is the one that actually matters." It's not showing you data. It's giving you decisions.

你打开仪表盘想看状态,可编排器代理早就已经在干活——它看过你所有的工作流,然后告诉你:“这个 PR 正在阻塞另外三个任务,这个 CI 失败是个 flaky 测试,而这条评审评论才是真正关键的那一条。”它不是在给你看数据,而是在给你决策。

The other thing that matters: plug anything in. Different agent runtime? Different issue tracker? Different notification channel? Swap it. The orchestrator doesn't care if you use Claude Code or Aider, tmux or Docker, GitHub or Linear. Eight plugin slots, all replaceable

另一个同样重要的点:万物皆可插拔。不同的代理运行时?不同的问题跟踪器?不同的通知渠道?换掉就是。编排器不在乎你用的是 Claude Code 还是 Aider,是 tmux 还是 Docker,是 GitHub 还是 Linear。八个插件槽位,全部可替换

The timeline

时间线

People see "40K lines in 8 days" and assume I went into a cave. I have a day job. This was maybe ~3 days of actual focused work spread across 8 days, with agents filling the gaps.

人们看到“8 天 4 万行”就以为我躲进了山洞闭关。我还有一份全职工作。整个过程大概只有 ~3 天是真正集中投入的工作,分散在 8 天里,其余空档由代理填上。

The pattern was simple: set up sessions before bed, agents work overnight, review and merge in the morning before work, set up new sessions, repeat.

模式很简单:睡前把会话都搭好,代理通宵干活;早上上班前审一轮、合并;再开新会话;重复。

The standout day: Saturday Feb 14. 27 PRs merged in a single day. The entire platform shipped — core services, CLI, web dashboard, all 17 plugins, npm publishing. I was reviewing and merging PRs faster than I could read them, but every PR had passed CI and automated code review first.

最夸张的一天:2 月 14 日(周六)。一天合并了 27 个 PR。整个平台都交付了——核心服务、CLI、Web 仪表盘、全部 17 个插件、npm 发布。我审 PR、合 PR 的速度快到来不及逐字读完,但每个 PR 在此之前都已经通过了 CI 和自动化代码评审。

Which models did what

哪些模型做了什么

Every commit tracks the model via git trailers:

每次提交都会通过 git trailers 记录模型:

Totals exceed 722 commits because some commits were written by one model and reviewed/fixed by another. Opus 4.6 handled the hard stuff — complex architecture, cross-package integrations. Sonnet handled volume — plugin implementations, tests, docs.

提交总数会超过 722,因为有些提交是一个模型写的、另一个模型审查/修复的。Opus 4.6 负责硬骨头——复杂架构、跨包集成。Sonnet 负责跑量——插件实现、测试、文档。

Fully autonomous code review: 700 comments, 1% human

全自动代码评审:700 条评论,1% 由人类完成

Agents don't just write code and throw it over the wall. There's a full automated review cycle:

代理不只是写完代码就扔过去。这里有一整套自动化评审循环:

Agent creates a PR and pushes code

代理创建 PR 并推送代码

Cursor Bugbot automatically reviews and posts inline comments

Cursor Bugbot 自动评审并发布行内评论

Agent reads comments, fixes the code, pushes again

代理阅读评论,修复代码,再次推送

Bugbot re-reviews

Bugbot 重新评审

700 automated code review comments. Bugbot caught real stuff — shell injection via exec(), path traversal, unclosed intervals, missing null checks. The agents fixed ~68% immediately, explained away ~7% as intentional, and deferred ~4% to future PRs.

700 条自动化代码评审评论。Bugbot 抓到的都是真问题——通过 exec() 的 shell 注入、路径遍历、未关闭的 interval、缺失的 null 检查。代理约有 ~68% 立即修复,约 ~7% 解释为有意为之,约 ~4% 延后到未来的 PR。

The ao-58 story

ao-58 的故事

The most dramatic example: PR #125, a dashboard redesign. It went through 12 CI failure→fix cycles. Each time, the agent got the failure output, diagnosed the issue (type errors, lint failures, test regressions), and pushed a fix. No human touched it.

最戏剧性的例子:PR #125,一次仪表盘重设计。它经历了 12 轮 “CI 失败→修复” 循环。每一次,代理拿到失败输出,诊断问题(类型错误、lint 失败、测试回归),然后推送修复。没有任何人碰过它。

12 rounds. Zero human intervention. Shipped clean.

12 轮。零人工干预。干净上线。

All 41 CI failures across 9 branches were eventually self-corrected by agents. Overall CI success rate: 84.6%.

9 个分支上的全部 41 次 CI 失败,最终都由代理自行纠正。总体 CI 通过率:84.6%。

Architecture

架构

The orchestrator uses a plugin system with 8 swappable slots:

编排器使用一个带 8 个可替换槽位的插件系统:

Session lifecycle:

会话生命周期:

Tracker pulls an issue (GitHub or Linear)

Tracker 拉取一个 issue(GitHub 或 Linear)

Workspace creates an isolated worktree or clone

Workspace 创建一个隔离的 worktree 或 clone

Runtime starts a tmux session or process

Runtime 启动一个 tmux 会话或进程

Agent (Claude Code, Aider, etc.) works autonomously

Agent(Claude Code、Aider 等)自主工作

Terminal lets you observe live via iTerm2 or web dashboard

Terminal 让你通过 iTerm2 或 Web 仪表盘实时观察

SCM creates PRs and enriches them with context

SCM 创建 PR,并为其补充上下文

Reactions auto re-spawn agents on CI failures or review comments

Reactions 在 CI 失败或评审评论出现时自动重启代理

Notifier pings you only when human judgment is needed

Notifier 只在需要人类判断时才通知你

Don't use tmux? Use the process runtime. Don't use GitHub? Use Linear. Don't use Claude Code? Plug in Aider or Codex. Swap any piece.

不用 tmux?用进程运行时。不用 GitHub?用 Linear。不用 Claude Code?插上 Aider 或 Codex。任何一块都能替换。

Self-healing CI: agents that fix their own failures

自愈 CI:会修自己的失败的代理

The most useful feature. Automated responses to GitHub events:

这是最有用的功能。对 GitHub 事件的自动响应:

CI fails? Agent picks it up. Reviewer requests changes? Agent reads the comments and fixes the code. PR approved? You get a Slack notification. This is how those 41 CI failures got self-corrected — the reactions system just forwarded failures back to agents automatically.

CI 挂了?代理接手。评审要求修改?代理读评论并修代码。PR 通过了?你会收到 Slack 通知。那 41 次 CI 失败之所以能自我纠正,就是因为 reactions 系统把失败自动转发回代理了。

The inception: AI agents building their own orchestrator

起源:AI 代理在建造自己的编排器

I had 30 concurrent agents working on Agent Orchestrator. They were building the TypeScript replacement while I was using the bash-script version to manage them. The thing being built was the thing managing its own construction.

我同时跑了 30 个代理在做 Agent Orchestrator。它们在构建 TypeScript 替代版,而我正用 bash 脚本版来管理它们。被建造的东西,正在管理它自己的建造过程。

What I actually did:

我实际做的事情:

Architecture decisions (plugin slots, config schema, session lifecycle)

架构决策(插件槽位、配置 schema、会话生命周期)

Spawning sessions and assigning issues

拉起会话并分配 issue

Reviewing PRs (mostly architecture, not line-by-line)

审 PR(大多看架构,不逐行抠)

Resolving cross-agent conflicts (two agents editing the same file)

解决跨代理冲突(两个代理改了同一个文件)

Judgment calls (reject this approach, try that one)

做判断(否掉这个方案,改试那个)

What agents did:

代理做的事情:

All implementation (40K lines of TypeScript)

全部实现(40K 行 TypeScript)

All tests (3,288 test cases)

全部测试(3,288 个测试用例)

All PR creation (86 of 102 PRs)

全部 PR 创建(102 个 PR 里有 86 个)

All review comment fixes

全部评审评论修复

All CI failure resolution

全部 CI 失败修复

I never committed directly to a feature branch. Every line of code went through a PR.

我从未直接向任何 feature 分支提交过。每一行代码都走了 PR。

Activity detection

活动检测

One of the trickier problems: figuring out what an agent is actually doing without asking it.

更棘手的问题之一:不问代理,怎么判断它实际上在干什么。

Claude Code writes structured JSONL event files during every session. Instead of relying on agents to self-report (they lie, or at least get confused), the orchestrator reads these files directly:

Claude Code 会在每个会话期间写结构化的 JSONL 事件文件。与其依赖代理自述(它们会撒谎,或者至少会搞混),编排器直接读取这些文件:

Is the agent actively generating tokens?

代理是在主动生成 tokens 吗?

Is it waiting for tool execution?

它在等待工具执行吗?

Is it idle?

它空闲了吗?

Has it finished?

它完成了吗?

The agent-claude-code plugin knows how to parse Claude's session files. A future agent-aider plugin would read Aider's equivalent.

agent-claude-code 插件知道如何解析 Claude 的会话文件。未来的 agent-aider 插件会读取 Aider 的对应文件。

Web dashboard

Web 仪表盘

Next.js 15, Server-Sent Events for real-time updates. No polling.

Next.js 15,用 Server-Sent Events 做实时更新。不靠轮询。

Attention zones — sessions grouped by what needs your attention (failing CI, awaiting review, running fine)

Attention zones——按需要你关注的类型分组会话(CI 失败、等待评审、运行良好)

Live terminal — xterm.js in the browser, showing the agent's actual terminal output in real time

Live terminal——浏览器里的 xterm.js,实时展示代理的真实终端输出

Session detail — current file being edited, recent commits, PR status, CI status

Session detail——正在编辑的文件、最近的提交、PR 状态、CI 状态

Config discovery — automatically finds your ao.config.yaml and shows available sessions

Config discovery——自动找到你的 ao.config.yaml 并显示可用会话

The self-improving AI loop

自我改进的 AI 回路

Every agent session generates signal. Which prompts led to clean PRs? Which ones spiraled into 12 CI failure cycles? Which patterns caused merge conflicts?

每个代理会话都会产生信号。哪些提示词让 PR 一次就干净?哪些会一路滚成 12 轮 CI 失败循环?哪些模式容易引发合并冲突?

Most agent setups throw this signal away. Session finishes, you move on, next session starts from zero.

大多数代理方案会把这些信号丢掉。会话一结束,你就转向下一件事,下一次会话又从零开始。

Agent Orchestrator has a self-improvement system (ao-52 — itself built by an agent) that logs performance, tracks session outcomes, and runs retrospectives. It learns which tasks succeed on the first try and which need tighter guardrails.

Agent Orchestrator 有一套自我改进系统(ao-52——它本身也是由代理构建的),用来记录性能、追踪会话结果并做复盘。它会学习哪些任务能一次成功、哪些需要更严格的护栏。

Agents build features → orchestrator observes what worked → adjusts how it manages future sessions → agents build better features. The loop compounds.

代理构建功能 → 编排器观察什么有效 → 调整它未来如何管理会话 → 代理构建出更好的功能。回路会叠加增益。

And since the agents built the orchestrator, and the orchestrator makes the agents more effective, and those agents keep improving the orchestrator — it's recursive. The tool is improving itself through the agents it manages.

而且由于编排器是由代理构建的,编排器又让代理更有效,而这些代理又持续改进编排器——这就是递归。这个工具通过它所管理的代理,在改进它自己。

I think this is why orchestration matters more than any individual agent improvement. The ceiling isn't "how good is Claude Code at TypeScript." It's "how good can a system get at deploying, observing, and improving dozens of agents working in parallel." That ceiling is much higher. And it rises every time the loop runs.

我认为这也是为什么“编排”比任何单个代理的能力提升都更重要。上限不是“Claude Code 写 TypeScript 有多强”,而是“一个系统能把并行工作的几十个代理部署得多好、观测得多好、改进得多好”。这个上限高得多。而且每跑一轮回路,它就会再抬高一点。

What's next: towards fully autonomous software engineering

接下来:走向完全自治的软件工程

Talk to your agents from anywhere. Right now you need to be at your desk. You should be able to message the orchestrator from Telegram or Slack — check status, approve a merge, redirect an agent — while you're on a walk.

随时随地和你的代理对话。现在你必须坐在桌前。你应该能从 Telegram 或 Slack 给编排器发消息——查状态、批准合并、重定向某个代理——哪怕你正在散步。

Tighter mid-session feedback. Agents drift. They start solving the wrong problem, over-engineer a simple fix, go down rabbit holes. The orchestrator needs to check agent work against the original intent and inject course corrections before they've burned 20 minutes going the wrong direction.

更紧的会话中反馈。代理会漂移。它们会开始解决错的问题,把一个简单修复过度工程化,或者钻进兔子洞。编排器需要把代理的工作与最初意图对齐,在它们走错方向浪费 20 分钟之前就注入纠偏。

Automatic escalation. Agent can't solve something? Escalate to orchestrator. Orchestrator needs judgment? Escalate to you. You only see things that genuinely need a human decision. Everything else resolves itself.

自动升级。代理解不出来?升级给编排器。编排器需要判断?升级给你。你只会看到真正需要人类决策的事情。其他一切都会自我解决。

Beyond that: a reconciler for automatic conflict resolution between parallel agents, auto-rebase for long-running branches, Docker/K8s runtimes for cloud deployments, and a plugin marketplace for community contributions.

再往后:用于并行代理间自动冲突解决的 reconciler、针对长期分支的自动 rebase、面向云部署的 Docker/K8s 运行时,以及用于社区贡献的插件市场。

Try it

试试看

Start the orchestrator, open the dashboard, and talk to it. Tell it what to build. It handles the rest — spawning agents, creating PRs, watching CI, forwarding review comments. You just make decisions.

启动编排器,打开仪表盘,然后和它对话。告诉它要构建什么。剩下的它全包——拉起代理、创建 PR、盯 CI、转发评审评论。你只负责做决定。

We're looking for contributors: new plugins (agent runtimes, trackers, notifiers), Docker/K8s runtime, a reconciler for automatic conflict detection, and better escalation rules.

我们正在寻找贡献者:新插件(代理运行时、tracker、notifier)、Docker/K8s 运行时、用于自动冲突检测的 reconciler,以及更好的升级规则。

The repo is live: github.com/ComposioHQ/agent-orchestrator

仓库已上线:github.com/ComposioHQ/agent-orchestrator

Full metrics report: github.com/ComposioHQ/agent-orchestrator/releases/tag/metrics-v1

完整指标报告:github.com/ComposioHQ/agent-orchestrator/releases/tag/metrics-v1

Interactive visualizations of the build data: pkarnal.com/ao-labs/

构建数据的交互式可视化:pkarnal.com/ao-labs/

I'm building Agent Orchestrator and the developer tooling layer at Composio. If working on self-improving AI systems sounds like your kind of problem — we're hiring across SF and Bangalore: jobs.ashbyhq.com/composio

我在 Composio 构建 Agent Orchestrator 以及开发者工具层。如果“自我进化的 AI 系统”听起来正是你想做的事——我们正在 SF 和 Bangalore 招人:jobs.ashbyhq.com/composio

Link: http://x.com/i/article/2025981530498375680

链接:http://x.com/i/article/2025981530498375680

相关笔记

The Self-Improving AI System That Built Itself

  • Source: https://x.com/agent_wrapper/status/2025986105485733945?s=46
  • Mirror: https://x.com/agent_wrapper/status/2025986105485733945?s=46
  • Published: 2026-02-23T17:28:23+00:00
  • Saved: 2026-02-25

Content

I was trying to ship faster

I had a codebase, a backlog of things to build, and not enough hours in the day. So I started running AI coding agents in parallel — give each one a task, let them write code, review the PRs, merge, repeat. I started with two or three. Then five. Then ten.

The agents were fast. The problem was me. I couldn't keep up with them. I was the one checking if CI passed, reading review comments, copy-pasting errors back. I'd gone from writing code to babysitting the things that write code. That doesn't scale.

So I wrote some bash scripts to automate the coordination — about 2,500 lines that managed tmux sessions, git worktrees, and tab switching. Each agent got its own isolated tmux session and worktree. The orchestrator could spawn them, peek at what they were doing, forward CI failures back, and let me jump between sessions just by asking "take me to the tab for PR #1121." It worked, barely.

Then I pointed the agents at the bash scripts themselves. They built v1 of a proper orchestrator. v1 managed the agents that built v2. And v2 has been improving itself since.

The result: 40,000 lines of TypeScript, 17 plugins, 3,288 tests — built in 8 days, mostly by the agents the system orchestrates. Every commit has a git trailer identifying which AI model wrote it. There's no ambiguity about what humans did vs what agents did.

We've open-sourced it: Agent Orchestrator (github.com/ComposioHQ/agent-orchestrator).

The key thing to understand: the orchestrator itself is an AI agent. Not a dashboard. Not a cron job. Not a script that polls GitHub. It's an agent — it reads your codebase, understands your backlog, decides how to decompose a feature into parallelizable tasks, assigns each task to a coding agent, and monitors their progress. When CI fails, it injects the failure back into the agent session — the agent reads the logs and fixes it. When a review comment comes in, it routes it to the right agent session with context. No human plumbing.

That's what makes this different from every "run agents in parallel" setup. The thing managing the agents is itself intelligent.

The real bottleneck in AI-assisted coding

Most people get the AI coding agent problem wrong. The agents can code. That's not the bottleneck. You are.

You spawn five tasks, go grab coffee, come back 20 minutes later and now you're just refreshing GitHub tabs — waiting for PRs, checking CI, reading review comments. Congratulations, you've automated engineering and replaced it with project management. Bad project management.

The orchestrator agent replaces you in that loop. Not with a script — with an actual AI agent that has context on every active session, every open PR, every CI run. It tracks everything, watches for failures, forwards review comments back to coding agents, and only pings you when something actually needs a human decision. Once that bottleneck — your attention — goes away, things start compounding fast.

You open the dashboard to see status. But the orchestrator agent is already working — it's looked at all your workstreams and it tells you: "This PR is blocking three other tasks, this CI failure is a flaky test, and this review comment is the one that actually matters." It's not showing you data. It's giving you decisions.

The other thing that matters: plug anything in. Different agent runtime? Different issue tracker? Different notification channel? Swap it. The orchestrator doesn't care if you use Claude Code or Aider, tmux or Docker, GitHub or Linear. Eight plugin slots, all replaceable

The timeline

People see "40K lines in 8 days" and assume I went into a cave. I have a day job. This was maybe ~3 days of actual focused work spread across 8 days, with agents filling the gaps.

The pattern was simple: set up sessions before bed, agents work overnight, review and merge in the morning before work, set up new sessions, repeat.

The standout day: Saturday Feb 14. 27 PRs merged in a single day. The entire platform shipped — core services, CLI, web dashboard, all 17 plugins, npm publishing. I was reviewing and merging PRs faster than I could read them, but every PR had passed CI and automated code review first.

Which models did what

Every commit tracks the model via git trailers:

Totals exceed 722 commits because some commits were written by one model and reviewed/fixed by another. Opus 4.6 handled the hard stuff — complex architecture, cross-package integrations. Sonnet handled volume — plugin implementations, tests, docs.

Fully autonomous code review: 700 comments, 1% human

Agents don't just write code and throw it over the wall. There's a full automated review cycle:

Agent creates a PR and pushes code

Cursor Bugbot automatically reviews and posts inline comments

Agent reads comments, fixes the code, pushes again

Bugbot re-reviews

700 automated code review comments. Bugbot caught real stuff — shell injection via exec(), path traversal, unclosed intervals, missing null checks. The agents fixed ~68% immediately, explained away ~7% as intentional, and deferred ~4% to future PRs.

The ao-58 story

The most dramatic example: PR #125, a dashboard redesign. It went through 12 CI failure→fix cycles. Each time, the agent got the failure output, diagnosed the issue (type errors, lint failures, test regressions), and pushed a fix. No human touched it.

12 rounds. Zero human intervention. Shipped clean.

All 41 CI failures across 9 branches were eventually self-corrected by agents. Overall CI success rate: 84.6%.

Architecture

The orchestrator uses a plugin system with 8 swappable slots:

Session lifecycle:

Tracker pulls an issue (GitHub or Linear)

Workspace creates an isolated worktree or clone

Runtime starts a tmux session or process

Agent (Claude Code, Aider, etc.) works autonomously

Terminal lets you observe live via iTerm2 or web dashboard

SCM creates PRs and enriches them with context

Reactions auto re-spawn agents on CI failures or review comments

Notifier pings you only when human judgment is needed

Don't use tmux? Use the process runtime. Don't use GitHub? Use Linear. Don't use Claude Code? Plug in Aider or Codex. Swap any piece.

Self-healing CI: agents that fix their own failures

The most useful feature. Automated responses to GitHub events:

CI fails? Agent picks it up. Reviewer requests changes? Agent reads the comments and fixes the code. PR approved? You get a Slack notification. This is how those 41 CI failures got self-corrected — the reactions system just forwarded failures back to agents automatically.

The inception: AI agents building their own orchestrator

I had 30 concurrent agents working on Agent Orchestrator. They were building the TypeScript replacement while I was using the bash-script version to manage them. The thing being built was the thing managing its own construction.

What I actually did:

Architecture decisions (plugin slots, config schema, session lifecycle)

Spawning sessions and assigning issues

Reviewing PRs (mostly architecture, not line-by-line)

Resolving cross-agent conflicts (two agents editing the same file)

Judgment calls (reject this approach, try that one)

What agents did:

All implementation (40K lines of TypeScript)

All tests (3,288 test cases)

All PR creation (86 of 102 PRs)

All review comment fixes

All CI failure resolution

I never committed directly to a feature branch. Every line of code went through a PR.

Activity detection

One of the trickier problems: figuring out what an agent is actually doing without asking it.

Claude Code writes structured JSONL event files during every session. Instead of relying on agents to self-report (they lie, or at least get confused), the orchestrator reads these files directly:

Is the agent actively generating tokens?

Is it waiting for tool execution?

Is it idle?

Has it finished?

The agent-claude-code plugin knows how to parse Claude's session files. A future agent-aider plugin would read Aider's equivalent.

Web dashboard

Next.js 15, Server-Sent Events for real-time updates. No polling.

Attention zones — sessions grouped by what needs your attention (failing CI, awaiting review, running fine)

Live terminal — xterm.js in the browser, showing the agent's actual terminal output in real time

Session detail — current file being edited, recent commits, PR status, CI status

Config discovery — automatically finds your ao.config.yaml and shows available sessions

The self-improving AI loop

Every agent session generates signal. Which prompts led to clean PRs? Which ones spiraled into 12 CI failure cycles? Which patterns caused merge conflicts?

Most agent setups throw this signal away. Session finishes, you move on, next session starts from zero.

Agent Orchestrator has a self-improvement system (ao-52 — itself built by an agent) that logs performance, tracks session outcomes, and runs retrospectives. It learns which tasks succeed on the first try and which need tighter guardrails.

Agents build features → orchestrator observes what worked → adjusts how it manages future sessions → agents build better features. The loop compounds.

And since the agents built the orchestrator, and the orchestrator makes the agents more effective, and those agents keep improving the orchestrator — it's recursive. The tool is improving itself through the agents it manages.

I think this is why orchestration matters more than any individual agent improvement. The ceiling isn't "how good is Claude Code at TypeScript." It's "how good can a system get at deploying, observing, and improving dozens of agents working in parallel." That ceiling is much higher. And it rises every time the loop runs.

What's next: towards fully autonomous software engineering

Talk to your agents from anywhere. Right now you need to be at your desk. You should be able to message the orchestrator from Telegram or Slack — check status, approve a merge, redirect an agent — while you're on a walk.

Tighter mid-session feedback. Agents drift. They start solving the wrong problem, over-engineer a simple fix, go down rabbit holes. The orchestrator needs to check agent work against the original intent and inject course corrections before they've burned 20 minutes going the wrong direction.

Automatic escalation. Agent can't solve something? Escalate to orchestrator. Orchestrator needs judgment? Escalate to you. You only see things that genuinely need a human decision. Everything else resolves itself.

Beyond that: a reconciler for automatic conflict resolution between parallel agents, auto-rebase for long-running branches, Docker/K8s runtimes for cloud deployments, and a plugin marketplace for community contributions.

Try it

Start the orchestrator, open the dashboard, and talk to it. Tell it what to build. It handles the rest — spawning agents, creating PRs, watching CI, forwarding review comments. You just make decisions.

We're looking for contributors: new plugins (agent runtimes, trackers, notifiers), Docker/K8s runtime, a reconciler for automatic conflict detection, and better escalation rules.

The repo is live: github.com/ComposioHQ/agent-orchestrator

Full metrics report: github.com/ComposioHQ/agent-orchestrator/releases/tag/metrics-v1

Interactive visualizations of the build data: pkarnal.com/ao-labs/

I'm building Agent Orchestrator and the developer tooling layer at Composio. If working on self-improving AI systems sounds like your kind of problem — we're hiring across SF and Bangalore: jobs.ashbyhq.com/composio

Link: http://x.com/i/article/2025981530498375680

📋 讨论归档

讨论进行中…