🤖 Agent · 🔧 基建 · 🧠 记忆

OpenClaw + Codex/Claude Code 代理蜂群（一人开发团队编排指南）

把“业务上下文”交给编排器（OpenClaw/Obsidian），把“代码上下文”交给执行代理（Codex/Claude Code），再用确定性的注册表+监控+多模型审阅定义 Done，从而把一人开发变成可并行、可重生、可收口的流水线。
打开原文 ↗

2026-03-02 原文链接 ↗

阅读简报

双语对照

完整翻译

原文

讨论归档

核心观点

上下文窗口是零和：同一个模型很难同时装下“业务全景”和“代码细节”，最佳实践是拆成两层——编排器负责高层目标/约束/历史，编码代理只拿任务所需的最小代码上下文。
任务要“可运行态管理”：每个任务独立 worktree + tmux session，并写入 JSON 注册表；cron 只做低成本检查（tmux 是否活、PR 是否存在、CI 状态等），把“盯终端”改成“只有需要人时才通知”。
Done 需要硬指标：不是“PR 开了”就算完，而是 PR 创建 + 与 main 同步 + CI 通过 + 多模型审阅通过 +（UI 变更必须有截图）这类可验证条件。
多模型审阅分工：Codex 擅长边界/逻辑漏洞；Gemini 擅长安全、扩展性与遗漏点；Claude Code 更像谨慎的复核者——用来二次确认而不是主审。
失败重试不是重复提示词：编排器应基于更完整的业务上下文诊断失败类型（上下文不足/跑偏/需要澄清/权限不够），然后“改提示词再重生”，并把成功模式沉淀为可复用的提示结构。

跟我们的关联

Stripe 的“Minions”式后台代理系统（集中式编排层 + 并行执行）。
Ralph Loop / 记忆-生成-评估-学习的闭环，但升级点在于：提示词也会随失败原因动态演化。
Obsidian 作为“业务记忆层”、GitHub PR/CI 作为“客观验收信号”、tmux/worktree 作为“可恢复的运行容器”。

讨论引子

在你自己的项目里，“业务上下文”和“代码上下文”分别有哪些？你会把哪些信息放到编排层，哪些坚决不让执行代理接触？

我已经不再直接使用 Codex 或 Claude Code 了。

我用 OpenClaw 作为我的编排层（orchestration layer）。我的编排器 Zoe 会生成代理、编写它们的提示词、为每个任务挑选合适的模型、监控进度，并在 PR 准备好合并时通过 Telegram 提醒我。

过去 4 周的一些证明点：

一天 94 次提交。这是我产出最高的一天——我接了 3 个客户电话，一次编辑器都没打开。平均每天大约 50 次提交。
30 分钟 7 个 PR。从想法到上线快得离谱，因为编码和验证基本都自动化了。
提交 → MRR：我把它用在我正在做的一个真实 B2B SaaS 上——把它与创始人主导销售打包，做到大多数功能需求同日交付。速度会把线索转化为付费客户。

.clawdbot/check-agents.sh

我的 git 历史看起来像是我刚雇了一支开发团队。实际上只有我：从管理 claude code，变成管理一个 openclaw 代理，而它再去管理一整支由其他 claude code 和 codex 代理组成的舰队。

成功率：系统几乎能对所有小到中等的任务一次完成，无需任何干预。

成本：Claude 每月约 $100，Codex 每月 $90，但你可以从 $20 起步。

这就是为什么它比直接使用 Codex 或 Claude Code 更好用：

>Codex 和 Claude Code 对你的业务几乎没有上下文。

它们看到的是代码。看不到你业务的全貌。

OpenClaw 改变了这道题。它充当你与所有代理之间的编排层——它把我所有业务上下文（客户数据、会议纪要、过去的决策、哪些有效、哪些无效）都保存在我的 Obsidian vault 里，并把历史上下文翻译成对每个编码代理都精准的提示词。代理专注于代码。编排器专注于高层策略。

从高层看，这个系统大概是这样运作的：

# Codex
codex --model gpt-5.3-codex \
  -c "model_reasoning_effort=high" \
  --dangerously-bypass-approvals-and-sandbox \
  "Your prompt here"

# Claude Code  
claude --model claude-opus-4.5 \
  --dangerously-skip-permissions \
  -p "Your prompt here"

上周 Stripe 写了他们的后台代理系统，叫 “Minions”——由集中式编排层支撑的并行编码代理。我一不小心也做出了同样的东西，只不过它在我的 Mac mini 上本地运行。

在我告诉你怎么把它搭起来之前，你得先明白：你为什么需要一个代理编排器。

为什么一个 AI 不能两头兼顾

上下文窗口是零和的。你必须选择往里放什么。

装满代码 → 就没有空间放业务上下文。装满客户历史 → 就没有空间放代码库。这就是为什么两层系统有效：每个 AI 只装载它完成任务所需要的内容。

OpenClaw 和 Codex 的上下文诉求截然不同：

# Wrong approach:
tmux send-keys -t codex-templates "Stop. Focus on the API layer first, not the UI." Enter

# Needs more context:
tmux send-keys -t codex-templates "The schema is in src/types/template.ts. Use that." Enter

靠上下文实现分工，而不是靠不同模型。

完整的 8 步工作流

我用上周的一个真实例子带你走一遍。

步骤 1：客户需求 → 与 Zoe 确定范围

我和一家代理机构客户开了个会。他们想在团队内复用已经配置好的设置。

通话结束后，我把需求和 Zoe 过了一遍。因为我的所有会议纪要都会自动同步到 obsidian vault，我这边几乎不用解释。我们一起把功能范围划清楚——最终落到一个模板系统上，让他们能保存并编辑现有配置。

然后 Zoe 会做三件事：

充值额度，立刻解除客户阻塞——她有管理端 API 权限
从生产数据库拉取客户配置——她拥有只读的生产 DB 权限（我的 codex 代理永远不会有这个权限），把他们现有配置取出来并写进提示词
生成一个 Codex 代理——附带包含所有上下文的详细提示词

步骤 2：生成代理

每个代理都有自己的 worktree（隔离分支）和 tmux 会话：

{
  "id": "feat-custom-templates",
  "tmuxSession": "codex-templates",
  "agent": "codex",
  "description": "Custom email templates for agency customer",
  "repo": "medialyst",
  "worktree": "feat-custom-templates",
  "branch": "feat/custom-templates",
  "startedAt": 1740268800000,
  "status": "running",
  "notifyOnComplete": true
}

代理运行在 tmux 会话中，并通过脚本把终端日志完整记录下来。

我们就是这样启动代理的：

我以前用 codex exec 或 claude -p，但最近改用 tmux：

tmux 好太多了，因为 任务中途重定向 这个能力非常强。代理跑偏了？别杀掉它：

任务会记录在 .clawdbot/active-tasks.json 里：

完成后，它会更新 PR 编号和各项检查。（第 5 步会细说）

{
  "status": "done",
  "pr": 341,
  "completedAt": 1740275400000,
  "checks": {
    "prCreated": true,
    "ciPassed": true,
    "claudeReviewPassed": true,
    "geminiReviewPassed": true
  },
  "note": "All checks passed. Ready to merge."
}

步骤 3：循环监控

一个 cron 任务每 10 分钟跑一次，负责“看护”所有代理。这基本上相当于一个改进版的 Ralph Loop，后面会说。

但它不会直接轮询代理——那太贵了。它会跑一个脚本，读取 JSON 注册表并检查：

# Create worktree + spawn agent
git worktree add ../feat-custom-templates -b feat/custom-templates origin/main
cd ../feat-custom-templates && pnpm install

tmux new-session -d -s "codex-templates" \
  -c "/Users/elvis/Documents/GitHub/medialyst-worktrees/feat-custom-templates" \
  "$HOME/.codex-agent/run-agent.sh templates gpt-5.3-codex high"

这个脚本 100% 确定性，而且非常省 token：

检查 tmux 会话是否还活着
检查被追踪分支上是否有打开的 PR
通过 gh cli 检查 CI 状态
如果 CI 失败或审阅反馈出现关键问题，自动重生失败代理（最多 3 次）
只有在需要人类介入时才提醒

我不盯终端。系统会告诉我什么时候该去看。

步骤 4：代理创建 PR

代理会提交、推送，并通过 gh pr create --fill 打开 PR。此时我不会收到通知——只有 PR 并不代表完成。

完成定义（极其重要：你的代理必须知道这个）：

PR 已创建
分支已与 main 同步（无合并冲突）
CI 通过（lint、types、unit tests、E2E）
Codex 审阅通过
Claude Code 审阅通过
Gemini 审阅通过
如有 UI 变更，必须包含截图

步骤 5：自动化代码审阅

每个 PR 都会被三个 AI 模型审阅。它们擅长的点不一样：

Codex Reviewer —— 特别擅长边界情况。审得最彻底。能抓出逻辑错误、缺失的错误处理、竞态条件。误报率非常低。
Gemini Code Assist Reviewer —— 免费且极其好用。能抓出安全问题、可扩展性问题，以及其他代理容易漏掉的东西，并给出具体修复建议。装上就对了。
Claude Code Reviewer —— 基本没啥用——往往过于谨慎。充斥着“考虑添加……”这类通常属于过度工程的建议。除非标成 critical，否则我都跳过。它很少能独立发现关键问题，但能对其他审阅者标出的风险做一遍验证。

三者都会直接在 PR 上发表评论。

步骤 6：自动化测试

我们的 CI 流水线会跑大量自动化测试：

Lint 与 TypeScript 检查
单元测试
E2E 测试
Playwright 针对预览环境的测试（与生产环境一致）

上周我加了一条新规则：只要 PR 改了任何 UI，就必须在 PR 描述里附上截图，否则 CI 直接失败。这会显著缩短审阅时间——我不用点进预览就能一眼看清改动。

步骤 7：人工审阅

现在我会收到 Telegram 通知：“PR #341 ready for review.”

到这一步：

CI 已通过
三个 AI 审阅者都已批准代码
截图展示了 UI 变更
所有边界情况都在审阅评论里写清了

我的审阅只要 5–10 分钟。很多 PR 我甚至不看代码就直接合并——截图已经告诉我我需要知道的一切。

步骤 8：合并

PR 合并。每天一个 cron 任务会清理孤儿 worktree 和任务注册表 json。

Ralph Loop V2

这本质上就是 Ralph Loop，但更强。

Ralph Loop 会从记忆里拉取上下文、生成输出、评估结果、保存学习。但大多数实现每一轮都用同一个提示词。提炼出的学习会改善未来的检索，但提示词本身是静态不变的。

我们的系统不一样。代理失败时，Zoe 不会只是用同一个提示词把它重生。她会在完整业务上下文下分析失败原因，并找到解除阻塞的办法：

代理上下文不够？“只关注这三个文件。”
代理跑偏了？“停下。客户想要的是 X，不是 Y。会议里他们是这么说的。”
代理需要澄清？“这是客户的邮件，以及他们公司是做什么的。”

Zoe 会把代理一路“带娃”到完成。她掌握代理没有的上下文——客户历史、会议纪要、我们之前试过什么、为什么失败。她用这些上下文在每次重试时写出更好的提示词。

但她也不会等我分配任务。她会主动找活：

早上： 扫描 Sentry → 发现 4 个新错误 → 生成 4 个代理去排查并修复
会后： 扫描会议纪要 → 标出客户提到的 3 个功能请求 → 生成 3 个 Codex 代理
晚上： 扫描 git log → 生成 Claude Code 去更新 changelog 和客户文档

我在一次客户电话后出去散个步。回来打开 Telegram：“7 PRs ready for review. 3 features, 4 bug fixes.”

当代理成功时，模式会被记录下来。“这种提示词结构适用于计费功能。”“Codex 需要先拿到类型定义。”“一定要包含测试文件路径。”

奖励信号是：CI 通过、三次代码审阅全部通过、人工合并。任何失败都会触发循环。随着时间推移，Zoe 因为记得哪些东西真正上线了，就会写出越来越好的提示词。

选择合适的代理

并不是所有编码代理都一样。快速参考：

Codex 是我的主力。后端逻辑、复杂 bug、多文件重构、任何需要跨代码库推理的任务。它更慢，但更彻底。我把 90% 的任务交给它。

Claude Code 更快，也更擅长前端工作。它的权限问题也更少，所以特别适合做 git 操作。（我以前更常用它来驱动日常，但现在 Codex 5.3 既更好也更快。）

Gemini 有另一种超能力——设计审美。要做漂亮的 UI 时，我会先让 Gemini 生成一份 HTML/CSS spec，再交给 Claude Code 在我们的组件系统里实现。Gemini 设计，Claude 落地。

Zoe 会为每个任务选择合适的代理，并在它们之间路由输出。计费系统 bug 交给 Codex。按钮样式修复交给 Claude Code。新仪表盘设计从 Gemini 开始。

如何搭建这套系统

把这整篇文章复制进 OpenClaw，然后告诉它：“为我的代码库实现这套代理蜂群配置。”

它会读懂架构、创建脚本、搭好目录结构，并配置 cron 监控。10 分钟搞定。

我没有课要卖给你。

没人预料到的瓶颈

我现在碰到的上限是：内存（RAM）。

每个代理都需要自己的 worktree。每个 worktree 都需要自己的 node_modules。每个代理都会跑构建、类型检查、测试。5 个代理同时跑，就意味着 5 个并行的 TypeScript 编译器、5 个测试 runner、以及 5 份依赖同时加载进内存。

我的 16GB Mac Mini 最多撑到 4–5 个代理就开始换页（swapping）——而且还得祈祷它们别同时触发 build。

所以我买了一台 128GB RAM 的 Mac Studio M4 max（$3,500）来跑这套系统。它 3 月底到货，到时候我会分享值不值。

下一步：一人百万美元公司

从 2026 年开始，我们会看到大量的一人百万美元公司。对那些懂得如何构建递归自我改进代理的人来说，杠杆巨大。

它会长这样：一个 AI 编排器作为你的延伸（就像 Zoe 对我那样），把工作委派给擅长不同业务职能的专门代理。工程。客服。运营。市场。每个代理都只做自己最擅长的事。你保持极致专注与完全控制。

下一代创业者不会再雇 10 人团队去做“一个人 + 正确系统”就能完成的事。他们会这样建造——小团队、快节奏、每天交付上线。

现在的 AI 生成垃圾太多了。围绕代理和“任务指挥中心”的炒作也太多了，却没做出任何真正有用的东西。花哨的 demo，没有现实收益。

我在努力做相反的事：少吹牛，多记录如何真正把业务做出来。真实客户、真实收入、真实能上线到生产环境的提交，也有真实的损失。

我在做什么？Agentic PR——一家一人公司，去挑战企业 PR 的老牌玩家。用代理帮助初创公司获得媒体报道，而不需要每月 $10k 的代理留用费。

如果你想看看我能把它做到什么程度，就跟着我一起走。

I don't use Codex or Claude Code directly anymore.

I use OpenClaw as my orchestration layer. My orchestrator, Zoe, spawns the agents, writes their prompts, picks the right model for each task, monitors progress, and pings me on Telegram when PRs are ready to merge.

Proof points from the last 4 weeks:

94 commits in one day. My most productive day - I had 3 client calls and didn't open my editor once. The average is around 50 commits a day.
7 PRs in 30 minutes. Idea to production are blazing fast because coding and validations are mostly automated.
Commits → MRR: I use this for a real B2B SaaS I'm building — bundling it with founder-led sales to deliver most feature requests same-day. Speed converts leads into paying customers.

.clawdbot/check-agents.sh

My git history looks like I just hired a dev team. In reality it's just me going from managing claude code, to managing an openclaw agent that manages a fleet of other claude code and codex agents.

Success rate: The system one-shots almost all small to medium tasks without any intervention.

Cost: ~$100/month for Claude and $90/month for Codex, but you can start with $20.

Here's why this works better than using Codex or Claude Code directly:

>Codex and Claude Code have very little context about your business.

They see code. They don't see the full picture of your business.

OpenClaw changes the equation. It acts as the orchestration layer between you and all agents — it holds all my business context (customer data, meeting notes, past decisions, what worked, what failed) inside my Obsidian vault, and translates historical context into precise prompts for each coding agent. The agents stay focused on code. The orchestrator stays at the high strategy level.

Here's how the system works at a high level:

# Codex
codex --model gpt-5.3-codex \
  -c "model_reasoning_effort=high" \
  --dangerously-bypass-approvals-and-sandbox \
  "Your prompt here"

# Claude Code  
claude --model claude-opus-4.5 \
  --dangerously-skip-permissions \
  -p "Your prompt here"

Last week Stripe wrote about their background agent system called "Minions" — parallel coding agents backed by a centralized orchestration layer. I accidentally built the same thing but it runs locally on my Mac mini.

Before I tell you how to set this up, you should know WHY you need an agent orchestrator.

我已经不再直接使用 Codex 或 Claude Code 了。

过去 4 周的一些证明点：

一天 94 次提交。这是我产出最高的一天——我接了 3 个客户电话，一次编辑器都没打开。平均每天大约 50 次提交。
30 分钟 7 个 PR。从想法到上线快得离谱，因为编码和验证基本都自动化了。
提交 → MRR：我把它用在我正在做的一个真实 B2B SaaS 上——把它与创始人主导销售打包，做到大多数功能需求同日交付。速度会把线索转化为付费客户。

.clawdbot/check-agents.sh

成功率：系统几乎能对所有小到中等的任务一次完成，无需任何干预。

成本：Claude 每月约 $100，Codex 每月 $90，但你可以从 $20 起步。

这就是为什么它比直接使用 Codex 或 Claude Code 更好用：

>Codex 和 Claude Code 对你的业务几乎没有上下文。

它们看到的是代码。看不到你业务的全貌。

从高层看，这个系统大概是这样运作的：

# Codex
codex --model gpt-5.3-codex \
  -c "model_reasoning_effort=high" \
  --dangerously-bypass-approvals-and-sandbox \
  "Your prompt here"

# Claude Code  
claude --model claude-opus-4.5 \
  --dangerously-skip-permissions \
  -p "Your prompt here"

在我告诉你怎么把它搭起来之前，你得先明白：你为什么需要一个代理编排器。

Why One AI Can't Do Both

Context windows are zero-sum. You have to choose what goes in.

Fill it with code → no room for business context. Fill it with customer history → no room for the codebase. This is why the two-tier system works: each AI is loaded with exactly what it needs.

OpenClaw and Codex have drastically different context:

# Wrong approach:
tmux send-keys -t codex-templates "Stop. Focus on the API layer first, not the UI." Enter

# Needs more context:
tmux send-keys -t codex-templates "The schema is in src/types/template.ts. Use that." Enter

Specialization through context, not through different models.

为什么一个 AI 不能两头兼顾

上下文窗口是零和的。你必须选择往里放什么。

装满代码 → 就没有空间放业务上下文。装满客户历史 → 就没有空间放代码库。这就是为什么两层系统有效：每个 AI 只装载它完成任务所需要的内容。

OpenClaw 和 Codex 的上下文诉求截然不同：

# Wrong approach:
tmux send-keys -t codex-templates "Stop. Focus on the API layer first, not the UI." Enter

# Needs more context:
tmux send-keys -t codex-templates "The schema is in src/types/template.ts. Use that." Enter

靠上下文实现分工，而不是靠不同模型。

The Full 8-step Workflow

Let me walk through a real example from last week.

Step 1: Customer Request → Scoping with Zoe

I had a call with an agency customer. They wanted to reuse configurations they've already set up across the team.

After the call, I talked through the request with Zoe. Because all my meeting notes sync automatically to my obsidian vault, zero explanation was needed on my end. We scoped out the feature together — and landed on a template system that lets them save and edit their existing configurations.

Then Zoe does three things:

Tops up credits to unblock customer immediately — she has admin API access
Pulls customer config from prod database — she has read-only prod DB access (my codex agents will never have this) to retrieve their existing setup, which gets included in the prompt
Spawns a Codex agent — with a detailed prompt containing all the context

Step 2: Spawn the Agent

Each agent gets its own worktree (isolated branch) and tmux session:

{
  "id": "feat-custom-templates",
  "tmuxSession": "codex-templates",
  "agent": "codex",
  "description": "Custom email templates for agency customer",
  "repo": "medialyst",
  "worktree": "feat-custom-templates",
  "branch": "feat/custom-templates",
  "startedAt": 1740268800000,
  "status": "running",
  "notifyOnComplete": true
}

The agent runs in a tmux session with full terminal logging via a script.

Here's how we launch agents:

I used to use codex exec or claude -p, but switch to tmux recently:

tmux is far better because mid-task redirection is powerful. Agent going the wrong direction? Don't kill it:

The task gets tracked in .clawdbot/active-tasks.json:

When complete, it updates with PR number and checks. (More on this in step 5)

{
  "status": "done",
  "pr": 341,
  "completedAt": 1740275400000,
  "checks": {
    "prCreated": true,
    "ciPassed": true,
    "claudeReviewPassed": true,
    "geminiReviewPassed": true
  },
  "note": "All checks passed. Ready to merge."
}

Step 3: Monitoring in a loop

A cron job runs every 10 minutes to babysit all agents. This pretty much functions as an improved Ralph Loop, more on it later.

But it doesn't poll the agents directly — that would be expensive. Instead, it runs a script that reads the JSON registry and checks:

# Create worktree + spawn agent
git worktree add ../feat-custom-templates -b feat/custom-templates origin/main
cd ../feat-custom-templates && pnpm install

tmux new-session -d -s "codex-templates" \
  -c "/Users/elvis/Documents/GitHub/medialyst-worktrees/feat-custom-templates" \
  "$HOME/.codex-agent/run-agent.sh templates gpt-5.3-codex high"

The script is 100% deterministic and extremely token-efficient:

Checks if tmux sessions are alive
Checks for open PRs on tracked branches
Checks CI status via gh cli
Auto-respawns failed agents (max 3 attempts) if CI fails or critical review feedback
Only alerts if something needs human attention

I'm not watching terminals. The system tells me when to look.

Step 4: Agent Creates PR

The agent commits, pushes, and opens a PR via gh pr create --fill. At this point I do NOT get notified — a PR alone isn't done.

Definition of done (very important your agent knows this):

PR created
Branch synced to main (no merge conflicts)
CI passing (lint, types, unit tests, E2E)
Codex review passed
Claude Code review passed
Gemini review passed
Screenshots included (if UI changes)

Step 5: Automated Code Review

Every PR gets reviewed by three AI models. They catch different things:

Codex Reviewer — Exceptional at edge cases. Does the most thorough review. Catches logic errors, missing error handling, race conditions. False positive rate is very low.
Gemini Code Assist Reviewer — Free and incredibly useful. Catches security issues, scalability problems other agents miss. And suggests specific fixes. No brainer to install.
Claude Code Reviewer — Mostly useless - tends to be overly cautious. Lots of "consider adding..." suggestions that are usually overengineering. I skip everything unless it's marked critical. It rarely finds critical issues on its own but validates what the other reviewers flag.

All three post comments directly on the PR.

Step 6: Automated Testing

Our CI pipeline runs a heavy amount of automated tests:

Lint and TypeScript checks
Unit tests
E2E tests
Playwright tests against a preview environment (identical to prod)

I added a new rule last week: if the PR changes any UI, it must include a screenshot in the PR description. Otherwise CI fails. This dramatically shortens review time — I can see exactly what changed without clicking through the preview.

Step 7: Human Review

Now I get the Telegram notification: "PR #341 ready for review."

By this point:

CI passed
Three AI reviewers approved the code
Screenshots show the UI changes
All edge cases are documented in review comments

My review takes 5-10 minutes. Many PRs I merge without reading the code — the screenshot shows me everything I need.

Step 8: Merge

PR merges. A daily cron job cleans up orphaned worktrees and task registry json.

完整的 8 步工作流

我用上周的一个真实例子带你走一遍。

步骤 1：客户需求 → 与 Zoe 确定范围

我和一家代理机构客户开了个会。他们想在团队内复用已经配置好的设置。

然后 Zoe 会做三件事：

充值额度，立刻解除客户阻塞——她有管理端 API 权限
从生产数据库拉取客户配置——她拥有只读的生产 DB 权限（我的 codex 代理永远不会有这个权限），把他们现有配置取出来并写进提示词
生成一个 Codex 代理——附带包含所有上下文的详细提示词

步骤 2：生成代理

每个代理都有自己的 worktree（隔离分支）和 tmux 会话：

{
  "id": "feat-custom-templates",
  "tmuxSession": "codex-templates",
  "agent": "codex",
  "description": "Custom email templates for agency customer",
  "repo": "medialyst",
  "worktree": "feat-custom-templates",
  "branch": "feat/custom-templates",
  "startedAt": 1740268800000,
  "status": "running",
  "notifyOnComplete": true
}

代理运行在 tmux 会话中，并通过脚本把终端日志完整记录下来。

我们就是这样启动代理的：

我以前用 codex exec 或 claude -p，但最近改用 tmux：

tmux 好太多了，因为 任务中途重定向 这个能力非常强。代理跑偏了？别杀掉它：

任务会记录在 .clawdbot/active-tasks.json 里：

完成后，它会更新 PR 编号和各项检查。（第 5 步会细说）

{
  "status": "done",
  "pr": 341,
  "completedAt": 1740275400000,
  "checks": {
    "prCreated": true,
    "ciPassed": true,
    "claudeReviewPassed": true,
    "geminiReviewPassed": true
  },
  "note": "All checks passed. Ready to merge."
}

步骤 3：循环监控

一个 cron 任务每 10 分钟跑一次，负责“看护”所有代理。这基本上相当于一个改进版的 Ralph Loop，后面会说。

但它不会直接轮询代理——那太贵了。它会跑一个脚本，读取 JSON 注册表并检查：

# Create worktree + spawn agent
git worktree add ../feat-custom-templates -b feat/custom-templates origin/main
cd ../feat-custom-templates && pnpm install

tmux new-session -d -s "codex-templates" \
  -c "/Users/elvis/Documents/GitHub/medialyst-worktrees/feat-custom-templates" \
  "$HOME/.codex-agent/run-agent.sh templates gpt-5.3-codex high"

这个脚本 100% 确定性，而且非常省 token：

检查 tmux 会话是否还活着
检查被追踪分支上是否有打开的 PR
通过 gh cli 检查 CI 状态
如果 CI 失败或审阅反馈出现关键问题，自动重生失败代理（最多 3 次）
只有在需要人类介入时才提醒

我不盯终端。系统会告诉我什么时候该去看。

步骤 4：代理创建 PR

代理会提交、推送，并通过 gh pr create --fill 打开 PR。此时我不会收到通知——只有 PR 并不代表完成。

完成定义（极其重要：你的代理必须知道这个）：

PR 已创建
分支已与 main 同步（无合并冲突）
CI 通过（lint、types、unit tests、E2E）
Codex 审阅通过
Claude Code 审阅通过
Gemini 审阅通过
如有 UI 变更，必须包含截图

步骤 5：自动化代码审阅

每个 PR 都会被三个 AI 模型审阅。它们擅长的点不一样：

Codex Reviewer —— 特别擅长边界情况。审得最彻底。能抓出逻辑错误、缺失的错误处理、竞态条件。误报率非常低。
Gemini Code Assist Reviewer —— 免费且极其好用。能抓出安全问题、可扩展性问题，以及其他代理容易漏掉的东西，并给出具体修复建议。装上就对了。
Claude Code Reviewer —— 基本没啥用——往往过于谨慎。充斥着“考虑添加……”这类通常属于过度工程的建议。除非标成 critical，否则我都跳过。它很少能独立发现关键问题，但能对其他审阅者标出的风险做一遍验证。

三者都会直接在 PR 上发表评论。

步骤 6：自动化测试

我们的 CI 流水线会跑大量自动化测试：

Lint 与 TypeScript 检查
单元测试
E2E 测试
Playwright 针对预览环境的测试（与生产环境一致）

步骤 7：人工审阅

现在我会收到 Telegram 通知：“PR #341 ready for review.”

到这一步：

CI 已通过
三个 AI 审阅者都已批准代码
截图展示了 UI 变更
所有边界情况都在审阅评论里写清了

我的审阅只要 5–10 分钟。很多 PR 我甚至不看代码就直接合并——截图已经告诉我我需要知道的一切。

步骤 8：合并

PR 合并。每天一个 cron 任务会清理孤儿 worktree 和任务注册表 json。

The Ralph Loop V2

This is essentially the Ralph Loop, but better.

The Ralph Loop pulls context from memory, generate output, evaluate results, save learnings. But most implementations run the same prompt each cycle. The distilled learnings improve future retrievals, but the prompt itself stays static.

Our system is different. When an agent fails, Zoe doesn't just respawn it with the same prompt. She looks at the failure with full business context and figures out how to unblock it:

Agent ran out of context? "Focus only on these three files."
Agent went the wrong direction? "Stop. The customer wanted X, not Y. Here's what they said in the meeting."
Agent need clarification? "Here's customer's email and what their company does."

Zoe babysits agents through to completion. She has context the agents don't — customer history, meeting notes, what we tried before, why it failed. She uses that context to write better prompts on each retry.

But she also doesn't wait for me to assign tasks. She finds work proactively:

Morning: Scans Sentry → finds 4 new errors → spawns 4 agents to investigate and fix
After meetings: Scans meeting notes → flags 3 feature requests customers mentioned → spawns 3 Codex agents
Evening: Scans git log → spawns Claude Code to update changelog and customer docs

I take a walk after a customer call. Come back to Telegram: "7 PRs ready for review. 3 features, 4 bug fixes."

When agents succeed, the pattern gets logged. "This prompt structure works for billing features." "Codex needs the type definitions upfront." "Always include the test file paths."

The reward signals are: CI passing, all three code reviews passing, human merge. Any failure triggers the loop. Over time, Zoe writes better prompts because she remembers what shipped.

Ralph Loop V2

这本质上就是 Ralph Loop，但更强。

我们的系统不一样。代理失败时，Zoe 不会只是用同一个提示词把它重生。她会在完整业务上下文下分析失败原因，并找到解除阻塞的办法：

代理上下文不够？“只关注这三个文件。”
代理跑偏了？“停下。客户想要的是 X，不是 Y。会议里他们是这么说的。”
代理需要澄清？“这是客户的邮件，以及他们公司是做什么的。”

但她也不会等我分配任务。她会主动找活：

早上： 扫描 Sentry → 发现 4 个新错误 → 生成 4 个代理去排查并修复
会后： 扫描会议纪要 → 标出客户提到的 3 个功能请求 → 生成 3 个 Codex 代理
晚上： 扫描 git log → 生成 Claude Code 去更新 changelog 和客户文档

我在一次客户电话后出去散个步。回来打开 Telegram：“7 PRs ready for review. 3 features, 4 bug fixes.”

当代理成功时，模式会被记录下来。“这种提示词结构适用于计费功能。”“Codex 需要先拿到类型定义。”“一定要包含测试文件路径。”

Choosing the Right Agent

Not all coding agents are equal. Quick reference:

Codex is my workhorse. Backend logic, complex bugs, multi-file refactors, anything that requires reasoning across the codebase. It's slower but thorough. I use it for 90% of tasks.

Claude Code is faster and better at frontend work. It also has fewer permission issues, so it's great for git operations. (I used to use this more to drive day to day, but Codex 5.3 is simply better and faster now)

Gemini has a different superpower — design sensibility. For beautiful UIs, I'll have Gemini generate an HTML/CSS spec first, then hand that to Claude Code to implement in our component system. Gemini designs, Claude builds.

Zoe picks the right agent for each task and routes outputs between them. A billing system bug goes to Codex. A button style fix goes to Claude Code. A new dashboard design starts with Gemini.

选择合适的代理

并不是所有编码代理都一样。快速参考：

Codex 是我的主力。后端逻辑、复杂 bug、多文件重构、任何需要跨代码库推理的任务。它更慢，但更彻底。我把 90% 的任务交给它。

Zoe 会为每个任务选择合适的代理，并在它们之间路由输出。计费系统 bug 交给 Codex。按钮样式修复交给 Claude Code。新仪表盘设计从 Gemini 开始。

How to Set This Up

Copy this entire article into OpenClaw and tell it: "Implement this agent swarm setup for my codebase."

It'll read the architecture, create the scripts, set up the directory structure, and configure cron monitoring. Done in 10 minutes.

No course to sell you.

如何搭建这套系统

把这整篇文章复制进 OpenClaw，然后告诉它：“为我的代码库实现这套代理蜂群配置。”

它会读懂架构、创建脚本、搭好目录结构，并配置 cron 监控。10 分钟搞定。

我没有课要卖给你。

The Bottleneck Nobody Expects

Here's the ceiling I'm hitting right now: RAM.

Each agent needs its own worktree. Each worktree needs its own node_modules. Each agent runs builds, type checks, tests. Five agents running simultaneously means five parallel TypeScript compilers, five test runners, five sets of dependencies loaded into memory.

My Mac Mini with 16GB tops out at 4-5 agents before it starts swapping — and I need to be lucky they don't try to build at the same time.

So I bought a Mac Studio M4 max with 128GB RAM ($3,500) to power this system. It arrives end of March and I'll share if it's worth it.

没人预料到的瓶颈

我现在碰到的上限是：内存（RAM）。

我的 16GB Mac Mini 最多撑到 4–5 个代理就开始换页（swapping）——而且还得祈祷它们别同时触发 build。

所以我买了一台 128GB RAM 的 Mac Studio M4 max（$3,500）来跑这套系统。它 3 月底到货，到时候我会分享值不值。

Up Next: The One-Person Million-Dollar Company

We're going to see a ton of one-person million-dollar companies starting in 2026. The leverage is massive for those who understand how to build recursively self-improving agents.

This is what it looks like: an AI orchestrator as an extension of yourself (like what Zoe is to me), delegating work to specialized agents that handle different business functions. Engineering. Customer support. Ops. Marketing. Each agent focused on what it's good at. You maintain laser focus and full control.

The next generation of entrepreneurs won't hire a team of 10 to do what one person with the right system can do. They'll build like this — staying small, moving fast, shipping daily.

There's so much AI-generated slop right now. So much hype around agents and "mission controls" without building anything actually useful. Fancy demos with no real-world benefits.

I'm trying to do the opposite: less hype, more documentation of building an actual business. Real customers, real revenue, real commits that ship to production, and real loss too.

What am I building? Agentic PR — a one-person company taking on the enterprise PR incumbents. Agents that help startups get press coverage without a $10k/month retainer.

If you want to see how far I take this, follow along.

下一步：一人百万美元公司

从 2026 年开始，我们会看到大量的一人百万美元公司。对那些懂得如何构建递归自我改进代理的人来说，杠杆巨大。

下一代创业者不会再雇 10 人团队去做“一个人 + 正确系统”就能完成的事。他们会这样建造——小团队、快节奏、每天交付上线。

现在的 AI 生成垃圾太多了。围绕代理和“任务指挥中心”的炒作也太多了，却没做出任何真正有用的东西。花哨的 demo，没有现实收益。

我在努力做相反的事：少吹牛，多记录如何真正把业务做出来。真实客户、真实收入、真实能上线到生产环境的提交，也有真实的损失。

我在做什么？Agentic PR——一家一人公司，去挑战企业 PR 的老牌玩家。用代理帮助初创公司获得媒体报道，而不需要每月 $10k 的代理留用费。

如果你想看看我能把它做到什么程度，就跟着我一起走。

I don't use Codex or Claude Code directly anymore.

Proof points from the last 4 weeks:

94 commits in one day. My most productive day - I had 3 client calls and didn't open my editor once. The average is around 50 commits a day.
7 PRs in 30 minutes. Idea to production are blazing fast because coding and validations are mostly automated.
Commits → MRR: I use this for a real B2B SaaS I'm building — bundling it with founder-led sales to deliver most feature requests same-day. Speed converts leads into paying customers.

.clawdbot/check-agents.sh

My git history looks like I just hired a dev team. In reality it's just me going from managing claude code, to managing an openclaw agent that manages a fleet of other claude code and codex agents.

Success rate: The system one-shots almost all small to medium tasks without any intervention.

Cost: ~$100/month for Claude and $90/month for Codex, but you can start with $20.

Here's why this works better than using Codex or Claude Code directly:

>Codex and Claude Code have very little context about your business.

They see code. They don't see the full picture of your business.

Here's how the system works at a high level:

# Codex
codex --model gpt-5.3-codex \
  -c "model_reasoning_effort=high" \
  --dangerously-bypass-approvals-and-sandbox \
  "Your prompt here"

# Claude Code  
claude --model claude-opus-4.5 \
  --dangerously-skip-permissions \
  -p "Your prompt here"

Before I tell you how to set this up, you should know WHY you need an agent orchestrator.

Why One AI Can't Do Both

Context windows are zero-sum. You have to choose what goes in.

Fill it with code → no room for business context. Fill it with customer history → no room for the codebase. This is why the two-tier system works: each AI is loaded with exactly what it needs.

OpenClaw and Codex have drastically different context:

# Wrong approach:
tmux send-keys -t codex-templates "Stop. Focus on the API layer first, not the UI." Enter

# Needs more context:
tmux send-keys -t codex-templates "The schema is in src/types/template.ts. Use that." Enter

Specialization through context, not through different models.

The Full 8-step Workflow

Let me walk through a real example from last week.

Step 1: Customer Request → Scoping with Zoe

I had a call with an agency customer. They wanted to reuse configurations they've already set up across the team.

Then Zoe does three things:

Tops up credits to unblock customer immediately — she has admin API access
Pulls customer config from prod database — she has read-only prod DB access (my codex agents will never have this) to retrieve their existing setup, which gets included in the prompt
Spawns a Codex agent — with a detailed prompt containing all the context

Step 2: Spawn the Agent

Each agent gets its own worktree (isolated branch) and tmux session:

{
  "id": "feat-custom-templates",
  "tmuxSession": "codex-templates",
  "agent": "codex",
  "description": "Custom email templates for agency customer",
  "repo": "medialyst",
  "worktree": "feat-custom-templates",
  "branch": "feat/custom-templates",
  "startedAt": 1740268800000,
  "status": "running",
  "notifyOnComplete": true
}

The agent runs in a tmux session with full terminal logging via a script.

Here's how we launch agents:

I used to use codex exec or claude -p, but switch to tmux recently:

tmux is far better because mid-task redirection is powerful. Agent going the wrong direction? Don't kill it:

The task gets tracked in .clawdbot/active-tasks.json:

When complete, it updates with PR number and checks. (More on this in step 5)

{
  "status": "done",
  "pr": 341,
  "completedAt": 1740275400000,
  "checks": {
    "prCreated": true,
    "ciPassed": true,
    "claudeReviewPassed": true,
    "geminiReviewPassed": true
  },
  "note": "All checks passed. Ready to merge."
}

Step 3: Monitoring in a loop

A cron job runs every 10 minutes to babysit all agents. This pretty much functions as an improved Ralph Loop, more on it later.

But it doesn't poll the agents directly — that would be expensive. Instead, it runs a script that reads the JSON registry and checks:

# Create worktree + spawn agent
git worktree add ../feat-custom-templates -b feat/custom-templates origin/main
cd ../feat-custom-templates && pnpm install

tmux new-session -d -s "codex-templates" \
  -c "/Users/elvis/Documents/GitHub/medialyst-worktrees/feat-custom-templates" \
  "$HOME/.codex-agent/run-agent.sh templates gpt-5.3-codex high"

The script is 100% deterministic and extremely token-efficient:

Checks if tmux sessions are alive
Checks for open PRs on tracked branches
Checks CI status via gh cli
Auto-respawns failed agents (max 3 attempts) if CI fails or critical review feedback
Only alerts if something needs human attention

I'm not watching terminals. The system tells me when to look.

Step 4: Agent Creates PR

The agent commits, pushes, and opens a PR via gh pr create --fill. At this point I do NOT get notified — a PR alone isn't done.

Definition of done (very important your agent knows this):

PR created
Branch synced to main (no merge conflicts)
CI passing (lint, types, unit tests, E2E)
Codex review passed
Claude Code review passed
Gemini review passed
Screenshots included (if UI changes)

Step 5: Automated Code Review

Every PR gets reviewed by three AI models. They catch different things:

Codex Reviewer — Exceptional at edge cases. Does the most thorough review. Catches logic errors, missing error handling, race conditions. False positive rate is very low.
Gemini Code Assist Reviewer — Free and incredibly useful. Catches security issues, scalability problems other agents miss. And suggests specific fixes. No brainer to install.
Claude Code Reviewer — Mostly useless - tends to be overly cautious. Lots of "consider adding..." suggestions that are usually overengineering. I skip everything unless it's marked critical. It rarely finds critical issues on its own but validates what the other reviewers flag.

All three post comments directly on the PR.

Step 6: Automated Testing

Our CI pipeline runs a heavy amount of automated tests:

Lint and TypeScript checks
Unit tests
E2E tests
Playwright tests against a preview environment (identical to prod)

Step 7: Human Review

Now I get the Telegram notification: "PR #341 ready for review."

By this point:

CI passed
Three AI reviewers approved the code
Screenshots show the UI changes
All edge cases are documented in review comments

My review takes 5-10 minutes. Many PRs I merge without reading the code — the screenshot shows me everything I need.

Step 8: Merge

PR merges. A daily cron job cleans up orphaned worktrees and task registry json.

The Ralph Loop V2

This is essentially the Ralph Loop, but better.

Our system is different. When an agent fails, Zoe doesn't just respawn it with the same prompt. She looks at the failure with full business context and figures out how to unblock it:

Agent ran out of context? "Focus only on these three files."
Agent went the wrong direction? "Stop. The customer wanted X, not Y. Here's what they said in the meeting."
Agent need clarification? "Here's customer's email and what their company does."

But she also doesn't wait for me to assign tasks. She finds work proactively:

Morning: Scans Sentry → finds 4 new errors → spawns 4 agents to investigate and fix
After meetings: Scans meeting notes → flags 3 feature requests customers mentioned → spawns 3 Codex agents
Evening: Scans git log → spawns Claude Code to update changelog and customer docs

I take a walk after a customer call. Come back to Telegram: "7 PRs ready for review. 3 features, 4 bug fixes."

When agents succeed, the pattern gets logged. "This prompt structure works for billing features." "Codex needs the type definitions upfront." "Always include the test file paths."

The reward signals are: CI passing, all three code reviews passing, human merge. Any failure triggers the loop. Over time, Zoe writes better prompts because she remembers what shipped.

Choosing the Right Agent

Not all coding agents are equal. Quick reference:

Codex is my workhorse. Backend logic, complex bugs, multi-file refactors, anything that requires reasoning across the codebase. It's slower but thorough. I use it for 90% of tasks.

Zoe picks the right agent for each task and routes outputs between them. A billing system bug goes to Codex. A button style fix goes to Claude Code. A new dashboard design starts with Gemini.

How to Set This Up

Copy this entire article into OpenClaw and tell it: "Implement this agent swarm setup for my codebase."

It'll read the architecture, create the scripts, set up the directory structure, and configure cron monitoring. Done in 10 minutes.

No course to sell you.

The Bottleneck Nobody Expects

Here's the ceiling I'm hitting right now: RAM.

My Mac Mini with 16GB tops out at 4-5 agents before it starts swapping — and I need to be lucky they don't try to build at the same time.

So I bought a Mac Studio M4 max with 128GB RAM ($3,500) to power this system. It arrives end of March and I'll share if it's worth it.

Up Next: The One-Person Million-Dollar Company

We're going to see a ton of one-person million-dollar companies starting in 2026. The leverage is massive for those who understand how to build recursively self-improving agents.

The next generation of entrepreneurs won't hire a team of 10 to do what one person with the right system can do. They'll build like this — staying small, moving fast, shipping daily.

There's so much AI-generated slop right now. So much hype around agents and "mission controls" without building anything actually useful. Fancy demos with no real-world benefits.

I'm trying to do the opposite: less hype, more documentation of building an actual business. Real customers, real revenue, real commits that ship to production, and real loss too.

What am I building? Agentic PR — a one-person company taking on the enterprise PR incumbents. Agents that help startups get press coverage without a $10k/month retainer.

If you want to see how far I take this, follow along.

📋 讨论归档

讨论进行中…