🧠 阿头学 · 💬 讨论题 · 💰投资

会自我进化的AI团队，比更强的模型更值钱

单个AI Agent再聪明也只是没上下文的提示词，真正的杠杆在于搭建有共享记忆、实验闭环、角色分工的多Agent协作系统——这比等待模型升级更能产生持续复利。
打开原文 ↗

2026-03-16 原文链接 ↗

阅读简报

双语对照

完整翻译

原文

讨论归档

核心观点

瓶颈不在模型强度，在架构与闭环 作者用9年增长经验指出，95%的人用AI看不到持续结果，不是因为模型不够强，而是缺少"团队协同+实验闭环+共享记忆"。单个Agent无论多聪明，只是一个带工具的Prompt；真正改变产出的是角色分工、统一指标、知识复用。

Karpathy模式：策略版本控制而非静态提示 用program.md（永久目标）、strategy.md（可演化策略）、results.tsv（只追加实验日志）三文件做循环。每次实验成功就锁定策略改动，失败就回滚并记录。这把"策略"从模糊提示变成有版本控制的代码分支，进步呈棘轮式单向锁定——系统第1天是最糟的一天。

多模型路由比单一大模型更高效 用极低成本小模型（Mistral/Qwen）做结构化和创意发散，只在"会产生级联影响的决策"时调用Claude。11个Agent的单次完整循环成本仅$0.009，且小模型在合适任务上能做出大模型2-3倍的产出。

人类被刻意设计成系统瓶颈 两阶段循环：Phase 1自动研究规划，然后停下来等人在Telegram审批，Phase 2才并行执行。这不是缺陷，而是现阶段最优解——让AI承担高耗能的发散与执行，人类只做低耗能高杠杆的收敛与决策。

现实预期：50-75%自动化已是上限 作者坦白指出Agent无法100%完成复杂工作，但架构良好的Swarm承担50-75%重活已经非常好。这种克制预期比"完全自治"幻想更有实操价值。

跟我们的关联

对增长负责人意味着什么 可以把增长团队产品化成可复制引擎：输入北极星指标+品牌规则，系统每天自动research→plan→approve→execute→记入知识库→下轮优化。这是从纯手工顾问走向"标准化+高毛利"的路径，也是把增长工作从经验驱动变成数据驱动的方式。

对产品经理的启发 技术选型应该反向推导：先明确"长周期能力"需求（实验闭环、可扩展安全执行、多模型支持），再选框架，而不是先选框架再想能干嘛。同时，在产品设计中故意放置"人工确认闸门"，让系统在关键决策点停下来等人类判断，比追求完全自动化更务实。

对投资人的价值 这套架构把AI从一次性工具变成会复利的系统。如果能验证"实验闭环确实让指标持续提升"，就找到了AI应用的真正杠杆——不是更强模型，而是更好的组织架构。这对任何需要持续优化的业务（增长、投放、社区运营、销售脚本）都可复制。

讨论引子

作者声称"架构>模型"是杠杆，但没有对照实验数据支撑：相同任务下，更强模型+简单流程 vs 普通模型+复杂架构，哪个实际效果更好？目前所有结论都停留在"能跑通pipeline"，缺乏业务指标对比。

文章强调Swarm通过共享知识库会"棘轮式进步"，但承认实验闭环还需30+个cycle才有真实数据。这个假设何时能被验证？如果验证失败（比如多Agent互相放大错误策略），系统会如何自纠正？

两阶段循环设计中，人类审批被设定为必要闸门。但当多Agent并发产生大量提案时，人类审批本身会成为新的、更严重的效率瓶颈吗？如何判断什么时候可以打开`auto_approve=True`？

36 小时前，一切从这里开始。

我已经做了 9 年的增长实验：增长黑客、分发、转化优化。循环永远是同一个：提出假设、测试、衡量、保留或砍掉、重复。

问题从来不在点子，而在速度。一个人类团队一周也许只能跑 2–5 个实验。你大多数时间都花在协调上，而不是执行上。研究不跟数据分析互通。写作者不知道上周什么有效。上下文活在人的脑子里，最后死在 Slack 线程里。

我试过用 AI agents 来修这个问题。试过 OpenClaw。试过带工具调用的独立 agents。试过整套“AI 员工”的玩法。

它总是因为同一个原因而失败。

没有团队的 agent，不过就是一个没有上下文的 prompt。

没人会雇一个没有研究团队、没有数据分析、没有战略、没有反馈回路的写作者。你不会雇个体。你会搭团队。

所以我就这么做了。

teams/growth/         ← live now
teams/trading/        ← next
teams/influencer/     ← planned
teams/your-team/      ← you tell me

我搭了什么

一个 swarm 框架，让多个 agent 像一个团队一样协同工作。每个 agent 都有自己的角色、工具、MCP 访问、专用 LLM 模型，以及上下文窗口。它们共享知识、交接工作，并从彼此的结果里学习。

hermes 位于上层，充当操作中枢。它能控制 swarm、覆写动作、委派任务，而且它会从底下的 agents 身上学习。hermes + swarm 会一起变得更聪明。

                              ┌────────────────┐
                              │    HERMES       │
                              │   (operator)    │
                              └───────┬────────┘
                                      │
                              ┌───────▼────────┐
                              │  ORCHESTRATOR   │
                              └──┬──────────┬───┘
                                 │          │
              ┌──────────────────▼──┐  ┌────▼───────────────────────┐
              │  PHASE 1: PLAN      │  │  PHASE 2: EXECUTE          │
              │  (sequential)       │  │  (parallel, after approval) │
              │                     │  │                             │
              │  research → plan    │  │  writers · designers ·      │
              │  → approve/reject   │  │  video · newsletter ·      │
              └─────────────────────┘  │  repurpose                  │
                                       └─────────────────────────────┘

              ┌─────────────────────────────────────────────┐
              │  SHARED LAYER                               │
              │  knowledge store · model router ·           │
              │  experiment engine · task management         │
              └─────────────────────────────────────────────┘

核心想法：一个文件夹 = 一个团队。

python
# experiment.py

VERDICT_THRESHOLD = 0.20  # 20% improvement = meaningful

def evaluate_experiment(self, experiment, results):
    if len(results) < experiment.sample_size_needed:
        return "running"
    avg_metric = sum(r.metric_value for r in results) / len(results)
    improvement = (avg_metric - experiment.baseline) / experiment.baseline
    if improvement > self.VERDICT_THRESHOLD:
        return "keep"
    elif improvement < -self.VERDICT_THRESHOLD:
        return "discard"
    else:
        return "inconclusive"

建一个文件夹，写好配置，启动引擎。就这样。

为什么是 Hermes，而不是 OpenClaw？

我说实话。从纸面上看，这两个很像：都是持久化 agents；都有 SOUL.md；都有 skills、cron、memory、多平台消息、MCP；也都能自托管。

我们大概都试过 openclaw，对吧？有人爱它，有人恨它。

但这就是我切换的原因。

Python vs Node。

openclaw 是 node/JS。hermes 是 python。做 ML/AI 基础设施——多模型路由、实验闭环、知识库、异步编排——python 才是原生栈。我需要的库（httpx、asyncio、apscheduler、必要时还有 numpy）基本都是 pip install 一步到位。在 node runtime 之上搭我的引擎，意味着每一步都要跟生态较劲。光这一点就占了我决策的 60%。

执行沙盒。

openclaw 跑在一个绑定 localhost 的 node 进程里。hermes 给你五种执行后端：local、docker、SSH、singularity、modal。容器加固包括：只读根文件系统、丢弃 capabilities、命名空间隔离。当你在 VPS 上跑 30+ 个带工具权限的 agents——能写文件、能调 API、还能执行代码——这道差距很关键。hermes 把沙盒当作核心基础设施，而不是事后补丁。

子代理隔离。

hermes 会生成彼此隔离的 subagents：各自有对话、终端，以及 python RPC 脚本。上下文成本为零——子代理在跑时，父级不会丢上下文。openclaw 也有多代理路由，但它是会话级隔离，不是执行级隔离。当我的增长负责人要并行把调研委派给三个写手时，hermes 原生就能搞定，而且不会污染任何上下文窗口。

记忆架构。

openclaw 的 memory 是基于文件的 markdown（对话日志 + 精选长期记忆）。单代理用起来没问题。但它默认在重启时会清空 working memory——有个已知问题：人们会因为静默压缩而丢掉好几天的 agent 上下文。hermes 有持久化记忆 + 自动生成、重启后也能保留的 skills。再加上我的 QMD 知识库（BM25 + 向量 + LLM 重排），整体记忆架构就是：hermes 记得怎么操作 + QMD 记得团队学到了什么。两层，都是持久化的。

SOUL.md：同一个概念，不同的执行方式。

两边都有 SOUL.md。但 hermes 每条消息都会重新加载它。我凌晨两点更新 swarm roster，hermes 下一次交互就能读到。无需重启、无需重新编译、无需处理缓存失效。openclaw 的某些配置变更需要重启进程。当你在迭代一个由 11 个 agents 组成的 swarm 时，热重载不是锦上添花——它是你保持理智的方式。

研究与 RL 管线。

hermes 支持批量轨迹生成：并行 workers、checkpointing，以及与 Atropos 的 RL 训练集成；还支持 ShareGPT 导出用于微调。这是 NousResearch 的 DNA——他们做的是训练基础设施。如果你希望你的 operator 最终能用你自己的任务数据做微调，hermes 有这条 pipeline；openclaw 没有。

诚实的取舍。

openclaw 的面向消费者的 UX 更好，生态更大；就目前而言，它对个人助理用例打磨得更完善。

但我不是只在做一个个人助理。我在做的是一台引擎：在 5 个模型之上编排近乎无限数量的专用 agents，带实验闭环和共享知识库。为此，hermes 的 python 栈、执行隔离、以及研究管线才是关键。

hermes 是 operator。swarm 是团队。你和 hermes 对话，hermes 协调底下的一切。

引擎：它实际上是怎么跑的

每天两阶段循环：暂时如此。

phase 1 每天早上自动运行。research analyst 扫描，growth lead 选角度并分配工作。然后它停下来等你。你喝着咖啡，通过 telegram 审批。

phase 2 点火——写手并行执行，视觉素材被生成，所有东西都保存进知识库。

你是被设计成的瓶颈。直到你足够信任它，愿意把 auto_approve=True 打开。

karpathy 模式（QMD + program.md）

这是最重要的部分。

我把 karpathy 的 autoresearch loop——他用来做自动化 ML 研究的模式——应用到了增长上。

每个 agent 都有三个文件：

program.md -- 不可变的目标 + 单一北极星指标
strategy.md -- 会随结果演化的“可编辑内容”
results.tsv -- 只追加的实验日志

循环是这样：agent 读它当前的 strategy，提出一个实验，执行，衡量结果。指标提升了？保留这次 strategy 的改动。没提升？回滚，记录失败，再试别的。

这正是 QMD 发光的地方。每一份产物——研究扫描、内容草稿、性能数据、实验结论、策略决策——都会被保存进一个共享知识库。QMD 用混合检索（BM25 + 向量 + LLM 重排）为它建立索引。本地运行。

当任何一个 agent 运行时，它看到的不只是自己的历史。它还能看到团队里其他所有 agent 产出的东西：

swarm 变聪明不是因为模型变强了，而是因为策略会一步步棘轮式锁定进步。第 1 天将是它此后最糟的一天。

为什么这对增长尤其重要：手动跑了 9 年实验之后，我知道瓶颈从来不是“我们没有点子”，而是“我们从自己已经试过的东西里学得不够快”。这种架构让每一次实验的结果，都会永久地对每一个 agent 可用。

多模型路由

不是每个任务都需要 claude。把每类任务路由到能胜任的最便宜模型，成本就能从几百降到个位数。

当你把它们路由到合适的任务上时，小模型能做出大模型 2–3 倍的产出。mistral nemo 以 $0.02/M 做路由与结构化。qwen 以 $0.26/M 做创意写作。只有会产生级联影响的决策，才调用 claude。

11 个 agents 的一次完整循环：$0.009。但现实里我认为，一旦为质量做过 fine tuned，数字会跳高一些。但依然很便宜。

它如何嵌入真实工作流

这不是玩具。它能接入团队真正的工作方式。

swarm 可以和 ClickUp（或 Notion，或你用的任何 PM 工具）集成。交付物会落到那里。如果没有分配任务，agents 就去做自己的工作——调研、实验、优化，朝着你为它们设定的北极星指标前进。如果有紧急事情进来，你通过 hermes（telegram、slack、CLI）分配，它会把任务委派给 swarm 里合适的 agent。

随着时间推移，你会在团队上构建一张社交图谱。hermes + swarm 会更懂这些 agents，更懂什么有效，一起变聪明。

你能搭哪些团队？

引擎不关心团队做什么。它只关心：agents、tools、metrics、experiments。

增长团队（现在正在跑的）
内容 swarm
战略 swarm
AI influencer 团队
discord / 社区运营 swarm
工程团队
量化交易 swarm
任何你能想象的东西

归根结底，就是对 swarms 做 finetuning，并通过实验循环训练它们。写好配置，定义北极星指标，让 autoresearch 模式发挥作用。

诚实地谈谈 agents

我想说一句，在 AI agent 圈子里没人愿意说的话。

如果你一直在用 AI agents 搭东西……你知道它们真实是什么样的。不是内容农场描述的那样，不是“完全自治”的幻想，而是现实。

agents 还做不到把复杂工作从头到尾 100% 完成。不是今天。要我说，如果你期待一个架构良好的 swarm 承担 50–75% 的重活，这是现实的。随着模型与工具继续进步，也许我们会走向 100%。但现在，假装已经到了那一步，就是不诚实。

95% 的人用 AI。也许只有 5% 能从中看到真实、可叠加的结果。

我想回答的问题是：怎么把这个数字往上推？怎么从“只有 5% 看到真实结果”，走到 10%、15%、20%？

我的赌注是：不在更好的模型，而在更好的架构。协同。共享上下文。实验闭环。会学习、也不放弃的团队。

所有人都在解“让单个 agent 更聪明”。我认为杠杆在于：让 agents 一起工作，并且从彼此身上学习。

现在能跑的和还不能跑的

这是一个为 hackathon 做的两天冲刺。我会坦白说清楚：什么能跑，什么不能。

端到端跑通：

两阶段循环的 orchestrator（research、plan、approve、execute）
多模型路由器（通过 openrouter 调用 5 个模型）
research analyst（perplexity + DeFi 数据丰富 + 结构化）
growth lead（通过 claude sonnet 做战略规划）
linkedin writer + twitter writer（qwen 3.5 plus）
含 7 个 QMD collections 的知识库
19 个 MCP tools 接入 hermes（stdio + HTTP）
通过 telegram / slack 的审批流
ClickUp 集成用于任务管理

还需要更多工作：

experiment ratcheting（基础设施已接好，需要 30+ 个 cycles 才有真实数据）
升级单个 agent 的 skill、tools 和 models。

架构是真的。协作能跑起来。引擎在跑。我会花更多时间打磨它，把它做成 production-ready，再公开分享。

这是个大工程。

更大的图景

这段旅程起点，是想给我工作里的团队做点东西。然后我意识到：这个模式是通用的。

如果你的北极星指标是 X，swarm 就会持续为它优化。从错误中学习。跨 agent 共享上下文与经验。每一次循环、每一个实验、每一次失败，都会让下一次跑得更好。

agent 时代的重点，不是造出更强的单个 agent，而是造出能协同、能学习、能复利的团队。

teams/rabin/
├── program.md           # mission + constraints + voice rules
├── brand-kit.md         # brand identity for writers
├── agents/
│   ├── research-analyst/
│   │   └── config.json  # model, tools, metric, schedule, lenses
│   ├── growth-lead/
│   ├── linkedin-writer/
│   ├── twitter-writer/
│   ├── visual-designer/
│   ├── analytics-agent/
│   └── ...11 agents
└── results/
    └── [agent-id]/
        ├── strategy.md  # evolves over time
        └── results.tsv  # experiment log

同一个引擎，不同的配置。一个会随你成长的 swarm。

自托管。基于 @NousResearch @Teknium 的 hermes + 由 @tobi 实现的 QMD + @karpathy 的 autoresearch 模式。

重要的是引擎。重要的是 MCPs。重要的是实验闭环。

去自己搭一个；不然就等我把 repo 公开并开源。

where this started 36h ago.

36 小时前，一切从这里开始。

i've been running growth experiments for 9 years. growth hacking, distribution, conversion optimization. the loop is always the same: hypothesis, test, measure, keep or kill, repeat.

我已经做了 9 年的增长实验：增长黑客、分发、转化优化。循环永远是同一个：提出假设、测试、衡量、保留或砍掉、重复。

the problem was never the ideas. it was the velocity. a human team runs maybe 2-5 experiments per week. most of your time goes to coordination, not execution. research doesn't talk to analytics. the writer doesn't know what worked last week. context lives in people's heads and dies in slack threads.

i tried fixing this with AI agents. tried OpenClaw. tried standalone agents with tool use. tried the whole "AI employee" playbook.

我试过用 AI agents 来修这个问题。试过 OpenClaw。试过带工具调用的独立 agents。试过整套“AI 员工”的玩法。

it kept failing for the same reason.

它总是因为同一个原因而失败。

an agent without a team is just a prompt with no context.

没有团队的 agent，不过就是一个没有上下文的 prompt。

nobody hires a writer with no research team, no analytics, no strategy, no feedback loop. you don't hire individuals. you build teams.

没人会雇一个没有研究团队、没有数据分析、没有战略、没有反馈回路的写作者。你不会雇个体。你会搭团队。

so that's what i did.

所以我就这么做了。

teams/growth/         ← live now
teams/trading/        ← next
teams/influencer/     ← planned
teams/your-team/      ← you tell me

teams/growth/         ← live now
teams/trading/        ← next
teams/influencer/     ← planned
teams/your-team/      ← you tell me

what i built

我搭了什么

a swarm framework where multiple agents operate as a team. each agent has its own role, tools, MCP access, specialized LLM model, and context window. they share knowledge, hand off work, and learn from each other's results.

hermes sits on top as the operator. it can control the swarm, override actions, delegate tasks, and it learns from the agents underneath it. hermes + the swarm get smarter together.

hermes 位于上层，充当操作中枢。它能控制 swarm、覆写动作、委派任务，而且它会从底下的 agents 身上学习。hermes + swarm 会一起变得更聪明。

                              ┌────────────────┐
                              │    HERMES       │
                              │   (operator)    │
                              └───────┬────────┘
                                      │
                              ┌───────▼────────┐
                              │  ORCHESTRATOR   │
                              └──┬──────────┬───┘
                                 │          │
              ┌──────────────────▼──┐  ┌────▼───────────────────────┐
              │  PHASE 1: PLAN      │  │  PHASE 2: EXECUTE          │
              │  (sequential)       │  │  (parallel, after approval) │
              │                     │  │                             │
              │  research → plan    │  │  writers · designers ·      │
              │  → approve/reject   │  │  video · newsletter ·      │
              └─────────────────────┘  │  repurpose                  │
                                       └─────────────────────────────┘

              ┌─────────────────────────────────────────────┐
              │  SHARED LAYER                               │
              │  knowledge store · model router ·           │
              │  experiment engine · task management         │
              └─────────────────────────────────────────────┘

                              ┌────────────────┐
                              │    HERMES       │
                              │   (operator)    │
                              └───────┬────────┘
                                      │
                              ┌───────▼────────┐
                              │  ORCHESTRATOR   │
                              └──┬──────────┬───┘
                                 │          │
              ┌──────────────────▼──┐  ┌────▼───────────────────────┐
              │  PHASE 1: PLAN      │  │  PHASE 2: EXECUTE          │
              │  (sequential)       │  │  (parallel, after approval) │
              │                     │  │                             │
              │  research → plan    │  │  writers · designers ·      │
              │  → approve/reject   │  │  video · newsletter ·      │
              └─────────────────────┘  │  repurpose                  │
                                       └─────────────────────────────┘

              ┌─────────────────────────────────────────────┐
              │  SHARED LAYER                               │
              │  knowledge store · model router ·           │
              │  experiment engine · task management         │
              └─────────────────────────────────────────────┘

the core idea: one folder = one team.

核心想法：一个文件夹 = 一个团队。

python
# experiment.py

VERDICT_THRESHOLD = 0.20  # 20% improvement = meaningful

def evaluate_experiment(self, experiment, results):
    if len(results) < experiment.sample_size_needed:
        return "running"
    avg_metric = sum(r.metric_value for r in results) / len(results)
    improvement = (avg_metric - experiment.baseline) / experiment.baseline
    if improvement > self.VERDICT_THRESHOLD:
        return "keep"
    elif improvement < -self.VERDICT_THRESHOLD:
        return "discard"
    else:
        return "inconclusive"

python
# experiment.py

VERDICT_THRESHOLD = 0.20  # 20% improvement = meaningful

def evaluate_experiment(self, experiment, results):
    if len(results) < experiment.sample_size_needed:
        return "running"
    avg_metric = sum(r.metric_value for r in results) / len(results)
    improvement = (avg_metric - experiment.baseline) / experiment.baseline
    if improvement > self.VERDICT_THRESHOLD:
        return "keep"
    elif improvement < -self.VERDICT_THRESHOLD:
        return "discard"
    else:
        return "inconclusive"

create a folder, write the configs, start the engine. that's it.

建一个文件夹，写好配置，启动引擎。就这样。

why Hermes over OpenClaw?

为什么是 Hermes，而不是 OpenClaw？

let me be honest. on paper, these two look similar. both are persistent agents. both have SOUL.md. both have skills, cron, memory, multi-platform messaging, MCP. both self-hosted.

我说实话。从纸面上看，这两个很像：都是持久化 agents；都有 SOUL.md；都有 skills、cron、memory、多平台消息、MCP；也都能自托管。

we all tried openclaw probably right? some love it , some hate it.

我们大概都试过 openclaw，对吧？有人爱它，有人恨它。

but here's why i switched.

但这就是我切换的原因。

python vs node.

Python vs Node。

openclaw is node/JS. hermes is python. when you're building ML/AI infrastructure -- multi-model routing, experiment loops, knowledge stores, async orchestration -- python is the native stack. every library i need (httpx, asyncio, apscheduler, numpy if needed) is a pip install away. building my engine on top of a node runtime would have meant fighting the ecosystem at every step. this alone was 60% of the decision.

execution sandboxing.

执行沙盒。

openclaw runs on a node process bound to localhost. hermes gives you five execution backends: local, docker, SSH, singularity, modal. container hardening with read-only root filesystems, dropped capabilities, namespace isolation. when you're running 30+ agents with tool access that can write files, call APIs, and execute code on a VPS, this gap matters. hermes treats sandboxing as core infrastructure, not an afterthought.

sub-agent isolation.

子代理隔离。

hermes spawns isolated subagents with their own conversations, terminals, and python RPC scripts. zero context cost -- the parent doesn't lose context when a child runs. openclaw has multi-agent routing, but it's session-based isolation, not execution isolation. when my growth lead needs to delegate research to three writers in parallel, hermes handles that natively without polluting any context window.

memory architecture.

记忆架构。

openclaw's memory is file-based markdown (conversation logs + curated long-term). works fine for single-agent use. but it flushes working memory on restart by default -- there's a known issue where people lose days of agent context to silent compaction. hermes has persistent memory + auto-generated skills that survive restarts. combined with my QMD knowledge store (BM25 + vector + LLM reranking), the total memory architecture is: hermes remembers how to operate + QMD remembers what the team has learned. two layers, both persistent.

SOUL.md : same concept, different execution.

SOUL.md：同一个概念，不同的执行方式。

both have SOUL.md. but hermes reloads it every single message. i update the swarm roster at 2am, hermes picks it up on the next interaction. no restart, no recompilation, no cache invalidation. openclaw needs a process restart for some config changes. when you're iterating on a swarm of 11 agents, hot-reload isn't a nice-to-have. it's how you stay sane.

research and RL pipeline.

研究与 RL 管线。

hermes has batch trajectory generation with parallel workers, checkpointing, and Atropos RL training integration. ShareGPT export for fine-tuning. this is NousResearch DNA . they build training infrastructure. if you want your operator to eventually get fine-tuned on your own task data, hermes has the pipeline. openclaw doesn't.

the honest tradeoff.

诚实的取舍。

openclaw has a better consumer UX. bigger ecosystem, it's polished for personal assistant use cases, for now.

openclaw 的面向消费者的 UX 更好，生态更大；就目前而言，它对个人助理用例打磨得更完善。

but i'm not building just a personal assistant. i'm building an engine that orchestrates infinite amount of specialized agents across 5 models with experiment loops and a shared knowledge store. for that, hermes' python stack, execution isolation, and research pipeline are what matter.

hermes is the operator. the swarm is the team. you talk to hermes. hermes coordinates everything underneath.

hermes 是 operator。swarm 是团队。你和 hermes 对话，hermes 协调底下的一切。

the engine: how it actually works

引擎：它实际上是怎么跑的

two-phase daily cycle: for now.

每天两阶段循环：暂时如此。

phase 1 runs automatically every morning. research analyst scans, growth lead picks the angle and assigns work. then it stops and waits for you. you approve over coffee via telegram.

phase 1 每天早上自动运行。research analyst 扫描，growth lead 选角度并分配工作。然后它停下来等你。你喝着咖啡，通过 telegram 审批。

phase 2 fires -- writers execute in parallel, visuals get generated, everything saves to the knowledge store.

phase 2 点火——写手并行执行，视觉素材被生成，所有东西都保存进知识库。

you're the bottleneck by design. until you trust it enough to flip auto_approve=True

你是被设计成的瓶颈。直到你足够信任它，愿意把 auto_approve=True 打开。

the karpathy pattern (QMD + program.md)

karpathy 模式（QMD + program.md）

this is the part that matters most.

这是最重要的部分。

i took karpathy's autoresearch loop -- the pattern he uses for automated ML research -- and applied it to growth.

我把 karpathy 的 autoresearch loop——他用来做自动化 ML 研究的模式——应用到了增长上。

every agent has three files:

每个 agent 都有三个文件：

program.md -- immutable goal + single north star metric

program.md -- 不可变的目标 + 单一北极星指标

strategy.md -- the "editable thing" that evolves based on results

strategy.md -- 会随结果演化的“可编辑内容”

results.tsv -- append-only experiment log

results.tsv -- 只追加的实验日志

the loop: agent reads its current strategy. proposes an experiment. executes. results get measured. metric improved? keep the strategy change. didn't improve? revert. log the failure. try something else.

循环是这样：agent 读它当前的 strategy，提出一个实验，执行，衡量结果。指标提升了？保留这次 strategy 的改动。没提升？回滚，记录失败，再试别的。

this is where QMD shines. every artifact -- research scans, content drafts, performance data, experiment verdicts, strategy decisions -- gets saved to a shared knowledge store. QMD indexes it with hybrid search (BM25 + vector + LLM reranking). runs locally.

when any agent runs, it doesn't just see its own history. it sees what every other agent in the team has produced:

当任何一个 agent 运行时，它看到的不只是自己的历史。它还能看到团队里其他所有 agent 产出的东西：

the swarm doesn't get smarter because the models improve. it gets smarter because the strategies ratchet. day 1 is the worst it will ever be.

swarm 变聪明不是因为模型变强了，而是因为策略会一步步棘轮式锁定进步。第 1 天将是它此后最糟的一天。

why this matters for growth specifically: after 9 years of running experiments manually, i know the bottleneck is never "we don't have ideas." it's "we don't learn fast enough from what we already tried." this architecture makes every experiment's outcome available to every agent, permanently.

multi-model routing

多模型路由

not every task needs claude. routing to the cheapest model that handles each task type is where costs drop from hundreds to single digits.

不是每个任务都需要 claude。把每类任务路由到能胜任的最便宜模型，成本就能从几百降到个位数。

lower models doing 2-3x the output of larger models when you route them to the right task. mistral nemo at $0.02/M handles routing and structuring. qwen handles creative writing at $0.26/M. claude only gets called for decisions that cascade.

one full cycle across 11 agents: $0.009. but in reality i believe once fine tuned for quality , the numbers will jump a bit higher. but still cheap.

11 个 agents 的一次完整循环：$0.009。但现实里我认为，一旦为质量做过 fine tuned，数字会跳高一些。但依然很便宜。

how it fits into a real workflow

它如何嵌入真实工作流

this isn't a toy. it plugs into how teams actually operate.

这不是玩具。它能接入团队真正的工作方式。

the swarm integrates with ClickUp (or Notion, or whatever PM tool you use). deliverables land there. if no tasks are assigned, agents are off doing their own work -- researching, experimenting, optimizing toward their north star metric that you have set for them. if something urgent comes in, you assign it through hermes (telegram, slack, CLI) and it delegates to the right agent in the swarm.

over time you build a social graph on your team. hermes + the swarm understand the agents better. understand what works. get smarter together.

随着时间推移，你会在团队上构建一张社交图谱。hermes + swarm 会更懂这些 agents，更懂什么有效，一起变聪明。

what teams can you build?

你能搭哪些团队？

the engine doesn't care what the team does. it cares about: agents, tools, metrics, experiments.

引擎不关心团队做什么。它只关心：agents、tools、metrics、experiments。

growth team (what's running now)

增长团队（现在正在跑的）

content swarm

内容 swarm

strategy swarm

战略 swarm

AI influencer team

AI influencer 团队

discord / community management swarm

discord / 社区运营 swarm

engineering team

工程团队

quant trading swarm

量化交易 swarm

whatever you envision

任何你能想象的东西

it all comes down to finetuning the swarms and training them through the experiment loop. write the configs, define the north star metric, let the autoresearch pattern do its job.

归根结底，就是对 swarms 做 finetuning，并通过实验循环训练它们。写好配置，定义北极星指标，让 autoresearch 模式发挥作用。

being honest about agents

诚实地谈谈 agents

let me say what nobody in the AI agent space wants to say.

我想说一句，在 AI agent 圈子里没人愿意说的话。

if you've been building with AI agents... you know how they really are. not how content farmers describe them. not the "fully autonomous" fantasy. the reality.

如果你一直在用 AI agents 搭东西……你知道它们真实是什么样的。不是内容农场描述的那样，不是“完全自治”的幻想，而是现实。

agents are not doing 100% of complex work end to end. not today. i'd say if you expect 50-75% of the heavy lifting from a well-architected swarm, that's realistic. over time, as models and tooling improve, maybe we push toward 100%. but right now, pretending otherwise is dishonest.

95% of people use AI. maybe 5% see tangible, compounding results from it.

95% 的人用 AI。也许只有 5% 能从中看到真实、可叠加的结果。

the question i'm trying to answer: how do we move that number? how do we go from 5% seeing real results to 10%, 15%, 20%?

我想回答的问题是：怎么把这个数字往上推？怎么从“只有 5% 看到真实结果”，走到 10%、15%、20%？

my bet: it's not about better models. it's about better architecture. coordination. shared context. experiment loops. teams that learn and dont give up.

我的赌注是：不在更好的模型，而在更好的架构。协同。共享上下文。实验闭环。会学习、也不放弃的团队。

everyone is solving for "make the individual agent smarter." i think the leverage is in making agents work together and learn from each other.

所有人都在解“让单个 agent 更聪明”。我认为杠杆在于：让 agents 一起工作，并且从彼此身上学习。

what's live and what's not

现在能跑的和还不能跑的

this was a two-day sprint for the hackathon. i'm being upfront about what works and what doesn't.

这是一个为 hackathon 做的两天冲刺。我会坦白说清楚：什么能跑，什么不能。

working end-to-end:

端到端跑通：

orchestrator with two-phase cycle (research, plan, approve, execute)

两阶段循环的 orchestrator（research、plan、approve、execute）

multi-model router (5 models via openrouter)

多模型路由器（通过 openrouter 调用 5 个模型）

research analyst (perplexity + DeFi data enrichment + structuring)

research analyst（perplexity + DeFi 数据丰富 + 结构化）

growth lead (strategic planning via claude sonnet)

growth lead（通过 claude sonnet 做战略规划）

linkedin writer + twitter writer (qwen 3.5 plus)

linkedin writer + twitter writer（qwen 3.5 plus）

knowledge store with 7 QMD collections

含 7 个 QMD collections 的知识库

19 MCP tools wired to hermes (stdio + HTTP)

19 个 MCP tools 接入 hermes（stdio + HTTP）

approval flow via telegram / slack

通过 telegram / slack 的审批流

ClickUp integration for task management

ClickUp 集成用于任务管理

needs more work:

还需要更多工作：

experiment ratcheting (infrastructure wired, needs 30+ cycles for real data)

experiment ratcheting（基础设施已接好，需要 30+ 个 cycles 才有真实数据）

upgrade the individual agent skill , tools and models.

升级单个 agent 的 skill、tools 和 models。

the architecture is real. the coordination works. the engine runs. i'll spend more time polishing it and making it production-ready before sharing publicly.

架构是真的。协作能跑起来。引擎在跑。我会花更多时间打磨它，把它做成 production-ready，再公开分享。

it's a big build.

这是个大工程。

the bigger picture

更大的图景

this journey started from wanting to build something for my team at work. then i realized the pattern is universal.

这段旅程起点，是想给我工作里的团队做点东西。然后我意识到：这个模式是通用的。

if your north star metric is X, the swarm keeps optimizing for it. learns from mistakes. cross-shares context and learnings. every cycle, every experiment, every failure makes the next run better.

the agent era isn't about building better individual agents. it's about building teams that coordinate, learn, and compound.

agent 时代的重点，不是造出更强的单个 agent，而是造出能协同、能学习、能复利的团队。

teams/rabin/
├── program.md           # mission + constraints + voice rules
├── brand-kit.md         # brand identity for writers
├── agents/
│   ├── research-analyst/
│   │   └── config.json  # model, tools, metric, schedule, lenses
│   ├── growth-lead/
│   ├── linkedin-writer/
│   ├── twitter-writer/
│   ├── visual-designer/
│   ├── analytics-agent/
│   └── ...11 agents
└── results/
    └── [agent-id]/
        ├── strategy.md  # evolves over time
        └── results.tsv  # experiment log

teams/rabin/
├── program.md           # mission + constraints + voice rules
├── brand-kit.md         # brand identity for writers
├── agents/
│   ├── research-analyst/
│   │   └── config.json  # model, tools, metric, schedule, lenses
│   ├── growth-lead/
│   ├── linkedin-writer/
│   ├── twitter-writer/
│   ├── visual-designer/
│   ├── analytics-agent/
│   └── ...11 agents
└── results/
    └── [agent-id]/
        ├── strategy.md  # evolves over time
        └── results.tsv  # experiment log

same engine. different configs. a swarm that grows with you.

同一个引擎，不同的配置。一个会随你成长的 swarm。

self-hosted. built on @NousResearch @Teknium hermes + QMD by @tobi + @karpathy autoresearch pattern.

自托管。基于 @NousResearch @Teknium 的 hermes + 由 @tobi 实现的 QMD + @karpathy 的 autoresearch 模式。

the engine is what matters. the MCPs are what matters. the experiment loop is what matters.

重要的是引擎。重要的是 MCPs。重要的是实验闭环。

go build your own, if not then just wait until i make the repo public & open source.

去自己搭一个；不然就等我把 repo 公开并开源。

where this started 36h ago.

i've been running growth experiments for 9 years. growth hacking, distribution, conversion optimization. the loop is always the same: hypothesis, test, measure, keep or kill, repeat.

i tried fixing this with AI agents. tried OpenClaw. tried standalone agents with tool use. tried the whole "AI employee" playbook.

it kept failing for the same reason.

an agent without a team is just a prompt with no context.

nobody hires a writer with no research team, no analytics, no strategy, no feedback loop. you don't hire individuals. you build teams.

so that's what i did.

teams/growth/         ← live now
teams/trading/        ← next
teams/influencer/     ← planned
teams/your-team/      ← you tell me

what i built

hermes sits on top as the operator. it can control the swarm, override actions, delegate tasks, and it learns from the agents underneath it. hermes + the swarm get smarter together.

                              ┌────────────────┐
                              │    HERMES       │
                              │   (operator)    │
                              └───────┬────────┘
                                      │
                              ┌───────▼────────┐
                              │  ORCHESTRATOR   │
                              └──┬──────────┬───┘
                                 │          │
              ┌──────────────────▼──┐  ┌────▼───────────────────────┐
              │  PHASE 1: PLAN      │  │  PHASE 2: EXECUTE          │
              │  (sequential)       │  │  (parallel, after approval) │
              │                     │  │                             │
              │  research → plan    │  │  writers · designers ·      │
              │  → approve/reject   │  │  video · newsletter ·      │
              └─────────────────────┘  │  repurpose                  │
                                       └─────────────────────────────┘

              ┌─────────────────────────────────────────────┐
              │  SHARED LAYER                               │
              │  knowledge store · model router ·           │
              │  experiment engine · task management         │
              └─────────────────────────────────────────────┘

the core idea: one folder = one team.

python
# experiment.py

VERDICT_THRESHOLD = 0.20  # 20% improvement = meaningful

def evaluate_experiment(self, experiment, results):
    if len(results) < experiment.sample_size_needed:
        return "running"
    avg_metric = sum(r.metric_value for r in results) / len(results)
    improvement = (avg_metric - experiment.baseline) / experiment.baseline
    if improvement > self.VERDICT_THRESHOLD:
        return "keep"
    elif improvement < -self.VERDICT_THRESHOLD:
        return "discard"
    else:
        return "inconclusive"

create a folder, write the configs, start the engine. that's it.

why Hermes over OpenClaw?

let me be honest. on paper, these two look similar. both are persistent agents. both have SOUL.md. both have skills, cron, memory, multi-platform messaging, MCP. both self-hosted.

we all tried openclaw probably right? some love it , some hate it.

but here's why i switched.

python vs node.

execution sandboxing.

sub-agent isolation.

memory architecture.

SOUL.md : same concept, different execution.

research and RL pipeline.

the honest tradeoff.

openclaw has a better consumer UX. bigger ecosystem, it's polished for personal assistant use cases, for now.

hermes is the operator. the swarm is the team. you talk to hermes. hermes coordinates everything underneath.

the engine: how it actually works

two-phase daily cycle: for now.

phase 1 runs automatically every morning. research analyst scans, growth lead picks the angle and assigns work. then it stops and waits for you. you approve over coffee via telegram.

phase 2 fires -- writers execute in parallel, visuals get generated, everything saves to the knowledge store.

you're the bottleneck by design. until you trust it enough to flip auto_approve=True

the karpathy pattern (QMD + program.md)

this is the part that matters most.

i took karpathy's autoresearch loop -- the pattern he uses for automated ML research -- and applied it to growth.

every agent has three files:

program.md -- immutable goal + single north star metric
strategy.md -- the "editable thing" that evolves based on results
results.tsv -- append-only experiment log

when any agent runs, it doesn't just see its own history. it sees what every other agent in the team has produced:

the swarm doesn't get smarter because the models improve. it gets smarter because the strategies ratchet. day 1 is the worst it will ever be.

multi-model routing

not every task needs claude. routing to the cheapest model that handles each task type is where costs drop from hundreds to single digits.

one full cycle across 11 agents: $0.009. but in reality i believe once fine tuned for quality , the numbers will jump a bit higher. but still cheap.

how it fits into a real workflow

this isn't a toy. it plugs into how teams actually operate.

over time you build a social graph on your team. hermes + the swarm understand the agents better. understand what works. get smarter together.

what teams can you build?

the engine doesn't care what the team does. it cares about: agents, tools, metrics, experiments.

growth team (what's running now)
content swarm
strategy swarm
AI influencer team
discord / community management swarm
engineering team
quant trading swarm
whatever you envision

it all comes down to finetuning the swarms and training them through the experiment loop. write the configs, define the north star metric, let the autoresearch pattern do its job.

being honest about agents

let me say what nobody in the AI agent space wants to say.

if you've been building with AI agents... you know how they really are. not how content farmers describe them. not the "fully autonomous" fantasy. the reality.

95% of people use AI. maybe 5% see tangible, compounding results from it.

the question i'm trying to answer: how do we move that number? how do we go from 5% seeing real results to 10%, 15%, 20%?

my bet: it's not about better models. it's about better architecture. coordination. shared context. experiment loops. teams that learn and dont give up.

everyone is solving for "make the individual agent smarter." i think the leverage is in making agents work together and learn from each other.

what's live and what's not

this was a two-day sprint for the hackathon. i'm being upfront about what works and what doesn't.

working end-to-end:

orchestrator with two-phase cycle (research, plan, approve, execute)
multi-model router (5 models via openrouter)
research analyst (perplexity + DeFi data enrichment + structuring)
growth lead (strategic planning via claude sonnet)
linkedin writer + twitter writer (qwen 3.5 plus)
knowledge store with 7 QMD collections
19 MCP tools wired to hermes (stdio + HTTP)
approval flow via telegram / slack
ClickUp integration for task management

needs more work:

experiment ratcheting (infrastructure wired, needs 30+ cycles for real data)
upgrade the individual agent skill , tools and models.

the architecture is real. the coordination works. the engine runs. i'll spend more time polishing it and making it production-ready before sharing publicly.

it's a big build.

the bigger picture

this journey started from wanting to build something for my team at work. then i realized the pattern is universal.

if your north star metric is X, the swarm keeps optimizing for it. learns from mistakes. cross-shares context and learnings. every cycle, every experiment, every failure makes the next run better.

the agent era isn't about building better individual agents. it's about building teams that coordinate, learn, and compound.

teams/rabin/
├── program.md           # mission + constraints + voice rules
├── brand-kit.md         # brand identity for writers
├── agents/
│   ├── research-analyst/
│   │   └── config.json  # model, tools, metric, schedule, lenses
│   ├── growth-lead/
│   ├── linkedin-writer/
│   ├── twitter-writer/
│   ├── visual-designer/
│   ├── analytics-agent/
│   └── ...11 agents
└── results/
    └── [agent-id]/
        ├── strategy.md  # evolves over time
        └── results.tsv  # experiment log

same engine. different configs. a swarm that grows with you.

self-hosted. built on @NousResearch @Teknium hermes + QMD by @tobi + @karpathy autoresearch pattern.

the engine is what matters. the MCPs are what matters. the experiment loop is what matters.

go build your own, if not then just wait until i make the repo public & open source.

📋 讨论归档

讨论进行中…