🪞 Uota学 · 🧠 阿头学

拆开 Clawdbot 的壳——一个 AI 工程师的架构笔记

Clawdbot 的架构没有黑魔法——TypeScript CLI + lane 串行化 + JSONL 记忆 + Markdown 文件，简单到让人不安，但正是这种"可解释的简单"让它跑起来了。

2026-01-31 原文链接 ↗

阅读简报

双语对照

完整翻译

原文

讨论归档

核心观点

Lane 串行化是被低估的架构决策 默认串行、显式并行——这不是偷懒，是对 agent 可靠性的深刻理解。多 agent 并行写共享状态是调试地狱，Cognition 那篇 "Don't Build Multi-Agents" 说的也是这个。Uota 的 subagent 模式天然走的就是这条路，验证了方向正确。

记忆系统简单得令人意外 没有记忆合并、没有按时间衰减、没有遗忘曲线——就是 JSONL 会话日志 + Markdown 文件 + 向量/关键词混合检索。作者说"偏好可解释的简单"，这和 Uota 当前的 memory/ 目录 + MEMORY.md 方案几乎同构。但问题是：没有遗忘曲线意味着旧记忆永远不会被降权，长期运行后噪声会累积。

Semantic Snapshots 比截图更适合 agent 浏览器工具用的是 ARIA 可访问性树的文本表示，不是截图。这对 token 效率和可靠性都更好——截图需要视觉模型解析，而文本快照可以直接被语言模型理解。这是 browser automation 的正确方向。

安全模型是"信任用户判断"而非"系统强制" 命令 allowlist + 弹窗审批，危险结构默认阻止。本质上是把安全决策权交给用户。对个人助手来说这没问题，但如果要做团队级部署，这套模型远远不够。

跟我们的关联

🪞Uota Lane 串行化的思路可以直接借鉴到 Uota 的任务调度里。当前 subagent spawn 已经是隔离的，但主 session 内的多任务排队还没有显式的 lane 概念——如果未来 Uota 要处理更多并发请求，这是一个值得提前设计的架构点。
🧠Neta 如果 Neta 要做用户侧的 AI agent（比如让用户的 OC 角色有记忆和主动行为），Clawdbot 的记忆方案是一个极简参考：JSONL + Markdown + 混合检索，不需要复杂的记忆图谱就能跑起来。但要注意长期记忆的噪声问题。

讨论引子

💭 Clawdbot 的记忆没有遗忘曲线——所有旧记忆权重相同。这对个人助手来说是 feature 还是 bug？Uota 的 MEMORY.md 手动策展模式和自动检索模式，哪个更适合长期运行？

💭 "默认串行、显式并行"在 agent 架构里几乎是共识了。但如果用户同时发来 5 条不相关的请求，串行处理的延迟用户能接受吗？什么时候该打破这个默认？

人人都在聊 Clawdbot，但它到底是怎么工作的

我深入看了看 Clawdbot（又名 Moltbot）的架构，以及它如何处理 agent 执行、工具调用、浏览器等。对 AI 工程师来说，这里面有很多值得学习的东西。

了解 Clawd 在底层如何运作，能帮助你更好地理解这个系统及其能力，更重要的是：它擅长什么、不擅长什么。

这篇文章起因于我个人的好奇：Clawd 的记忆是怎么处理的？它到底有多可靠��

在本文中，我会从较为表层的视角梳理 Clawd 的工作方式。

从技术角度看，Clawd 到底是什么

大家都知道，Clawd 是一个个人助理：你既可以在本地运行，也可以通过模型 API 运行；用起来就像在手机上一样方便。但它究竟“是什么”呢？

本质上，Clawdbot 是一个 TypeScript 的 CLI 应用。

它不是 Python、Next.js，也不是一个 Web 应用。

它是一个进程，会：

在你的机器上运行，并暴露一个网关服务器，用于处理所有渠道连接（telegram、whatsapp、slack 等）

调用 LLM API（Anthropic、OpenAI、本地模型等）

在本地执行工具，

并在你的电脑上做你想让它做的任何事。

架构

为了更直观地解释架构，下面用一个例子说明：你给 Clawd 发消息，到最终得到输出，中间都发生了什么。

当你在某个消息应用里向 Clawd 提示（prompt）时，会发生以下流程：

渠道适配器

渠道适配器会接收你的消息并进行处理（规范化、提取附件等）。不同的消息应用和输入流都有各自专用的适配器。

网关服务器

网关服务器作为任务/会话协调器，会接收你的消息并将其分发到对应的会话。这是 Clawd 的核心：它可以处理多个相互重叠的请求。

为了让操作串行化，Clawd 使用了基于 lane 的命令队列。每个会话都有自己专属的 lane，而风险较低、可并行的任务可以在并行的 lane 中运行（cron jobs）。

这与把 async/await 写成一团“意大利面”形成鲜明对比。过度并行会损害可靠性，并带来成群的调试噩梦。

默认串行，明确需要时才并行

如果你做过 agent，多少已经体会过这一点。这也是 Cognition 的《Don’t Build Multi-Agents》那篇博文里的洞见。

给每个 agent 搭个简单的 async 方案，最终往往只会得到一堆交错混杂的垃圾输出。日志几乎无法阅读；而一旦它们共享状态，竞态条件就会成为你在开发中必须时刻担心的问题。

Lane 是对队列的抽象，把“串行化”作为默认架构，而不是事后才补的功能。作为开发者，你照常写代码，队列会替你处理竞态问题。

思维模型会从“我需要锁住什么？”转变为“什么是安全可并行的？”

Agent Runner（代理运行器）

这一步真正的 AI 才登场。它会决定用哪个模型，选择对应的 API key（如果都不可用，就把该 profile 标记为 cooldown 并尝试下一个），如果主模型失败，则回退到另一个模型。

agent runner 会动态组装 system prompt：包含可用的 tools、skills、memory，然后再追加会话历史（来自 .jsonl 文件）。

接下来，这些内容会交给 context window guard，用来确认上下文窗口是否还有足够空间。如果上下文几乎满了，它要么压缩会话（对上下文做总结），要么优雅地失败。

LLM API 调用

LLM 调用本身支持流式输出，并对不同的提供方做了抽象封装。如果模型支持，它还可以请求 extended thinking。

Agentic Loop（代理循环）

如果 LLM 返回的是工具调用（tool call），Clawd 会在本地执行，并将结果加入对话。这个循环会不断重复，直到 LLM 输出最终文本，或达到最大轮数（默认约 20）。

这也是“魔法”发生的地方：

Computer Use

我后面会讲到。

响应路径

很标准：响应通过渠道回到你这边。会话也会通过一个基础的 jsonl 进行持久化——每一行都是一个 json 对象，记录用户消息、工具调用、结果、响应等。这就是 Clawd 的“记忆”（基于会话的记忆）。

以上覆盖了基本架构。

现在我们来看看一些更关键的组件。

Clawd 如何记住东西

没有合适的记忆系统，AI 助手就跟金鱼差不多。Clawd 通过两套系统来处理：

如前所述，以 JSONL 形式保存的会话记录。

以 Markdown 形式保存的记忆文件：写在 MEMORY[.]md 或 memory/ 文件夹里。

在搜索上，它采用向量检索与关键词匹配的混合方案，兼取两者之长。

因此，搜索 “authentication bug” 时，既能找到提到 “auth issues” 的文档（语义匹配），也能找到包含完全相同短语的内容（关键词匹配）。

向量检索使用 SQLite；关键词检索使用 FTS5（同样是 SQLite 的扩展）。Embedding 提供方是可配置的。

它还受益于 Smart Synching：当文件监听器检测到文件变化时会触发同步。

这些 Markdown 是 agent 自己用标准的 “write” 文件工具生成的，并没有什么特殊的 memory-write API。agent 只是把内容写到 memory/*.md。

一旦开始一段新对话，会有一个 hook 抓取上一段对话，并用 Markdown 写一份摘要。

Clawd 的记忆系统出乎意料地简单，和我们在 @CamelAIOrg 里实现的 workflow memories 非常相似：不做记忆合并，也不做按月/按周的记忆压缩。

这种简单性从不同角度看，既可能是优势也可能是陷阱；但我一向更偏好可解释的简单，而不是复杂的意大利面。

记忆会永久保留，而旧记忆基本拥有相同权重，所以可以说它没有遗忘曲线。

Clawd 的“爪子”：它如何使用你的电脑

这是 Clawd 的一条“护城河”：你把一台电脑交给它，让它去用。那么它究竟怎么用电脑？基本和你想象的差不多。

Clawd 会在你自行承担风险的前提下，给 agent 相当大的电脑访问权限。它用一个 exec 工具在以下环境里执行 shell 命令：

sandbox：默认选项，命令在 Docker 容器里运行

直接在宿主机上运行

在远程设备上运行

除此之外，Clawd 还有文件系统工具（读、写、编辑），

浏览器工具（基于 Playwright，并带有 semantic snapshots），

以及进程管理（process tool），用于后台长时运行命令、终止进程等。

安全性（或者说，几乎没有？）

类似 Claude Code，它有一个命令 allowlist：对用户希望审批的命令，会弹窗询问（允许一次、总是允许、拒绝）。

一些安全命令（例如 jq、grep、cut、sort、uniq、head、tail、tr、wc）默认已预批准。

危险的 shell 结构默认会被阻止。

它的安全机制与 Claude Code 的配置非常相似。

理念是：在用户允许的范围内，尽可能给 agent 更多自主性。

浏览器：Semantic Snapshots

浏览器工具并不主要依赖截图，而是使用 semantic snapshots——一种基于文本的页面可访问性树（ARIA）表示。

所以，一个 agent 看到的会是：

Link: http://x.com/i/article/2016908271227953152

everyone talks about Clawdbot, but here's how it works

Source: https://x.com/hesamation/status/2017038553058857413?s=46
Mirror: https://x.com/hesamation/status/2017038553058857413?s=46
Published: 2026-01-30T00:54:00+00:00
Saved: 2026-01-31

Content

I took a look inside Clawdbot (aka Moltbot) architecture and how it handles agent executions, tool use, browser, etc. there are many lessons to learn for AI engineers.

learning how Clawd works under the hood allows a better understanding of the system and its capabilities, and most importantly, what it's GOOD at and BAD at.

this started as a personal curiosity about how Clawd handles its memory and how reliable it is.

in this article i'll go through the surface-level of how Clawd works.

What Clawd is TECHNICALLY

so everybody knows Clawd is a personal assistant you can run locally or through model APIs and access as easy as on your phone. but what is it really?

at its core, Clawdbot is a Typescript CLI application.

it's not Python, Next.js, or a web app.

it's a process that

runs on your machine and exposes a gateway server to handle all channel connections (telegram, whatsapp, slack, etc.)

makes calls to LLM APIs (Anthropic, OpenAI, local, etc.)

executes tools locally,

and does whatever you want on your computer.

The Architecture

to explain the architecture more simply, here's an example of what happens when you message Clawd all the way to how you get an output.

here's what happens when you prompt Clawd on a messanger:

Channel Adapter

A Channel Adapter takes you message and processes it (normalize, extract attachments). Different messengers and input streams have their dedicated adapters.

Gateway Server

The Gateway Server which is the task/session coordinator takes your message and passes it to the right session. this is the heart of the Clawd. It handles multiple overlapping requests.

to serialize operations, Clawd uses a lane-based command queue. A session has its own dedicated lane, and low-risk parallizable tasks can run in parallel lanes (crone jobs).

This is in contrast to using async/await spaghetti. over parallilization hurts reliability and brings out a huge swarm of debugging nightmares.

Default to Serial, go for Parallel explicitly

if you've worked with agents you've already realized this to some extent. this is also the insight from Cognition's Don’t Build Multi-Agents blog post.

the mental model shifts from "what do I need to lock?" to "what's safe to parallalize?"

Agent Runner

the agent runner assembles the system prompt prompt dynamically with available tools, skills, memory, and then adds the session history (from a .jsonl file).

LLM API CALL

the LLM call itself streams responses and holds an abstraction over different providers. it can also request extended thinking if the model supports it.

Agentic Loop

if the LLM returns a tool call response, Clawd executes it locally and adds the results to the conversation. This is repeated until the LLM responds with final text or hits max turns (default ~20).

this is also where the magic happens:

Computer Use

which i will get to.

Response Path

this covers the basic architecture.

now let's jump on some of the more critical components.

How Clawd Remembers

without a proper memory system, an ai assistant is just as good as a goldfish. Clawd handles this through two systems:

Session transcripts in JSONL as mentioned.

Memory files as markdowns in MEMORY[.]md or the memory/ folder.

For searching, it uses a hybrid of vector search and keyword matches. This captures the best of both worlds.

So searching for "authentication bug" finds both documents mentioning "auth issues" (semantic) and exact phrase (keyword match).

for the vector search SQLite is used and for keyword search FTS5 which is also a SQLite extention. the embedding provider is configurable.

It also benefits from Smart Synching which triggers when file watcher triggers on file changes.

this markdown is generated by the agent itself using a standard 'write' file tool. There's no special memory-write API. the agent simply writes to memory/*.md.

once a new conversation starts a hok grabs the previous conversation, and writes a summary in markdown.

Clawd's memory system is surprisingly simple and very similar to what we have implemented in @CamelAIOrg as workflow memories. No merging of memories, no monthly/weekly memory compressions.

This simplicity can be an advantage or a pitfall depending on your perspective, but I'm always in favor of explainable simplicity rather than complex spaghetti.

the memory persists forever and old memories have basically equal weight, so we can say there's no forgetting curve.

Clawd's Claws: How it uses your Computer

this is one of the MOAT's of Clawd: you give it a computer and let it use. So how does it use the computer? It's basically similar to what you think.

Clawd gives the agent significant computer access at your own risks. it uses an exec tool to run shell commands on:

sandbox: the default, wher commands run in a Docker container

directly on host machine

on remote devices

Aside from that Clawd also has Filesystem tools (read, write, edit),

Browser tool, which is Playwrite-based with semantic snapshots,

and Process management (process tool) for background long-term commands, kill processes, etc.

The Safety (or a lack of none?)

Similar to Claude Code there is an allowlist for commands the user would like to approve (allow once, always, deny prompts to the user).

safe commands (such as jq, grep, cut, sort, uniq, head, tail, tr, wc) are pre-approved already.

dangerous shell constructs are blocked by default.

the safety is very similar to what Claude Code has installed.

the idea is to have as much autonomy as the user allows.

Browser: Semantic Snapshots

the browser tool does not primirily use screenshots, but uses semantic snapshots instead, which is a text-based representation of the page's accessibility tree (ARIA).

so an agent would see: