返回列表
🧠 阿头学 · 💬 讨论题 · 💰投资

你本可以自己发明 OpenClaw

持久化 AI 助手的核心架构并非黑箱,而是 10 个递进的工程问题的组合解,从会话存储到多渠道网关,每一块都可以自己搭建,但生产级的复杂性远超 demo。
打开原文 ↗

2026-02-13 原文链接 ↗
阅读简报
双语对照
完整翻译
原文
讨论归档

核心观点

  • 网关模式重塑交互位置:AI 逻辑与消息渠道彻底解耦,让同一个智能体同时活在 Telegram、Discord、Slack 里,用户在任何地方的对话历史和记忆都统一,这是从"你去找 AI"到"AI 住进你的工作流"的质变。
  • 痛点驱动的组件映射:文章清晰地把每个实际问题映射到对应的架构决策(记不住事→JSONL 会话、只会说不会做→工具+Agent Loop、乱执行命令→权限白名单、对话太长→上下文压缩),这套映射表本身就是做 AI 产品的通用 checklist。
  • SOUL.md 把通用 AI 变成具体角色:用 Markdown 文件定义人格、边界、行为规则,比硬编码 System Prompt 更可维护,也更接近"给员工写岗位说明书"的管理逻辑,越具体的约束产出越稳定。
  • 多智能体通过共享文件协作而非直接通信:研究员 Scout 和管家 Jarvis 不需要互发消息,而是读写同一个记忆目录,这极大降低了多 Agent 系统的死锁风险和通信复杂度。
  • 从 demo 到生产的隐形成本被严重低估:文章强调"400 行代码就覆盖了关键组件",但刻意回避了多用户、多租户、分布式一致性、prompt injection 防护、渠道协议变化、断线重连等生产级工程的 90% 复杂性。

跟我们的关联

  • 对 ATou 意味着什么:如果你在做团队内部的 AI 助手或自动化工具,这套架构给了一个清晰的演进路径——先从单渠道 bot 开始,逐步加会话、工具、权限、多渠道,不需要一开始就选择复杂的 Agent 框架。下一步可以评估:你的团队是否需要多智能体分工(如研究员+执行者),以及如何设计共享记忆的隔离策略。
  • 对 Neta 意味着什么:这篇文章本质是 OpenClaw 的技术营销——通过"你自己也能发明"的叙事降低心理门槛,再以"生产级版本"收尾引导采用。但其中关于网关模式、权限三层框架、上下文压缩的设计思想确实有参考价值。下一步应该问:这套架构在你的具体场景(如客服多渠道统一、内部工作流自动化)中的成本-收益是否真的划算,还是用现成的 Bot 平台或托管服务更经济。
  • 对 Uota 意味着什么:如果你在评估 AI Agent 产品的成熟度或架构合理性,这篇文章提供了一个参考框架——看它是否覆盖了会话、工具、权限、多渠道、记忆、队列、心跳这些基本组件。但要警惕"简单化陷阱":真正的差异在于边界条件处理(并发、故障恢复、安全隔离),而不是核心逻辑的简洁性。
  • 通用启发:这套"痛点→组件"的映射方法可以用在任何复杂系统的架构设计上——不要一开始就堆砌功能,而是按实际遇到的问题逐步迭代,每次解决一个具体痛点,最后的架构往往比预先设计的更清晰、更可维护。

讨论引子

1. 网关模式的成本真的值得吗?文章假设"统一多渠道体验"是必需的,但在实际产品中,是否每个渠道都需要完整的会话、记忆、工具访问权限,还是可以按渠道特性做差异化设计(如 Telegram 上功能全,邮件上只读)?

2. 权限系统的"三层模型"在 prompt injection 面前有多脆弱?文章提到的"审批允许列表"依赖字符串匹配,但 LLM 极易被诱导执行看似安全实则危险的命令变体(如 `rm -rf / --dry-run` 绕过审批),这套防护在 24/7 运行的场景下是否足够,还是需要更强的隔离(如容器、虚拟机)?

3. 多智能体通过共享文件协作的隐患是什么?当多个智能体同时读写同一个记忆文件时,如何保证一致性和隐私隔离(如 Alice 的私密信息不被 Bob 的智能体读到)?文章完全没提这个问题。

OpenClaw 的强大之处,出奇地简单:它是一个网关,把一个 AI 智能体连接到你的消息应用,让它拥有与电脑交互的工具,并且能够在多次对话之间记住“你是谁”。

真正的复杂性来自同时处理多个渠道、管理持久会话、协调多个智能体,以及把整套系统做得足够可靠,能在你的机器上 24/7 运行。

这篇文章里,我会从零开始,一步步搭到 OpenClaw 的架构,展示你如何从第一性原理出发,只用一个消息 API、一个 LLM,以及让 AI 在聊天窗口之外真正有用的愿望,就能自己把它“发明”出来。

最终目标:理解“持久化 AI 助手”是如何工作的,这样你就能构建自己的版本。

首先,我们得把问题说清楚

当你在浏览器里使用 ChatGPT 或 Claude 时,会有一个根本限制:AI 被困在网页界面背后。你得去找它。它不会来找你。

想想你每天真正的沟通方式。你用 WhatsApp、Telegram、Discord、Slack、iMessage。对话发生在那里,而不是在某个单独的 AI 标签页里。

如果 AI 就……住在你的消息应用里,会怎样?如果你能像给朋友发短信一样给它发消息,而它能够:

记住你的偏好和过去的对话

在你的电脑上运行命令

替你浏览网页

代表你发送消息

按计划“醒来”,执行重复性任务

这就是 OpenClaw 做的事。它让 AI 活在你本来就沟通的地方,并且可以访问你的工具和你的上下文。

我们从零开始搭一个。

最简单可行的 Bot

先从绝对最小集合开始:一个能在 Telegram 上回复消息的 AI。

运行它,在 Telegram 上发一条消息,AI 就回复。很简单。

但这基本上就是 Claude 网页界面的更差版本。每条消息都是独立的。没有记忆。没有工具。没有人格。

那如果我们给它记忆呢?

目标:持久会话

我们这个简单 bot 的问题在于“无状态”。每条消息都是一段全新对话。你问它“我刚才说了什么?”,它完全不知道。

修复方法是会话(sessions)。为每个用户保留一份对话历史。

现在你就能进行真正的对话:

关键洞察是 JSONL 格式。每一行是一条消息。只追加写入。如果进程在写到一半时崩溃,最多丢一行。这正是 OpenClaw 用来保存会话转录的方式:

每个会话映射到一个文件。每个文件就是一段对话。重启进程,一切都还在。

但我们会碰到一个问题:对话会增长。最终它们会超过模型的上下文窗口。我们稍后会回来处理这个。

目标:加入人格(SOUL.md)

我们的 bot 能用了,但它没有人格。它只是一个通用 AI 助手。如果我们希望它是某个人呢?

OpenClaw 用 SOUL.md 解决:一个 Markdown 文件,用来定义智能体的身份、行为与边界。

这样你就不是在跟一个通用助手说话,而是在跟 Jarvis 说话。SOUL 会在每次 API 调用时作为 system prompt 注入。

在 OpenClaw 里,SOUL.md 放在智能体的 workspace 中:

它会在会话开始时加载,并注入 system prompt。你可以在里面写任何东西。给智能体写一个起源故事。定义它的核心哲学。列出它的行为规则。

SOUL 越具体,智能体行为就越一致。“要有帮助”太模糊。“做一个你真的愿意聊天的助手:需要时简洁,重要时详尽。不做企业腔。不谄媚。就……好。”——这会给模型更多可操作的约束。

目标:加入工具

只能说话的 bot 很受限。如果它能做事呢?

核心思路:给 AI 一组结构化工具,让它自己决定何时使用它们。

现在我们需要智能体循环(agent loop)。当 AI 想用某个工具时,我们执行工具,并把结果喂回去:

接着我们把 handle_message 更新为使用智能体循环,而不是直接调用 API:

现在你可以给你的 bot 发消息:

AI 自己决定用哪些工具、按什么顺序用,并把结果综合成自然语言回复。全程通过一条 Telegram 消息完成。

OpenClaw 的生产级工具目录要大得多——浏览器自动化、智能体间消息、子智能体生成等等——但每个工具都遵循同一个模式:一个 schema、一段描述、一个执行函数。

目标:权限控制

我们在从 Telegram 消息里执行命令。这太吓人了。如果有人拿到了你的 Telegram 账号,然后让 bot 执行 rm -rf / 呢?

我们需要一个权限系统。OpenClaw 的做法:一个审批允许列表,会记住你批准过什么。

我们把这些 helper 加到现有代码旁边,然后更新 execute_toolrun_command 分支来使用它们:

现在更新 execute_tool 中的 run_command 分支:执行前先检查权限:

当命令是安全的或之前已批准过,它会立刻运行。否则,智能体会收到“permission denied”,并可以尝试别的方案。审批会持久化到 exec-approvals.json,所以同一个命令你不会被问两次。

OpenClaw 在此基础上扩展了 glob 模式(一次批准 git *),以及一个三层模型:“ask”(询问用户)、“record”(记录但放行)、“ignore”(自动放行)。

目标:网关

从这里开始就变得有意思了。到目前为止,我们只有一个 Telegram bot。但如果你还想在 Discord 上用?在 WhatsApp 上用?在 Slack 上用?

你可以为每个平台分别写一个 bot。但那样你会有分离的会话、分离的记忆、分离的配置。Telegram 上的 AI 不会知道你在 Discord 上聊过什么。

解决方案:网关。一个中心进程管理所有渠道。

看看我们已经有什么。我们的 run_agent_turn 函数根本不知道 Telegram 的存在。它接收消息并返回文本。这就是关键——智能体逻辑已经与渠道解耦了。

为了证明这一点,我们加第二个接口。我们在 Telegram bot 旁边再加一个简单的 HTTP API,它们都对同一个智能体、同一套会话说话:

试一下:先在 Telegram 上告诉 bot 你的名字。然后用同一个 user ID(你的 Telegram user ID)通过 HTTP 查询,证明会话是共享的:

同一个智能体,同一套会话,同一份记忆。两个不同接口。这就是网关模式。

下一步就是把它做成配置驱动——用一个 JSON 文件指定要启动哪些渠道、以及如何认证。

这正是 OpenClaw 做的:它的网关通过单一配置文件管理 Telegram、Discord、WhatsApp、Slack、Signal、iMessage 等等。它还支持可配置的会话作用域——按用户、按渠道,或单一共享会话——让同一个人在多渠道获得统一体验。我们暂时先保持简单:用 user ID 作为会话键。

目标:上下文压缩

还记得我们前面标记的“会话增长”问题吗?跟 bot 聊了几周之后,会话文件里有成千上万条消息。总 token 数超过模型上下文窗口。现在怎么办?

修复方法:总结旧消息,保留新消息。把这两个函数加到现有代码旁边:

然后在 handle_message 顶部、加载会话之后,加入压缩检查:

试一下:为了不聊上几个小时来测试压缩,先临时把阈值调低:

聊 10–15 条消息,然后看旧消息被摘要替换。bot 仍然记得关键信息,但会话文件小了很多。

OpenClaw 的压缩更复杂——它按 token 数把消息切成块、分别摘要,并为估算误差预留安全边际——但核心思想完全一致。

目标:长期记忆

会话历史给你的是“对话记忆”。但如果你重置会话或新开一个会话呢?一切都没了。

我们需要一个独立的记忆系统——能跨会话重置而保存的持久知识。做法:给智能体工具,用文件存储来保存与搜索记忆。

把这两个工具加到 TOOLS 列表里:

把它们的分支加到 execute_tool

最后,更新 SOUL,让智能体知道“记忆”这回事:

试一下:

记忆之所以能持久,是因为它存放在文件里,而不是会话里。重置会话、重启 bot——记忆仍然在。

OpenClaw 的生产级记忆用向量检索加 embeddings 做语义匹配(因此 “auth bug” 能匹配 “authentication issues”),但我们的关键词搜索作为入门已经很好用了。

目标:命令队列

这里有个微妙但致命的问题:当两条消息同时到达,会发生什么?

比如你在 Telegram 发“查一下我的日历”,同时又通过 HTTP API 发“天气怎么样”。两边都尝试加载同一个会话、都尝试追加写入,于是你得到损坏的数据。

修复很简单:按会话加锁。对同一个会话一次只处理一条消息。不同会话仍然可以并行。

现在用这把锁把 handle_message 的主体包起来:

/chat HTTP endpoint 也做同样的事:

就这样——五行初始化。同一用户的消息会排队。不同用户的消息并行跑。没有竞态条件。

OpenClaw 在此基础上做了 lane-based 队列(为消息、cron 任务、子智能体分不同 lane),这样心跳任务永远不会阻塞实时对话。

目标:Cron Jobs(心跳)

到目前为止,我们的智能体只有在你跟它说话时才会响应。但如果你希望它每天早上检查邮件呢?或者在会议前总结日历呢?

你需要定时执行。我们加心跳——定期触发智能体的循环任务。

关键洞察:每个心跳使用自己的会话键(cron:morning-briefing)。这样定时任务不会弄乱你的主对话历史。心跳调用的仍然是同一个 run_agent_turn 函数——它只是另一条消息,只不过由定时器触发,而不是由人触发。

试一下:为了测试,把计划改成每分钟运行一次:

你会在终端看到心跳触发,智能体会响应。测试结束后再改回每日计划。

OpenClaw 支持完整的 cron 表达式(30 7 * * *),并把心跳路由到单独的命令队列 lane,确保它永远不会阻塞实时消息。

目标:多智能体

一个智能体很有用。但当任务越来越多,你会发现一个人格与一套工具很难把所有事都做得好。研究助理需要的指令和通用助理不同。

修复方法:多套智能体配置 + 路由。每个智能体有自己的 SOUL、自己的会话;你根据消息在它们之间切换。

更新 handle_message:把消息路由到正确的智能体:

试一下:

每个智能体有自己的对话历史,但它们共享同一个记忆目录。Scout 保存研究发现;Jarvis 之后可以搜索并取回。它们通过共享文件协作,而不需要直接互发消息。

OpenClaw 在此基础上扩展了子智能体生成(父智能体可以生成一个子智能体来做聚焦任务)和智能体间消息,但核心模式相同:每个智能体本质上就是 SOUL + 会话 + 工具。

把一切拼起来

我们把构建出的所有东西合成一份可运行的脚本。这是一个干净的独立 REPL,包含教程里的每个特性:会话、SOUL、工具、权限、压缩、记忆、命令队列、cron、多智能体路由。

我在这里整理了一个约 400 行的 mini-openclaw:

https://gist.github.com/dabit3/86ee04a1c02c839409a02b20fe99a492

把它保存为 mini-openclaw.py 并运行:

一个会话大概长这样:

记忆跨会话持久存在。智能体通过共享记忆文件协作。命令需要审批。心跳在后台运行。一切都在约 400 行内完成。

我们学到了什么

从一个简单的 Telegram bot 出发,我们构建了持久化 AI 助手的每个关键组件:

持久会话(JSONL 文件):抗崩溃的对话记忆。每个会话一个文件,每行一条消息。重启进程,一切都还在。

SOUL.md(system prompt):一个人格文件,把通用 AI 变成具体智能体,使其行为、边界与风格保持一致。

工具 + 智能体循环:结构化的工具定义,让 AI 自己决定何时行动。智能体循环调用 LLM,执行任何被请求的工具,把结果反馈回去,并重复直到完成。

权限控制:安全命令的允许列表 + 持久化审批,让危险操作必须获得明确同意。

网关模式:一个中心智能体,多种接口。Telegram、HTTP 或任何其他渠道——都对同一套会话、同一份记忆说话。

上下文压缩:当对话长到超过上下文窗口,摘要旧消息并保留新消息。bot 既保留知识,又不触碰 token 上限。

长期记忆:基于文件的存储,配合保存与搜索工具。跨会话重置仍然存在的知识,任何智能体都能访问。

命令队列:按会话加锁,防止多条消息同时到达时产生竞态与数据损坏。

心跳:定时器驱动的智能体运行,每个任务有自己的隔离会话。智能体醒来,完成任务,然后继续“睡觉”。

多智能体路由:多套智能体配置,SOUL 与会话键各不相同,按消息内容路由。智能体通过共享记忆文件协作。

这些组件都源自一个个实际问题:

“AI 记不住任何东西” → 会话

“它回复得像个通用聊天机器人” → SOUL.md

“它只能说,不能做” → 工具 + 智能体循环

“它会不经询问就执行危险命令” → 权限控制

“我希望它存在于所有消息应用里” → 网关

“对话太长了” → 压缩

“它在不同会话之间会忘事” → 记忆

“两条消息同时到达会弄坏数据” → 命令队列

“我希望它能自动做事” → 心跳

“一个智能体很难把所有事都做好” → 多智能体

这就是你如何自己发明 OpenClaw。

继续深入

我们的原型覆盖了核心架构。下面是 OpenClaw 如何把每个想法扩展到生产级——当你用完基础后,值得继续探索的特性。

带语义快照的浏览器

大多数 AI 助手“看不见”网页。OpenClaw 通过 Playwright 给智能体一个浏览器,但它不发送截图(每张 5MB,token 成本高),而是用语义快照——页面可访问性树的文本表示:

每个可交互元素都会得到一个编号的 ref ID。智能体想点击某个东西时,它会说“click ref=1”——这会精确映射到一个元素。没有猜测,也不会出现“点一下上方那个蓝色按钮”。而且由于快照是文本而不是图片,它大约比截图小 100 倍,这意味着每页消耗的 token 少得多。

会话作用域与身份链接

我们的原型用 user ID 作为会话键。OpenClaw 支持可配置的作用域:

main(默认):所有私信共享一个会话——简单,适合单用户部署。

per-peer:每个人在所有渠道共享一个会话。

per-channel-peer:每个人在每个渠道各自有会话。

身份链接让你把同一个人的多渠道会话合并起来,因此 Alice 在 Telegram 和 Discord 上的对话可以共享同一份历史。

渠道插件系统

我们的原型硬编码了 Telegram + HTTP。OpenClaw 用插件式架构:每个渠道(Telegram、Discord、WhatsApp、Slack、Signal、iMessage)都是一个独立 adapter,把消息归一化为统一格式。新增一个渠道意味着只写一个 adapter,而不需要改任何智能体逻辑。

向量记忆检索

我们的关键词搜索能用,但会漏掉语义匹配(“auth bug” 不会匹配 “authentication issues”)。OpenClaw 的生产级记忆用混合方案:SQLite 上的向量检索(带 embedding 扩展)做语义相似度匹配,外加 FTS5 做精确关键词匹配。可配置的 embedding 提供方包括 OpenAI、本地模型、Gemini、Voyage。

子智能体生成

我们的多智能体是手动路由。OpenClaw 允许智能体以编程方式生成子智能体——父智能体调用 sessions_spawn,子智能体在自己的上下文中运行(带超时),并把结果返回父智能体。这让“深入研究某个主题”这类模式成为可能:主智能体把任务委派给专家,完成后再继续。

下一步

如果你想构建自己的版本:

从一个渠道开始:先把 Telegram 或 Discord bot 跑起来,并实现会话

逐步加入工具:从文件读写开始,再加入 shell 执行

需要时再加记忆:当会话会被重置时,你就会想要持久记忆

一个渠道不够用时再加渠道:网关模式会自然浮现

任务开始分化时再加智能体:不要一开始就搞 10 个,先从 2 个开始

或者直接用 OpenClaw。它是开源的,并处理了我们略过的各种边界情况。但现在,你已经知道它底层是怎么运作的了。

Link: http://x.com/i/article/2021347850656022528

What makes OpenClaw powerful is surprisingly simple: it's a gateway that connects an AI agent to your messaging apps, gives it tools to interact with your computer, and lets it remember who you are across conversations.

The complexity comes from handling multiple channels simultaneously, managing persistent sessions, coordinating multiple agents, and making the whole thing reliable enough to run 24/7 on your machine.

In this post, I'll start from scratch and build up to OpenClaw's architecture step by step, showing how you could have invented it yourself from first principles, using nothing but a messaging API, an LLM, and the desire to make AI actually useful outside the chat window.

End goal: understand how persistent AI assistants work, so you can build your own.

First, let's establish the problem

When you use ChatGPT or Claude in a browser, there's a fundamental limitation: the AI lives behind a web interface. You go to it. It doesn't come to you.

Think about how you actually communicate day-to-day. You use WhatsApp, Telegram, Discord, Slack, iMessage. Your conversations happen there, not in some separate AI tab.

What if the AI just... lived in your messaging apps? What if you could text it like a friend and it could:

Remember your preferences and past conversations

Run commands on your computer

Browse the web for you

Send messages on your behalf

Wake up on a schedule and do recurring tasks

This is what OpenClaw does. It's an AI that lives where you already communicate, with access to your tools and your context.

Let's build one from scratch.

The Simplest Possible Bot

Let's start with the absolute minimum: an AI that responds to messages on Telegram.

Run it, send a message on Telegram, and the AI responds. Simple.

But this is basically a worse version of the Claude web interface. Every message is independent. No memory. No tools. No personality.

What if we gave it memory?

Goal: Persistent Sessions

A problem with our simple bot is statelessness. Every message is a fresh conversation. Ask it "what did I say earlier?" and it has no idea.

The fix is sessions. Keep a conversation history per user.

Now you can have an actual conversation:

The key insight is the JSONL format. Each line is one message. Append-only. If the process crashes mid-write, you lose at most one line. This is exactly what OpenClaw uses for session transcripts:

Each session maps to a file. Each file is a conversation. Restart the process and everything is still there.

But we'll hit a problem: conversations grow. Eventually they'll exceed the model's context window. We'll come back to that.

Goal: Adding a Personality (SOUL.md)

Our bot works, but it has no personality. It's a generic AI assistant. What if we wanted it to be someone?

OpenClaw solves this with SOUL.md: a markdown file that defines the agent's identity, behavior, and boundaries.

Now instead of a generic assistant, you're talking to Jarvis. The SOUL gets injected as the system prompt on every API call.

In OpenClaw, the SOUL.md lives in the agent's workspace:

It gets loaded at session start and injected into the system prompt. You can write anything you want in there. Give the agent an origin story. Define its core philosophy. List its behavioral rules.

The more specific your SOUL, the more consistent the agent's behavior. "Be helpful" is vague. "Be the assistant you'd actually want to talk to. Concise when needed, thorough when it matters. Not a corporate drone. Not a sycophant. Just... good." - that gives the model something to work with.

Goal: Adding Tools

A bot that can only talk is limited. What if it could do things?

The core idea: give the AI structured tools and let it decide when to use them.

Now we need the agent loop. When the AI wants to use a tool, we execute it and feed the result back:

Now we update handle_message to use the agent loop instead of calling the API directly:

Now you can text your bot:

The AI decided which tools to use, in what order, and synthesized the results into a natural response. All through a Telegram message.

OpenClaw's production tool catalog is much larger - browser automation, inter-agent messaging, sub-agent spawning, and more - but every tool follows this exact pattern: a schema, a description, and an execution function.

Goal: Permission Controls

We're executing commands from Telegram messages. That's terrifying. What if someone gets access to your Telegram account and tells the bot to rm -rf /?

We need a permission system. OpenClaw's approach: an approval allowlist that remembers what you've approved.

We add these helpers alongside our existing code, then update the run_command case inside execute_tool to use them:

Now update the run_command case in execute_tool to check permissions before executing:

When a command is safe or previously approved, it runs immediately. When it's not, the agent gets told "permission denied" and can try a different approach. The approval gets persisted to exec-approvals.json, so you're never asked twice for the same command.

OpenClaw extends this with glob patterns (approve git * once) and a three-tier model: "ask" (prompt user), "record" (log but allow), and "ignore" (auto-allow).

Goal: The Gateway

Here's where it gets interesting. So far we have a Telegram bot. But what if you also want the AI on Discord? And WhatsApp? And Slack?

You could write separate bots for each platform. But then you'd have separate sessions, separate memory, separate configurations. The AI on Telegram wouldn't know what you discussed on Discord.

The solution: a gateway. One central process that manages all channels.

Look at what we already have. Our run_agent_turn function doesn't know anything about Telegram. It takes messages and returns text. That's the key - the agent logic is already decoupled from the channel.

To prove it, let's add a second interface. We'll add a simple HTTP API alongside our Telegram bot, both talking to the same agent and the same sessions:

Try it out: Tell the bot your name on Telegram. Then query via HTTP using the same user ID (your Telegram user ID) to prove the session is shared:

Same agent, same sessions, same memory. Two different interfaces. That's the gateway pattern.

The next step would be making this config-driven - a JSON file specifying which channels to start and how to authenticate them.

That's what OpenClaw does: its gateway manages Telegram, Discord, WhatsApp, Slack, Signal, iMessage, and more, all through a single config file. It also supports configurable session scoping - per-user, per-channel, or a single shared session - so the same person gets a unified experience across channels. We'll keep our simple user-ID-as-session-key approach for now.

Goal: Context Compaction

Remember the growing session problem we flagged earlier? After chatting with your bot for weeks, the session file has thousands of messages. The total token count exceeds the model's context window. Now what?

The fix: summarize old messages, keep recent ones. Add these two functions alongside your existing code:

Now add the compaction check at the top of handle_message, right after loading the session:

Try it out: To test compaction without chatting for hours, temporarily lower the threshold:

Have a conversation of 10-15 messages, then watch the old messages get replaced with a summary. The bot still remembers key facts, but the session file is much smaller.

OpenClaw's compaction is more sophisticated - it splits messages into chunks by token count, summarizes each chunk separately, and includes a safety margin for estimation inaccuracy - but the core idea is identical.

Goal: Long-Term Memory

Session history gives you conversation memory. But what happens when you reset a session or start a new one? Everything is gone.

We need a separate memory system - persistent knowledge that survives session resets. The approach: give the agent tools to save and search memories stored as files.

Add these two tools to the TOOLS list:

Add their cases to execute_tool:

Finally, update the SOUL so the agent knows about memory:

Try it out:

The memory persists because it's stored in files, not in the session. Reset the session, restart the bot - the memories are still there.

OpenClaw's production memory uses vector search with embeddings for semantic matching (so "auth bug" matches "authentication issues"), but our keyword search works well for getting started.

Goal: Command Queue

Here's a subtle but critical problem: what happens when two messages arrive at the same time?

Say you send "check my calendar" on Telegram and "what's the weather" via the HTTP API simultaneously. Both try to load the same session, both try to append to it, and you get corrupted data.

The fix is simple: a per-session lock. Only one message processes at a time for each session. Different sessions can still run in parallel.

Now wrap the body of handle_message with the lock:

Do the same for the /chat HTTP endpoint:

That's it - five lines of setup. Messages for the same user queue up. Messages for different users run in parallel. No race conditions.

OpenClaw extends this with lane-based queues (separate lanes for messages, cron jobs, and sub-agents) so heartbeats never block real-time conversations.

Goal: Cron Jobs (Heartbeats)

So far our agent only responds when you talk to it. But what if you want it to check your email every morning? Or summarize your calendar before meetings?

You need scheduled execution. Let's add heartbeats - recurring tasks that trigger the agent on a timer.

The key insight: each heartbeat uses its own session key (cron:morning-briefing). This keeps scheduled tasks from cluttering your main conversation history. The heartbeat calls the same run_agent_turn function - it's just another message, triggered by a timer instead of a human.

Try it out: For testing, change the schedule to run every minute:

You'll see the heartbeat fire in your terminal, and the agent will respond. Change it back to a daily schedule when you're done testing.

OpenClaw supports full cron expressions (30 7 * * *) and routes heartbeats through a separate command queue lane so they never block real-time messages.

Goal: Multi-Agent

One agent is useful. But as you add more tasks, you'll find a single personality and toolset can't cover everything well. A research assistant needs different instructions than a general assistant.

The fix: multiple agent configurations with routing. Each agent has its own SOUL, its own session, and you switch between them based on the message.

Update handle_message to route messages to the right agent:

Try it out:

Each agent has its own conversation history, but they share the same memory directory. Scout saves research findings; Jarvis can search for them later. They collaborate through shared files without needing direct messaging.

OpenClaw extends this with sub-agent spawning (a parent agent can spawn a child for a focused task) and inter-agent messaging, but the core pattern is the same: each agent is just a SOUL + session + tools.

Putting It All Together

Let's combine everything we've built into a single runnable script. This is a clean standalone REPL that includes every feature from the tutorial: sessions, SOUL, tools, permissions, compaction, memory, command queue, cron, and multi-agent routing.

I've put together a mini-openclaw in ~400 lines of code here:

https://gist.github.com/dabit3/86ee04a1c02c839409a02b20fe99a492

Save this as mini-openclaw.py and run it:

Here's what a session looks like:

The memory persists across sessions. Agents collaborate through shared memory files. Commands require approval. Heartbeats run in the background. All in ~400 lines.

What We've Learned

Starting from a simple Telegram bot, we built every major component of a persistent AI assistant:

Persistent sessions (JSONL files): Crash-safe conversation memory. Each session is one file, each line is one message. Restart the process and everything is still there.

SOUL.md (system prompt): A personality file that transforms a generic AI into a specific agent with consistent behavior, boundaries, and style.

Tools + Agent loop: Structured tool definitions that let the AI decide when to act. The agent loop calls the LLM, executes any requested tools, feeds results back, and repeats until done.

Permission controls: An allowlist of safe commands plus persistent approvals, so dangerous operations require explicit consent.

The gateway pattern: One central agent with multiple interfaces. Telegram, HTTP, or any other channel - they all talk to the same sessions and the same memory.

Context compaction: When conversations outgrow the context window, summarize old messages and keep recent ones. The bot keeps its knowledge without hitting token limits.

Long-term memory: File-based storage with save and search tools. Knowledge that survives session resets, accessible to any agent.

Command queue: Per-session locking to prevent race conditions when multiple messages arrive simultaneously.

Heartbeats: Scheduled agent runs on a timer, each with its own isolated session. The agent wakes up, does its task, and goes back to sleep.

Multi-agent routing: Multiple agent configurations with different SOULs and session keys, routed by message content. Agents collaborate through shared memory files.

Each of these emerged from a practical problem:

"The AI can't remember anything" → Sessions

"It responds like a generic chatbot" → SOUL.md

"It can only talk, not act" → Tools + Agent loop

"It runs dangerous commands without asking" → Permission controls

"I want it on all my messaging apps" → Gateway

"The conversation got too long" → Compaction

"It forgets things between sessions" → Memory

"Two messages at once corrupt the data" → Command queue

"I want it to do things automatically" → Heartbeats

"One agent can't do everything well" → Multi-agent

This is how you could have invented OpenClaw.

Going Further

Our prototype covers the core architecture. Here's how OpenClaw extends each idea for production use - features worth exploring once you've outgrown the basics.

Browser with Semantic Snapshots

Most AI assistants can't see the web. OpenClaw gives the agent a browser via Playwright, but instead of sending screenshots (5MB each, expensive in tokens), it uses semantic snapshots - a text representation of the page's accessibility tree:

Each interactive element gets a numbered ref ID. When the agent wants to click something, it says "click ref=1" - which maps to exactly one element on the page. No guessing, no "click the blue button near the top." And since the snapshot is text instead of an image, it's roughly 100x smaller than a screenshot, which means far fewer tokens per page.

Session Scoping & Identity Links

Our prototype uses user ID as the session key. OpenClaw supports configurable scoping:

main (default): All DMs share one session — simple, great for single-user setups.

per-peer: Each person gets one session across all channels.

per-channel-peer: Each person per channel gets their own session.

Identity links let you merge sessions across channels for the same person, so Alice's Telegram and Discord conversations share the same history.

Channel Plugin System

Our prototype hardcodes Telegram + HTTP. OpenClaw uses a plugin architecture where each channel (Telegram, Discord, WhatsApp, Slack, Signal, iMessage) is a separate adapter that normalizes messages into a common format. Adding a new channel means writing one adapter, not touching any agent logic.

Vector Memory Search

Our keyword search works, but misses semantic matches ("auth bug" won't match "authentication issues"). OpenClaw's production memory uses a hybrid approach: vector search via SQLite with embedding extensions for semantic similarity, plus FTS5 for exact keyword matches. Configurable embedding providers include OpenAI, local models, Gemini, and Voyage.

Sub-agent Spawning

Our multi-agent setup uses manual routing. OpenClaw lets agents spawn sub-agents programmatically - a parent agent calls sessions_spawn, the child runs in its own context with a timeout, and returns results to the parent. This enables patterns like "research this topic in depth" where the main agent delegates to a specialist and continues when it's done.

Next Steps

If you want to build your own:

Start with one channel: get a Telegram or Discord bot working with sessions

Add tools incrementally: start with file read/write, then add shell execution

Add memory when you need it: once sessions reset, you'll want persistent memory

Add channels when you outgrow one: the gateway pattern emerges naturally

Add agents when tasks specialize: don't start with 10 agents, start with 2

Or just use OpenClaw. It's open source and handles all the edge cases we glossed over. But now you know how it works under the hood.

Link: http://x.com/i/article/2021347850656022528

OpenClaw 的强大之处,出奇地简单:它是一个网关,把一个 AI 智能体连接到你的消息应用,让它拥有与电脑交互的工具,并且能够在多次对话之间记住“你是谁”。

真正的复杂性来自同时处理多个渠道、管理持久会话、协调多个智能体,以及把整套系统做得足够可靠,能在你的机器上 24/7 运行。

这篇文章里,我会从零开始,一步步搭到 OpenClaw 的架构,展示你如何从第一性原理出发,只用一个消息 API、一个 LLM,以及让 AI 在聊天窗口之外真正有用的愿望,就能自己把它“发明”出来。

最终目标:理解“持久化 AI 助手”是如何工作的,这样你就能构建自己的版本。

首先,我们得把问题说清楚

当你在浏览器里使用 ChatGPT 或 Claude 时,会有一个根本限制:AI 被困在网页界面背后。你得去找它。它不会来找你。

想想你每天真正的沟通方式。你用 WhatsApp、Telegram、Discord、Slack、iMessage。对话发生在那里,而不是在某个单独的 AI 标签页里。

如果 AI 就……住在你的消息应用里,会怎样?如果你能像给朋友发短信一样给它发消息,而它能够:

记住你的偏好和过去的对话

在你的电脑上运行命令

替你浏览网页

代表你发送消息

按计划“醒来”,执行重复性任务

这就是 OpenClaw 做的事。它让 AI 活在你本来就沟通的地方,并且可以访问你的工具和你的上下文。

我们从零开始搭一个。

最简单可行的 Bot

先从绝对最小集合开始:一个能在 Telegram 上回复消息的 AI。

运行它,在 Telegram 上发一条消息,AI 就回复。很简单。

但这基本上就是 Claude 网页界面的更差版本。每条消息都是独立的。没有记忆。没有工具。没有人格。

那如果我们给它记忆呢?

目标:持久会话

我们这个简单 bot 的问题在于“无状态”。每条消息都是一段全新对话。你问它“我刚才说了什么?”,它完全不知道。

修复方法是会话(sessions)。为每个用户保留一份对话历史。

现在你就能进行真正的对话:

关键洞察是 JSONL 格式。每一行是一条消息。只追加写入。如果进程在写到一半时崩溃,最多丢一行。这正是 OpenClaw 用来保存会话转录的方式:

每个会话映射到一个文件。每个文件就是一段对话。重启进程,一切都还在。

但我们会碰到一个问题:对话会增长。最终它们会超过模型的上下文窗口。我们稍后会回来处理这个。

目标:加入人格(SOUL.md)

我们的 bot 能用了,但它没有人格。它只是一个通用 AI 助手。如果我们希望它是某个人呢?

OpenClaw 用 SOUL.md 解决:一个 Markdown 文件,用来定义智能体的身份、行为与边界。

这样你就不是在跟一个通用助手说话,而是在跟 Jarvis 说话。SOUL 会在每次 API 调用时作为 system prompt 注入。

在 OpenClaw 里,SOUL.md 放在智能体的 workspace 中:

它会在会话开始时加载,并注入 system prompt。你可以在里面写任何东西。给智能体写一个起源故事。定义它的核心哲学。列出它的行为规则。

SOUL 越具体,智能体行为就越一致。“要有帮助”太模糊。“做一个你真的愿意聊天的助手:需要时简洁,重要时详尽。不做企业腔。不谄媚。就……好。”——这会给模型更多可操作的约束。

目标:加入工具

只能说话的 bot 很受限。如果它能做事呢?

核心思路:给 AI 一组结构化工具,让它自己决定何时使用它们。

现在我们需要智能体循环(agent loop)。当 AI 想用某个工具时,我们执行工具,并把结果喂回去:

接着我们把 handle_message 更新为使用智能体循环,而不是直接调用 API:

现在你可以给你的 bot 发消息:

AI 自己决定用哪些工具、按什么顺序用,并把结果综合成自然语言回复。全程通过一条 Telegram 消息完成。

OpenClaw 的生产级工具目录要大得多——浏览器自动化、智能体间消息、子智能体生成等等——但每个工具都遵循同一个模式:一个 schema、一段描述、一个执行函数。

目标:权限控制

我们在从 Telegram 消息里执行命令。这太吓人了。如果有人拿到了你的 Telegram 账号,然后让 bot 执行 rm -rf / 呢?

我们需要一个权限系统。OpenClaw 的做法:一个审批允许列表,会记住你批准过什么。

我们把这些 helper 加到现有代码旁边,然后更新 execute_toolrun_command 分支来使用它们:

现在更新 execute_tool 中的 run_command 分支:执行前先检查权限:

当命令是安全的或之前已批准过,它会立刻运行。否则,智能体会收到“permission denied”,并可以尝试别的方案。审批会持久化到 exec-approvals.json,所以同一个命令你不会被问两次。

OpenClaw 在此基础上扩展了 glob 模式(一次批准 git *),以及一个三层模型:“ask”(询问用户)、“record”(记录但放行)、“ignore”(自动放行)。

目标:网关

从这里开始就变得有意思了。到目前为止,我们只有一个 Telegram bot。但如果你还想在 Discord 上用?在 WhatsApp 上用?在 Slack 上用?

你可以为每个平台分别写一个 bot。但那样你会有分离的会话、分离的记忆、分离的配置。Telegram 上的 AI 不会知道你在 Discord 上聊过什么。

解决方案:网关。一个中心进程管理所有渠道。

看看我们已经有什么。我们的 run_agent_turn 函数根本不知道 Telegram 的存在。它接收消息并返回文本。这就是关键——智能体逻辑已经与渠道解耦了。

为了证明这一点,我们加第二个接口。我们在 Telegram bot 旁边再加一个简单的 HTTP API,它们都对同一个智能体、同一套会话说话:

试一下:先在 Telegram 上告诉 bot 你的名字。然后用同一个 user ID(你的 Telegram user ID)通过 HTTP 查询,证明会话是共享的:

同一个智能体,同一套会话,同一份记忆。两个不同接口。这就是网关模式。

下一步就是把它做成配置驱动——用一个 JSON 文件指定要启动哪些渠道、以及如何认证。

这正是 OpenClaw 做的:它的网关通过单一配置文件管理 Telegram、Discord、WhatsApp、Slack、Signal、iMessage 等等。它还支持可配置的会话作用域——按用户、按渠道,或单一共享会话——让同一个人在多渠道获得统一体验。我们暂时先保持简单:用 user ID 作为会话键。

目标:上下文压缩

还记得我们前面标记的“会话增长”问题吗?跟 bot 聊了几周之后,会话文件里有成千上万条消息。总 token 数超过模型上下文窗口。现在怎么办?

修复方法:总结旧消息,保留新消息。把这两个函数加到现有代码旁边:

然后在 handle_message 顶部、加载会话之后,加入压缩检查:

试一下:为了不聊上几个小时来测试压缩,先临时把阈值调低:

聊 10–15 条消息,然后看旧消息被摘要替换。bot 仍然记得关键信息,但会话文件小了很多。

OpenClaw 的压缩更复杂——它按 token 数把消息切成块、分别摘要,并为估算误差预留安全边际——但核心思想完全一致。

目标:长期记忆

会话历史给你的是“对话记忆”。但如果你重置会话或新开一个会话呢?一切都没了。

我们需要一个独立的记忆系统——能跨会话重置而保存的持久知识。做法:给智能体工具,用文件存储来保存与搜索记忆。

把这两个工具加到 TOOLS 列表里:

把它们的分支加到 execute_tool

最后,更新 SOUL,让智能体知道“记忆”这回事:

试一下:

记忆之所以能持久,是因为它存放在文件里,而不是会话里。重置会话、重启 bot——记忆仍然在。

OpenClaw 的生产级记忆用向量检索加 embeddings 做语义匹配(因此 “auth bug” 能匹配 “authentication issues”),但我们的关键词搜索作为入门已经很好用了。

目标:命令队列

这里有个微妙但致命的问题:当两条消息同时到达,会发生什么?

比如你在 Telegram 发“查一下我的日历”,同时又通过 HTTP API 发“天气怎么样”。两边都尝试加载同一个会话、都尝试追加写入,于是你得到损坏的数据。

修复很简单:按会话加锁。对同一个会话一次只处理一条消息。不同会话仍然可以并行。

现在用这把锁把 handle_message 的主体包起来:

/chat HTTP endpoint 也做同样的事:

就这样——五行初始化。同一用户的消息会排队。不同用户的消息并行跑。没有竞态条件。

OpenClaw 在此基础上做了 lane-based 队列(为消息、cron 任务、子智能体分不同 lane),这样心跳任务永远不会阻塞实时对话。

目标:Cron Jobs(心跳)

到目前为止,我们的智能体只有在你跟它说话时才会响应。但如果你希望它每天早上检查邮件呢?或者在会议前总结日历呢?

你需要定时执行。我们加心跳——定期触发智能体的循环任务。

关键洞察:每个心跳使用自己的会话键(cron:morning-briefing)。这样定时任务不会弄乱你的主对话历史。心跳调用的仍然是同一个 run_agent_turn 函数——它只是另一条消息,只不过由定时器触发,而不是由人触发。

试一下:为了测试,把计划改成每分钟运行一次:

你会在终端看到心跳触发,智能体会响应。测试结束后再改回每日计划。

OpenClaw 支持完整的 cron 表达式(30 7 * * *),并把心跳路由到单独的命令队列 lane,确保它永远不会阻塞实时消息。

目标:多智能体

一个智能体很有用。但当任务越来越多,你会发现一个人格与一套工具很难把所有事都做得好。研究助理需要的指令和通用助理不同。

修复方法:多套智能体配置 + 路由。每个智能体有自己的 SOUL、自己的会话;你根据消息在它们之间切换。

更新 handle_message:把消息路由到正确的智能体:

试一下:

每个智能体有自己的对话历史,但它们共享同一个记忆目录。Scout 保存研究发现;Jarvis 之后可以搜索并取回。它们通过共享文件协作,而不需要直接互发消息。

OpenClaw 在此基础上扩展了子智能体生成(父智能体可以生成一个子智能体来做聚焦任务)和智能体间消息,但核心模式相同:每个智能体本质上就是 SOUL + 会话 + 工具。

把一切拼起来

我们把构建出的所有东西合成一份可运行的脚本。这是一个干净的独立 REPL,包含教程里的每个特性:会话、SOUL、工具、权限、压缩、记忆、命令队列、cron、多智能体路由。

我在这里整理了一个约 400 行的 mini-openclaw:

https://gist.github.com/dabit3/86ee04a1c02c839409a02b20fe99a492

把它保存为 mini-openclaw.py 并运行:

一个会话大概长这样:

记忆跨会话持久存在。智能体通过共享记忆文件协作。命令需要审批。心跳在后台运行。一切都在约 400 行内完成。

我们学到了什么

从一个简单的 Telegram bot 出发,我们构建了持久化 AI 助手的每个关键组件:

持久会话(JSONL 文件):抗崩溃的对话记忆。每个会话一个文件,每行一条消息。重启进程,一切都还在。

SOUL.md(system prompt):一个人格文件,把通用 AI 变成具体智能体,使其行为、边界与风格保持一致。

工具 + 智能体循环:结构化的工具定义,让 AI 自己决定何时行动。智能体循环调用 LLM,执行任何被请求的工具,把结果反馈回去,并重复直到完成。

权限控制:安全命令的允许列表 + 持久化审批,让危险操作必须获得明确同意。

网关模式:一个中心智能体,多种接口。Telegram、HTTP 或任何其他渠道——都对同一套会话、同一份记忆说话。

上下文压缩:当对话长到超过上下文窗口,摘要旧消息并保留新消息。bot 既保留知识,又不触碰 token 上限。

长期记忆:基于文件的存储,配合保存与搜索工具。跨会话重置仍然存在的知识,任何智能体都能访问。

命令队列:按会话加锁,防止多条消息同时到达时产生竞态与数据损坏。

心跳:定时器驱动的智能体运行,每个任务有自己的隔离会话。智能体醒来,完成任务,然后继续“睡觉”。

多智能体路由:多套智能体配置,SOUL 与会话键各不相同,按消息内容路由。智能体通过共享记忆文件协作。

这些组件都源自一个个实际问题:

“AI 记不住任何东西” → 会话

“它回复得像个通用聊天机器人” → SOUL.md

“它只能说,不能做” → 工具 + 智能体循环

“它会不经询问就执行危险命令” → 权限控制

“我希望它存在于所有消息应用里” → 网关

“对话太长了” → 压缩

“它在不同会话之间会忘事” → 记忆

“两条消息同时到达会弄坏数据” → 命令队列

“我希望它能自动做事” → 心跳

“一个智能体很难把所有事都做好” → 多智能体

这就是你如何自己发明 OpenClaw。

继续深入

我们的原型覆盖了核心架构。下面是 OpenClaw 如何把每个想法扩展到生产级——当你用完基础后,值得继续探索的特性。

带语义快照的浏览器

大多数 AI 助手“看不见”网页。OpenClaw 通过 Playwright 给智能体一个浏览器,但它不发送截图(每张 5MB,token 成本高),而是用语义快照——页面可访问性树的文本表示:

每个可交互元素都会得到一个编号的 ref ID。智能体想点击某个东西时,它会说“click ref=1”——这会精确映射到一个元素。没有猜测,也不会出现“点一下上方那个蓝色按钮”。而且由于快照是文本而不是图片,它大约比截图小 100 倍,这意味着每页消耗的 token 少得多。

会话作用域与身份链接

我们的原型用 user ID 作为会话键。OpenClaw 支持可配置的作用域:

main(默认):所有私信共享一个会话——简单,适合单用户部署。

per-peer:每个人在所有渠道共享一个会话。

per-channel-peer:每个人在每个渠道各自有会话。

身份链接让你把同一个人的多渠道会话合并起来,因此 Alice 在 Telegram 和 Discord 上的对话可以共享同一份历史。

渠道插件系统

我们的原型硬编码了 Telegram + HTTP。OpenClaw 用插件式架构:每个渠道(Telegram、Discord、WhatsApp、Slack、Signal、iMessage)都是一个独立 adapter,把消息归一化为统一格式。新增一个渠道意味着只写一个 adapter,而不需要改任何智能体逻辑。

向量记忆检索

我们的关键词搜索能用,但会漏掉语义匹配(“auth bug” 不会匹配 “authentication issues”)。OpenClaw 的生产级记忆用混合方案:SQLite 上的向量检索(带 embedding 扩展)做语义相似度匹配,外加 FTS5 做精确关键词匹配。可配置的 embedding 提供方包括 OpenAI、本地模型、Gemini、Voyage。

子智能体生成

我们的多智能体是手动路由。OpenClaw 允许智能体以编程方式生成子智能体——父智能体调用 sessions_spawn,子智能体在自己的上下文中运行(带超时),并把结果返回父智能体。这让“深入研究某个主题”这类模式成为可能:主智能体把任务委派给专家,完成后再继续。

下一步

如果你想构建自己的版本:

从一个渠道开始:先把 Telegram 或 Discord bot 跑起来,并实现会话

逐步加入工具:从文件读写开始,再加入 shell 执行

需要时再加记忆:当会话会被重置时,你就会想要持久记忆

一个渠道不够用时再加渠道:网关模式会自然浮现

任务开始分化时再加智能体:不要一开始就搞 10 个,先从 2 个开始

或者直接用 OpenClaw。它是开源的,并处理了我们略过的各种边界情况。但现在,你已经知道它底层是怎么运作的了。

Link: http://x.com/i/article/2021347850656022528

相关笔记

What makes OpenClaw powerful is surprisingly simple: it's a gateway that connects an AI agent to your messaging apps, gives it tools to interact with your computer, and lets it remember who you are across conversations.

The complexity comes from handling multiple channels simultaneously, managing persistent sessions, coordinating multiple agents, and making the whole thing reliable enough to run 24/7 on your machine.

In this post, I'll start from scratch and build up to OpenClaw's architecture step by step, showing how you could have invented it yourself from first principles, using nothing but a messaging API, an LLM, and the desire to make AI actually useful outside the chat window.

End goal: understand how persistent AI assistants work, so you can build your own.

First, let's establish the problem

When you use ChatGPT or Claude in a browser, there's a fundamental limitation: the AI lives behind a web interface. You go to it. It doesn't come to you.

Think about how you actually communicate day-to-day. You use WhatsApp, Telegram, Discord, Slack, iMessage. Your conversations happen there, not in some separate AI tab.

What if the AI just... lived in your messaging apps? What if you could text it like a friend and it could:

Remember your preferences and past conversations

Run commands on your computer

Browse the web for you

Send messages on your behalf

Wake up on a schedule and do recurring tasks

This is what OpenClaw does. It's an AI that lives where you already communicate, with access to your tools and your context.

Let's build one from scratch.

The Simplest Possible Bot

Let's start with the absolute minimum: an AI that responds to messages on Telegram.

Run it, send a message on Telegram, and the AI responds. Simple.

But this is basically a worse version of the Claude web interface. Every message is independent. No memory. No tools. No personality.

What if we gave it memory?

Goal: Persistent Sessions

A problem with our simple bot is statelessness. Every message is a fresh conversation. Ask it "what did I say earlier?" and it has no idea.

The fix is sessions. Keep a conversation history per user.

Now you can have an actual conversation:

The key insight is the JSONL format. Each line is one message. Append-only. If the process crashes mid-write, you lose at most one line. This is exactly what OpenClaw uses for session transcripts:

Each session maps to a file. Each file is a conversation. Restart the process and everything is still there.

But we'll hit a problem: conversations grow. Eventually they'll exceed the model's context window. We'll come back to that.

Goal: Adding a Personality (SOUL.md)

Our bot works, but it has no personality. It's a generic AI assistant. What if we wanted it to be someone?

OpenClaw solves this with SOUL.md: a markdown file that defines the agent's identity, behavior, and boundaries.

Now instead of a generic assistant, you're talking to Jarvis. The SOUL gets injected as the system prompt on every API call.

In OpenClaw, the SOUL.md lives in the agent's workspace:

It gets loaded at session start and injected into the system prompt. You can write anything you want in there. Give the agent an origin story. Define its core philosophy. List its behavioral rules.

The more specific your SOUL, the more consistent the agent's behavior. "Be helpful" is vague. "Be the assistant you'd actually want to talk to. Concise when needed, thorough when it matters. Not a corporate drone. Not a sycophant. Just... good." - that gives the model something to work with.

Goal: Adding Tools

A bot that can only talk is limited. What if it could do things?

The core idea: give the AI structured tools and let it decide when to use them.

Now we need the agent loop. When the AI wants to use a tool, we execute it and feed the result back:

Now we update handle_message to use the agent loop instead of calling the API directly:

Now you can text your bot:

The AI decided which tools to use, in what order, and synthesized the results into a natural response. All through a Telegram message.

OpenClaw's production tool catalog is much larger - browser automation, inter-agent messaging, sub-agent spawning, and more - but every tool follows this exact pattern: a schema, a description, and an execution function.

Goal: Permission Controls

We're executing commands from Telegram messages. That's terrifying. What if someone gets access to your Telegram account and tells the bot to rm -rf /?

We need a permission system. OpenClaw's approach: an approval allowlist that remembers what you've approved.

We add these helpers alongside our existing code, then update the run_command case inside execute_tool to use them:

Now update the run_command case in execute_tool to check permissions before executing:

When a command is safe or previously approved, it runs immediately. When it's not, the agent gets told "permission denied" and can try a different approach. The approval gets persisted to exec-approvals.json, so you're never asked twice for the same command.

OpenClaw extends this with glob patterns (approve git * once) and a three-tier model: "ask" (prompt user), "record" (log but allow), and "ignore" (auto-allow).

Goal: The Gateway

Here's where it gets interesting. So far we have a Telegram bot. But what if you also want the AI on Discord? And WhatsApp? And Slack?

You could write separate bots for each platform. But then you'd have separate sessions, separate memory, separate configurations. The AI on Telegram wouldn't know what you discussed on Discord.

The solution: a gateway. One central process that manages all channels.

Look at what we already have. Our run_agent_turn function doesn't know anything about Telegram. It takes messages and returns text. That's the key - the agent logic is already decoupled from the channel.

To prove it, let's add a second interface. We'll add a simple HTTP API alongside our Telegram bot, both talking to the same agent and the same sessions:

Try it out: Tell the bot your name on Telegram. Then query via HTTP using the same user ID (your Telegram user ID) to prove the session is shared:

Same agent, same sessions, same memory. Two different interfaces. That's the gateway pattern.

The next step would be making this config-driven - a JSON file specifying which channels to start and how to authenticate them.

That's what OpenClaw does: its gateway manages Telegram, Discord, WhatsApp, Slack, Signal, iMessage, and more, all through a single config file. It also supports configurable session scoping - per-user, per-channel, or a single shared session - so the same person gets a unified experience across channels. We'll keep our simple user-ID-as-session-key approach for now.

Goal: Context Compaction

Remember the growing session problem we flagged earlier? After chatting with your bot for weeks, the session file has thousands of messages. The total token count exceeds the model's context window. Now what?

The fix: summarize old messages, keep recent ones. Add these two functions alongside your existing code:

Now add the compaction check at the top of handle_message, right after loading the session:

Try it out: To test compaction without chatting for hours, temporarily lower the threshold:

Have a conversation of 10-15 messages, then watch the old messages get replaced with a summary. The bot still remembers key facts, but the session file is much smaller.

OpenClaw's compaction is more sophisticated - it splits messages into chunks by token count, summarizes each chunk separately, and includes a safety margin for estimation inaccuracy - but the core idea is identical.

Goal: Long-Term Memory

Session history gives you conversation memory. But what happens when you reset a session or start a new one? Everything is gone.

We need a separate memory system - persistent knowledge that survives session resets. The approach: give the agent tools to save and search memories stored as files.

Add these two tools to the TOOLS list:

Add their cases to execute_tool:

Finally, update the SOUL so the agent knows about memory:

Try it out:

The memory persists because it's stored in files, not in the session. Reset the session, restart the bot - the memories are still there.

OpenClaw's production memory uses vector search with embeddings for semantic matching (so "auth bug" matches "authentication issues"), but our keyword search works well for getting started.

Goal: Command Queue

Here's a subtle but critical problem: what happens when two messages arrive at the same time?

Say you send "check my calendar" on Telegram and "what's the weather" via the HTTP API simultaneously. Both try to load the same session, both try to append to it, and you get corrupted data.

The fix is simple: a per-session lock. Only one message processes at a time for each session. Different sessions can still run in parallel.

Now wrap the body of handle_message with the lock:

Do the same for the /chat HTTP endpoint:

That's it - five lines of setup. Messages for the same user queue up. Messages for different users run in parallel. No race conditions.

OpenClaw extends this with lane-based queues (separate lanes for messages, cron jobs, and sub-agents) so heartbeats never block real-time conversations.

Goal: Cron Jobs (Heartbeats)

So far our agent only responds when you talk to it. But what if you want it to check your email every morning? Or summarize your calendar before meetings?

You need scheduled execution. Let's add heartbeats - recurring tasks that trigger the agent on a timer.

The key insight: each heartbeat uses its own session key (cron:morning-briefing). This keeps scheduled tasks from cluttering your main conversation history. The heartbeat calls the same run_agent_turn function - it's just another message, triggered by a timer instead of a human.

Try it out: For testing, change the schedule to run every minute:

You'll see the heartbeat fire in your terminal, and the agent will respond. Change it back to a daily schedule when you're done testing.

OpenClaw supports full cron expressions (30 7 * * *) and routes heartbeats through a separate command queue lane so they never block real-time messages.

Goal: Multi-Agent

One agent is useful. But as you add more tasks, you'll find a single personality and toolset can't cover everything well. A research assistant needs different instructions than a general assistant.

The fix: multiple agent configurations with routing. Each agent has its own SOUL, its own session, and you switch between them based on the message.

Update handle_message to route messages to the right agent:

Try it out:

Each agent has its own conversation history, but they share the same memory directory. Scout saves research findings; Jarvis can search for them later. They collaborate through shared files without needing direct messaging.

OpenClaw extends this with sub-agent spawning (a parent agent can spawn a child for a focused task) and inter-agent messaging, but the core pattern is the same: each agent is just a SOUL + session + tools.

Putting It All Together

Let's combine everything we've built into a single runnable script. This is a clean standalone REPL that includes every feature from the tutorial: sessions, SOUL, tools, permissions, compaction, memory, command queue, cron, and multi-agent routing.

I've put together a mini-openclaw in ~400 lines of code here:

https://gist.github.com/dabit3/86ee04a1c02c839409a02b20fe99a492

Save this as mini-openclaw.py and run it:

Here's what a session looks like:

The memory persists across sessions. Agents collaborate through shared memory files. Commands require approval. Heartbeats run in the background. All in ~400 lines.

What We've Learned

Starting from a simple Telegram bot, we built every major component of a persistent AI assistant:

Persistent sessions (JSONL files): Crash-safe conversation memory. Each session is one file, each line is one message. Restart the process and everything is still there.

SOUL.md (system prompt): A personality file that transforms a generic AI into a specific agent with consistent behavior, boundaries, and style.

Tools + Agent loop: Structured tool definitions that let the AI decide when to act. The agent loop calls the LLM, executes any requested tools, feeds results back, and repeats until done.

Permission controls: An allowlist of safe commands plus persistent approvals, so dangerous operations require explicit consent.

The gateway pattern: One central agent with multiple interfaces. Telegram, HTTP, or any other channel - they all talk to the same sessions and the same memory.

Context compaction: When conversations outgrow the context window, summarize old messages and keep recent ones. The bot keeps its knowledge without hitting token limits.

Long-term memory: File-based storage with save and search tools. Knowledge that survives session resets, accessible to any agent.

Command queue: Per-session locking to prevent race conditions when multiple messages arrive simultaneously.

Heartbeats: Scheduled agent runs on a timer, each with its own isolated session. The agent wakes up, does its task, and goes back to sleep.

Multi-agent routing: Multiple agent configurations with different SOULs and session keys, routed by message content. Agents collaborate through shared memory files.

Each of these emerged from a practical problem:

"The AI can't remember anything" → Sessions

"It responds like a generic chatbot" → SOUL.md

"It can only talk, not act" → Tools + Agent loop

"It runs dangerous commands without asking" → Permission controls

"I want it on all my messaging apps" → Gateway

"The conversation got too long" → Compaction

"It forgets things between sessions" → Memory

"Two messages at once corrupt the data" → Command queue

"I want it to do things automatically" → Heartbeats

"One agent can't do everything well" → Multi-agent

This is how you could have invented OpenClaw.

Going Further

Our prototype covers the core architecture. Here's how OpenClaw extends each idea for production use - features worth exploring once you've outgrown the basics.

Browser with Semantic Snapshots

Most AI assistants can't see the web. OpenClaw gives the agent a browser via Playwright, but instead of sending screenshots (5MB each, expensive in tokens), it uses semantic snapshots - a text representation of the page's accessibility tree:

Each interactive element gets a numbered ref ID. When the agent wants to click something, it says "click ref=1" - which maps to exactly one element on the page. No guessing, no "click the blue button near the top." And since the snapshot is text instead of an image, it's roughly 100x smaller than a screenshot, which means far fewer tokens per page.

Session Scoping & Identity Links

Our prototype uses user ID as the session key. OpenClaw supports configurable scoping:

main (default): All DMs share one session — simple, great for single-user setups.

per-peer: Each person gets one session across all channels.

per-channel-peer: Each person per channel gets their own session.

Identity links let you merge sessions across channels for the same person, so Alice's Telegram and Discord conversations share the same history.

Channel Plugin System

Our prototype hardcodes Telegram + HTTP. OpenClaw uses a plugin architecture where each channel (Telegram, Discord, WhatsApp, Slack, Signal, iMessage) is a separate adapter that normalizes messages into a common format. Adding a new channel means writing one adapter, not touching any agent logic.

Vector Memory Search

Our keyword search works, but misses semantic matches ("auth bug" won't match "authentication issues"). OpenClaw's production memory uses a hybrid approach: vector search via SQLite with embedding extensions for semantic similarity, plus FTS5 for exact keyword matches. Configurable embedding providers include OpenAI, local models, Gemini, and Voyage.

Sub-agent Spawning

Our multi-agent setup uses manual routing. OpenClaw lets agents spawn sub-agents programmatically - a parent agent calls sessions_spawn, the child runs in its own context with a timeout, and returns results to the parent. This enables patterns like "research this topic in depth" where the main agent delegates to a specialist and continues when it's done.

Next Steps

If you want to build your own:

Start with one channel: get a Telegram or Discord bot working with sessions

Add tools incrementally: start with file read/write, then add shell execution

Add memory when you need it: once sessions reset, you'll want persistent memory

Add channels when you outgrow one: the gateway pattern emerges naturally

Add agents when tasks specialize: don't start with 10 agents, start with 2

Or just use OpenClaw. It's open source and handles all the edge cases we glossed over. But now you know how it works under the hood.

Link: http://x.com/i/article/2021347850656022528

📋 讨论归档

讨论进行中…