返回列表
🧠 阿头学 · 🪞 Uota学

Agent 记忆不是功能,是基础设施——别再把聊天记录当记忆用了

记忆系统的核心不是"存更多",而是"结构化 + 衰减 + 冲突消解"——把 Agent 当操作系统设计,而不是聊天机器人。

2026-01-29 原文链接 ↗
阅读简报
双语对照
完整翻译
原文
讨论归档

核心观点

  • 聊天历史 ≠ 记忆 大多数人把对话记录塞进上下文窗口就以为是"记忆",但这只是短期缓存。真正的记忆需要抽取事实、分层存储、主动更新。这个区分是第一性原理级别的。
  • 两种长期记忆架构各有战场 基于文件的三层结构(Resources → Items → Categories)适合陪伴/助理场景,知识图谱+向量混合适合需要精确关系的场景(CRM、研究)。选错架构比没有架构更危险。
  • 记忆必须衰减,否则系统会腐烂 每夜整合、每周总结、每月重建索引——这不是可选的维护,是系统存活的前提。没有衰减的记忆会让 Agent 溺死在过时信息里。
  • 冲突消解是被忽视的关键 Embedding 只衡量相似度不衡量真伪。用户换了工作,向量库会同时返回新旧两份矛盾信息。没有时间感知和冲突处理,Agent 就会幻觉式地"缝合"出错误答案。
  • Agent 是操作系统,不是聊天机器人 进程管理、内存管理、I/O 管理、垃圾回收——这套心智模型比任何具体技术方案都重要。

跟我们的关联

  • Uota 的记忆系统直接相关:当前 Uota 用的是 memory/ 日志 + MEMORY.md 的两层结构,本质上就是文章里"基于文件的记忆"的简化版。意味着可以参考文章的三层架构(Resources → Items → Categories)+ 衰减机制来升级。接下来可以设计一个 memory cron,做每夜整合和每周总结。
  • Neta 角色记忆场景:AI 社交产品的角色需要跨会话记住用户偏好和关系,这篇文章的图记忆架构(向量+知识图谱混合)直接适用于 Neta 的角色记忆系统设计。可以评估当前 Neta 角色记忆的架构是否有冲突消解和衰减机制。

讨论引子

  • Uota 现在的 memory/ + MEMORY.md 两层结构,是不是已经在无意中走了"基于文件的记忆"路线?缺的是不是就是自动化的衰减和整合?
  • Neta 的角色如果要做"永不遗忘"的陪伴体验,冲突消解(用户观点/状态变化)这块现在是怎么处理的?有没有在用时间衰减?

如何构建一个永不遗忘的智能体

3 个月前,我因为没能做出一个永不遗忘的智能体,在一次技术面试中被拒了。

我知道的每一种方法都能用……直到它突然不灵。

我自信地走进那间房。我做过聊天机器人。我懂 embedding。我知道怎么用向量数据库。

但当面试官让我设计一个能跨越数周记住用户偏好(而不只是单次对话里记住)的智能体时,我僵住了。

我本能地掏出标准剧本:把一切都存进向量数据库,需要时检索相似对话。

把我击穿的问题却很朴素:规模怎么办?一千次会话之后,冲突数据怎么处理?怎么阻止它为了填补空白而编造“回忆”?

我答不上来。

那次失败迫使我真正去深挖,找出一个可行的解法:

大多数关于“带记忆的智能体”的教程,其实都在教你如何把 RAG 当作记忆来实现。

问题不在 embedding。不在 token 上限。甚至不在检索本身。

问题在于:记忆是基础设施,不是一个功能点。

下面就是我为解决它搭建的完整系统,以及我用来实现它的代码思路。

“标准”记忆的真正问题

我曾经以为记忆的含义是:保留对话历史,然后把它塞进上下文窗口。

这大概能撑 10 轮对话。然后上下文窗口就满了。

于是你截断旧消息。现在你的智能体忘了用户是纯素食者,转头推荐一家牛排馆。

你意识到:对话历史不是记忆,它只是聊天记录。

“行吧,”我想,“那我把每条消息都做 embedding,用相似度检索把相关的捞出来。”

这确实更好用了一些。至少一开始是。

但两周后,向量数据库里已经有了 500 条记录。当用户问:“我跟你说过我的工作情况是什么来着?”检索系统返回了来自 12 段不同对话的碎片。

智能体看到的是:

“我喜欢我的工作”(第 1 周)

“我在考虑辞职”(第 2 周)

“我的经理很支持我”(第 1 周)

“我的经理对所有事都事无巨细地管”(第 2 周)

哪句才是真的?

智能体完全不知道。它幻觉式地合成了一段解释:“你喜欢你那位支持你的经理,但你因为被微观管理而在考虑辞职。”

完全错了。用户在第 1 周和第 2 周之间换了工作。

关键的认知在这里:embedding 衡量的是相似度,而不是事实真伪。

向量数据库有一个致命盲区:它不理解时间、上下文或更新。它只会吐出在数学上与你的提问最接近的文本。这不是在记住,这是在猜。

修复它需要一次观念上的转变。记忆不是硬盘。它是一种过程。你不能只存数据;你得给它生命周期,并让它能够演化。

短期记忆:已解决的问题

在处理最难的部分(长期记忆)之前,我们得先把短期连续性解决掉。

短期记忆指的是:记住 30 秒前说过什么。这其实已经是个成熟问题了。

答案是:Checkpointing(检查点)。

每个智能体都像一个状态机:接收输入、更新内部状态、调用工具、生成输出,然后再更新状态。检查点是在某一时刻对整个状态做的一次快照。

它带来三种能力:

确定性:重放任意一次对话。

可恢复性:如果智能体崩溃,能精确从中断处继续。

可调试性:倒带以检查智能体的“想法”。

在生产环境里,我用的是由 Postgres 支持的 checkpointer。模式如下:

这能处理“当下”。但检查点是短暂的。它不会积累智慧。要做到这一点,我们需要长期架构。

长期记忆架构

经历了数月的失败后,我找到了两种真正可用的架构。

架构 A:基于文件的记忆(自组织系统)

它模仿人类如何分类知识。最适合做助理、治疗师或陪伴型产品。

三层层级结构:

第 1 层:Resources(原始数据)。事实来源。未经处理的日志、上传内容、转写文本。不可变且带时间戳。

第 2 层:Items(原子事实)。从资源中抽取的离散事实(“用户偏好 Python”“用户对贝类过敏”)。

第 3 层:Categories(演化式摘要)。高层上下文。将 Items 分组写入诸如 work_preferences.md 或 personal_life.md 之类的文件。

写入路径:主动记忆化

当新信息到来时,系统不会只是把它归档——它会处理它。它会拉取该类别的现有摘要,并主动把新细节织入叙事之中。这能自动处理矛盾:如果用户提到自己已经改用 Rust,系统不会只是把“Rust”追加到列表里;它会重写档案,用新偏好替换旧偏好。

读取路径(分层检索):为了节省 token,你不会把所有东西都拉进来。

先拉取类别摘要。

询问 LLM:“这些够吗?”

如果够 -> 直接回答。

如果不够 -> 下钻到具体的 items。

这种方式非常利于叙事一致性。但它在复杂关系上会吃力。要解决这个,你需要图。

架构 B:上下文图记忆(知识之网)

基于文件的记忆难以处理复杂关系。对于需要精确性的系统(CRM、研究),你需要图。

混合结构

向量库用于发现,用来浮出相关或相似文本。

知识图谱用于精确,以“主语–谓语–宾语”的关系存储事实。

冲突消解

我们还内建了冲突消解。如果图谱当前写着用户在 Google 工作,但新消息又表明他在 OpenAI,系统不会只是新增一份第二份工作。相反,它会识别出矛盾,把 Google 的连接归档为“过去历史”,并将 OpenAI 设为当前雇主。

检索会并行运行(向量 + 图遍历),再把结果合并。

混合检索

检索并行跑两路:

向量检索:找语义相近的对话。

图遍历:找与查询相关联的实体。

然后把两者结果合并成统一上下文。这能避免“什么都记得但什么都不懂”的问题。

记忆刷新、衰减与 Cron 任务

有一件事几乎没人会告诉你:记忆必须衰减。

“永不遗忘”不意味着“记住每一个 token”。它意味着“记住重要的东西”。

如果你不修剪数据库,你的智能体会变得困惑、迟缓且昂贵。

我用后台 Cron 任务来保持系统健康:

每夜整合

每天凌晨 3 点,后台进程会回顾当天的对话。它会寻找智能体在在线运行时没抓到的模式;合并冗余记忆;把被频繁访问的 items 提升到更高优先级的存储中。

每周总结

每周一次,系统会重新总结各个类别文件。它把旧 items 压缩成更高层的洞见;修剪 90 天内从未被访问过的记忆。

每月重建索引

每月我们会对记忆存储做一次全量重建索引。

用最新版本的模型重建 embedding,并基于真实使用情况调整图边。

一段时间没被触及的内容会被归档。

这些维护能让记忆系统在数月内保持健康。

没有它们,系统会腐烂。

推理时的检索是如何工作的

大多数检索系统失败,是因为它们只依赖向量相似度。这是个错误。一个健壮的记忆系统,会从上下文窗口的约束反推设计。它先用一个合成查询(而不是原始用户输入)做广域搜索。接着,它把这些搜索结果视为候选,而不是答案。我们再用“相关性打分器”和“时间衰减”函数对候选进行过滤。这样就能保证:一个稍微没那么相关但发生在最近的记忆,往往能胜过一个来自六个月前的“完美匹配”。最终得到的 prompt 只包含真正能推动答案的 5–10 个记忆 token,而不是一堵听起来相似的文字墙。

这能确保智能体只看到它需要的东西。不多,也不少。

为什么大多数人会在这里失败

当我把这个系统做出来后,我终于明白自己为什么会在那次面试里失败。多数实现之所以在生产环境里崩盘,是因为它们犯了五个关键错误:

错误 1:永远存原始对话 对话很嘈杂。如果你把每个“呃”和“那个”都存起来,你的记忆会被污染。抽取事实,不要存转写稿。

错误 2:盲目使用 embedding Embedding 找的是相似度,不是真相。“我喜欢我的工作”和“我讨厌我的工作”在 embedding 空间里会非常相近。你需要冲突消解逻辑。

错误 3:没有记忆衰减 没有衰减,智能体会溺死在过去里。它记得你两年前的度假计划,却忘了你当前的截止期。

错误 4:没有写入规则 如果智能体想写就写,它会把垃圾写进去。为“哪些值得记住”定义明确规则。

错误 5:把记忆当成聊天历史 这是致命错误。聊天历史是短暂的。记忆是对所学内容的结构化表征。

心智模型

真正的突破发生在我们不再把智能体看作简单聊天机器人,而是把它们当作操作系统。一个智能体需要完全相同的能力:

进程管理:跟踪多个并发任务。

内存管理:分配、更新并释放知识。

I/O 管理:与工具和用户对接。

最重要的是,它需要一套复杂的记忆架构。你需要“RAM”来承载当前对话快速、易失的上下文,但你也需要“硬盘”——一种持久、可索引、能在会话结束后仍然存活的知识存储方式。如果你不对这份记忆做定期维护,就像垃圾回收一样,系统最终会崩溃。

前后对比

三个月前:

今天:

聊天机器人与陪伴型产品的差别,是记忆。

记忆与好记忆的差别,是架构。

如果你在构建智能体,这已经不再是可选项。用户期待持久性,期待学习,期待智能体记得他们是谁。

三个月前,我还做不出来。现在,我已经交付了能跨越数千次会话记住客户偏好的智能体。

那次面试拒绝看起来像失败,却成了我理解生产系统真正需求的催化剂。

存储很便宜。结构很难。但正是结构,把一个无状态的语言模型变成了真正“永不遗忘”的东西。

明天的智能体不会只拥有更多参数或更好的训练数据。它们会拥有能在每一次交互中学习、演化并持续改进的记忆系统。

现在,你也知道该怎么做了。

链接:http://x.com/i/article/2008980750738313221

相关笔记

3 months ago, I was rejected from a technical interview because I couldn’t build an agent that never forgets.

3 个月前,我因为没能做出一个永不遗忘的智能体,在一次技术面试中被拒了。

Every approach I knew worked… until it didn’t.

我知道的每一种方法都能用……直到它突然不灵。

I walked into that room confident. I’d built chatbots. I understood embeddings. I knew how to use vector databases.

我自信地走进那间房。我做过聊天机器人。我懂 embedding。我知道怎么用向量数据库。

But when the interviewer asked me to design an agent that could remember a user’s preferences across weeks not just within a single conversation, I froze.

但当面试官让我设计一个能跨越数周记住用户偏好(而不只是单次对话里记住)的智能体时,我僵住了。

My instinct was the standard playbook: Store everything in a vector database and retrieve similar conversations when needed.

我本能地掏出标准剧本:把一切都存进向量数据库,需要时检索相似对话。

The questions that killed me were simple: What about scale? After a thousand sessions, how do you handle conflicting data? How do you stop it from faking memories just to fill the gaps?

把我击穿的问题却很朴素:规模怎么办?一千次会话之后,冲突数据怎么处理?怎么阻止它为了填补空白而编造“回忆”?

I had no answer.

我答不上来。

That failure forced me to actually deep dive and find a solution:

那次失败迫使我真正去深挖,找出一个可行的解法:

Most tutorials about "agents with memory" are teaching how to implement RAG for memory.

大多数关于“带记忆的智能体”的教程,其实都在教你如何把 RAG 当作记忆来实现。

The problem isn't embeddings. It isn't token limits. It isn't even retrieval.

问题不在 embedding。不在 token 上限。甚至不在检索本身。

The problem is that memory is infrastructure, not a feature.

问题在于:记忆是基础设施,不是一个功能点。

Here is the entire system I built to solve it and the code I used to do it.

下面就是我为解决它搭建的完整系统,以及我用来实现它的代码思路。

The Real Problem With "Standard" Memory

“标准”记忆的真正问题

Here is what I thought memory meant: Keeping the conversation history and stuffing it into the context window.

我曾经以为记忆的含义是:保留对话历史,然后把它塞进上下文窗口。

That works for about 10 exchanges. Then the context window fills up.

这大概能撑 10 轮对话。然后上下文窗口就满了。

So you truncate old messages. Now your agent forgets the user is vegan and recommends a steakhouse.

于是你截断旧消息。现在你的智能体忘了用户是纯素食者,转头推荐一家牛排馆。

You realize conversation history isn't memory it's just a chat log.

你意识到:对话历史不是记忆,它只是聊天记录。

"Fine," I thought. "I'll embed every message and retrieve relevant ones using similarity search."

“行吧,”我想,“那我把每条消息都做 embedding,用相似度检索把相关的捞出来。”

This worked better. For a while.

这确实更好用了一些。至少一开始是。

But after two weeks, the vector database had 500 entries. When the user asked, "What did I tell you about my work situation?" the retrieval system returned fragments from 12 different conversations.

但两周后,向量数据库里已经有了 500 条记录。当用户问:“我跟你说过我的工作情况是什么来着?”检索系统返回了来自 12 段不同对话的碎片。

The agent saw:

智能体看到的是:

"I love my job" (Week 1)

“我喜欢我的工作”(第 1 周)

"I'm thinking about quitting" (Week 2)

“我在考虑辞职”(第 2 周)

"My manager is supportive" (Week 1)

“我的经理很支持我”(第 1 周)

"My manager micromanages everything" (Week 2)

“我的经理对所有事都事无巨细地管”(第 2 周)

Which one is true?

哪句才是真的?

The agent had no idea. It hallucinated a synthesis: "You love your supportive manager but you're thinking about quitting because of micromanagement."

智能体完全不知道。它幻觉式地合成了一段解释:“你喜欢你那位支持你的经理,但你因为被微观管理而在考虑辞职。”

Completely wrong. The user had switched jobs between Week 1 and Week 2.

完全错了。用户在第 1 周和第 2 周之间换了工作。

This is the crucial realization: Embeddings measure similarity, not truth.

关键的认知在这里:embedding 衡量的是相似度,而不是事实真伪。

Vector databases have a blind spot: they don't understand time, context, or updates. They just spit back text that looks mathematically close to what you asked for. That isn’t remembering; it’s guessing.

向量数据库有一个致命盲区:它不理解时间、上下文或更新。它只会吐出在数学上与你的提问最接近的文本。这不是在记住,这是在猜。

The fix required a mental shift. Memory isn't a hard drive. It’s a process. You can't just store data; you have to give it a lifespan and let it evolve.

修复它需要一次观念上的转变。记忆不是硬盘。它是一种过程。你不能只存数据;你得给它生命周期,并让它能够演化。

Short-Term Memory: The Solved Problem

短期记忆:已解决的问题

Before tackling the hard part (long-term memory), we need to handle short-term continuity.

在处理最难的部分(长期记忆)之前,我们得先把短期连续性解决掉。

Short-term memory is the ability to remember what was said 30 seconds ago. This is actually a solved problem.

短期记忆指的是:记住 30 秒前说过什么。这其实已经是个成熟问题了。

The solution is Checkpointing.

答案是:Checkpointing(检查点)。

Every agent operates as a state machine. It receives input, updates internal state, calls tools, generates output, and updates state again. A checkpoint is a snapshot of this entire state at a specific moment.

每个智能体都像一个状态机:接收输入、更新内部状态、调用工具、生成输出,然后再更新状态。检查点是在某一时刻对整个状态做的一次快照。

This gives you three capabilities:

它带来三种能力:

Determinism: Replay any conversation.

确定性:重放任意一次对话。

Recoverability: Resume exactly where you left off if the agent crashes.

可恢复性:如果智能体崩溃,能精确从中断处继续。

Debuggability: Rewind to inspect the agent's "thoughts."

可调试性:倒带以检查智能体的“想法”。

In production, I use Postgres-backed checkpointers. Here is the pattern:

在生产环境里,我用的是由 Postgres 支持的 checkpointer。模式如下:

This handles the "now." But checkpoints are ephemeral. They don't build wisdom. For that, we need Long-Term Architectures.

这能处理“当下”。但检查点是短暂的。它不会积累智慧。要做到这一点,我们需要长期架构。

Long-Term Memory Architectures

长期记忆架构

After months of failure, I found two architectures that actually work.

经历了数月的失败后,我找到了两种真正可用的架构。

Architecture A: File-Based Memory (The Self-Organizing System)

架构 A:基于文件的记忆(自组织系统)

This mimics how humans categorize knowledge. It works best for assistants, therapists, or companions.

它模仿人类如何分类知识。最适合做助理、治疗师或陪伴型产品。

The Three-Layer Hierarchy:

三层层级结构:

Layer 1: Resources (Raw Data). The source of truth. Unprocessed logs, uploads, transcripts. Immutable and timestamped.

第 1 层:Resources(原始数据)。事实来源。未经处理的日志、上传内容、转写文本。不可变且带时间戳。

Layer 2: Items (Atomic Facts). Discrete facts extracted from resources ("User prefers Python," "User is allergic to shellfish").

第 2 层:Items(原子事实)。从资源中抽取的离散事实(“用户偏好 Python”“用户对贝类过敏”)。

Layer 3: Categories (Evolving Summaries). The high-level context. Items are grouped into files like work_preferences.md or personal_life.md.

第 3 层:Categories(演化式摘要)。高层上下文。将 Items 分组写入诸如 work_preferences.md 或 personal_life.md 之类的文件。

The Write Path: Active Memorization

写入路径:主动记忆化

When new information arrives, the system doesn't just file it away it processes it. It pulls up the existing summary for that category and actively weaves the new detail into the narrative. This handles contradictions automatically: if a user mentions they’ve switched to Rust, the system doesn't just add 'Rust' to the list; it rewrites the profile to replace the old preference

当新信息到来时,系统不会只是把它归档——它会处理它。它会拉取该类别的现有摘要,并主动把新细节织入叙事之中。这能自动处理矛盾:如果用户提到自己已经改用 Rust,系统不会只是把“Rust”追加到列表里;它会重写档案,用新偏好替换旧偏好。

The Read Path (Tiered Retrieval): To save tokens, you don't pull everything.

读取路径(分层检索):为了节省 token,你不会把所有东西都拉进来。

Pull Category Summaries.

先拉取类别摘要。

Ask LLM: "Is this enough?"

询问 LLM:“这些够吗?”

If yes -> Respond.

如果够 -> 直接回答。

If no -> Drill down into specific items.

如果不够 -> 下钻到具体的 items。

This works beautifully for narrative coherence. But it struggles with complex relationships. For that, you need graphs.

这种方式非常利于叙事一致性。但它在复杂关系上会吃力。要解决这个,你需要图。

Architecture B: Context-Graph Memory (The Knowledge Web)

架构 B:上下文图记忆(知识之网)

File-based memory struggles with complex relationships. For precise systems (CRM, Research), you need a Graph.

基于文件的记忆难以处理复杂关系。对于需要精确性的系统(CRM、研究),你需要图。

Hybrid Structure

混合结构

Vector store for discovery, used to surface related or similar text.

向量库用于发现,用来浮出相关或相似文本。

Knowledge graph for precision, storing facts as subject–predicate–object relationships.

知识图谱用于精确,以“主语–谓语–宾语”的关系存储事实。

Conflict resolution

冲突消解

We also built in conflict resolution. If the graph currently says the user works at Google, but a new message places them at OpenAI, the system doesn't just add a second job. Instead, it recognizes the contradiction, archives the Google connection as 'past history,' and makes OpenAI the active employer.

我们还内建了冲突消解。如果图谱当前写着用户在 Google 工作,但新消息又表明他在 OpenAI,系统不会只是新增一份第二份工作。相反,它会识别出矛盾,把 Google 的连接归档为“过去历史”,并将 OpenAI 设为当前雇主。

Retrieval involves running parallel searches (Vector + Graph traversal) and merging the results

检索会并行运行(向量 + 图遍历),再把结果合并。

Hybrid Search

混合检索

Retrieval runs two searches in parallel:

检索并行跑两路:

Vector Search: Find semantically similar conversations.

向量检索:找语义相近的对话。

Graph Traversal: Find entities connected to the query.

图遍历:找与查询相关联的实体。

The results merge into a unified context. This prevents the "remembers everything but knows nothing" problem.

然后把两者结果合并成统一上下文。这能避免“什么都记得但什么都不懂”的问题。

Memory refresh, Decay, and Cron jobs

记忆刷新、衰减与 Cron 任务

Here is what nobody tells you: Memory must decay.

有一件事几乎没人会告诉你:记忆必须衰减。

"Never forget" doesn't mean "remember every single token." It means "remember what matters."

“永不遗忘”不意味着“记住每一个 token”。它意味着“记住重要的东西”。

If you don't prune your database, your agent becomes confused, slow, and expensive.

如果你不修剪数据库,你的智能体会变得困惑、迟缓且昂贵。

I run background Cron jobs to keep the system healthy:

我用后台 Cron 任务来保持系统健康:

Nightly Consolidation

每夜整合

Every night at 3 AM, a background process reviews the day's conversations. It looks for patterns the agent missed during live operation. It merges redundant memories. It promotes frequently-accessed items to higher-priority storage.

每天凌晨 3 点,后台进程会回顾当天的对话。它会寻找智能体在在线运行时没抓到的模式;合并冗余记忆;把被频繁访问的 items 提升到更高优先级的存储中。

Weekly Summarization

每周总结

Once a week, the system re-summarizes category files. It compresses old items into higher-level insights. It prunes memories that haven't been accessed in 90 days.

每周一次,系统会重新总结各个类别文件。它把旧 items 压缩成更高层的洞见;修剪 90 天内从未被访问过的记忆。

Monthly Re-indexing

每月重建索引

On a monthly basis, we run a full re-index of the memory store.

每月我们会对记忆存储做一次全量重建索引。

Embeddings are rebuilt with the latest model version, and graph edges are adjusted based on real usage.

用最新版本的模型重建 embedding,并基于真实使用情况调整图边。

Anything that hasn’t been touched in a while gets archived.

一段时间没被触及的内容会被归档。

This maintenance keeps memory systems healthy for months.

这些维护能让记忆系统在数月内保持健康。

Without it, they rot.

没有它们,系统会腐烂。

How retrieval works at inference time

推理时的检索是如何工作的

Most retrieval systems fail because they rely solely on vector similarity. That’s a mistake. A robust memory system works backwards from the constraints of the context window. It starts with a broad search using a synthesized query, not the raw user input. Then, it treats those search results as prospects, not answers. We filter those prospects through a "relevance scorer" and a "time-decay" function. This ensures that a slightly less relevant but highly recent memory often beats a perfect match from six months ago. The result is a prompt that contains only the 5-10 memory tokens that actually move the needle, rather than a wall of similar-sounding text.

大多数检索系统失败,是因为它们只依赖向量相似度。这是个错误。一个健壮的记忆系统,会从上下文窗口的约束反推设计。它先用一个合成查询(而不是原始用户输入)做广域搜索。接着,它把这些搜索结果视为候选,而不是答案。我们再用“相关性打分器”和“时间衰减”函数对候选进行过滤。这样就能保证:一个稍微没那么相关但发生在最近的记忆,往往能胜过一个来自六个月前的“完美匹配”。最终得到的 prompt 只包含真正能推动答案的 5–10 个记忆 token,而不是一堵听起来相似的文字墙。

This ensures the agent sees only what it needs. Nothing more. Nothing less.

这能确保智能体只看到它需要的东西。不多,也不少。

Why most people fail at this

为什么大多数人会在这里失败

After building this system, I understood why I failed that interview. Most implementations fail in production because they make five critical mistakes:

当我把这个系统做出来后,我终于明白自己为什么会在那次面试里失败。多数实现之所以在生产环境里崩盘,是因为它们犯了五个关键错误:

Mistake 1: Storing raw conversations forever Conversations are noisy. If you store every "um" and "like," your memory becomes polluted. Extract facts, not transcripts.

错误 1:永远存原始对话 对话很嘈杂。如果你把每个“呃”和“那个”都存起来,你的记忆会被污染。抽取事实,不要存转写稿。

Mistake 2: Blind embedding usage Embeddings find similarity, not truth. "I love my job" and "I hate my job" embed very similarly. You need resolution logic.

错误 2:盲目使用 embedding Embedding 找的是相似度,不是真相。“我喜欢我的工作”和“我讨厌我的工作”在 embedding 空间里会非常相近。你需要冲突消解逻辑。

Mistake 3: No memory decay Without decay, your agent drowns in the past. It remembers your vacation plans from two years ago but forgets your current deadline.

错误 3:没有记忆衰减 没有衰减,智能体会溺死在过去里。它记得你两年前的度假计划,却忘了你当前的截止期。

Mistake 4: No write rules If the agent writes to memory whenever it wants, it will write junk. Define explicit rules for what deserves to be remembered.

错误 4:没有写入规则 如果智能体想写就写,它会把垃圾写进去。为“哪些值得记住”定义明确规则。

Mistake 5: Treating memory as chat history This is the fatal mistake. Chat history is ephemeral. Memory is a structured representation of what was learned.

错误 5:把记忆当成聊天历史 这是致命错误。聊天历史是短暂的。记忆是对所学内容的结构化表征。

The Mental model

心智模型

The real breakthrough happened when we stopped looking at agents as simple chatbots and started treating them like operating systems. An agent needs the exact same capabilities:

真正的突破发生在我们不再把智能体看作简单聊天机器人,而是把它们当作操作系统。一个智能体需要完全相同的能力:

Process Management: Track multiple concurrent tasks.

进程管理:跟踪多个并发任务。

Memory Management: Allocate, update, and free knowledge.

内存管理:分配、更新并释放知识。

I/O Management: Interface with tools and users.

I/O 管理:与工具和用户对接。

Most importantly, it requires a sophisticated memory architecture. You need "RAM" for the fast, volatile context of the current conversation, but you also need a "hard drive" a persistent, indexed way to store knowledge that survives after the session ends. If you don't run regular maintenance on that memory, much like garbage collection, the system eventually breaks down.

最重要的是,它需要一套复杂的记忆架构。你需要“RAM”来承载当前对话快速、易失的上下文,但你也需要“硬盘”——一种持久、可索引、能在会话结束后仍然存活的知识存储方式。如果你不对这份记忆做定期维护,就像垃圾回收一样,系统最终会崩溃。

The Before and After

前后对比

Three months ago:

三个月前:

Today:

今天:

The difference between a chatbot and a companion is memory.

聊天机器人与陪伴型产品的差别,是记忆。

The difference between memory and good memory is architecture.

记忆与好记忆的差别,是架构。

If you're building agents, this is no longer optional. Users expect persistence. They expect learning. They expect the agent to remember who they are.

如果你在构建智能体,这已经不再是可选项。用户期待持久性,期待学习,期待智能体记得他们是谁。

Three months ago, I couldn't build this. Now I've shipped agents that remember customer preferences across thousands of sessions.

三个月前,我还做不出来。现在,我已经交付了能跨越数千次会话记住客户偏好的智能体。

The interview rejection that felt like failure became the catalyst for understanding what production systems actually require.

那次面试拒绝看起来像失败,却成了我理解生产系统真正需求的催化剂。

Storage is cheap. Structure is hard. But structure is what transforms a stateless language model into something that genuinely never forgets.

存储很便宜。结构很难。但正是结构,把一个无状态的语言模型变成了真正“永不遗忘”的东西。

The agents of tomorrow won't just have more parameters or better training data. They'll have memory systems that learn, evolve, and improve with every interaction.

明天的智能体不会只拥有更多参数或更好的训练数据。它们会拥有能在每一次交互中学习、演化并持续改进的记忆系统。

And now you know how to build them.

现在,你也知道该怎么做了。

Link: http://x.com/i/article/2008980750738313221

链接:http://x.com/i/article/2008980750738313221

相关笔记

how to build an agent that never forgets

  • Source: https://x.com/rohit4verse/status/2012925228159295810?s=46
  • Mirror: https://x.com/rohit4verse/status/2012925228159295810?s=46
  • Published: 2026-01-18T16:29:07+00:00
  • Saved: 2026-01-29

Content

3 months ago, I was rejected from a technical interview because I couldn’t build an agent that never forgets.

Every approach I knew worked… until it didn’t.

I walked into that room confident. I’d built chatbots. I understood embeddings. I knew how to use vector databases.

But when the interviewer asked me to design an agent that could remember a user’s preferences across weeks not just within a single conversation, I froze.

My instinct was the standard playbook: Store everything in a vector database and retrieve similar conversations when needed.

The questions that killed me were simple: What about scale? After a thousand sessions, how do you handle conflicting data? How do you stop it from faking memories just to fill the gaps?

I had no answer.

That failure forced me to actually deep dive and find a solution:

Most tutorials about "agents with memory" are teaching how to implement RAG for memory.

The problem isn't embeddings. It isn't token limits. It isn't even retrieval.

The problem is that memory is infrastructure, not a feature.

Here is the entire system I built to solve it and the code I used to do it.

The Real Problem With "Standard" Memory

Here is what I thought memory meant: Keeping the conversation history and stuffing it into the context window.

That works for about 10 exchanges. Then the context window fills up.

So you truncate old messages. Now your agent forgets the user is vegan and recommends a steakhouse.

You realize conversation history isn't memory it's just a chat log.

"Fine," I thought. "I'll embed every message and retrieve relevant ones using similarity search."

This worked better. For a while.

But after two weeks, the vector database had 500 entries. When the user asked, "What did I tell you about my work situation?" the retrieval system returned fragments from 12 different conversations.

The agent saw:

"I love my job" (Week 1)

"I'm thinking about quitting" (Week 2)

"My manager is supportive" (Week 1)

"My manager micromanages everything" (Week 2)

Which one is true?

The agent had no idea. It hallucinated a synthesis: "You love your supportive manager but you're thinking about quitting because of micromanagement."

Completely wrong. The user had switched jobs between Week 1 and Week 2.

This is the crucial realization: Embeddings measure similarity, not truth.

Vector databases have a blind spot: they don't understand time, context, or updates. They just spit back text that looks mathematically close to what you asked for. That isn’t remembering; it’s guessing.

The fix required a mental shift. Memory isn't a hard drive. It’s a process. You can't just store data; you have to give it a lifespan and let it evolve.

Short-Term Memory: The Solved Problem

Before tackling the hard part (long-term memory), we need to handle short-term continuity.

Short-term memory is the ability to remember what was said 30 seconds ago. This is actually a solved problem.

The solution is Checkpointing.

Every agent operates as a state machine. It receives input, updates internal state, calls tools, generates output, and updates state again. A checkpoint is a snapshot of this entire state at a specific moment.

This gives you three capabilities:

Determinism: Replay any conversation.

Recoverability: Resume exactly where you left off if the agent crashes.

Debuggability: Rewind to inspect the agent's "thoughts."

In production, I use Postgres-backed checkpointers. Here is the pattern:

This handles the "now." But checkpoints are ephemeral. They don't build wisdom. For that, we need Long-Term Architectures.

Long-Term Memory Architectures

After months of failure, I found two architectures that actually work.

Architecture A: File-Based Memory (The Self-Organizing System)

This mimics how humans categorize knowledge. It works best for assistants, therapists, or companions.

The Three-Layer Hierarchy:

Layer 1: Resources (Raw Data). The source of truth. Unprocessed logs, uploads, transcripts. Immutable and timestamped.

Layer 2: Items (Atomic Facts). Discrete facts extracted from resources ("User prefers Python," "User is allergic to shellfish").

Layer 3: Categories (Evolving Summaries). The high-level context. Items are grouped into files like work_preferences.md or personal_life.md.

The Write Path: Active Memorization

When new information arrives, the system doesn't just file it away it processes it. It pulls up the existing summary for that category and actively weaves the new detail into the narrative. This handles contradictions automatically: if a user mentions they’ve switched to Rust, the system doesn't just add 'Rust' to the list; it rewrites the profile to replace the old preference

The Read Path (Tiered Retrieval): To save tokens, you don't pull everything.

Pull Category Summaries.

Ask LLM: "Is this enough?"

If yes -> Respond.

If no -> Drill down into specific items.

This works beautifully for narrative coherence. But it struggles with complex relationships. For that, you need graphs.

Architecture B: Context-Graph Memory (The Knowledge Web)

File-based memory struggles with complex relationships. For precise systems (CRM, Research), you need a Graph.

Hybrid Structure

Vector store for discovery, used to surface related or similar text.

Knowledge graph for precision, storing facts as subject–predicate–object relationships.

Conflict resolution

We also built in conflict resolution. If the graph currently says the user works at Google, but a new message places them at OpenAI, the system doesn't just add a second job. Instead, it recognizes the contradiction, archives the Google connection as 'past history,' and makes OpenAI the active employer.

Retrieval involves running parallel searches (Vector + Graph traversal) and merging the results

Hybrid Search

Retrieval runs two searches in parallel:

Vector Search: Find semantically similar conversations.

Graph Traversal: Find entities connected to the query.

The results merge into a unified context. This prevents the "remembers everything but knows nothing" problem.

Memory refresh, Decay, and Cron jobs

Here is what nobody tells you: Memory must decay.

"Never forget" doesn't mean "remember every single token." It means "remember what matters."

If you don't prune your database, your agent becomes confused, slow, and expensive.

I run background Cron jobs to keep the system healthy:

Nightly Consolidation

Every night at 3 AM, a background process reviews the day's conversations. It looks for patterns the agent missed during live operation. It merges redundant memories. It promotes frequently-accessed items to higher-priority storage.

Weekly Summarization

Once a week, the system re-summarizes category files. It compresses old items into higher-level insights. It prunes memories that haven't been accessed in 90 days.

Monthly Re-indexing

On a monthly basis, we run a full re-index of the memory store.

Embeddings are rebuilt with the latest model version, and graph edges are adjusted based on real usage.

Anything that hasn’t been touched in a while gets archived.

This maintenance keeps memory systems healthy for months.

Without it, they rot.

How retrieval works at inference time

Most retrieval systems fail because they rely solely on vector similarity. That’s a mistake. A robust memory system works backwards from the constraints of the context window. It starts with a broad search using a synthesized query, not the raw user input. Then, it treats those search results as prospects, not answers. We filter those prospects through a "relevance scorer" and a "time-decay" function. This ensures that a slightly less relevant but highly recent memory often beats a perfect match from six months ago. The result is a prompt that contains only the 5-10 memory tokens that actually move the needle, rather than a wall of similar-sounding text.

This ensures the agent sees only what it needs. Nothing more. Nothing less.

Why most people fail at this

After building this system, I understood why I failed that interview. Most implementations fail in production because they make five critical mistakes:

Mistake 1: Storing raw conversations forever Conversations are noisy. If you store every "um" and "like," your memory becomes polluted. Extract facts, not transcripts.

Mistake 2: Blind embedding usage Embeddings find similarity, not truth. "I love my job" and "I hate my job" embed very similarly. You need resolution logic.

Mistake 3: No memory decay Without decay, your agent drowns in the past. It remembers your vacation plans from two years ago but forgets your current deadline.

Mistake 4: No write rules If the agent writes to memory whenever it wants, it will write junk. Define explicit rules for what deserves to be remembered.

Mistake 5: Treating memory as chat history This is the fatal mistake. Chat history is ephemeral. Memory is a structured representation of what was learned.

The Mental model

The real breakthrough happened when we stopped looking at agents as simple chatbots and started treating them like operating systems. An agent needs the exact same capabilities:

Process Management: Track multiple concurrent tasks.

Memory Management: Allocate, update, and free knowledge.

I/O Management: Interface with tools and users.

Most importantly, it requires a sophisticated memory architecture. You need "RAM" for the fast, volatile context of the current conversation, but you also need a "hard drive" a persistent, indexed way to store knowledge that survives after the session ends. If you don't run regular maintenance on that memory, much like garbage collection, the system eventually breaks down.

The Before and After

Three months ago:

Today:

The difference between a chatbot and a companion is memory.

The difference between memory and good memory is architecture.

If you're building agents, this is no longer optional. Users expect persistence. They expect learning. They expect the agent to remember who they are.

Three months ago, I couldn't build this. Now I've shipped agents that remember customer preferences across thousands of sessions.

The interview rejection that felt like failure became the catalyst for understanding what production systems actually require.

Storage is cheap. Structure is hard. But structure is what transforms a stateless language model into something that genuinely never forgets.

The agents of tomorrow won't just have more parameters or better training data. They'll have memory systems that learn, evolve, and improve with every interaction.

And now you know how to build them.

Link: http://x.com/i/article/2008980750738313221

📋 讨论归档

讨论进行中…