返回列表
🧠 阿头学 · 💬 讨论题

Claude Code 七层记忆架构:本质是缓存经济学,不是“会思考的记忆魔法”

这篇拆解最有价值的判断是:Claude Code 的核心竞争力不是“记忆更智能”,而是把长上下文问题改造成一套以 prompt cache 为中心的成本工程;但作者明显高估了架构优雅性,低估了复杂系统的失真与污染风险。
打开原文 ↗

2026-04-01 原文链接 ↗
阅读简报
双语对照
完整翻译
原文
讨论归档

核心观点

  • 本质是成本架构,不是玄学记忆 七层记忆听起来像能力飞跃,但从工具结果外存、微压缩、会话记忆到梦境整合,真正一以贯之的目标都是减少上下文重算和 API 开销;“梦境”这个命名有传播力,但本质仍是后台摘要、索引和清洗。
  • Prompt cache 才是系统第一性原理 文中最站得住的部分不是“七层”本身,而是它对字节级前缀一致性的偏执维护:冻结工具结果替换决策、复用渲染后的 system prompt、构造共享前缀的 fork、用 cache_edits 在服务端删内容而不改本地消息,这些都说明系统优先优化的是缓存命中率而非表面智能。
  • “先便宜后昂贵”的分层防御非常强 工具结果先持久化、旧结果再微压缩、会话中持续记笔记、最后才做全量压缩,这套漏斗式设计判断上是成熟的;它把“上下文快爆了再慌张总结”的粗暴模式,替换成了持续维护工作记忆的异步系统。
  • 长期记忆有价值,但污染风险被明显低估 自动记忆提取和梦境整合确实提高跨会话复用率,但任何让模型自动合并、删除、纠正文档的系统都天然有事实漂移风险;文章对这一点着墨不足,容易让人误以为“能整合”就等于“整合得准”。
  • 源码拆解扎实,但效果论证不够 作者在文件路径、阈值、锁机制、熔断器、互斥条件上给了大量硬证据,这使“系统怎么工作”基本可信;但“这样工作后任务成功率是否更高、摘要是否失真、长期记忆是否稳定”几乎没有数据支撑,这是文章最大的短板。

跟我们的关联

  • 对 ATou 意味着什么、下一步怎么用 对 ATou,这意味着做 agent 不能先迷信“长记忆”,而要先设计成本漏斗;下一步可以把自己的系统拆成“外存预览/轻压缩/会话笔记/长期记忆”四层,而不是一上来做统一 memory。
  • 对 Neta 意味着什么、下一步怎么用 对 Neta,这意味着产品评估不能只看模型效果,而要看缓存命中、摘要触发率、失败熔断是否健康;下一步应把 prompt cache hit rate、summary 频率、memory 污染率列成一等指标。
  • 对 Uota 意味着什么、下一步怎么用 对 Uota,这意味着“梦境”“长期记忆”这类命名很容易制造智能幻觉,分析时必须拆回工程动作;下一步讨论时应区分“叙事包装”与“真实能力”,尤其追问压缩后保真度。
  • 对三者共同意味着什么、下一步怎么用 三者都该接受一个更硬的判断:AI agent 的竞争不只是模型强弱,而是预算调度能力;下一步可以复用这篇里的方法论做任何 AI 工作流设计——先让便宜层挡住贵层,再给每一层加熔断和回退。

讨论引子

1. 如果 prompt cache 真是 agent 系统的第一性原理,那我们今天对“模型能力”的重视是否被高估了? 2. 会话记忆和梦境整合在工程上很优雅,但一旦发生错误固化,谁来承担“长期记忆污染”的代价? 3. 企业级复杂代码库里,把“架构与约定”排除在长期记忆之外,到底是节制,还是脱离真实场景?

对 Claude Code 外泄的 harness 内部所有记忆与上下文管理系统做一次完整逆向拆解,从亚毫秒级的轻量 token 修剪,到你睡觉时也会把记忆做整合的“梦境”系统。

1. 问题:无界世界里的有界上下文

LLM 有个根本限制,上下文窗口是固定的。Claude Code 通常运行在 200K token 的窗口里(加上 [1m] 后缀可扩到 1M)。一次编码会话很容易就超了,读几个文件,跑点 grep,来回改几轮,就到顶了。

Claude Code 用一套七层记忆架构解决这个问题,跨度从亚毫秒级 token 修剪到耗时数小时的后台“梦境”整合。每一层越往后越贵,但能力也越强。整个系统的设计目标是,让更便宜的层尽量挡住更昂贵的层,不让它们被触发。

Token 计数:地基

一切从知道用了多少 token 开始。权威函数在 src/utils/tokens.ts 里,叫 tokenCountWithEstimation()

标准 token 计数 = 上一次 API 响应的 usage.input_tokens + 对之后新增消息的粗略估算

粗估用的是一个很简单的启发式规则:大多数文本按 每 token 4 字节估,JSON(token 更密)按 每 token 2 字节估。图片和文档不管大小,都统一按 2,000 token 估算。

上下文窗口如何确定

系统按一条优先级链路来决定可用的上下文窗口:

[1m] 模型后缀 → 模型能力查询 → 1M beta header → 环境变量覆盖 → 200K 默认值

有效上下文窗口会先扣掉 20K token 作为压缩输出的预留。窗口不能全用满,因为还得留位置生成那段救命的总结。

2. 架构总览:七层记忆

每一层由不同条件触发,成本也不同。系统尽量做到 第 N 层能阻止第 N+1 层被触发

# Session Title
_A short and distinctive 5-10 word descriptive title_

# Current State
_What is actively being worked on right now?_

# Task specification
_What did the user ask to build?_

# Files and Functions
_Important files and their relevance_

# Workflow
_Bash commands usually run and their interpretation_

# Errors & Corrections
_Errors encountered and how they were fixed_

# Codebase and System Documentation
_Important system components and how they fit together_

# Learnings
_What has worked well? What has not?_

# Key results
_If the user asked for specific output, repeat it here_

# Worklog
_Step by step, what was attempted and done_

3. 第 1 层:工具结果存储

  • File: src/utils/toolResultStorage.ts

  • Cost: 只有磁盘 I/O,没有 API 调用

  • When: 每次产生工具结果,立刻执行

问题

在代码库里跑一次 grep,很容易就返回 100KB+ 文本。大文件 cat 一下也可能有 50KB。这些结果会吞掉大量上下文,而且对话推进几分钟后就变得不再重要。

解法

每个工具结果在进上下文前都会走一套预算系统:

当结果超过阈值时:

  1. 完整结果写到磁盘 tool-results/<sessionId>/<toolUseId>.txt

  2. 预览(前 ~2KB)放进上下文,并包在 <persisted-output> 标签里

  3. 之后如果需要,模型可以用 Read 再把完整结果读回来

ContentReplacementState:缓存稳定的决策

这里有个关键细节:一旦某个工具结果被替换成预览,这个决定就会被 ContentReplacementState 冻结。后续每次 API 调用,同一个结果都会拿到同一个预览,这样才能保证提示前缀字节级一致,从而命中 prompt cache。这个状态还会被写进 transcript,所以即使恢复会话也能延续。

GrowthBook 覆盖

每个工具的阈值可以通过 tengu_satin_quoll 特性开关远程调整,让 Anthropic 不用发版就能给特定工具改持久化阈值。

4. 第 2 层:微压缩(Microcompaction)

  • File: src/services/compact/microCompact.ts

  • Cost: 零到极低的 API 成本

  • When: 每一轮对话,在 API 调用前

微压缩是最轻量的上下文减负。它不做总结,只是清掉那些大概率不会再用到的旧工具结果。

三种不同机制

a) 基于时间的微压缩

Trigger: 距离上一条 assistant 消息的空闲间隔超过阈值(默认:60 分钟)

Rationale: Anthropic 的服务端 prompt cache TTL 大概是 1 小时。如果 1 小时没发请求,缓存就过期了,整个提示前缀会被重新 tokenize。既然反正要重写,不如先把旧工具结果清掉,缩小要重写的体积。

Action: 把工具结果内容替换成 [Old tool result content cleared],同时至少保留最近 N 条结果(下限为 1)。

Configuration(通过 GrowthBook tengu_slate_heron):

b) 缓存微压缩(Cache-Editing API)

这是最有技术含量的一种。它不改本地消息(那会让 prompt cache 失效),而是用 API 的 cache_edits 机制,在不破坏前缀的前提下,从服务端缓存里把工具结果删掉。

工作方式:

  1. 工具结果出现时,会登记到全局 CachedMCState 里

  2. 当数量超过阈值,就挑出最旧的一批(扣掉“保留最近”的缓冲)准备删除

  3. 生成一个 cache_edits 块,并跟着下一次 API 请求一起发出去

  4. 服务端从缓存前缀里删除指定的工具结果

  5. 本地消息不变,删除只发生在 API 层

关键安全点:只在主线程运行。如果分叉出来的子代理(session_memory、agent_summary 等)去改全局状态,会把主线程的 cache editing 搞坏。

c) API 级上下文管理(apiMicrocompact.ts)

一种更新的服务端方案,通过 context_management 这个 API 参数来做:

compactMetadata = {
  type: 'auto' | 'manual',
  preCompactTokenCount: number,
  compactedMessageUuid: UUID,           // Last msg before boundary
  preCompactDiscoveredTools: string[],  // Loaded deferred tools
  preservedSegment?: {                  // Session memory path only
    headUuid, anchorUuid, tailUuid
  }
}

这让 API 服务器原生处理上下文管理,客户端不用再自己追踪或清理工具结果。

哪些工具结果允许被清理?

只有这些工具的结果会被清掉:

FileRead, Bash/Shell, Grep, Glob, WebSearch, WebFetch, FileEdit, FileWrite

缺席得很明显的有:thinking blocks、assistant 文本、用户消息、MCP 工具结果。

5. 第 3 层:会话记忆(Session Memory)

SessionMemoryCompactConfig = {
  minTokens: 10_000,          // Minimum tokens to preserve
  minTextBlockMessages: 5,     // Minimum messages with text blocks
  maxTokens: 40_000            // Hard cap on preserved tokens
}
  • Files: src/services/SessionMemory/

  • Cost: 每次提取要跑一个分叉子代理的 API 调用

  • When: 对话过程中周期性触发(post-sampling hook)

核心想法

不等到上下文快满才慌张总结,而是持续维护笔记。这样一旦需要压缩,摘要已经在那儿了,不用再额外付一次昂贵的总结成本。

会话记忆模板

每个会话都会有一个 markdown 文件,路径是 ~/.claude/projects/<slug>/.claude/session-memory/<sessionId>.md,里面是结构化模板:

触发逻辑

会话记忆提取在同时满足两个条件时触发:

自上次提取以来 token 增长 ≥ minimumTokensBetweenUpdate AND(自上次提取以来工具调用次数 ≥ toolCallsBetweenUpdates OR 上一轮 assistant 没有工具调用)

token 门槛永远必须满足,即使工具调用门槛满足也不例外。上一轮无工具调用这条是为了捕捉自然的对话断点,也就是模型刚好结束了一段工作流程的时候。

提取如何执行

提取通过 runForkedAgent() 以分叉子代理执行:

  • querySource: 'session_memory'

  • 只允许用 FileEdit 改记忆文件(其他工具全部禁用)

  • 共享父代理的 prompt cache 以节省成本

  • 通过 sequential() 包装顺序执行,避免多次提取互相重叠

会话记忆压缩:兑现价值的地方

当 autocompact 触发时,会先尝试 trySessionMemoryCompaction():

  1. 检查会话记忆是否真的有内容(不是空模板)

  2. 直接把会话记忆 markdown 当作压缩摘要不需要 API 调用

  3. 计算要保留的最近消息范围(从 lastSummarizedMessageId 向后扩,直到满足最小保留要求)

  4. 返回一个 CompactionResult,包含会话记忆摘要 + 保留的最近消息

Configuration:

关键洞察:会话记忆压缩比完整压缩便宜太多,因为摘要已经存在了。没有 summarizer API 调用,没有额外的 prompt 构造,也没有输出 token 成本,只是把会话记忆文件注入为摘要。

6. 第 4 层:完整压缩(Full Compaction)

  • File: src/services/compact/compact.ts

  • Cost: 一次完整 API 调用(输入=整段对话,输出=摘要)

  • When: 上下文超过 autocompact 阈值且会话记忆压缩不可用

触发条件

effective context window = context window - 20K(预留给输出) autocompact threshold = effective window - 13K(缓冲)

如果 tokenCountWithEstimation(messages) > autocompact threshold → 触发

熔断器(Circuit Breaker)

连续失败 3 次后,autocompact 本次会话里就不再尝试了。之所以加这一条,是因为发现有 1,279 个会话出现了 50+ 次连续失败(单会话最高 3,272 次),导致全球每天白白浪费大约 25 万次 API 调用。

压缩算法

Step 1: 预处理

  • 执行用户配置的 PreCompact hooks

  • 从消息里去掉图片(用 [image] 标记替代)

  • 去掉 skill discovery/listing 附件(稍后再注入回来)

Step 2: 生成摘要 系统会分叉一个 summarizer 代理,用很详细的提示请求一个 9 段式摘要:

---
name: testing-approach
description: User prefers integration tests over mocks after a prod incident
type: feedback
---

Integration tests must hit a real database, not mocks.

**Why:** Prior incident where mock/prod divergence masked a broken migration.

**How to apply:** When writing tests for database code, always use the test database helper.

提示里用了一种很巧的两段式输出结构:

  • 先输出 <analysis> 块,作为草稿区让模型整理思路

  • 再输出 <summary> 块,作为最终结构化摘要

  • <analysis> 块在摘要进入上下文前会被剥离,它能提升摘要质量,但不会消耗压缩后的 token 预算

Step 3: 压缩后恢复关键上下文

压缩完成后,会把关键上下文再注入回来:

  • 最近读过的前 5 个文件(每个 5K tokens,总预算 50K)

  • 已调用的 skill 内容(每个 5K tokens,总预算 25K)

  • Plan 附件(如果在 plan mode)

  • 延迟加载的工具 schema、agent 列表、MCP 指令

  • SessionStart hooks 重新执行(恢复 CLAUDE.md 等)

  • 会话元数据重新追加,供 --resume 展示

Step 4: 边界消息

用一条 SystemCompactBoundaryMessage 标记压缩点:

部分压缩(Partial Compaction)

两种方向变体,用更“手术式”的方式管理上下文:

  • from direction: 总结 pivot index 之后的消息,保留更早的消息不动。因为保留的前缀不变,所以能保住 prompt cache

  • up_to direction: 总结 pivot 之前的消息,保留更晚的消息。因为摘要会改写前缀,所以会打破缓存

Prompt-Too-Long 的恢复策略

如果压缩请求本身就触发 prompt-too-long(对话大到 summarizer 都吃不下):

  1. 用 groupMessagesByApiRound() 按 API 轮次分组

  2. 丢掉最老的分组,直到 token 缺口补齐(或者在缺口无法解析时丢掉 20% 的分组)

  3. 最多重试 3 次

  4. 全部失败就抛出 ERROR_MESSAGE_PROMPT_TOO_LONG

7. 第 5 层:自动记忆提取(Auto Memory Extraction)

  • File: src/services/extractMemories/extractMemories.ts

  • Cost: 一次分叉代理 API 调用

  • When: 每个完整 query loop 结束时触发(模型给出最终回复且没有工具调用)

目的

会话记忆记录的是当前会话的笔记,而自动记忆提取会把信息沉淀成跨会话可复用的长期知识,持久化在 ~/.claude/projects/<path>/memory/ 里。

记忆类型

有四类记忆,每类都有自己的保存规则:

CacheSafeParams = {
  systemPrompt: SystemPrompt,      // Must be byte-identical to parent
  userContext: { [k: string]: string },
  systemContext: { [k: string]: string },
  toolUseContext: ToolUseContext,   // Contains tools, model, options
  forkContextMessages: Message[],  // Parent's conversation (cache prefix)
}

记忆文件格式

TimeBasedMCConfig = {
  enabled: false,           // Master switch
  gapThresholdMinutes: 60,  // Trigger after 1h idle
  keepRecent: 5             // Keep last 5 tool results
}

明确不保存什么

提取提示会明确排除:

  • 代码风格、约定、架构(都能从代码推出来)

  • Git 历史(用 git log/git blame)

  • 调试结论(修复已经写进代码)

  • 任何写在 CLAUDE.md 里的内容

  • 临时性任务细节

与主代理互斥

如果主代理在当前回合已经写过记忆文件,就会跳过提取,避免后台代理重复劳动:

function hasMemoryWritesSince(messages, sinceUuid): boolean { // 扫描针对 auto-memory 路径的 Edit/Write tool_use 块 // 如果主代理已经保存过记忆则返回 true }

执行策略

提取提示会要求代理在有限回合预算内高效执行:

Turn 1: 并行发出所有可能需要更新文件的 FileRead Turn 2: 并行发出所有 FileWrite/FileEdit 不要在多个回合之间交错读写。

MEMORY.md:索引文件

MEMORY.md 是索引,不是记忆倾倒。每条索引应该是一行,长度大约小于 150 字符:

  • 测试方式 — 生产事故后只做真实 DB 集成测试,不用 mock
  • 用户画像 — 资深 Go 工程师,刚上手 React,关注可观测性

硬限制是 200 行25KB,先到哪个算哪个。超过 200 行的部分在加载进 system prompt 时会被截断。

8. 第 6 层:梦境(Dreaming)

1. Primary Request and Intent
2. Key Technical Concepts
3. Files and Code Sections (with code snippets)
4. Errors and Fixes
5. Problem Solving
6. All User Messages (verbatim — critical for intent tracking)
7. Pending Tasks
8. Current Work
9. Optional Next Step
  • File: src/services/autoDream/autoDream.ts

  • Cost: 一次分叉代理 API 调用(可能多回合)

  • When: 后台运行,在累积了足够时间与会话之后触发

概念

梦境是一种跨会话记忆整合。后台进程会回顾过去的会话 transcript,持续改进 memory 目录。它很像生物的记忆巩固机制,睡眠时会回放、归类、整合当天经验,写入长期存储。

门禁序列(先便宜后昂贵)

梦境系统用的是级联门禁,每一关都比下一关更便宜,所以大多数回合会很早就退出。

锁机制

<memoryDir>/.consolidate-lock 这个锁文件一物两用:

Path: <autoMemPath>/.consolidate-lock Body: Process PID(单行) mtime: lastConsolidatedAt 时间戳(锁本身就是时间戳)

  • Acquire: 写入 PID → mtime = now。随后重读验证 PID(防竞态)。

  • Success: mtime 保持为 now(标记整合完成时间)。

  • Failure: rollbackConsolidationLock(priorMtime) 用 utimes() 回滚 mtime。

  • Stale: 如果 mtime 超过 60 分钟且 PID 不在运行 → 回收。

  • Crash recovery: 检测到死 PID → 下一个进程回收。

四阶段整合

梦境代理会收到一个结构化提示,定义了四个阶段:

Phase 1 — 定位:

  • ls memory 目录

  • 读 MEMORY.md 了解当前索引

  • 快速翻已有主题文件,避免造重复

Phase 2 — 收集最近信号:

  • 如果有的话,查看日记 logs/YYYY/MM/YYYY-MM-DD.md

  • 检查是否出现漂移的记忆(与当前代码库矛盾的事实)

  • 针对具体线索,窄范围 grep transcript:grep -rn "<narrow term>" transcripts/ --include="*.jsonl" | tail -50

  • “不要把 transcript 全读一遍。只看那些你已经怀疑确实重要的点。”

Phase 3 — 整合:

  • 写入或更新记忆文件

  • 把新信号合并进已有主题文件,而不是新建近重复文件

  • 把相对日期改成绝对日期("yesterday" → "2026-03-30")

  • 在源头删除已被推翻的事实

Phase 4 — 修剪与索引:

  • 更新 MEMORY.md,保证不超过 200 行 / 25KB

  • 去掉指向过期/错误/被替代记忆的索引

  • 把冗长的索引条目缩短(细节应该放进主题文件)

  • 解决文件之间的矛盾

工具约束

梦境代理在严格限制下运行:

  • Bash: 只允许只读命令(ls、find、grep、cat、stat、wc、head、tail)

  • Edit/Write: 只能写 memory 目录路径

  • 不允许 MCP 工具、不允许 Agent 工具、不允许破坏性操作

UI 展示

梦境会在 footer pill 里作为后台任务显示,有一个两段式状态机:

DreamPhase: 'starting' → 'updating'(第一次 Edit/Write 落地时) Status: running → completed | failed | killed

用户可以在后台任务对话框里杀掉梦境任务,锁的 mtime 会回滚,下次会话还能重试。

进度跟踪

系统会监控梦境代理的每一轮 assistant 输出:

  • 抓取文本块用于 UI 展示

  • 统计 tool_use 块数量

  • 收集 Edit/Write 的文件路径用于完成摘要

  • 实时展示最多保留最近 30 轮

9. 第 7 层:跨代理通信

  • Files: src/utils/forkedAgent.ts, src/tools/AgentTool/, src/tools/SendMessageTool/

  • Cost: 随具体模式变化

  • When: 代理生成、后台任务、队友协作

分叉代理模式(Forked Agent Pattern)

Claude Code 里几乎所有后台操作(会话记忆、自动记忆、梦境、压缩、agent summary 等)都建立在分叉代理模式之上,这是底座:

ContextEditStrategy =
  | { type: 'clear_tool_uses_20250919',   // Clear old tool results
      trigger: { type: 'input_tokens', value: 180_000 },
      clear_at_least: { type: 'input_tokens', value: 140_000 } }
  | { type: 'clear_thinking_20251015',     // Clear old thinking blocks
      keep: { type: 'thinking_turns', value: 1 } | 'all' }

分叉会创建一个隔离的上下文,并复制可变状态:

  • readFileState: 复制 LRU 缓存(避免互相污染)

  • abortController: 子控制器与父控制器联动

  • denialTracking: 全新追踪状态

  • ContentReplacementState: 复制(保住缓存稳定的替换决策)

但它又通过保持 cache-critical 参数一致来共享 prompt cache。API 看到同一段前缀,就会命中缓存。

Agent 工具:生成子代理

Agent 工具支持多种生成模式:

防递归分叉:分叉出来的子代理仍然带着 Agent 工具(保证工具定义一致以便共享缓存),但会检测对话历史里的 <fork_boilerplate_tag>,从而拒绝递归分叉。

为了共享缓存而构造的分叉消息:

所有分叉子代理都会生成字节级一致的 API 请求前缀: 1. 完整复制父代理上一条 assistant 消息(包含全部 tool_use、thinking、text) 2. 再加一条用户消息,里面包含: - 为每个 tool_use 准备相同的占位结果 - 每个子代理各自的指令文本块(只有这一段不同) → 尽可能让并发分叉之间共享同一段 prompt cache

SendMessage:代理间通信

SendMessage 工具让代理之间能在运行时互相发消息:

SendMessage({ to: 'research-agent', // 或用 '' 广播,或 'uds:<path>'、'bridge:<id>'* message: '检查第 5 节', summary: '请求复核该章节' })

路由逻辑:

  1. 进程内按名字找子代理 → queuePendingMessage() → 在下一轮工具边界时被 drain

  2. 环境团队(基于进程) → writeToMailbox() → 文件邮箱

  3. 跨会话 → 通过 bridge/UDS 调 postInterClaudeMessage()

结构化消息用于生命周期控制:

  • shutdown_request / shutdown_response — 协调代理优雅退出

  • plan_approval_response — leader 同意或拒绝队友的 plan

代理记忆:跨调用的持久状态

代理可以在三种范围里维持跨调用的持久记忆:

Agent Summary:周期性的进度快照

对 coordinator 模式的子代理,会每 30 秒把对话分叉一次,生成一个 3–5 个词的进度摘要:

Good: "Reading runAgent.ts" Good: "Fixing null check in validate.ts" Bad: "Investigating the issue" (too vague) Bad: "Analyzed the branch diff" (past tense)

使用 Haiku(最便宜的模型),并禁用所有工具,这就是纯文本生成任务。

10. Query Loop:如何拼在一起

File: src/query.ts

11. Prompt 缓存优化

Claude Code 架构里最精妙的部分之一,是它对 prompt cache 的偏执优化。几乎每个设计决定都会先看对缓存的影响。

问题是什么

Anthropic 的 API 会在服务端缓存 prompt 前缀(TTL 大约 1 小时)。缓存命中意味着只为新增 token 付费。缓存未命中则要把整个 prompt 重新 tokenize。对于 200K token,这个差别大概是每次请求 ~$0.003 和 ~$0.60 的差距。

保持缓存不破的模式

1. CacheSafeParams:每个分叉代理(会话记忆、压缩、梦境、提取)都会继承父代理完全相同的 system prompt、工具和消息前缀。分叉后的 API 请求前缀一致 → 缓存命中。

2. renderedSystemPrompt:分叉线程直接复用父代理已经渲染好的 system prompt 字节,避免重复渲染带来的差异(比如 GrowthBook flag 在两次渲染之间变了)。

3. ContentReplacementState 复制:工具结果的持久化决策会被冻结。同样的结果每次都会得到同样的预览 → 前缀稳定。

4. 缓存微压缩:用 cache_edits 改服务端缓存而不改本地前缀 → 不破缓存。

5. 分叉消息构造:所有分叉子代理拿到字节级一致的前缀,只有最后一段指令不同 → 并发分叉最大化共享缓存。

6. 压缩后的缓存断裂通知:压缩后调用 notifyCompaction() 重置缓存基线,这样预期中的压缩后缓存未命中不会被误判为异常。

缓存断裂检测

系统会通过 promptCacheBreakDetection.ts 主动监控非预期的缓存未命中,并标记出来排查。已知合理的缓存断裂(压缩、微压缩等)会提前登记,避免误报。

12. 关键数字

上下文阈值

工具结果预算

会话记忆

压缩

梦境

微压缩

13. 设计原则

1. 分层防御,先便宜后昂贵

每一层上下文管理,都是为了不让下一层更贵的机制触发:

  • 工具结果存储让微压缩需要清掉的东西更少

  • 微压缩避免触发会话记忆压缩

  • 会话记忆压缩避免触发完整压缩

  • 完整压缩避免上下文溢出错误

2. Prompt 缓存优先

几乎所有设计决定都把 prompt cache 影响放在前面。系统会用尽办法保持 API 请求前缀字节级一致,包括冻结的 ContentReplacementState、复用渲染后的 system prompt、cache_edits API、统一的分叉消息构造。

3. 隔离但共享

分叉代理复制可变状态来隔离(避免互相污染),同时共享 prompt cache 前缀(避免成本爆炸)。这是个微妙平衡,隔离过度会浪费缓存,共享过度会引入 bug。

4. 到处都是熔断器

  • Autocompact:三次失败止损

  • Dream scan:10 分钟节流

  • Dream lock:基于 PID 的互斥锁 + 过期回收

  • 会话记忆:顺序执行包装

  • 记忆提取:与主代理写入互斥

5. 优雅降级

每套系统失败时都会静默让位给下一层。会话记忆压缩失败就返回 null → 交给完整压缩。梦境拿锁失败 → 下次会话再试。提取记忆出错 → 只记录日志,不抛出。

6. 特性开关作为紧急刹车

几乎每个系统都被 GrowthBook 特性开关门控:

  • tengu_session_memory — 会话记忆

  • tengu_sm_compact — 会话记忆压缩

  • tengu_onyx_plover — 梦境

  • tengu_passport_quail — 自动记忆提取

  • tengu_slate_heron — 基于时间的微压缩

  • CACHED_MICROCOMPACT — 缓存编辑微压缩

  • CONTEXT_COLLAPSE — 上下文折叠

  • HISTORY_SNIP — 消息裁剪

只要某个系统出问题,就能不发版快速回滚。

7. 需要时就互斥

  • Context Collapse ↔ Autocompact(collapse 自己管上下文)

How Claude Code Manages Memory: A Deep Technical Analysis

A comprehensive reverse-engineering of every memory and context management system inside Claude Code's leaked harness — from lightweight token pruning to a "dreaming" system that consolidates memories while you sleep.

对 Claude Code 外泄的 harness 内部所有记忆与上下文管理系统做一次完整逆向拆解,从亚毫秒级的轻量 token 修剪,到你睡觉时也会把记忆做整合的“梦境”系统。

1. The Problem: Bounded Context in an Unbounded World

LLMs have a fundamental constraint: a fixed context window. Claude Code typically operates with a 200K token window (expandable to 1M with the [1m] suffix). A single coding session can easily blow past this — a few file reads, some grep results, a handful of edit cycles, and you're at the limit.

Claude Code solves this with a 7-layer memory architecture that spans from sub-millisecond token pruning to multi-hour background "dreaming" consolidation. Each layer is progressively more expensive but more powerful, and the system is designed so cheaper layers prevent the need for more expensive ones.

1. 问题:无界世界里的有界上下文

LLM 有个根本限制,上下文窗口是固定的。Claude Code 通常运行在 200K token 的窗口里(加上 [1m] 后缀可扩到 1M)。一次编码会话很容易就超了,读几个文件,跑点 grep,来回改几轮,就到顶了。

Claude Code 用一套七层记忆架构解决这个问题,跨度从亚毫秒级 token 修剪到耗时数小时的后台“梦境”整合。每一层越往后越贵,但能力也越强。整个系统的设计目标是,让更便宜的层尽量挡住更昂贵的层,不让它们被触发。

Token Counting: The Foundation

Everything starts with knowing how many tokens you've used. The canonical function is tokenCountWithEstimation() in src/utils/tokens.ts:

Canonical token count = last API response's usage.input_tokens + rough estimates for messages added since

The rough estimation uses a simple heuristic: 4 bytes per token for most text, 2 bytes per token for JSON (which tokenizes more densely). Images and documents get a flat 2,000 token estimate regardless of size.

Token 计数:地基

一切从知道用了多少 token 开始。权威函数在 src/utils/tokens.ts 里,叫 tokenCountWithEstimation()

标准 token 计数 = 上一次 API 响应的 usage.input_tokens + 对之后新增消息的粗略估算

粗估用的是一个很简单的启发式规则:大多数文本按 每 token 4 字节估,JSON(token 更密)按 每 token 2 字节估。图片和文档不管大小,都统一按 2,000 token 估算。

Context Window Resolution

The system resolves the available context window through a priority chain:

[1m] model suffix → model capability lookup → 1M beta header → env override → 200K default

The effective context window subtracts a 20K token reserve for compaction output — you can't use the full window because you need room to generate the summary that saves you.

上下文窗口如何确定

系统按一条优先级链路来决定可用的上下文窗口:

[1m] 模型后缀 → 模型能力查询 → 1M beta header → 环境变量覆盖 → 200K 默认值

有效上下文窗口会先扣掉 20K token 作为压缩输出的预留。窗口不能全用满,因为还得留位置生成那段救命的总结。

2. Architecture Overview: 7 Layers of Memory

Each layer is triggered by different conditions and has different costs. The system is designed so Layer N prevents Layer N+1 from firing whenever possible.

# Session Title
_A short and distinctive 5-10 word descriptive title_

# Current State
_What is actively being worked on right now?_

# Task specification
_What did the user ask to build?_

# Files and Functions
_Important files and their relevance_

# Workflow
_Bash commands usually run and their interpretation_

# Errors & Corrections
_Errors encountered and how they were fixed_

# Codebase and System Documentation
_Important system components and how they fit together_

# Learnings
_What has worked well? What has not?_

# Key results
_If the user asked for specific output, repeat it here_

# Worklog
_Step by step, what was attempted and done_

2. 架构总览:七层记忆

每一层由不同条件触发,成本也不同。系统尽量做到 第 N 层能阻止第 N+1 层被触发

# Session Title
_A short and distinctive 5-10 word descriptive title_

# Current State
_What is actively being worked on right now?_

# Task specification
_What did the user ask to build?_

# Files and Functions
_Important files and their relevance_

# Workflow
_Bash commands usually run and their interpretation_

# Errors & Corrections
_Errors encountered and how they were fixed_

# Codebase and System Documentation
_Important system components and how they fit together_

# Learnings
_What has worked well? What has not?_

# Key results
_If the user asked for specific output, repeat it here_

# Worklog
_Step by step, what was attempted and done_

3. Layer 1: Tool Result Storage

  • File: src/utils/toolResultStorage.ts

  • Cost: Disk I/O only — no API calls

  • When: Every tool result, immediately

The Problem

A single grep across a codebase can return 100KB+ of text. A cat of a large file might be 50KB. These results consume massive context and become stale within minutes as the conversation moves on.

The Solution

Every tool result passes through a budget system before entering context:

When a result exceeds its threshold:

  1. The full result is written to disk at tool-results/<sessionId>/<toolUseId>.txt

  2. A preview (first ~2KB) is placed in context, wrapped in <persisted-output> tags

  3. The model can later use Read to access the full result if needed

ContentReplacementState: Cache-Stable Decisions

A critical subtlety: once a tool result is replaced with a preview, that decision is frozen in ContentReplacementState. On subsequent API calls, the same result gets the same preview — this ensures the prompt prefix remains byte-identical for prompt cache hits. This state even survives session resume by being persisted to the transcript.

GrowthBook Override

Per-tool thresholds can be remotely tuned via the tengu_satin_quoll feature flag — allowing Anthropic to adjust persistence thresholds for specific tools without a code deploy.

3. 第 1 层:工具结果存储

  • File: src/utils/toolResultStorage.ts

  • Cost: 只有磁盘 I/O,没有 API 调用

  • When: 每次产生工具结果,立刻执行

问题

在代码库里跑一次 grep,很容易就返回 100KB+ 文本。大文件 cat 一下也可能有 50KB。这些结果会吞掉大量上下文,而且对话推进几分钟后就变得不再重要。

解法

每个工具结果在进上下文前都会走一套预算系统:

当结果超过阈值时:

  1. 完整结果写到磁盘 tool-results/<sessionId>/<toolUseId>.txt

  2. 预览(前 ~2KB)放进上下文,并包在 <persisted-output> 标签里

  3. 之后如果需要,模型可以用 Read 再把完整结果读回来

ContentReplacementState:缓存稳定的决策

这里有个关键细节:一旦某个工具结果被替换成预览,这个决定就会被 ContentReplacementState 冻结。后续每次 API 调用,同一个结果都会拿到同一个预览,这样才能保证提示前缀字节级一致,从而命中 prompt cache。这个状态还会被写进 transcript,所以即使恢复会话也能延续。

GrowthBook 覆盖

每个工具的阈值可以通过 tengu_satin_quoll 特性开关远程调整,让 Anthropic 不用发版就能给特定工具改持久化阈值。

4. Layer 2: Microcompaction

  • File: src/services/compact/microCompact.ts

  • Cost: Zero to minimal API cost

  • When: Every turn, before the API call

Microcompaction is the lightest-weight context relief. It doesn't summarize anything — it just clears old tool results that are unlikely to be needed.

Three Distinct Mechanisms

a) Time-Based Microcompact

Trigger: Idle gap since last assistant message exceeds threshold (default: 60 minutes)

Rationale: Anthropic's server-side prompt cache has a ~1 hour TTL. If you haven't sent a request in an hour, the cache has expired and the entire prompt prefix will be re-tokenized from scratch. Since it's being rewritten anyway, clear old tool results first to shrink what gets rewritten.

Action: Replaces tool result content with [Old tool result content cleared], keeping at least the most recent N results (floor of 1).

Configuration (via GrowthBook tengu_slate_heron):

b) Cached Microcompact (Cache-Editing API)

This is the most technically interesting mechanism. Instead of modifying local messages (which would invalidate the prompt cache), it uses the API's cache_edits mechanism to delete tool results from the server-side cache without invalidating the prefix.

How it works:

  1. Tool results are registered in a global CachedMCState as they appear

  2. When the count exceeds a threshold, the oldest results (minus a "keep recent" buffer) are selected for deletion

  3. A cache_edits block is generated and sent alongside the next API request

  4. The server deletes the specified tool results from its cached prefix

  5. Local messages remain unchanged — the deletion is API-layer only

Critical safety: Only runs on the main thread. If forked subagents (session_memory, agent_summary, etc.) modified the global state, they'd corrupt the main thread's cache editing.

c) API-Level Context Management (apiMicrocompact.ts)

A newer server-side approach using the context_management API parameter:

compactMetadata = {
  type: 'auto' | 'manual',
  preCompactTokenCount: number,
  compactedMessageUuid: UUID,           // Last msg before boundary
  preCompactDiscoveredTools: string[],  // Loaded deferred tools
  preservedSegment?: {                  // Session memory path only
    headUuid, anchorUuid, tailUuid
  }
}

This tells the API server to handle context management natively — the client doesn't need to track or manage tool result clearing.

Which Tools Are Compactable?

Only results from these tools get cleared:

FileRead, Bash/Shell, Grep, Glob, WebSearch, WebFetch, FileEdit, FileWrite

Notably absent: thinking blocks, assistant text, user messages, MCP tool results.

4. 第 2 层:微压缩(Microcompaction)

  • File: src/services/compact/microCompact.ts

  • Cost: 零到极低的 API 成本

  • When: 每一轮对话,在 API 调用前

微压缩是最轻量的上下文减负。它不做总结,只是清掉那些大概率不会再用到的旧工具结果。

三种不同机制

a) 基于时间的微压缩

Trigger: 距离上一条 assistant 消息的空闲间隔超过阈值(默认:60 分钟)

Rationale: Anthropic 的服务端 prompt cache TTL 大概是 1 小时。如果 1 小时没发请求,缓存就过期了,整个提示前缀会被重新 tokenize。既然反正要重写,不如先把旧工具结果清掉,缩小要重写的体积。

Action: 把工具结果内容替换成 [Old tool result content cleared],同时至少保留最近 N 条结果(下限为 1)。

Configuration(通过 GrowthBook tengu_slate_heron):

b) 缓存微压缩(Cache-Editing API)

这是最有技术含量的一种。它不改本地消息(那会让 prompt cache 失效),而是用 API 的 cache_edits 机制,在不破坏前缀的前提下,从服务端缓存里把工具结果删掉。

工作方式:

  1. 工具结果出现时,会登记到全局 CachedMCState 里

  2. 当数量超过阈值,就挑出最旧的一批(扣掉“保留最近”的缓冲)准备删除

  3. 生成一个 cache_edits 块,并跟着下一次 API 请求一起发出去

  4. 服务端从缓存前缀里删除指定的工具结果

  5. 本地消息不变,删除只发生在 API 层

关键安全点:只在主线程运行。如果分叉出来的子代理(session_memory、agent_summary 等)去改全局状态,会把主线程的 cache editing 搞坏。

c) API 级上下文管理(apiMicrocompact.ts)

一种更新的服务端方案,通过 context_management 这个 API 参数来做:

compactMetadata = {
  type: 'auto' | 'manual',
  preCompactTokenCount: number,
  compactedMessageUuid: UUID,           // Last msg before boundary
  preCompactDiscoveredTools: string[],  // Loaded deferred tools
  preservedSegment?: {                  // Session memory path only
    headUuid, anchorUuid, tailUuid
  }
}

这让 API 服务器原生处理上下文管理,客户端不用再自己追踪或清理工具结果。

哪些工具结果允许被清理?

只有这些工具的结果会被清掉:

FileRead, Bash/Shell, Grep, Glob, WebSearch, WebFetch, FileEdit, FileWrite

缺席得很明显的有:thinking blocks、assistant 文本、用户消息、MCP 工具结果。

5. Layer 3: Session Memory

SessionMemoryCompactConfig = {
  minTokens: 10_000,          // Minimum tokens to preserve
  minTextBlockMessages: 5,     // Minimum messages with text blocks
  maxTokens: 40_000            // Hard cap on preserved tokens
}
  • Files: src/services/SessionMemory/

  • Cost: One forked agent API call per extraction

  • When: Periodically during conversation (post-sampling hook)

The Idea

Instead of waiting until context is full and then desperately trying to summarize everything, continuously maintain notes about the conversation. Then when compaction IS needed, you already have a summary ready — no expensive summarization call required.

Session Memory Template

Each session gets a markdown file at ~/.claude/projects/<slug>/.claude/session-memory/<sessionId>.md with a structured template:

Trigger Logic

Session memory extraction fires when both conditions are met:

Token growth since last extraction ≥ minimumTokensBetweenUpdate AND (tool calls since last extraction ≥ toolCallsBetweenUpdates OR no tool calls in the last assistant turn)

The token threshold is always required — even if the tool call threshold is met. The "no tool calls in last turn" clause captures natural conversation breaks where the model has finished a work sequence.

Extraction Execution

The extraction runs as a forked subagent via runForkedAgent():

  • querySource: 'session_memory'

  • Only allowed to use FileEdit on the memory file (all other tools denied)

  • Shares the parent's prompt cache for cost efficiency

  • Runs sequentially (via sequential() wrapper) to prevent overlapping extractions

Session Memory Compaction: The Payoff

When autocompact triggers, it first tries trySessionMemoryCompaction():

  1. Check if session memory has actual content (not just the empty template)

  2. Use the session memory markdown as the compaction summaryno API call needed

  3. Calculate which recent messages to keep (expanding backward from lastSummarizedMessageId to meet minimums)

  4. Return a CompactionResult with the session memory as summary + preserved recent messages

Configuration:

The key insight: Session memory compaction is dramatically cheaper than full compaction because the summary already exists. No summarizer API call, no prompt construction, no output token cost. The session memory file is simply injected as the summary.

5. 第 3 层:会话记忆(Session Memory)

SessionMemoryCompactConfig = {
  minTokens: 10_000,          // Minimum tokens to preserve
  minTextBlockMessages: 5,     // Minimum messages with text blocks
  maxTokens: 40_000            // Hard cap on preserved tokens
}
  • Files: src/services/SessionMemory/

  • Cost: 每次提取要跑一个分叉子代理的 API 调用

  • When: 对话过程中周期性触发(post-sampling hook)

核心想法

不等到上下文快满才慌张总结,而是持续维护笔记。这样一旦需要压缩,摘要已经在那儿了,不用再额外付一次昂贵的总结成本。

会话记忆模板

每个会话都会有一个 markdown 文件,路径是 ~/.claude/projects/<slug>/.claude/session-memory/<sessionId>.md,里面是结构化模板:

触发逻辑

会话记忆提取在同时满足两个条件时触发:

自上次提取以来 token 增长 ≥ minimumTokensBetweenUpdate AND(自上次提取以来工具调用次数 ≥ toolCallsBetweenUpdates OR 上一轮 assistant 没有工具调用)

token 门槛永远必须满足,即使工具调用门槛满足也不例外。上一轮无工具调用这条是为了捕捉自然的对话断点,也就是模型刚好结束了一段工作流程的时候。

提取如何执行

提取通过 runForkedAgent() 以分叉子代理执行:

  • querySource: 'session_memory'

  • 只允许用 FileEdit 改记忆文件(其他工具全部禁用)

  • 共享父代理的 prompt cache 以节省成本

  • 通过 sequential() 包装顺序执行,避免多次提取互相重叠

会话记忆压缩:兑现价值的地方

当 autocompact 触发时,会先尝试 trySessionMemoryCompaction():

  1. 检查会话记忆是否真的有内容(不是空模板)

  2. 直接把会话记忆 markdown 当作压缩摘要不需要 API 调用

  3. 计算要保留的最近消息范围(从 lastSummarizedMessageId 向后扩,直到满足最小保留要求)

  4. 返回一个 CompactionResult,包含会话记忆摘要 + 保留的最近消息

Configuration:

关键洞察:会话记忆压缩比完整压缩便宜太多,因为摘要已经存在了。没有 summarizer API 调用,没有额外的 prompt 构造,也没有输出 token 成本,只是把会话记忆文件注入为摘要。

6. Layer 4: Full Compaction

  • File: src/services/compact/compact.ts

  • Cost: One full API call (input = entire conversation, output = summary)

  • When: Context exceeds autocompact threshold AND session memory compaction unavailable

Trigger

effective context window = context window - 20K (reserved for output) autocompact threshold = effective window - 13K (buffer)

If tokenCountWithEstimation(messages) > autocompact threshold → trigger

Circuit Breaker

After 3 consecutive failures, autocompact stops trying for the rest of the session. This was added after discovering that 1,279 sessions had 50+ consecutive failures (up to 3,272 in a single session), wasting approximately 250K API calls per day globally.

The Compaction Algorithm

Step 1: Pre-processing

  • Execute user-configured PreCompact hooks

  • Strip images from messages (replaced with [image] markers)

  • Strip skill discovery/listing attachments (will be re-injected)

Step 2: Generate Summary The system forks a summarizer agent with a detailed prompt requesting a 9-section summary:

---
name: testing-approach
description: User prefers integration tests over mocks after a prod incident
type: feedback
---

Integration tests must hit a real database, not mocks.

**Why:** Prior incident where mock/prod divergence masked a broken migration.

**How to apply:** When writing tests for database code, always use the test database helper.

The prompt uses a clever two-phase output structure:

  • First: <analysis> block — a drafting scratchpad where the model organizes its thoughts

  • Then: <summary> block — the actual structured summary

  • The <analysis> block is stripped before the summary enters context — it improves summary quality without consuming post-compact tokens

Step 3: Post-compact Restoration

After compaction, critical context is re-injected:

  • Top 5 recently-read files (5K tokens each, 50K total budget)

  • Invoked skill content (5K tokens each, 25K total budget)

  • Plan attachment (if in plan mode)

  • Deferred tool schemas, agent listings, MCP instructions

  • SessionStart hooks re-execute (restores CLAUDE.md, etc.)

  • Session metadata re-appended for --resume display

Step 4: Boundary Message

A SystemCompactBoundaryMessage marks the compaction point:

Partial Compaction

Two directional variants for more surgical context management:

  • from direction: Summarize messages AFTER a pivot index, keep earlier ones intact. Preserves prompt cache because the kept prefix is unchanged.

  • up_to direction: Summarize messages BEFORE pivot, keep later ones. Breaks cache because the summary changes the prefix.

Prompt-Too-Long Recovery

If the compaction request itself hits prompt-too-long (the conversation is so large even the summarizer can't process it):

  1. Group messages by API round via groupMessagesByApiRound()

  2. Drop the oldest groups until the token gap is covered (or 20% of groups if gap is unparseable)

  3. Retry up to 3 times

  4. If all retries fail → ERROR_MESSAGE_PROMPT_TOO_LONG thrown

6. 第 4 层:完整压缩(Full Compaction)

  • File: src/services/compact/compact.ts

  • Cost: 一次完整 API 调用(输入=整段对话,输出=摘要)

  • When: 上下文超过 autocompact 阈值且会话记忆压缩不可用

触发条件

effective context window = context window - 20K(预留给输出) autocompact threshold = effective window - 13K(缓冲)

如果 tokenCountWithEstimation(messages) > autocompact threshold → 触发

熔断器(Circuit Breaker)

连续失败 3 次后,autocompact 本次会话里就不再尝试了。之所以加这一条,是因为发现有 1,279 个会话出现了 50+ 次连续失败(单会话最高 3,272 次),导致全球每天白白浪费大约 25 万次 API 调用。

压缩算法

Step 1: 预处理

  • 执行用户配置的 PreCompact hooks

  • 从消息里去掉图片(用 [image] 标记替代)

  • 去掉 skill discovery/listing 附件(稍后再注入回来)

Step 2: 生成摘要 系统会分叉一个 summarizer 代理,用很详细的提示请求一个 9 段式摘要:

---
name: testing-approach
description: User prefers integration tests over mocks after a prod incident
type: feedback
---

Integration tests must hit a real database, not mocks.

**Why:** Prior incident where mock/prod divergence masked a broken migration.

**How to apply:** When writing tests for database code, always use the test database helper.

提示里用了一种很巧的两段式输出结构:

  • 先输出 <analysis> 块,作为草稿区让模型整理思路

  • 再输出 <summary> 块,作为最终结构化摘要

  • <analysis> 块在摘要进入上下文前会被剥离,它能提升摘要质量,但不会消耗压缩后的 token 预算

Step 3: 压缩后恢复关键上下文

压缩完成后,会把关键上下文再注入回来:

  • 最近读过的前 5 个文件(每个 5K tokens,总预算 50K)

  • 已调用的 skill 内容(每个 5K tokens,总预算 25K)

  • Plan 附件(如果在 plan mode)

  • 延迟加载的工具 schema、agent 列表、MCP 指令

  • SessionStart hooks 重新执行(恢复 CLAUDE.md 等)

  • 会话元数据重新追加,供 --resume 展示

Step 4: 边界消息

用一条 SystemCompactBoundaryMessage 标记压缩点:

部分压缩(Partial Compaction)

两种方向变体,用更“手术式”的方式管理上下文:

  • from direction: 总结 pivot index 之后的消息,保留更早的消息不动。因为保留的前缀不变,所以能保住 prompt cache

  • up_to direction: 总结 pivot 之前的消息,保留更晚的消息。因为摘要会改写前缀,所以会打破缓存

Prompt-Too-Long 的恢复策略

如果压缩请求本身就触发 prompt-too-long(对话大到 summarizer 都吃不下):

  1. 用 groupMessagesByApiRound() 按 API 轮次分组

  2. 丢掉最老的分组,直到 token 缺口补齐(或者在缺口无法解析时丢掉 20% 的分组)

  3. 最多重试 3 次

  4. 全部失败就抛出 ERROR_MESSAGE_PROMPT_TOO_LONG

7. Layer 5: Auto Memory Extraction

  • File: src/services/extractMemories/extractMemories.ts

  • Cost: One forked agent API call

  • When: End of each complete query loop (model produces final response with no tool calls)

Purpose

While Session Memory captures notes about the current session, Auto Memory Extraction builds durable, cross-session knowledge that persists in ~/.claude/projects/<path>/memory/.

Memory Types

Four types of memories, each with specific save criteria:

CacheSafeParams = {
  systemPrompt: SystemPrompt,      // Must be byte-identical to parent
  userContext: { [k: string]: string },
  systemContext: { [k: string]: string },
  toolUseContext: ToolUseContext,   // Contains tools, model, options
  forkContextMessages: Message[],  // Parent's conversation (cache prefix)
}

Memory File Format

TimeBasedMCConfig = {
  enabled: false,           // Master switch
  gapThresholdMinutes: 60,  // Trigger after 1h idle
  keepRecent: 5             // Keep last 5 tool results
}

What NOT to Save

The extraction prompt explicitly excludes:

  • Code patterns, conventions, architecture (derivable from code)

  • Git history (use git log/git blame)

  • Debugging solutions (the fix is in the code)

  • Anything in CLAUDE.md files

  • Ephemeral task details

Mutual Exclusivity with Main Agent

If the main agent already wrote memory files during the current turn, extraction is skipped. This prevents the background agent from duplicating work the main agent already did:

function hasMemoryWritesSince(messages, sinceUuid): boolean { // Scans for Edit/Write tool_use blocks targeting auto-memory paths // Returns true if main agent already saved memories }

Execution Strategy

The extraction prompt instructs the agent to be efficient with its limited turn budget:

Turn 1: Issue all FileRead calls in parallel for files you might update Turn 2: Issue all FileWrite/FileEdit calls in parallel Do not interleave reads and writes across multiple turns.

MEMORY.md: The Index

MEMORY.md is an index file, not a memory dump. Each entry should be one line under ~150 characters:

  • Testing Approach — Real DB tests only, no mocks after prod incident
  • User Profile — Senior Go eng, new to React, focused on observability

Hard limits: 200 lines or 25KB — whichever is hit first. Lines beyond 200 are truncated when loaded into the system prompt.

7. 第 5 层:自动记忆提取(Auto Memory Extraction)

  • File: src/services/extractMemories/extractMemories.ts

  • Cost: 一次分叉代理 API 调用

  • When: 每个完整 query loop 结束时触发(模型给出最终回复且没有工具调用)

目的

会话记忆记录的是当前会话的笔记,而自动记忆提取会把信息沉淀成跨会话可复用的长期知识,持久化在 ~/.claude/projects/<path>/memory/ 里。

记忆类型

有四类记忆,每类都有自己的保存规则:

CacheSafeParams = {
  systemPrompt: SystemPrompt,      // Must be byte-identical to parent
  userContext: { [k: string]: string },
  systemContext: { [k: string]: string },
  toolUseContext: ToolUseContext,   // Contains tools, model, options
  forkContextMessages: Message[],  // Parent's conversation (cache prefix)
}

记忆文件格式

TimeBasedMCConfig = {
  enabled: false,           // Master switch
  gapThresholdMinutes: 60,  // Trigger after 1h idle
  keepRecent: 5             // Keep last 5 tool results
}

明确不保存什么

提取提示会明确排除:

  • 代码风格、约定、架构(都能从代码推出来)

  • Git 历史(用 git log/git blame)

  • 调试结论(修复已经写进代码)

  • 任何写在 CLAUDE.md 里的内容

  • 临时性任务细节

与主代理互斥

如果主代理在当前回合已经写过记忆文件,就会跳过提取,避免后台代理重复劳动:

function hasMemoryWritesSince(messages, sinceUuid): boolean { // 扫描针对 auto-memory 路径的 Edit/Write tool_use 块 // 如果主代理已经保存过记忆则返回 true }

执行策略

提取提示会要求代理在有限回合预算内高效执行:

Turn 1: 并行发出所有可能需要更新文件的 FileRead Turn 2: 并行发出所有 FileWrite/FileEdit 不要在多个回合之间交错读写。

MEMORY.md:索引文件

MEMORY.md 是索引,不是记忆倾倒。每条索引应该是一行,长度大约小于 150 字符:

  • 测试方式 — 生产事故后只做真实 DB 集成测试,不用 mock
  • 用户画像 — 资深 Go 工程师,刚上手 React,关注可观测性

硬限制是 200 行25KB,先到哪个算哪个。超过 200 行的部分在加载进 system prompt 时会被截断。

8. Layer 6: Dreaming

1. Primary Request and Intent
2. Key Technical Concepts
3. Files and Code Sections (with code snippets)
4. Errors and Fixes
5. Problem Solving
6. All User Messages (verbatim — critical for intent tracking)
7. Pending Tasks
8. Current Work
9. Optional Next Step
  • File: src/services/autoDream/autoDream.ts

  • Cost: One forked agent API call (potentially multi-turn)

  • When: Background, after sufficient time and sessions have accumulated

The Concept

Dreaming is cross-session memory consolidation — a background process that reviews past session transcripts and improves the memory directory. It's analogous to how biological memory consolidation happens during sleep: experiences from the day are reviewed, organized, and integrated into long-term storage.

Gate Sequence (Cheapest Check First)

The dream system uses a cascading gate design where each check is cheaper than the next, so most turns exit early:

The Lock Mechanism

The lock file at <memoryDir>/.consolidate-lock serves double duty:

Path: <autoMemPath>/.consolidate-lock Body: Process PID (single line) mtime: lastConsolidatedAt timestamp (the lock IS the timestamp)

  • Acquire: Write PID → mtime = now. Verify PID on re-read (race protection).

  • Success: mtime stays at now (marks consolidation time).

  • Failure: rollbackConsolidationLock(priorMtime) rewinds mtime via utimes().

  • Stale: If mtime > 60 minutes old AND PID is not running → reclaim.

  • Crash recovery: Dead PID detected → next process reclaims.

Four-Phase Consolidation

The dream agent receives a structured prompt defining four phases:

Phase 1 — Orient:

  • ls the memory directory

  • Read MEMORY.md to understand the current index

  • Skim existing topic files to avoid creating duplicates

Phase 2 — Gather Recent Signal:

  • Review daily logs (logs/YYYY/MM/YYYY-MM-DD.md) if present

  • Check for drifted memories (facts that contradict current codebase)

  • Grep session transcripts narrowly for specific context:grep -rn "<narrow term>" transcripts/ --include="*.jsonl" | tail -50

  • "Don't exhaustively read transcripts. Look only for things you already suspect matter."

Phase 3 — Consolidate:

  • Write or update memory files

  • Merge new signal into existing topic files rather than creating near-duplicates

  • Convert relative dates to absolute ("yesterday" → "2026-03-30")

  • Delete contradicted facts at the source

Phase 4 — Prune and Index:

  • Update MEMORY.md to stay under 200 lines / 25KB

  • Remove pointers to stale/wrong/superseded memories

  • Shorten verbose index entries (detail belongs in topic files)

  • Resolve contradictions between files

Tool Constraints

The dream agent operates under strict restrictions:

  • Bash: Read-only commands only (ls, find, grep, cat, stat, wc, head, tail)

  • Edit/Write: Only to memory directory paths

  • No MCP tools, no Agent tool, no destructive operations

UI Surfacing

Dreams appear as background tasks in the footer pill, with a two-phase state machine:

DreamPhase: 'starting' → 'updating' (when first Edit/Write lands) Status: running → completed | failed | killed

Users can kill a dream from the background tasks dialog — the lock mtime is rolled back so the next session can retry.

Progress Tracking

Each assistant turn from the dream agent is watched:

  • Text blocks captured for UI display

  • Tool_use blocks counted

  • Edit/Write file paths collected for the completion summary

  • Capped at 30 most recent turns for live display

8. 第 6 层:梦境(Dreaming)

1. Primary Request and Intent
2. Key Technical Concepts
3. Files and Code Sections (with code snippets)
4. Errors and Fixes
5. Problem Solving
6. All User Messages (verbatim — critical for intent tracking)
7. Pending Tasks
8. Current Work
9. Optional Next Step
  • File: src/services/autoDream/autoDream.ts

  • Cost: 一次分叉代理 API 调用(可能多回合)

  • When: 后台运行,在累积了足够时间与会话之后触发

概念

梦境是一种跨会话记忆整合。后台进程会回顾过去的会话 transcript,持续改进 memory 目录。它很像生物的记忆巩固机制,睡眠时会回放、归类、整合当天经验,写入长期存储。

门禁序列(先便宜后昂贵)

梦境系统用的是级联门禁,每一关都比下一关更便宜,所以大多数回合会很早就退出。

锁机制

<memoryDir>/.consolidate-lock 这个锁文件一物两用:

Path: <autoMemPath>/.consolidate-lock Body: Process PID(单行) mtime: lastConsolidatedAt 时间戳(锁本身就是时间戳)

  • Acquire: 写入 PID → mtime = now。随后重读验证 PID(防竞态)。

  • Success: mtime 保持为 now(标记整合完成时间)。

  • Failure: rollbackConsolidationLock(priorMtime) 用 utimes() 回滚 mtime。

  • Stale: 如果 mtime 超过 60 分钟且 PID 不在运行 → 回收。

  • Crash recovery: 检测到死 PID → 下一个进程回收。

四阶段整合

梦境代理会收到一个结构化提示,定义了四个阶段:

Phase 1 — 定位:

  • ls memory 目录

  • 读 MEMORY.md 了解当前索引

  • 快速翻已有主题文件,避免造重复

Phase 2 — 收集最近信号:

  • 如果有的话,查看日记 logs/YYYY/MM/YYYY-MM-DD.md

  • 检查是否出现漂移的记忆(与当前代码库矛盾的事实)

  • 针对具体线索,窄范围 grep transcript:grep -rn "<narrow term>" transcripts/ --include="*.jsonl" | tail -50

  • “不要把 transcript 全读一遍。只看那些你已经怀疑确实重要的点。”

Phase 3 — 整合:

  • 写入或更新记忆文件

  • 把新信号合并进已有主题文件,而不是新建近重复文件

  • 把相对日期改成绝对日期("yesterday" → "2026-03-30")

  • 在源头删除已被推翻的事实

Phase 4 — 修剪与索引:

  • 更新 MEMORY.md,保证不超过 200 行 / 25KB

  • 去掉指向过期/错误/被替代记忆的索引

  • 把冗长的索引条目缩短(细节应该放进主题文件)

  • 解决文件之间的矛盾

工具约束

梦境代理在严格限制下运行:

  • Bash: 只允许只读命令(ls、find、grep、cat、stat、wc、head、tail)

  • Edit/Write: 只能写 memory 目录路径

  • 不允许 MCP 工具、不允许 Agent 工具、不允许破坏性操作

UI 展示

梦境会在 footer pill 里作为后台任务显示,有一个两段式状态机:

DreamPhase: 'starting' → 'updating'(第一次 Edit/Write 落地时) Status: running → completed | failed | killed

用户可以在后台任务对话框里杀掉梦境任务,锁的 mtime 会回滚,下次会话还能重试。

进度跟踪

系统会监控梦境代理的每一轮 assistant 输出:

  • 抓取文本块用于 UI 展示

  • 统计 tool_use 块数量

  • 收集 Edit/Write 的文件路径用于完成摘要

  • 实时展示最多保留最近 30 轮

9. Layer 7: Cross-Agent Communication

  • Files: src/utils/forkedAgent.ts, src/tools/AgentTool/, src/tools/SendMessageTool/

  • Cost: Varies by pattern

  • When: Agent spawning, background tasks, teammate coordination

The Forked Agent Pattern

Nearly every background operation in Claude Code (session memory, auto memory, dreaming, compaction, agent summaries) uses the forked agent pattern. This is the foundation:

ContextEditStrategy =
  | { type: 'clear_tool_uses_20250919',   // Clear old tool results
      trigger: { type: 'input_tokens', value: 180_000 },
      clear_at_least: { type: 'input_tokens', value: 140_000 } }
  | { type: 'clear_thinking_20251015',     // Clear old thinking blocks
      keep: { type: 'thinking_turns', value: 1 } | 'all' }

The fork creates an isolated context with cloned mutable state:

  • readFileState: Cloned LRU cache (prevents cross-contamination)

  • abortController: Child controller linked to parent

  • denialTracking: Fresh tracking state

  • ContentReplacementState: Cloned (preserves cache-stable decisions)

But it shares the prompt cache by keeping identical cache-critical parameters. The API sees the same prefix and serves a cache hit.

Agent Tool: Spawning Sub-Agents

The Agent tool supports multiple spawning patterns:

Fork anti-recursion: Fork children keep the Agent tool in their tool pool (for cache-identical definitions) but detect the <fork_boilerplate_tag> in conversation history to reject recursive fork attempts.

Fork message construction for cache sharing:

All fork children produce byte-identical API request prefixes: 1. Full parent assistant message (all tool_use blocks, thinking, text) 2. Single user message with: - Identical placeholder result for every tool_use - Per-child directive text block (only this differs) → Maximum prompt cache sharing across concurrent forks

SendMessage: Inter-Agent Communication

The SendMessage tool enables runtime communication between agents:

SendMessage({ to: 'research-agent', // or '' for broadcast, 'uds:<path>', 'bridge:<id>'* message: 'Check Section 5', summary: 'Requesting section review' })

Routing logic:

  1. In-process subagent by name → queuePendingMessage() → drained at next tool round boundary

  2. Ambient team (process-based) → writeToMailbox() → file-based mailbox

  3. Cross-session → postInterClaudeMessage() via bridge/UDS

Structured messages for lifecycle control:

  • shutdown_request / shutdown_response — Graceful agent shutdown coordination

  • plan_approval_response — Leader approves/rejects teammate plans

9. 第 7 层:跨代理通信

  • Files: src/utils/forkedAgent.ts, src/tools/AgentTool/, src/tools/SendMessageTool/

  • Cost: 随具体模式变化

  • When: 代理生成、后台任务、队友协作

分叉代理模式(Forked Agent Pattern)

Claude Code 里几乎所有后台操作(会话记忆、自动记忆、梦境、压缩、agent summary 等)都建立在分叉代理模式之上,这是底座:

ContextEditStrategy =
  | { type: 'clear_tool_uses_20250919',   // Clear old tool results
      trigger: { type: 'input_tokens', value: 180_000 },
      clear_at_least: { type: 'input_tokens', value: 140_000 } }
  | { type: 'clear_thinking_20251015',     // Clear old thinking blocks
      keep: { type: 'thinking_turns', value: 1 } | 'all' }

分叉会创建一个隔离的上下文,并复制可变状态:

  • readFileState: 复制 LRU 缓存(避免互相污染)

  • abortController: 子控制器与父控制器联动

  • denialTracking: 全新追踪状态

  • ContentReplacementState: 复制(保住缓存稳定的替换决策)

但它又通过保持 cache-critical 参数一致来共享 prompt cache。API 看到同一段前缀,就会命中缓存。

Agent 工具:生成子代理

Agent 工具支持多种生成模式:

防递归分叉:分叉出来的子代理仍然带着 Agent 工具(保证工具定义一致以便共享缓存),但会检测对话历史里的 <fork_boilerplate_tag>,从而拒绝递归分叉。

为了共享缓存而构造的分叉消息:

所有分叉子代理都会生成字节级一致的 API 请求前缀: 1. 完整复制父代理上一条 assistant 消息(包含全部 tool_use、thinking、text) 2. 再加一条用户消息,里面包含: - 为每个 tool_use 准备相同的占位结果 - 每个子代理各自的指令文本块(只有这一段不同) → 尽可能让并发分叉之间共享同一段 prompt cache

SendMessage:代理间通信

SendMessage 工具让代理之间能在运行时互相发消息:

SendMessage({ to: 'research-agent', // 或用 '' 广播,或 'uds:<path>'、'bridge:<id>'* message: '检查第 5 节', summary: '请求复核该章节' })

路由逻辑:

  1. 进程内按名字找子代理 → queuePendingMessage() → 在下一轮工具边界时被 drain

  2. 环境团队(基于进程) → writeToMailbox() → 文件邮箱

  3. 跨会话 → 通过 bridge/UDS 调 postInterClaudeMessage()

结构化消息用于生命周期控制:

  • shutdown_request / shutdown_response — 协调代理优雅退出

  • plan_approval_response — leader 同意或拒绝队友的 plan

Agent Memory: Persistent Cross-Invocation State

Agents can maintain persistent memory across invocations in three scopes:

Agent Summary: Periodic Progress Snapshots

For coordinator-mode sub-agents, a timer forks the conversation every 30 seconds to generate a 3-5 word progress summary:

Good: "Reading runAgent.ts" Good: "Fixing null check in validate.ts" Bad: "Investigating the issue" (too vague) Bad: "Analyzed the branch diff" (past tense)

Uses Haiku (cheapest model) and denies all tools — it's a pure text generation task.

代理记忆:跨调用的持久状态

代理可以在三种范围里维持跨调用的持久记忆:

Agent Summary:周期性的进度快照

对 coordinator 模式的子代理,会每 30 秒把对话分叉一次,生成一个 3–5 个词的进度摘要:

Good: "Reading runAgent.ts" Good: "Fixing null check in validate.ts" Bad: "Investigating the issue" (too vague) Bad: "Analyzed the branch diff" (past tense)

使用 Haiku(最便宜的模型),并禁用所有工具,这就是纯文本生成任务。

10. The Query Loop: How It All Fits Together

File: src/query.ts

10. Query Loop:如何拼在一起

File: src/query.ts

11. Prompt Cache Optimization

One of the most sophisticated aspects of Claude Code's architecture is its obsessive prompt cache optimization. Nearly every design decision considers cache impact.

11. Prompt 缓存优化

Claude Code 架构里最精妙的部分之一,是它对 prompt cache 的偏执优化。几乎每个设计决定都会先看对缓存的影响。

The Problem

Anthropic's API caches prompt prefixes server-side (~1 hour TTL). A cache hit means you only pay for new tokens. A cache miss means re-tokenizing the entire prompt. At 200K tokens, that's the difference between ~$0.003 and ~$0.60 per request.

Cache-Preserving Patterns

1. CacheSafeParams: Every forked agent (session memory, compaction, dreaming, extraction) inherits the parent's exact system prompt, tools, and message prefix. The fork's API request has an identical prefix → cache hit.

2. renderedSystemPrompt: The fork threads the parent's already-rendered system prompt bytes, avoiding re-rendering divergence (e.g., GrowthBook flag value changes between renders).

3. ContentReplacementState cloning: Tool result persistence decisions are frozen. The same results get the same previews on every API call → stable prefix.

4. Cached microcompact: Uses cache_edits to modify the server cache without changing the local prefix → no cache break.

5. Fork message construction: All fork children get byte-identical prefixes. Only the final directive differs → maximum cache sharing across concurrent forks.

6. Post-compact cache break notification: After compaction, notifyCompaction() resets the cache baseline so the expected post-compact cache miss isn't flagged as an anomaly.

问题是什么

Anthropic 的 API 会在服务端缓存 prompt 前缀(TTL 大约 1 小时)。缓存命中意味着只为新增 token 付费。缓存未命中则要把整个 prompt 重新 tokenize。对于 200K token,这个差别大概是每次请求 ~$0.003 和 ~$0.60 的差距。

保持缓存不破的模式

1. CacheSafeParams:每个分叉代理(会话记忆、压缩、梦境、提取)都会继承父代理完全相同的 system prompt、工具和消息前缀。分叉后的 API 请求前缀一致 → 缓存命中。

2. renderedSystemPrompt:分叉线程直接复用父代理已经渲染好的 system prompt 字节,避免重复渲染带来的差异(比如 GrowthBook flag 在两次渲染之间变了)。

3. ContentReplacementState 复制:工具结果的持久化决策会被冻结。同样的结果每次都会得到同样的预览 → 前缀稳定。

4. 缓存微压缩:用 cache_edits 改服务端缓存而不改本地前缀 → 不破缓存。

5. 分叉消息构造:所有分叉子代理拿到字节级一致的前缀,只有最后一段指令不同 → 并发分叉最大化共享缓存。

6. 压缩后的缓存断裂通知:压缩后调用 notifyCompaction() 重置缓存基线,这样预期中的压缩后缓存未命中不会被误判为异常。

Cache Break Detection

The system actively monitors for unexpected cache misses via promptCacheBreakDetection.ts, flagging them for investigation. Known-good cache breaks (compaction, microcompact, etc.) are pre-registered to avoid false positives.

缓存断裂检测

系统会通过 promptCacheBreakDetection.ts 主动监控非预期的缓存未命中,并标记出来排查。已知合理的缓存断裂(压缩、微压缩等)会提前登记,避免误报。

12. Key Numbers

Context Thresholds

Tool Result Budgets

Session Memory

Compaction

Dreaming

Microcompact

12. 关键数字

上下文阈值

工具结果预算

会话记忆

压缩

梦境

微压缩

13. Design Principles

13. 设计原则

1. Layered Defense, Cheapest First

Every context management layer is designed to prevent the next, more expensive layer from firing:

  • Tool result storage prevents microcompact from needing to clear as much

  • Microcompact prevents session memory compaction

  • Session memory compaction prevents full compaction

  • Full compaction prevents context overflow errors

1. 分层防御,先便宜后昂贵

每一层上下文管理,都是为了不让下一层更贵的机制触发:

  • 工具结果存储让微压缩需要清掉的东西更少

  • 微压缩避免触发会话记忆压缩

  • 会话记忆压缩避免触发完整压缩

  • 完整压缩避免上下文溢出错误

2. Prompt Cache Preservation

Almost every design decision considers prompt cache impact. The system goes to extraordinary lengths to keep API request prefixes byte-identical: frozen ContentReplacementState, rendered system prompt threading, cache_edits API, identical fork message construction.

2. Prompt 缓存优先

几乎所有设计决定都把 prompt cache 影响放在前面。系统会用尽办法保持 API 请求前缀字节级一致,包括冻结的 ContentReplacementState、复用渲染后的 system prompt、cache_edits API、统一的分叉消息构造。

3. Isolation with Sharing

Forked agents get cloned mutable state (preventing cross-contamination) but share the prompt cache prefix (preventing cost explosion). This is a careful balance — too much isolation wastes cache, too much sharing causes bugs.

3. 隔离但共享

分叉代理复制可变状态来隔离(避免互相污染),同时共享 prompt cache 前缀(避免成本爆炸)。这是个微妙平衡,隔离过度会浪费缓存,共享过度会引入 bug。

4. Circuit Breakers Everywhere

  • Autocompact: 3-strike limit

  • Dream scan: 10-minute throttle

  • Dream lock: PID-based mutex with stale detection

  • Session memory: Sequential execution wrapper

  • Extract memories: Mutual exclusivity with main agent writes

4. 到处都是熔断器

  • Autocompact:三次失败止损

  • Dream scan:10 分钟节流

  • Dream lock:基于 PID 的互斥锁 + 过期回收

  • 会话记忆:顺序执行包装

  • 记忆提取:与主代理写入互斥

5. Graceful Degradation

Each system fails silently and lets the next layer catch. Session memory compaction returns null on failure → full compaction runs. Dream lock acquisition fails → next session retries. Extract memories errors → logged, not thrown.

5. 优雅降级

每套系统失败时都会静默让位给下一层。会话记忆压缩失败就返回 null → 交给完整压缩。梦境拿锁失败 → 下次会话再试。提取记忆出错 → 只记录日志,不抛出。

6. Feature Flags as Kill Switches

Nearly every system is gated by GrowthBook feature flags:

  • tengu_session_memory — session memory

  • tengu_sm_compact — session memory compaction

  • tengu_onyx_plover — dreaming

  • tengu_passport_quail — auto memory extraction

  • tengu_slate_heron — time-based microcompact

  • CACHED_MICROCOMPACT — cache-editing microcompact

  • CONTEXT_COLLAPSE — context collapse

  • HISTORY_SNIP — message snipping

This allows rapid rollback without code deploys if any system causes problems.

6. 特性开关作为紧急刹车

几乎每个系统都被 GrowthBook 特性开关门控:

  • tengu_session_memory — 会话记忆

  • tengu_sm_compact — 会话记忆压缩

  • tengu_onyx_plover — 梦境

  • tengu_passport_quail — 自动记忆提取

  • tengu_slate_heron — 基于时间的微压缩

  • CACHED_MICROCOMPACT — 缓存编辑微压缩

  • CONTEXT_COLLAPSE — 上下文折叠

  • HISTORY_SNIP — 消息裁剪

只要某个系统出问题,就能不发版快速回滚。

7. Mutual Exclusivity Where Needed

  • Context Collapse ↔ Autocompact (collapse manages context its own way)

  • Main agent memory writes ↔ Background extraction (prevents duplication)

  • Session memory compaction ↔ Full compaction (SM tried first, full is fallback)

  • Autocompact ↔ Subagent query sources (prevents deadlocks)

7. 需要时就互斥

  • Context Collapse ↔ Autocompact(collapse 自己管上下文)

How Claude Code Manages Memory: A Deep Technical Analysis

A comprehensive reverse-engineering of every memory and context management system inside Claude Code's leaked harness — from lightweight token pruning to a "dreaming" system that consolidates memories while you sleep.

1. The Problem: Bounded Context in an Unbounded World

LLMs have a fundamental constraint: a fixed context window. Claude Code typically operates with a 200K token window (expandable to 1M with the [1m] suffix). A single coding session can easily blow past this — a few file reads, some grep results, a handful of edit cycles, and you're at the limit.

Claude Code solves this with a 7-layer memory architecture that spans from sub-millisecond token pruning to multi-hour background "dreaming" consolidation. Each layer is progressively more expensive but more powerful, and the system is designed so cheaper layers prevent the need for more expensive ones.

Token Counting: The Foundation

Everything starts with knowing how many tokens you've used. The canonical function is tokenCountWithEstimation() in src/utils/tokens.ts:

Canonical token count = last API response's usage.input_tokens + rough estimates for messages added since

The rough estimation uses a simple heuristic: 4 bytes per token for most text, 2 bytes per token for JSON (which tokenizes more densely). Images and documents get a flat 2,000 token estimate regardless of size.

Context Window Resolution

The system resolves the available context window through a priority chain:

[1m] model suffix → model capability lookup → 1M beta header → env override → 200K default

The effective context window subtracts a 20K token reserve for compaction output — you can't use the full window because you need room to generate the summary that saves you.

2. Architecture Overview: 7 Layers of Memory

Each layer is triggered by different conditions and has different costs. The system is designed so Layer N prevents Layer N+1 from firing whenever possible.

# Session Title
_A short and distinctive 5-10 word descriptive title_

# Current State
_What is actively being worked on right now?_

# Task specification
_What did the user ask to build?_

# Files and Functions
_Important files and their relevance_

# Workflow
_Bash commands usually run and their interpretation_

# Errors & Corrections
_Errors encountered and how they were fixed_

# Codebase and System Documentation
_Important system components and how they fit together_

# Learnings
_What has worked well? What has not?_

# Key results
_If the user asked for specific output, repeat it here_

# Worklog
_Step by step, what was attempted and done_

3. Layer 1: Tool Result Storage

  • File: src/utils/toolResultStorage.ts

  • Cost: Disk I/O only — no API calls

  • When: Every tool result, immediately

The Problem

A single grep across a codebase can return 100KB+ of text. A cat of a large file might be 50KB. These results consume massive context and become stale within minutes as the conversation moves on.

The Solution

Every tool result passes through a budget system before entering context:

When a result exceeds its threshold:

  1. The full result is written to disk at tool-results/<sessionId>/<toolUseId>.txt

  2. A preview (first ~2KB) is placed in context, wrapped in <persisted-output> tags

  3. The model can later use Read to access the full result if needed

ContentReplacementState: Cache-Stable Decisions

A critical subtlety: once a tool result is replaced with a preview, that decision is frozen in ContentReplacementState. On subsequent API calls, the same result gets the same preview — this ensures the prompt prefix remains byte-identical for prompt cache hits. This state even survives session resume by being persisted to the transcript.

GrowthBook Override

Per-tool thresholds can be remotely tuned via the tengu_satin_quoll feature flag — allowing Anthropic to adjust persistence thresholds for specific tools without a code deploy.

4. Layer 2: Microcompaction

  • File: src/services/compact/microCompact.ts

  • Cost: Zero to minimal API cost

  • When: Every turn, before the API call

Microcompaction is the lightest-weight context relief. It doesn't summarize anything — it just clears old tool results that are unlikely to be needed.

Three Distinct Mechanisms

a) Time-Based Microcompact

Trigger: Idle gap since last assistant message exceeds threshold (default: 60 minutes)

Rationale: Anthropic's server-side prompt cache has a ~1 hour TTL. If you haven't sent a request in an hour, the cache has expired and the entire prompt prefix will be re-tokenized from scratch. Since it's being rewritten anyway, clear old tool results first to shrink what gets rewritten.

Action: Replaces tool result content with [Old tool result content cleared], keeping at least the most recent N results (floor of 1).

Configuration (via GrowthBook tengu_slate_heron):

b) Cached Microcompact (Cache-Editing API)

This is the most technically interesting mechanism. Instead of modifying local messages (which would invalidate the prompt cache), it uses the API's cache_edits mechanism to delete tool results from the server-side cache without invalidating the prefix.

How it works:

  1. Tool results are registered in a global CachedMCState as they appear

  2. When the count exceeds a threshold, the oldest results (minus a "keep recent" buffer) are selected for deletion

  3. A cache_edits block is generated and sent alongside the next API request

  4. The server deletes the specified tool results from its cached prefix

  5. Local messages remain unchanged — the deletion is API-layer only

Critical safety: Only runs on the main thread. If forked subagents (session_memory, agent_summary, etc.) modified the global state, they'd corrupt the main thread's cache editing.

c) API-Level Context Management (apiMicrocompact.ts)

A newer server-side approach using the context_management API parameter:

compactMetadata = {
  type: 'auto' | 'manual',
  preCompactTokenCount: number,
  compactedMessageUuid: UUID,           // Last msg before boundary
  preCompactDiscoveredTools: string[],  // Loaded deferred tools
  preservedSegment?: {                  // Session memory path only
    headUuid, anchorUuid, tailUuid
  }
}

This tells the API server to handle context management natively — the client doesn't need to track or manage tool result clearing.

Which Tools Are Compactable?

Only results from these tools get cleared:

FileRead, Bash/Shell, Grep, Glob, WebSearch, WebFetch, FileEdit, FileWrite

Notably absent: thinking blocks, assistant text, user messages, MCP tool results.

5. Layer 3: Session Memory

SessionMemoryCompactConfig = {
  minTokens: 10_000,          // Minimum tokens to preserve
  minTextBlockMessages: 5,     // Minimum messages with text blocks
  maxTokens: 40_000            // Hard cap on preserved tokens
}
  • Files: src/services/SessionMemory/

  • Cost: One forked agent API call per extraction

  • When: Periodically during conversation (post-sampling hook)

The Idea

Instead of waiting until context is full and then desperately trying to summarize everything, continuously maintain notes about the conversation. Then when compaction IS needed, you already have a summary ready — no expensive summarization call required.

Session Memory Template

Each session gets a markdown file at ~/.claude/projects/<slug>/.claude/session-memory/<sessionId>.md with a structured template:

Trigger Logic

Session memory extraction fires when both conditions are met:

Token growth since last extraction ≥ minimumTokensBetweenUpdate AND (tool calls since last extraction ≥ toolCallsBetweenUpdates OR no tool calls in the last assistant turn)

The token threshold is always required — even if the tool call threshold is met. The "no tool calls in last turn" clause captures natural conversation breaks where the model has finished a work sequence.

Extraction Execution

The extraction runs as a forked subagent via runForkedAgent():

  • querySource: 'session_memory'

  • Only allowed to use FileEdit on the memory file (all other tools denied)

  • Shares the parent's prompt cache for cost efficiency

  • Runs sequentially (via sequential() wrapper) to prevent overlapping extractions

Session Memory Compaction: The Payoff

When autocompact triggers, it first tries trySessionMemoryCompaction():

  1. Check if session memory has actual content (not just the empty template)

  2. Use the session memory markdown as the compaction summaryno API call needed

  3. Calculate which recent messages to keep (expanding backward from lastSummarizedMessageId to meet minimums)

  4. Return a CompactionResult with the session memory as summary + preserved recent messages

Configuration:

The key insight: Session memory compaction is dramatically cheaper than full compaction because the summary already exists. No summarizer API call, no prompt construction, no output token cost. The session memory file is simply injected as the summary.

6. Layer 4: Full Compaction

  • File: src/services/compact/compact.ts

  • Cost: One full API call (input = entire conversation, output = summary)

  • When: Context exceeds autocompact threshold AND session memory compaction unavailable

Trigger

effective context window = context window - 20K (reserved for output) autocompact threshold = effective window - 13K (buffer)

If tokenCountWithEstimation(messages) > autocompact threshold → trigger

Circuit Breaker

After 3 consecutive failures, autocompact stops trying for the rest of the session. This was added after discovering that 1,279 sessions had 50+ consecutive failures (up to 3,272 in a single session), wasting approximately 250K API calls per day globally.

The Compaction Algorithm

Step 1: Pre-processing

  • Execute user-configured PreCompact hooks

  • Strip images from messages (replaced with [image] markers)

  • Strip skill discovery/listing attachments (will be re-injected)

Step 2: Generate Summary The system forks a summarizer agent with a detailed prompt requesting a 9-section summary:

---
name: testing-approach
description: User prefers integration tests over mocks after a prod incident
type: feedback
---

Integration tests must hit a real database, not mocks.

**Why:** Prior incident where mock/prod divergence masked a broken migration.

**How to apply:** When writing tests for database code, always use the test database helper.

The prompt uses a clever two-phase output structure:

  • First: <analysis> block — a drafting scratchpad where the model organizes its thoughts

  • Then: <summary> block — the actual structured summary

  • The <analysis> block is stripped before the summary enters context — it improves summary quality without consuming post-compact tokens

Step 3: Post-compact Restoration

After compaction, critical context is re-injected:

  • Top 5 recently-read files (5K tokens each, 50K total budget)

  • Invoked skill content (5K tokens each, 25K total budget)

  • Plan attachment (if in plan mode)

  • Deferred tool schemas, agent listings, MCP instructions

  • SessionStart hooks re-execute (restores CLAUDE.md, etc.)

  • Session metadata re-appended for --resume display

Step 4: Boundary Message

A SystemCompactBoundaryMessage marks the compaction point:

Partial Compaction

Two directional variants for more surgical context management:

  • from direction: Summarize messages AFTER a pivot index, keep earlier ones intact. Preserves prompt cache because the kept prefix is unchanged.

  • up_to direction: Summarize messages BEFORE pivot, keep later ones. Breaks cache because the summary changes the prefix.

Prompt-Too-Long Recovery

If the compaction request itself hits prompt-too-long (the conversation is so large even the summarizer can't process it):

  1. Group messages by API round via groupMessagesByApiRound()

  2. Drop the oldest groups until the token gap is covered (or 20% of groups if gap is unparseable)

  3. Retry up to 3 times

  4. If all retries fail → ERROR_MESSAGE_PROMPT_TOO_LONG thrown

7. Layer 5: Auto Memory Extraction

  • File: src/services/extractMemories/extractMemories.ts

  • Cost: One forked agent API call

  • When: End of each complete query loop (model produces final response with no tool calls)

Purpose

While Session Memory captures notes about the current session, Auto Memory Extraction builds durable, cross-session knowledge that persists in ~/.claude/projects/<path>/memory/.

Memory Types

Four types of memories, each with specific save criteria:

CacheSafeParams = {
  systemPrompt: SystemPrompt,      // Must be byte-identical to parent
  userContext: { [k: string]: string },
  systemContext: { [k: string]: string },
  toolUseContext: ToolUseContext,   // Contains tools, model, options
  forkContextMessages: Message[],  // Parent's conversation (cache prefix)
}

Memory File Format

TimeBasedMCConfig = {
  enabled: false,           // Master switch
  gapThresholdMinutes: 60,  // Trigger after 1h idle
  keepRecent: 5             // Keep last 5 tool results
}

What NOT to Save

The extraction prompt explicitly excludes:

  • Code patterns, conventions, architecture (derivable from code)

  • Git history (use git log/git blame)

  • Debugging solutions (the fix is in the code)

  • Anything in CLAUDE.md files

  • Ephemeral task details

Mutual Exclusivity with Main Agent

If the main agent already wrote memory files during the current turn, extraction is skipped. This prevents the background agent from duplicating work the main agent already did:

function hasMemoryWritesSince(messages, sinceUuid): boolean { // Scans for Edit/Write tool_use blocks targeting auto-memory paths // Returns true if main agent already saved memories }

Execution Strategy

The extraction prompt instructs the agent to be efficient with its limited turn budget:

Turn 1: Issue all FileRead calls in parallel for files you might update Turn 2: Issue all FileWrite/FileEdit calls in parallel Do not interleave reads and writes across multiple turns.

MEMORY.md: The Index

MEMORY.md is an index file, not a memory dump. Each entry should be one line under ~150 characters:

  • Testing Approach — Real DB tests only, no mocks after prod incident
  • User Profile — Senior Go eng, new to React, focused on observability

Hard limits: 200 lines or 25KB — whichever is hit first. Lines beyond 200 are truncated when loaded into the system prompt.

8. Layer 6: Dreaming

1. Primary Request and Intent
2. Key Technical Concepts
3. Files and Code Sections (with code snippets)
4. Errors and Fixes
5. Problem Solving
6. All User Messages (verbatim — critical for intent tracking)
7. Pending Tasks
8. Current Work
9. Optional Next Step
  • File: src/services/autoDream/autoDream.ts

  • Cost: One forked agent API call (potentially multi-turn)

  • When: Background, after sufficient time and sessions have accumulated

The Concept

Dreaming is cross-session memory consolidation — a background process that reviews past session transcripts and improves the memory directory. It's analogous to how biological memory consolidation happens during sleep: experiences from the day are reviewed, organized, and integrated into long-term storage.

Gate Sequence (Cheapest Check First)

The dream system uses a cascading gate design where each check is cheaper than the next, so most turns exit early:

The Lock Mechanism

The lock file at <memoryDir>/.consolidate-lock serves double duty:

Path: <autoMemPath>/.consolidate-lock Body: Process PID (single line) mtime: lastConsolidatedAt timestamp (the lock IS the timestamp)

  • Acquire: Write PID → mtime = now. Verify PID on re-read (race protection).

  • Success: mtime stays at now (marks consolidation time).

  • Failure: rollbackConsolidationLock(priorMtime) rewinds mtime via utimes().

  • Stale: If mtime > 60 minutes old AND PID is not running → reclaim.

  • Crash recovery: Dead PID detected → next process reclaims.

Four-Phase Consolidation

The dream agent receives a structured prompt defining four phases:

Phase 1 — Orient:

  • ls the memory directory

  • Read MEMORY.md to understand the current index

  • Skim existing topic files to avoid creating duplicates

Phase 2 — Gather Recent Signal:

  • Review daily logs (logs/YYYY/MM/YYYY-MM-DD.md) if present

  • Check for drifted memories (facts that contradict current codebase)

  • Grep session transcripts narrowly for specific context:grep -rn "<narrow term>" transcripts/ --include="*.jsonl" | tail -50

  • "Don't exhaustively read transcripts. Look only for things you already suspect matter."

Phase 3 — Consolidate:

  • Write or update memory files

  • Merge new signal into existing topic files rather than creating near-duplicates

  • Convert relative dates to absolute ("yesterday" → "2026-03-30")

  • Delete contradicted facts at the source

Phase 4 — Prune and Index:

  • Update MEMORY.md to stay under 200 lines / 25KB

  • Remove pointers to stale/wrong/superseded memories

  • Shorten verbose index entries (detail belongs in topic files)

  • Resolve contradictions between files

Tool Constraints

The dream agent operates under strict restrictions:

  • Bash: Read-only commands only (ls, find, grep, cat, stat, wc, head, tail)

  • Edit/Write: Only to memory directory paths

  • No MCP tools, no Agent tool, no destructive operations

UI Surfacing

Dreams appear as background tasks in the footer pill, with a two-phase state machine:

DreamPhase: 'starting' → 'updating' (when first Edit/Write lands) Status: running → completed | failed | killed

Users can kill a dream from the background tasks dialog — the lock mtime is rolled back so the next session can retry.

Progress Tracking

Each assistant turn from the dream agent is watched:

  • Text blocks captured for UI display

  • Tool_use blocks counted

  • Edit/Write file paths collected for the completion summary

  • Capped at 30 most recent turns for live display

9. Layer 7: Cross-Agent Communication

  • Files: src/utils/forkedAgent.ts, src/tools/AgentTool/, src/tools/SendMessageTool/

  • Cost: Varies by pattern

  • When: Agent spawning, background tasks, teammate coordination

The Forked Agent Pattern

Nearly every background operation in Claude Code (session memory, auto memory, dreaming, compaction, agent summaries) uses the forked agent pattern. This is the foundation:

ContextEditStrategy =
  | { type: 'clear_tool_uses_20250919',   // Clear old tool results
      trigger: { type: 'input_tokens', value: 180_000 },
      clear_at_least: { type: 'input_tokens', value: 140_000 } }
  | { type: 'clear_thinking_20251015',     // Clear old thinking blocks
      keep: { type: 'thinking_turns', value: 1 } | 'all' }

The fork creates an isolated context with cloned mutable state:

  • readFileState: Cloned LRU cache (prevents cross-contamination)

  • abortController: Child controller linked to parent

  • denialTracking: Fresh tracking state

  • ContentReplacementState: Cloned (preserves cache-stable decisions)

But it shares the prompt cache by keeping identical cache-critical parameters. The API sees the same prefix and serves a cache hit.

Agent Tool: Spawning Sub-Agents

The Agent tool supports multiple spawning patterns:

Fork anti-recursion: Fork children keep the Agent tool in their tool pool (for cache-identical definitions) but detect the <fork_boilerplate_tag> in conversation history to reject recursive fork attempts.

Fork message construction for cache sharing:

All fork children produce byte-identical API request prefixes: 1. Full parent assistant message (all tool_use blocks, thinking, text) 2. Single user message with: - Identical placeholder result for every tool_use - Per-child directive text block (only this differs) → Maximum prompt cache sharing across concurrent forks

SendMessage: Inter-Agent Communication

The SendMessage tool enables runtime communication between agents:

SendMessage({ to: 'research-agent', // or '' for broadcast, 'uds:<path>', 'bridge:<id>'* message: 'Check Section 5', summary: 'Requesting section review' })

Routing logic:

  1. In-process subagent by name → queuePendingMessage() → drained at next tool round boundary

  2. Ambient team (process-based) → writeToMailbox() → file-based mailbox

  3. Cross-session → postInterClaudeMessage() via bridge/UDS

Structured messages for lifecycle control:

  • shutdown_request / shutdown_response — Graceful agent shutdown coordination

  • plan_approval_response — Leader approves/rejects teammate plans

Agent Memory: Persistent Cross-Invocation State

Agents can maintain persistent memory across invocations in three scopes:

Agent Summary: Periodic Progress Snapshots

For coordinator-mode sub-agents, a timer forks the conversation every 30 seconds to generate a 3-5 word progress summary:

Good: "Reading runAgent.ts" Good: "Fixing null check in validate.ts" Bad: "Investigating the issue" (too vague) Bad: "Analyzed the branch diff" (past tense)

Uses Haiku (cheapest model) and denies all tools — it's a pure text generation task.

10. The Query Loop: How It All Fits Together

File: src/query.ts

11. Prompt Cache Optimization

One of the most sophisticated aspects of Claude Code's architecture is its obsessive prompt cache optimization. Nearly every design decision considers cache impact.

The Problem

Anthropic's API caches prompt prefixes server-side (~1 hour TTL). A cache hit means you only pay for new tokens. A cache miss means re-tokenizing the entire prompt. At 200K tokens, that's the difference between ~$0.003 and ~$0.60 per request.

Cache-Preserving Patterns

1. CacheSafeParams: Every forked agent (session memory, compaction, dreaming, extraction) inherits the parent's exact system prompt, tools, and message prefix. The fork's API request has an identical prefix → cache hit.

2. renderedSystemPrompt: The fork threads the parent's already-rendered system prompt bytes, avoiding re-rendering divergence (e.g., GrowthBook flag value changes between renders).

3. ContentReplacementState cloning: Tool result persistence decisions are frozen. The same results get the same previews on every API call → stable prefix.

4. Cached microcompact: Uses cache_edits to modify the server cache without changing the local prefix → no cache break.

5. Fork message construction: All fork children get byte-identical prefixes. Only the final directive differs → maximum cache sharing across concurrent forks.

6. Post-compact cache break notification: After compaction, notifyCompaction() resets the cache baseline so the expected post-compact cache miss isn't flagged as an anomaly.

Cache Break Detection

The system actively monitors for unexpected cache misses via promptCacheBreakDetection.ts, flagging them for investigation. Known-good cache breaks (compaction, microcompact, etc.) are pre-registered to avoid false positives.

12. Key Numbers

Context Thresholds

Tool Result Budgets

Session Memory

Compaction

Dreaming

Microcompact

13. Design Principles

1. Layered Defense, Cheapest First

Every context management layer is designed to prevent the next, more expensive layer from firing:

  • Tool result storage prevents microcompact from needing to clear as much

  • Microcompact prevents session memory compaction

  • Session memory compaction prevents full compaction

  • Full compaction prevents context overflow errors

2. Prompt Cache Preservation

Almost every design decision considers prompt cache impact. The system goes to extraordinary lengths to keep API request prefixes byte-identical: frozen ContentReplacementState, rendered system prompt threading, cache_edits API, identical fork message construction.

3. Isolation with Sharing

Forked agents get cloned mutable state (preventing cross-contamination) but share the prompt cache prefix (preventing cost explosion). This is a careful balance — too much isolation wastes cache, too much sharing causes bugs.

4. Circuit Breakers Everywhere

  • Autocompact: 3-strike limit

  • Dream scan: 10-minute throttle

  • Dream lock: PID-based mutex with stale detection

  • Session memory: Sequential execution wrapper

  • Extract memories: Mutual exclusivity with main agent writes

5. Graceful Degradation

Each system fails silently and lets the next layer catch. Session memory compaction returns null on failure → full compaction runs. Dream lock acquisition fails → next session retries. Extract memories errors → logged, not thrown.

6. Feature Flags as Kill Switches

Nearly every system is gated by GrowthBook feature flags:

  • tengu_session_memory — session memory

  • tengu_sm_compact — session memory compaction

  • tengu_onyx_plover — dreaming

  • tengu_passport_quail — auto memory extraction

  • tengu_slate_heron — time-based microcompact

  • CACHED_MICROCOMPACT — cache-editing microcompact

  • CONTEXT_COLLAPSE — context collapse

  • HISTORY_SNIP — message snipping

This allows rapid rollback without code deploys if any system causes problems.

7. Mutual Exclusivity Where Needed

  • Context Collapse ↔ Autocompact (collapse manages context its own way)

  • Main agent memory writes ↔ Background extraction (prevents duplication)

  • Session memory compaction ↔ Full compaction (SM tried first, full is fallback)

  • Autocompact ↔ Subagent query sources (prevents deadlocks)

📋 讨论归档

讨论进行中…