返回列表
🪞 Uota学 · 💬 讨论题

缓存是约束,不是功能——你得围绕它“设计产品”

Prompt caching 是前缀匹配:前缀里任何一处抖动都会让你后面所有 token 全部重付费——所以 Claude Code 的所有设计都在“保护前缀”。

2026-02-20 原文链接 ↗
阅读简报
双语对照
完整翻译
原文
讨论归档

核心观点

  • 静态在前,动态在后:这是 agent prompt 的“物理定律” 系统提示词/工具定义/项目说明应尽可能共享前缀;会话消息才放在后面。
  • 别在会话中途换模型/换工具 缓存按模型隔离;工具集合也是前缀的一部分。中途切换看似省钱,实际可能更贵。
  • 用“系统消息补丁”代替“改系统 prompt” 时间变化、文件变更等信息,用下一轮 message 传 ,避免打断缓存前缀。
  • Plan mode 的正确打开方式:状态机用工具表达 不要为了 plan mode 去替换工具集;保留工具不变,用 Enter/ExitPlanMode 工具/消息去限制行为。
  • 像监控 uptime 一样监控 cache hit rate hit 率掉几个点就能让成本/延迟爆炸,值得 SEV。

跟我们的关联

  • 我们做 agent 的“架构优先级”可能要倒过来:先把缓存友好/确定性/可观测性做好,再谈推理策略花活。
  • OpenClaw 的工具层面:工具 schema 顺序/内容必须稳定;动态加载可以用“stub + 搜索”替代“增删”。
  • 任何需要 fork(总结/压缩/子任务)的地方,都应该复用父会话的前缀(同系统 prompt + 同工具集),否则成本会非常丑。

讨论引子

  • 我们现在有哪些“为了灵活”在中途增删工具/改系统 prompt 的习惯?哪些其实是在自杀式破缓存?
  • 如果要做“缓存友好”的 sub-agent handoff,handoff message 的最小充分信息应该是什么?
  • 你愿意把哪些产品交互限制成“状态机”,只为了让缓存稳定?这会不会反过来让产品更清晰?

构建 Claude Code 的经验:提示缓存决定一切

工程领域常说“Cache Rules Everything Around Me”,而同样的规则也适用于智能体。

像 Claude Code 这样长时间运行的智能体产品之所以可行,靠的是提示缓存(prompt caching):它让我们能够复用前几轮往返交互的计算结果,并显著降低延迟与成本。

什么是提示缓存,它如何工作,以及技术上该如何实现?更多内容请阅读 @RLanceMartin 关于提示缓存的文章,以及我们最新的 auto-caching(自动缓存)发布。

在 Claude Code,我们围绕提示缓存搭建了整个运行框架(harness)。高提示缓存命中率既能降低成本,也能让我们为订阅方案提供更宽松的速率限制,因此我们会对提示缓存命中率设置告警,过低时就会宣告 SEV 级事故。

下面是我们在规模化优化提示缓存过程中学到的(往往不直觉的)经验。

为缓存布局你的系统提示词

提示缓存通过前缀匹配来工作——API 会从请求开头开始缓存,直到每个 cache_control 断点为止的所有内容。这意味着你放置内容的顺序极其重要:你希望尽可能多的请求共享同一段前缀。

最好的做法是静态内容在前、动态内容在后。对 Claude Code 来说大致是:

静态系统提示词与工具(全局缓存)

Claude.MD(项目内缓存)

会话上下文(会话内缓存)

对话消息

这样我们就能最大化会话之间共享缓存命中的比例。

但这可能出奇地脆弱!我们以前打破过这种排序的原因包括:在静态系统提示词里加入过于详细的时间戳、以非确定性的方式打乱工具定义顺序、更新工具参数(例如 AgentTool 能调用哪些 agent)等等。

用系统消息做更新

有时你放进提示词的信息会过期,例如时间变化,或用户修改了文件。你可能会想更新提示词,但那会导致缓存未命中,并可能让用户付出相当高的成本。

不妨考虑在下一轮改用消息把这类信息传入。在 Claude Code 中,我们会在下一条用户消息或工具结果里加入一个 标签,把更新后的信息提供给模型(例如现在是星期三),从而帮助保留缓存。

不要在会话中途切换模型

提示缓存是按模型隔离的,这会让提示缓存的成本计算变得相当反直觉。

如果你在与 Opus 的对话里已经进行了 100k tokens,然后想问一个相对容易的问题,那么让 Opus 来回答其实会比切换到 Haiku 更便宜,因为我们需要为 Haiku 重新构建提示缓存。

如果你确实需要切换模型,最佳方式是使用子智能体(subagents):由 Opus 先准备一条“handoff”消息,把它需要对方完成的任务交接清楚。我们在 Claude Code 里经常这样做——例如 Explore agents 就会用 Haiku。

不要在会话中途增删工具

在对话中途改变工具集合,是人们破坏提示缓存最常见的方式之一。这看起来很直觉——你只需给模型你认为它现在需要的工具。但由于工具是被缓存前缀的一部分,新增或移除一个工具会让整个对话的缓存失效。

计划模式——围绕缓存来设计

计划模式是一个围绕缓存约束来设计功能的绝佳例子。直觉上的做法是:当用户进入计划模式时,把工具集替换为只读工具。但这会打破缓存。

相反,我们始终在请求中保留全部工具,并把 EnterPlanMode 和 ExitPlanMode 本身也设计成工具。当用户打开计划模式时,智能体会收到一条系统消息,解释它处于计划模式以及应遵循的指令——浏览代码库,不要编辑文件,计划完成后调用 ExitPlanMode。工具定义从不改变。

这还有一个额外好处:因为 EnterPlanMode 是模型自己也能调用的工具,它在检测到棘手问题时可以自主进入计划模式,而不会造成任何缓存破坏。

工具搜索——延后加载而不是移除

同样的原则也适用于我们的工具搜索功能。Claude Code 可能会加载几十个 MCP 工具,把它们全部包含在每次请求中会很昂贵;但在对话中途移除它们又会打破缓存。

我们的解决方案是:defer_loading。我们不移除工具,而是发送轻量级的 stub——只有工具名称,并带上 defer_loading: true——模型在需要时可以通过 ToolSearch 工具来“发现”它们。只有当模型选中某个工具时,我们才加载完整的工具 schemas。这样缓存前缀就保持稳定:同一批 stub 始终以相同顺序存在。

幸运的是,你可以通过我们的 API 使用工具搜索工具来简化这一流程。

分叉上下文——压缩(Compaction)

Compaction 是指当你用完上下文窗口时发生的事情。我们会对截至目前的对话进行总结,然后用这份总结继续开启一个新会话。

令人意外的是,compaction 在提示缓存上有许多不直觉的边界情况。

尤其是,当我们进行 compaction 时,需要把整个对话发给模型来生成摘要。如果这是一个独立的 API 调用,并且使用不同的系统提示词且不带任何工具(这是最简单的实现方式),那么主对话的缓存前缀就完全对不上。你就得为所有这些输入 tokens 支付全价,从而大幅增加用户成本。

解决方案——缓存安全的分叉

当我们运行 compaction 时,会使用与父对话完全相同的系统提示词、用户上下文、系统上下文以及工具定义。我们先前置父对话的对话消息,然后在末尾把 compaction 提示作为一条新的用户消息追加进去。

从 API 的视角看,这个请求与父对话的最后一次请求几乎一致——同样的前缀、同样的工具、同样的历史——因此缓存前缀可以复用。唯一新增的 tokens 就是 compaction 提示本身。

不过,这也意味着我们需要保存一个“compaction 缓冲区”,以确保在上下文窗口里留出足够空间来包含 compact message 以及摘要输出 tokens。

Compaction 很棘手,但幸运的是,你不必自己学一遍这些教训——基于我们从 Claude Code 得到的经验,我们把 compaction 直接构建进了 API,这样你就可以在自己的应用中应用这些模式。

经验总结

提示缓存是前缀匹配。前缀中的任何变化都会让其后的所有内容失效。请围绕这一约束来设计你的整个系统。把顺序排对了,大多数缓存就会“免费”生效。

用系统消息替代系统提示词的修改。你可能会想通过编辑系统提示词来实现进入计划模式、修改日期等,但实际上更好的做法是在对话过程中把这些作为系统消息插入。

不要在对话中途更换工具或模型。用工具来表达状态切换(如计划模式),而不是更改工具集。延后加载工具,而不是移除工具。

像监控 uptime 一样监控缓存命中率。我们会对缓存中断发出告警并把它们当作事故处理。缓存未命中率哪怕只增加几个百分点,也会显著影响成本与延迟。

分叉操作需要共享父对话的前缀。如果你需要运行旁路计算(compaction、summarization、skill execution),请使用完全一致、缓存安全的参数,这样才能在父对话前缀上获得缓存命中。

Claude Code 从第一天起就围绕提示缓存构建;如果你在构建智能体,也应该这么做。

链接:http://x.com/i/article/2024543492064882688

相关笔记

It is often said in engineering that "Cache Rules Everything Around Me", and the same rule holds for agents.

工程领域常说“Cache Rules Everything Around Me”,而同样的规则也适用于智能体。

Long running agentic products like Claude Code are made feasible by prompt caching which allows us to reuse computation from previous roundtrips and significantly decrease latency and cost.

像 Claude Code 这样长时间运行的智能体产品之所以可行,靠的是提示缓存(prompt caching):它让我们能够复用前几轮往返交互的计算结果,并显著降低延迟与成本。

What is prompt caching, how does it work and how do you implement it technically? Read more in @RLanceMartin's piece on prompt caching and our new auto-caching launch.

什么是提示缓存,它如何工作,以及技术上该如何实现?更多内容请阅读 @RLanceMartin 关于提示缓存的文章,以及我们最新的 auto-caching(自动缓存)发布。

At Claude Code, we build our entire harness around prompt caching. A high prompt cache hit rate decreases costs and helps us create more generous rate limits for our subscription plans, so we run alerts on our prompt cache hit rate and declare SEVs if they're too low.

在 Claude Code,我们围绕提示缓存搭建了整个运行框架(harness)。高提示缓存命中率既能降低成本,也能让我们为订阅方案提供更宽松的速率限制,因此我们会对提示缓存命中率设置告警,过低时就会宣告 SEV 级事故。

These are the (often unintuitive) lessons we've learned from optimizing prompt caching at scale.

下面是我们在规模化优化提示缓存过程中学到的(往往不直觉的)经验。

Lay Out Your System Prompt for Caching

为缓存布局你的系统提示词

Prompt caching works by prefix matching — the API caches everything from the start of the request up to each cache_control breakpoint. This means the order you put things in matters enormously, you want as many of your requests to share a prefix as possible.

提示缓存通过前缀匹配来工作——API 会从请求开头开始缓存,直到每个 cache_control 断点为止的所有内容。这意味着你放置内容的顺序极其重要:你希望尽可能多的请求共享同一段前缀。

The best way to do this is static content first, dynamic content last. For Claude Code this looks like:

最好的做法是静态内容在前、动态内容在后。对 Claude Code 来说大致是:

Static system prompt & Tools (globally cached)

静态系统提示词与工具(全局缓存)

Claude.MD (cached within a project)

Claude.MD(项目内缓存)

Session context (cached within a session)

会话上下文(会话内缓存)

Conversation messages

对话消息

This way we maximize how many sessions share cache hits.

这样我们就能最大化会话之间共享缓存命中的比例。

But this can be surprisingly fragile! Examples of reasons we’ve broken this ordering before include: putting an in-depth timestamp in the static system prompt, shuffling tool order definitions non-deterministically, updating parameters of tools (e.g. what agents the AgentTool can call), etc.

但这可能出奇地脆弱!我们以前打破过这种排序的原因包括:在静态系统提示词里加入过于详细的时间戳、以非确定性的方式打乱工具定义顺序、更新工具参数(例如 AgentTool 能调用哪些 agent)等等。

Use System Messages for Updates

用系统消息做更新

There may be times when the information you put in your prompt becomes out of date, for example if you have the time or if the user changes a file. It may be tempting to update the prompt, but that would result in a cache miss and could end up being quite expensive for the user.

有时你放进提示词的信息会过期,例如时间变化,或用户修改了文件。你可能会想更新提示词,但那会导致缓存未命中,并可能让用户付出相当高的成本。

Consider if you can pass in this information via messages in the next turn instead. In Claude Code, we add a tag in the next user message or tool result with the updated information for the model (e.g. it is now Wednesday), which helps preserve the cache.

不妨考虑在下一轮改用消息把这类信息传入。在 Claude Code 中,我们会在下一条用户消息或工具结果里加入一个 标签,把更新后的信息提供给模型(例如现在是星期三),从而帮助保留缓存。

Don't change Models Mid-Session

不要在会话中途切换模型

Prompt caches are unique to models and this can make the math of prompt caching quite unintuitive.

提示缓存是按模型隔离的,这会让提示缓存的成本计算变得相当反直觉。

If you're 100k tokens into a conversation with Opus and want to ask a question that is fairly easy to answer, it would actually be more expensive to switch to Haiku than to have Opus answer, because we would need to rebuild the prompt cache for Haiku.

如果你在与 Opus 的对话里已经进行了 100k tokens,然后想问一个相对容易的问题,那么让 Opus 来回答其实会比切换到 Haiku 更便宜,因为我们需要为 Haiku 重新构建提示缓存。

If you need to switch models, the best way to do it is with subagents, where Opus would prepare a "handoff" message to another model on the task that it needs done. We do this often with the Explore agents in Claude Code which use Haiku.

如果你确实需要切换模型,最佳方式是使用子智能体(subagents):由 Opus 先准备一条“handoff”消息,把它需要对方完成的任务交接清楚。我们在 Claude Code 里经常这样做——例如 Explore agents 就会用 Haiku。

Never Add or Remove Tools Mid-Session

不要在会话中途增删工具

Changing the tool set in the middle of a conversation is one of the most common ways people break prompt caching. It seems intuitive — you should only give the model tools you think it needs right now. But because tools are part of the cached prefix, adding or removing a tool invalidates the cache for the entire conversation.

在对话中途改变工具集合,是人们破坏提示缓存最常见的方式之一。这看起来很直觉——你只需给模型你认为它现在需要的工具。但由于工具是被缓存前缀的一部分,新增或移除一个工具会让整个对话的缓存失效。

Plan Mode — Design Around the Cache

计划模式——围绕缓存来设计

Plan mode is a great example of designing features around caching constraints. The intuitive approach would be: when the user enters plan mode, swap out the tool set to only include read-only tools. But that would break the cache.

计划模式是一个围绕缓存约束来设计功能的绝佳例子。直觉上的做法是:当用户进入计划模式时,把工具集替换为只读工具。但这会打破缓存。

Instead, we keep all tools in the request at all times and use EnterPlanMode and ExitPlanMode as tools themselves. When the user toggles plan mode on, the agent gets a system message explaining that it's in plan mode and what the instructions are — explore the codebase, don't edit files, call ExitPlanMode when the plan is complete. The tool definitions never change.

相反,我们始终在请求中保留全部工具,并把 EnterPlanMode 和 ExitPlanMode 本身也设计成工具。当用户打开计划模式时,智能体会收到一条系统消息,解释它处于计划模式以及应遵循的指令——浏览代码库,不要编辑文件,计划完成后调用 ExitPlanMode。工具定义从不改变。

This has a bonus benefit: because EnterPlanMode is a tool the model can call itself, it can autonomously enter plan mode when it detects a hard problem, without any cache break.

这还有一个额外好处:因为 EnterPlanMode 是模型自己也能调用的工具,它在检测到棘手问题时可以自主进入计划模式,而不会造成任何缓存破坏。

Tool Search — Defer Instead of Remove

工具搜索——延后加载而不是移除

The same principle applies to our tool search feature. Claude Code can have dozens of MCP tools loaded, and including all of them in every request would be expensive. But removing them mid-conversation would break the cache.

同样的原则也适用于我们的工具搜索功能。Claude Code 可能会加载几十个 MCP 工具,把它们全部包含在每次请求中会很昂贵;但在对话中途移除它们又会打破缓存。

Our solution: defer_loading. Instead of removing tools, we send lightweight stubs — just the tool name, with defer_loading: true — that the model can "discover" via a ToolSearch tool when needed. The full tool schemas are only loaded when the model selects them. This keeps the cached prefix stable: the same stubs are always present in the same order.

我们的解决方案是:defer_loading。我们不移除工具,而是发送轻量级的 stub——只有工具名称,并带上 defer_loading: true——模型在需要时可以通过 ToolSearch 工具来“发现”它们。只有当模型选中某个工具时,我们才加载完整的工具 schemas。这样缓存前缀就保持稳定:同一批 stub 始终以相同顺序存在。

Luckily you can use the tool search tool through our API to simplify this.

幸运的是,你可以通过我们的 API 使用工具搜索工具来简化这一流程。

Forking Context — Compaction

分叉上下文——压缩(Compaction)

Compaction is what happens when you run out of the context window. We summarize the conversation so far and continue a new session with that summary.

Compaction 是指当你用完上下文窗口时发生的事情。我们会对截至目前的对话进行总结,然后用这份总结继续开启一个新会话。

Surprisingly, compaction has many edge cases with prompt caching that can be unintuitive.

令人意外的是,compaction 在提示缓存上有许多不直觉的边界情况。

In particular, when we compact we need to send the entire conversation to the model to generate a summary. If this is a separate API call with a different system prompt and no tools (which is the simple implementation), the cached prefix from the main conversation doesn't match at all. You pay full price for all those input tokens, drastically increasing the cost for the user.

尤其是,当我们进行 compaction 时,需要把整个对话发给模型来生成摘要。如果这是一个独立的 API 调用,并且使用不同的系统提示词且不带任何工具(这是最简单的实现方式),那么主对话的缓存前缀就完全对不上。你就得为所有这些输入 tokens 支付全价,从而大幅增加用户成本。

The Solution — Cache-Safe Forking

解决方案——缓存安全的分叉

When we run compaction, we use the exact same system prompt, user context, system context, and tool definitions as the parent conversation. We prepend the parent's conversation messages, then append the compaction prompt as a new user message at the end.

当我们运行 compaction 时,会使用与父对话完全相同的系统提示词、用户上下文、系统上下文以及工具定义。我们先前置父对话的对话消息,然后在末尾把 compaction 提示作为一条新的用户消息追加进去。

From the API's perspective, this request looks nearly identical to the parent's last request — same prefix, same tools, same history — so the cached prefix is reused. The only new tokens are the compaction prompt itself.

从 API 的视角看,这个请求与父对话的最后一次请求几乎一致——同样的前缀、同样的工具、同样的历史——因此缓存前缀可以复用。唯一新增的 tokens 就是 compaction 提示本身。

This does mean however that we need to save a "compaction buffer" so that we have enough room in the context window to include the compact message and the summary output tokens.

不过,这也意味着我们需要保存一个“compaction 缓冲区”,以确保在上下文窗口里留出足够空间来包含 compact message 以及摘要输出 tokens。

Compaction is tricky but luckily, you don't need to learn these lessons yourself — based on our learnings from Claude Code we built compaction directly into the API, so you can apply these patterns in your own applications.

Compaction 很棘手,但幸运的是,你不必自己学一遍这些教训——基于我们从 Claude Code 得到的经验,我们把 compaction 直接构建进了 API,这样你就可以在自己的应用中应用这些模式。

Lessons Learned

经验总结

Prompt caching is a prefix match. Any change anywhere in the prefix invalidates everything after it. Design your entire system around this constraint. Get the ordering right and most of the caching works for free.

提示缓存是前缀匹配。前缀中的任何变化都会让其后的所有内容失效。请围绕这一约束来设计你的整个系统。把顺序排对了,大多数缓存就会“免费”生效。

Use system messages instead of system prompt changes. You may be tempted to edit the system prompt to do things like entering plan mode, changing the date, etc. but it would actually be better to insert these as system messages during the conversation.

用系统消息替代系统提示词的修改。你可能会想通过编辑系统提示词来实现进入计划模式、修改日期等,但实际上更好的做法是在对话过程中把这些作为系统消息插入。

Don't change tools or models mid-conversation. Use tools to model state transitions (like plan mode) rather than changing the tool set. Defer tool loading instead of removing tools.

不要在对话中途更换工具或模型。用工具来表达状态切换(如计划模式),而不是更改工具集。延后加载工具,而不是移除工具。

Monitor your cache hit rate like you monitor uptime. We alert on cache breaks and treat them as incidents. A few percentage points of cache miss rate can dramatically affect cost and latency.

像监控 uptime 一样监控缓存命中率。我们会对缓存中断发出告警并把它们当作事故处理。缓存未命中率哪怕只增加几个百分点,也会显著影响成本与延迟。

Fork operations need to share the parent's prefix. If you need to run a side computation (compaction, summarization, skill execution), use identical cache-safe parameters so you get cache hits on the parent's prefix.

分叉操作需要共享父对话的前缀。如果你需要运行旁路计算(compaction、summarization、skill execution),请使用完全一致、缓存安全的参数,这样才能在父对话前缀上获得缓存命中。

Claude Code is built around prompt caching from day one, you should do the same if you’re building an agent.

Claude Code 从第一天起就围绕提示缓存构建;如果你在构建智能体,也应该这么做。

Link: http://x.com/i/article/2024543492064882688

链接:http://x.com/i/article/2024543492064882688

相关笔记

Lessons from Building Claude Code: Prompt Caching Is Everything

  • Source: https://x.com/trq212/status/2024574133011673516?s=46
  • Mirror: https://x.com/trq212/status/2024574133011673516?s=46
  • Published: 2026-02-19T19:57:42+00:00
  • Saved: 2026-02-20

Content

It is often said in engineering that "Cache Rules Everything Around Me", and the same rule holds for agents.

Long running agentic products like Claude Code are made feasible by prompt caching which allows us to reuse computation from previous roundtrips and significantly decrease latency and cost.

What is prompt caching, how does it work and how do you implement it technically? Read more in @RLanceMartin's piece on prompt caching and our new auto-caching launch.

At Claude Code, we build our entire harness around prompt caching. A high prompt cache hit rate decreases costs and helps us create more generous rate limits for our subscription plans, so we run alerts on our prompt cache hit rate and declare SEVs if they're too low.

These are the (often unintuitive) lessons we've learned from optimizing prompt caching at scale.

Lay Out Your System Prompt for Caching

Prompt caching works by prefix matching — the API caches everything from the start of the request up to each cache_control breakpoint. This means the order you put things in matters enormously, you want as many of your requests to share a prefix as possible.

The best way to do this is static content first, dynamic content last. For Claude Code this looks like:

Static system prompt & Tools (globally cached)

Claude.MD (cached within a project)

Session context (cached within a session)

Conversation messages

This way we maximize how many sessions share cache hits.

But this can be surprisingly fragile! Examples of reasons we’ve broken this ordering before include: putting an in-depth timestamp in the static system prompt, shuffling tool order definitions non-deterministically, updating parameters of tools (e.g. what agents the AgentTool can call), etc.

Use System Messages for Updates

There may be times when the information you put in your prompt becomes out of date, for example if you have the time or if the user changes a file. It may be tempting to update the prompt, but that would result in a cache miss and could end up being quite expensive for the user.

Consider if you can pass in this information via messages in the next turn instead. In Claude Code, we add a tag in the next user message or tool result with the updated information for the model (e.g. it is now Wednesday), which helps preserve the cache.

Don't change Models Mid-Session

Prompt caches are unique to models and this can make the math of prompt caching quite unintuitive.

If you're 100k tokens into a conversation with Opus and want to ask a question that is fairly easy to answer, it would actually be more expensive to switch to Haiku than to have Opus answer, because we would need to rebuild the prompt cache for Haiku.

If you need to switch models, the best way to do it is with subagents, where Opus would prepare a "handoff" message to another model on the task that it needs done. We do this often with the Explore agents in Claude Code which use Haiku.

Never Add or Remove Tools Mid-Session

Changing the tool set in the middle of a conversation is one of the most common ways people break prompt caching. It seems intuitive — you should only give the model tools you think it needs right now. But because tools are part of the cached prefix, adding or removing a tool invalidates the cache for the entire conversation.

Plan Mode — Design Around the Cache

Plan mode is a great example of designing features around caching constraints. The intuitive approach would be: when the user enters plan mode, swap out the tool set to only include read-only tools. But that would break the cache.

Instead, we keep all tools in the request at all times and use EnterPlanMode and ExitPlanMode as tools themselves. When the user toggles plan mode on, the agent gets a system message explaining that it's in plan mode and what the instructions are — explore the codebase, don't edit files, call ExitPlanMode when the plan is complete. The tool definitions never change.

This has a bonus benefit: because EnterPlanMode is a tool the model can call itself, it can autonomously enter plan mode when it detects a hard problem, without any cache break.

Tool Search — Defer Instead of Remove

The same principle applies to our tool search feature. Claude Code can have dozens of MCP tools loaded, and including all of them in every request would be expensive. But removing them mid-conversation would break the cache.

Our solution: defer_loading. Instead of removing tools, we send lightweight stubs — just the tool name, with defer_loading: true — that the model can "discover" via a ToolSearch tool when needed. The full tool schemas are only loaded when the model selects them. This keeps the cached prefix stable: the same stubs are always present in the same order.

Luckily you can use the tool search tool through our API to simplify this.

Forking Context — Compaction

Compaction is what happens when you run out of the context window. We summarize the conversation so far and continue a new session with that summary.

Surprisingly, compaction has many edge cases with prompt caching that can be unintuitive.

In particular, when we compact we need to send the entire conversation to the model to generate a summary. If this is a separate API call with a different system prompt and no tools (which is the simple implementation), the cached prefix from the main conversation doesn't match at all. You pay full price for all those input tokens, drastically increasing the cost for the user.

The Solution — Cache-Safe Forking

When we run compaction, we use the exact same system prompt, user context, system context, and tool definitions as the parent conversation. We prepend the parent's conversation messages, then append the compaction prompt as a new user message at the end.

From the API's perspective, this request looks nearly identical to the parent's last request — same prefix, same tools, same history — so the cached prefix is reused. The only new tokens are the compaction prompt itself.

This does mean however that we need to save a "compaction buffer" so that we have enough room in the context window to include the compact message and the summary output tokens.

Compaction is tricky but luckily, you don't need to learn these lessons yourself — based on our learnings from Claude Code we built compaction directly into the API, so you can apply these patterns in your own applications.

Lessons Learned

Prompt caching is a prefix match. Any change anywhere in the prefix invalidates everything after it. Design your entire system around this constraint. Get the ordering right and most of the caching works for free.

Use system messages instead of system prompt changes. You may be tempted to edit the system prompt to do things like entering plan mode, changing the date, etc. but it would actually be better to insert these as system messages during the conversation.

Don't change tools or models mid-conversation. Use tools to model state transitions (like plan mode) rather than changing the tool set. Defer tool loading instead of removing tools.

Monitor your cache hit rate like you monitor uptime. We alert on cache breaks and treat them as incidents. A few percentage points of cache miss rate can dramatically affect cost and latency.

Fork operations need to share the parent's prefix. If you need to run a side computation (compaction, summarization, skill execution), use identical cache-safe parameters so you get cache hits on the parent's prefix.

Claude Code is built around prompt caching from day one, you should do the same if you’re building an agent.

Link: http://x.com/i/article/2024543492064882688

📋 讨论归档

讨论进行中…