返回列表
🪞 Uota学

做 Agent 别先想“更聪明”,先把缓存命中率当命根子

长上下文 agent 的成本/延迟基本被“重复的前缀”决定——Prompt Caching 把这部分变成 10% 价格,而 auto-caching 把工程难度也降了一截。

2026-02-20 原文链接 ↗
阅读简报
双语对照
完整翻译
原文
讨论归档

核心观点

  • 缓存不是优化项,是可行性前提 无状态 Messages API 迫使你每轮重发工具/指令/历史;不缓存就等于每轮全额支付大段重复 tokens。
  • 机制要点:cache_control 断点 + 哈希完全一致 只要有一个字符变了就 miss;因此“可重复、可确定性”比“漂亮的 prompt”更重要。
  • auto-caching 的价值:减少人为移动断点的心智负担 让断点自动跟着最后一个可缓存 block 走,适合 turn-based/loop agent。
  • 生产环境最该盯的指标:cache hit rate 甚至可以像 Manus/Claude Code 那样把它当作一级 KPI(hit 率掉了就等于成本爆炸)。

跟我们的关联

  • OpenClaw/内部 agent 体系要把“前缀稳定”写进架构:系统提示词、工具定义顺序、静态说明尽量不要动;变更用后续消息补丁,而不是改前缀。
  • 成本优化别只盯模型单价:高频长会话场景里,缓存命中率的杠杆可能比换模型大一个数量级。
  • 需要一套工程规范:工具列表确定性排序、避免在系统 prompt 里塞时间戳/随机信息、避免回写历史。

讨论引子

  • 我们现在哪些 agent 流程“每轮都在重复发一坨静态上下文”?有没有办法把它切成稳定前缀 + 增量补丁?
  • 如果把 cache hit rate 当核心指标,应该怎么埋点/告警?阈值多少算“事故”?
  • 工程上,为了缓存稳定你愿意牺牲哪些“灵活性”(比如动态增删工具、动态改系统 prompt)?

Claude 的提示词自动缓存(Prompt Auto-Caching)

TL;DR:在使用 Claude 时,提示词缓存是节省成本与降低延迟的绝佳方式。使用提示词缓存的输入 token 成本仅为未缓存 token 的 10%。API 刚刚加入了自动缓存(auto-caching),让你只需在 API 请求中添加一个 cache_control 参数,就能更轻松地缓存提示词(文档见此处)。另外,也可以看看 @trq212 对 Claude Code 如何使用提示词缓存的深度解析,以及关于缓存友好型提示词设计的实用建议。

The case for caching

许多 AI 应用在多轮交互中会反复引入相同的上下文。例如,agent 会在循环中执行动作;每个动作都会产生新的上下文。Claude 的 messages API 是无状态的,这意味着它不会记住过去的动作。agent harness 需要在每一轮把新的上下文与过去的动作、工具描述和通用指令一起打包发送。

这意味着在多轮中,大部分上下文其实是相同的。但是,如果没有缓存,你每一轮都要为整个上下文窗口付费。为什么不复用这段共享的上下文呢?这正是提示词缓存所做的事。你可以在定价页面看到,缓存 token 的费用是基础输入 token 的 10%。有了缓存,每一段新的上下文块只需完整付费一次。

Manus 的 @peakji 指出,缓存命中率是生产环境 AI agent 最重要的单一指标。@trq212 也提到,提示词缓存对像 Claude Code 这样长时间运行、token 消耗很大的 agent 至关重要。

How it works

关于大模型推理与缓存的细节,有一些很棒的资料(例如来自 @sankalp 的这里,或来自 @kipply 的这里)。总体而言,大模型推理流水线通常包含一个 prefill 阶段用于处理提示词,以及一个 decode 阶段用于生成输出 token。

缓存背后的直觉是:prefill 的计算可以只做一次,然后保存(例如缓存起来);当未来的提示词(其中一部分)与之前完全相同时,就可以复用这部分结果。vLLM、SGLang 等推理库/框架会用不同的方法来实现这一核心思想。

Usage with Claude

在 Claude 的 messages API 中,使用缓存需要一个 cache_control 断点,它可以放在提示词中的任意一个 block 上。这会向 Claude 传达两件事。

第一,它是一个“写入点”,告诉 Claude 将截至并包含该 block 的所有 block 都缓存起来。这会为断点之前的全部内容块生成一个加密哈希。其作用域限定在你的 workspace 内。

第二,它会告诉 Claude 从断点位置向前最多回溯 20 个 block,寻找任何先前的缓存写入匹配(“命中”)。哈希要求内容完全一致;只要有一个字符不同,就会生成不同的哈希并导致缓存未命中。如果匹配成功,prefill 就会使用缓存。

尽管如此,缓存仍有一些挑战。对于按轮次推进的应用(例如 agent),随着对话推进,你必须把断点移动到最新的 block。API 现在用自动缓存解决了这一点:你可以在发给 Claude messages API 的请求中只放一个 cache_control 参数。

使用自动缓存时,缓存断点会自动移动到请求中最后一个可缓存的 block。随着对话增长,断点会自动随之移动。如果你想自己设置断点(例如放在 system prompt 或其他上下文 block 上),它仍然支持块级缓存。

另一个挑战是如何设计提示词以最大化缓存命中。例如,如果你编辑了历史内容(见下文),就有可能破坏缓存。

这是我们在 Claude Code 中解决过的问题!@trq212 刚刚分享了不少在考虑缓存前提下进行提示词设计的实用洞见。

链接:http://x.com/i/article/2024515623544639493

相关笔记

TL;DR: Prompt caching is a great way to save cost + latency when using Claude. Input tokens that use the prompt cache are 10% the cost of non-cached tokens. Auto-caching was just added to the API, which makes it easier to cache your prompt with a single cache_control parameter in the API request (docs here). Also, check out @trq212's deep dive on Claude Code's use of prompt caching and useful tips for cache-friendly prompt design.

TL;DR:在使用 Claude 时,提示词缓存是节省成本与降低延迟的绝佳方式。使用提示词缓存的输入 token 成本仅为未缓存 token 的 10%。API 刚刚加入了自动缓存(auto-caching),让你只需在 API 请求中添加一个 cache_control 参数,就能更轻松地缓存提示词(文档见此处)。另外,也可以看看 @trq212 对 Claude Code 如何使用提示词缓存的深度解析,以及关于缓存友好型提示词设计的实用建议。

The case for caching

The case for caching

Many AI applications ingest the same context across turns. For example, agents perform actions in a loop. Each action produces new context. Claude’s messages API is stateless, which means it doesn’t remember past actions. The agent harness needs to package new context with past actions, tool descriptions, and general instructions at each turn.

许多 AI 应用在多轮交互中会反复引入相同的上下文。例如,agent 会在循环中执行动作;每个动作都会产生新的上下文。Claude 的 messages API 是无状态的,这意味着它不会记住过去的动作。agent harness 需要在每一轮把新的上下文与过去的动作、工具描述和通用指令一起打包发送。

This means most of the context is the same across turns. But, without caching, you pay for the entire context window every turn. Why not just re-use the shared context? That’s what prompt caching does. You can see on the pricing page that cached tokens are 10% the cost of base input tokens. With caching, you only pay in full for each new context block once.

这意味着在多轮中,大部分上下文其实是相同的。但是,如果没有缓存,你每一轮都要为整个上下文窗口付费。为什么不复用这段共享的上下文呢?这正是提示词缓存所做的事。你可以在定价页面看到,缓存 token 的费用是基础输入 token 的 10%。有了缓存,每一段新的上下文块只需完整付费一次。

@peakji from Manus called out the cache hit rate as the single most important metric for a production AI agent. @trq212 has noted that prompt caching is critical for long running / token-heavy agents like Claude Code.

Manus 的 @peakji 指出,缓存命中率是生产环境 AI agent 最重要的单一指标。@trq212 也提到,提示词缓存对像 Claude Code 这样长时间运行、token 消耗很大的 agent 至关重要。

How it works

How it works

There are some great resources (e.g., here from @sankalp or here from @kipply) on the details of LLM inference and caching. In general, LLM inference pipelines typically use a prefill phase that processes the prompt and a decode phase that generates output tokens.

关于大模型推理与缓存的细节,有一些很棒的资料(例如来自 @sankalp 的这里,或来自 @kipply 的这里)。总体而言,大模型推理流水线通常包含一个 prefill 阶段用于处理提示词,以及一个 decode 阶段用于生成输出 token。

The intuition behind caching is that the prefill computation can be performed once, saved (e.g., cached), and then re-used if (part of) a future prompt is identical. Inference libraries / frameworks like vLLM and SGLang use different approaches to achieve this central idea.

缓存背后的直觉是:prefill 的计算可以只做一次,然后保存(例如缓存起来);当未来的提示词(其中一部分)与之前完全相同时,就可以复用这部分结果。vLLM、SGLang 等推理库/框架会用不同的方法来实现这一核心思想。

Usage with Claude

Usage with Claude

Caching with the Claude messages API uses a cache_control breakpoint, which can be placed at any block in your prompt. This tells Claude two things.

在 Claude 的 messages API 中,使用缓存需要一个 cache_control 断点,它可以放在提示词中的任意一个 block 上。这会向 Claude 传达两件事。

First, it is a “write point” telling Claude to cache all blocks up to and including this one. This creates a cryptographic hash of all the content blocks up to that breakpoint. This is scoped to your workspace.

第一,它是一个“写入点”,告诉 Claude 将截至并包含该 block 的所有 block 都缓存起来。这会为断点之前的全部内容块生成一个加密哈希。其作用域限定在你的 workspace 内。

Second, it tells Claude to search backward at most 20 blocks from the breakpoint to find any prior cache write matches (“hits”). The hash requires identical content. One character difference will produce a different hash and a cache miss. If there's a match, the cache is used in prefill.

第二,它会告诉 Claude 从断点位置向前最多回溯 20 个 block,寻找任何先前的缓存写入匹配(“命中”)。哈希要求内容完全一致;只要有一个字符不同,就会生成不同的哈希并导致缓存未命中。如果匹配成功,prefill 就会使用缓存。

Still, there are challenges with caching. For turn-based apps (e.g., agents), you have to move the breakpoint to the latest block as the conversation progresses. The API now addresses this with auto-caching. You can place a single cache_control parameter in your request to the Claude messages API.

尽管如此,缓存仍有一些挑战。对于按轮次推进的应用(例如 agent),随着对话推进,你必须把断点移动到最新的 block。API 现在用自动缓存解决了这一点:你可以在发给 Claude messages API 的请求中只放一个 cache_control 参数。

With auto-caching, the cache breakpoint moves to the last cacheable block in your request. As your conversation grows, the breakpoint moves with it automatically. This still works with block-level caching if you want to set breakpoints (e.g., on your system prompt or other context blocks).

使用自动缓存时,缓存断点会自动移动到请求中最后一个可缓存的 block。随着对话增长,断点会自动随之移动。如果你想自己设置断点(例如放在 system prompt 或其他上下文 block 上),它仍然支持块级缓存。

Another challenge is designing your prompt to maximize cache hits. For example, if you edit the history (see below) you risk breaking the cache.

另一个挑战是如何设计提示词以最大化缓存命中。例如,如果你编辑了历史内容(见下文),就有可能破坏缓存。

This is a problem that we've tackled with Claude Code! @trq212 just shared a number of useful insights on prompt design with caching in mind.

这是我们在 Claude Code 中解决过的问题!@trq212 刚刚分享了不少在考虑缓存前提下进行提示词设计的实用洞见。

Link: http://x.com/i/article/2024515623544639493

链接:http://x.com/i/article/2024515623544639493

相关笔记

Prompt auto-caching with Claude

  • Source: https://x.com/rlancemartin/status/2024573404888911886?s=46
  • Mirror: https://x.com/rlancemartin/status/2024573404888911886?s=46
  • Published: 2026-02-19T19:54:49+00:00
  • Saved: 2026-02-20

Content

TL;DR: Prompt caching is a great way to save cost + latency when using Claude. Input tokens that use the prompt cache are 10% the cost of non-cached tokens. Auto-caching was just added to the API, which makes it easier to cache your prompt with a single cache_control parameter in the API request (docs here). Also, check out @trq212's deep dive on Claude Code's use of prompt caching and useful tips for cache-friendly prompt design.

The case for caching

Many AI applications ingest the same context across turns. For example, agents perform actions in a loop. Each action produces new context. Claude’s messages API is stateless, which means it doesn’t remember past actions. The agent harness needs to package new context with past actions, tool descriptions, and general instructions at each turn.

This means most of the context is the same across turns. But, without caching, you pay for the entire context window every turn. Why not just re-use the shared context? That’s what prompt caching does. You can see on the pricing page that cached tokens are 10% the cost of base input tokens. With caching, you only pay in full for each new context block once.

@peakji from Manus called out the cache hit rate as the single most important metric for a production AI agent. @trq212 has noted that prompt caching is critical for long running / token-heavy agents like Claude Code.

How it works

There are some great resources (e.g., here from @sankalp or here from @kipply) on the details of LLM inference and caching. In general, LLM inference pipelines typically use a prefill phase that processes the prompt and a decode phase that generates output tokens.

The intuition behind caching is that the prefill computation can be performed once, saved (e.g., cached), and then re-used if (part of) a future prompt is identical. Inference libraries / frameworks like vLLM and SGLang use different approaches to achieve this central idea.

Usage with Claude

Caching with the Claude messages API uses a cache_control breakpoint, which can be placed at any block in your prompt. This tells Claude two things.

First, it is a “write point” telling Claude to cache all blocks up to and including this one. This creates a cryptographic hash of all the content blocks up to that breakpoint. This is scoped to your workspace.

Second, it tells Claude to search backward at most 20 blocks from the breakpoint to find any prior cache write matches (“hits”). The hash requires identical content. One character difference will produce a different hash and a cache miss. If there's a match, the cache is used in prefill.

Still, there are challenges with caching. For turn-based apps (e.g., agents), you have to move the breakpoint to the latest block as the conversation progresses. The API now addresses this with auto-caching. You can place a single cache_control parameter in your request to the Claude messages API.

With auto-caching, the cache breakpoint moves to the last cacheable block in your request. As your conversation grows, the breakpoint moves with it automatically. This still works with block-level caching if you want to set breakpoints (e.g., on your system prompt or other context blocks).

Another challenge is designing your prompt to maximize cache hits. For example, if you edit the history (see below) you risk breaking the cache.

This is a problem that we've tackled with Claude Code! @trq212 just shared a number of useful insights on prompt design with caching in mind.

Link: http://x.com/i/article/2024515623544639493

📋 讨论归档

讨论进行中…