🧠 阿头学 · 💬 讨论题

程序化工具调用——Agent 架构从"对话驱动"转向"代码驱动"的关键一步

Anthropic 用 PTC 把 Agent 的工具编排从"模型每步都要回传上下文"改成"模型写代码在容器内一次跑完"，思路对了，但官方只秀收益不谈新增复杂度和失败模式，是一篇包装精良的产品 PR。
打开原文 ↗

2026-02-28 原文链接 ↗

阅读简报

双语对照

完整翻译

原文

讨论归档

核心观点

工具的本质是"控制面"而非"能力扩展" 文章重新定义了工具的价值：封装一个动作为工具，不是因为模型做不到，而是为了在该动作周围插入 UX、护栏、并发控制、可观测性和自主性管理。这个框架比"给模型多加几个 API"的思路高一个层次。
PTC 把中间结果留在容器内，只把最终输出交给模型 传统模式下每次工具调用结果都序列化回上下文，导致 token 膨胀和噪声干扰。PTC 让 Claude 生成编排代码，在沙箱容器内直接调用工具并处理中间数据，模型只看最终结论。搜索场景下 token 减少 24%、准确率提升 11%，机制上说得通。
"控制"与"可组合性"的权衡是真问题 工具给你控制力但牺牲编排灵活性，纯代码给你灵活性但丢失治理能力。PTC 试图两头兼顾：工具处理器仍在沙箱外作为控制面，代码在沙箱内做高效编排。架构思路是严肃的工程回答，不只是营销口号。
但"模型看不到中间过程"是一把双刃剑 中间步骤不回传意味着模型无法实时感知容器内的错误或异常，调试难度上升，复杂分支任务中可能"盲跑"完整个错误流程。文章对此完全回避。
基准数据缺乏控制变量 引用的 +11% 准确率来自"Opus 4.6 + PTC"，但没有说明对照组是"Opus 4.6 不启用 PTC"还是旧版本模型，无法区分模型升级本身和 PTC 各自的贡献。

跟我们的关联

对 ATou 做 Agent 产品的启发：设计工具时先问"我需要控制什么"，再决定"封装什么为工具"。五维判断表（UX / 护栏 / 并发 / 可观测性 / 自主性）可以直接作为团队内部的工具设计 checklist，避免无脑堆工具。
对 Neta 的技术架构意义：如果当前有"多轮工具调用反复回传 LLM"的工作流（尤其是搜索、数据清洗类），可以参考 PTC 的三层架构（LLM → 代码容器 → 工具处理器）做改造，把信息清洗下沉到代码层，把战略判断留给模型，token 成本和延迟都能显著下降。
对团队管理的映射：文中"如果框架能撤销某个动作，就可以给 Agent 更多自主权"这一原则，等价于管理学里的授权逻辑——授权边界不取决于能力，而取决于系统对错误的容错和回滚能力。设计 AI 员工权限体系时可以直接套用。
对投资判断的提醒：LMArena 排名波动极快，竞品几周内就可能用类似编排优化追平。PTC 的架构思路有长期价值，但"Search Arena 第一"这个标签的保质期很短。

讨论引子

1. 在高风险场景（金融交易、基础设施变更）中，你愿意让 Agent 在容器内"盲跑"多步再给你最终结果，还是宁愿多花 token 让每一步都回传可审计？这条线怎么画？ 2. PTC 让模型从"对话式调度员"变成"写脚本的工程师"，但模型生成的编排代码本身可能有 bug。当编排逻辑出错时，谁来兜底？这是否意味着我们需要一套"代码审计工具"来审计 AI 写的编排代码？ 3. 如果"工具 = 控制面"这个定义成立，那现在市面上大量 Agent 框架里动辄几十个工具的设计是不是根本方向错了？

TL;DR – 在 Claude Opus/Sonnet 4.6 中，程序化工具调用（PTC）是一项很有意思的能力。不同于每次工具调用都要往返 Claude 的上下文，Claude 会编写代码，在容器内直接编排并调用工具。中间步骤的工具结果会返回给代码，而不是回到 Claude 的上下文窗口。这能减少 token 消耗，并提升多步任务（如搜索）的性能。最近，启用 PTC 的 Opus 4.6 在 LMArena 的搜索基准上拿到第 1。查看我们的文档，了解更多关于 PTC 以及默认使用 PTC 的新 web search 工具。

“电脑使用”是 Claude 最核心的能力之一。只要给 Claude 一个 bash 工具，就能打开巨大的行动空间，并引出一个常见问题：bash 就够了吗？又该如何决定要给一个 agent 配哪些其他工具？

动作是 Claude 与世界交互的方式。工具则是一种用声明式方法来指定 Claude 可以执行哪些动作的机制。API 允许你通过提供工具名、描述和输入参数来为 Claude 添加工具。

https://www.anthropic.com/engineering/building-effective-agents

如果 Claude 想调用某个工具，它会返回一个包含工具参数的 JSON 对象以供执行。工具处理器（例如 MCP server、你写的代码等）会运行该工具并把上下文传回。如果你把这套流程跑在一个循环里，就得到一个 agent。比如，bash 工具会通过生成一个包含命令的 JSON 对象来产出 bash 命令，然后把它交给 bash 工具处理器执行：

何时使用工具

让 Claude 在循环中使用 bash 工具，本身就是一个电脑使用型 agent。这也是 Claude Code 的核心。但 Claude Code 并不只是用 bash。它把工具当作某些动作周围的 控制面（control surface）。可以看看 @trq212 对这些点的拆解。把某个动作“升级”为工具，在以下几种情况下往往是合理的：

用户体验（UX）： @trq212 提到 AskUserQuestion 工具。这个例子说明，在需要捕获特定动作并以特定方式呈现给用户时，工具很有用。
护栏（Guardrails）。 有些动作需要护栏。例如，文件编辑工具可以做一次 staleness check，用来验证文件自模型上次读取以来是否发生变化。
并发控制。 有时按并发安全性对动作分组很有用（例如，只读工具可以并行运行）。
可观测性。 将特定动作隔离出来便于记录日志也很有用（例如，衡量延迟或 token 消耗）。
自主性。 你可能希望按自主级别对动作分组。如果运行框架（harness）能够撤销某个动作，就可以更自由地批准该动作。

工具的问题

工具在“控制”与“可组合性”之间做了权衡。设想把三个动作都做成工具调用：每次工具调用的上下文都会返回给 Claude。每一次往返都会带来延迟，把工具结果序列化进上下文（例如，即便下一步只需要 5 行，也可能把上千行都塞进来），并且引入一次额外的推理步骤。随着动作数量增加，这种组合成本会越来越高。

程序化工具调用

Claude 正在开发一种能力，把代码的可组合性与工具的控制面结合起来。Claude 可以进行 程序化工具调用（PTC，文档见此处）：你照常定义工具；但 Claude 不再逐个调用，而是把它们组合成函数，并在代码执行容器中运行。每个函数的输出会 返回到容器，而不是进入 Claude 的上下文窗口。

https://x.com/arena/status/2027095484834398512?s=20

当代码调用一个工具（例如 await web_search(query)）时，容器会暂停。这次调用会以带类型的 tool-use 事件跨过沙箱边界。它的执行方式与模型直接调用工具并无二致（例如通过 handler、MCP server 等）。但结果会返回给正在运行的代码，而不是回到 Claude 的上下文窗口。代码会按照 Claude 指定的控制流来处理它（例如再调用另一个工具、过滤数据、累积结果）。只有最终输出才会到达 Claude。

在 Opus 4.6 上，我们观察到 PTC 在非编码评测（例如用于网页搜索的 BrowseComp 与 DeepsearchQA）中提升了 token 效率与性能。举例来说，与其把 50 条原始搜索结果塞进上下文让 Claude 推理，不如让代码以程序方式解析、过滤并交叉引用结果。这样能保留相关信息并丢弃其余部分（例如动态过滤）。在 BrowseComp 和 DeepsearchQA 上，它在平均减少 24% 输入 token 的同时，将准确率提升了平均 11%。启用 PTC 的 Opus 4.6 目前在 LMarena 的 Search Arena 排名第 1。

https://cdn.openai.com/pdf/5e10f4ab-d6f7-442e-9508-59515c65e35d/browsecomp.pdf

鉴于这些收益，PTC 现已内置到 API 的 web search tool 中，用于在使用搜索时提升性能并节省 token：

https://platform.claude.com/docs/en/agents-and-tools/tool-use/web-search-tool

PTC 是一种在保留工具控制面的同时，获得代码执行好处（例如可组合性）的方法：工具实现运行在沙箱边界之外、也就是你这一侧，而不是在沙箱内部。每一次调用中，工具处理器仍然作为控制面处在中间，可以检查、拒绝、记录日志，或排队等待人工批准。但与此同时，它也让 Claude 能够在代码中更流畅地编排各种动作。

链接：http://x.com/i/article/2027070316661653504

TL;DR – Programmatic tool calling (PTC) is an interesting capability in Claude Opus/Sonnet 4.6. Instead of making tool calls that each round-trip through Claude's context, Claude writes code that can orchestrate tool calls directly inside a container. Intermediate tool results return to the code, not Claude’s context window. This reduces token usage and improves performance on multi-step tasks like search. Opus 4.6 with PTC recently scored #1 on LMArena’s search benchmark. See our docs to learn more about PTC and our new web search tool that uses PTC by default.

Computer use is one of Claude’s most central capabilities. Just giving Claude a bash tool opens up a broad action space and leads to a common question: is bash all you need? And how to decide what other tools to give an agent?

Actions are how Claude interacts with the world. Tools are a way to declaratively specify the actions that Claude can take. The API lets you add tools by giving Claude a tool name, description, and input arguments.

https://www.anthropic.com/engineering/building-effective-agents

If Claude wants to call a tool, it will respond with a JSON object of tool arguments to run. A tool handler (e.g., MCP server, code you write, etc) runs the tool and passes back context. If you run this in a loop, you have an agent. For example, the bash tool produced bash commands by generating a JSON object with the command. It is passed to a bash tool handler to execute:

When to use tools

Claude with a bash tool running in a loop is a computer-use agent. This is central to Claude Code. But Claude Code doesn’t just use bash. It uses tools as a control surface around certain actions. See @trq212's breakdown on these points. Promoting an action to a tool can make sense in a few cases:

**UX: ** @trq212 talks about the AskUserQuestion tool. This examples shows that tools are useful in cases where specific actions need to be caught and rendered to the user in a particular way.
Guardrails. Some actions need guardrails. For example, a file edit tool can run a staleness check to verify that the file hasn't changed since the model last read it.
Concurrency control. Sometimes it's useful to group actions by concurrency safety (e.g., read-only tools can run in parallel).
Observability. It can be useful to isolate specific actions for logging (e.g., measuring latency or token usage).
Autonomy. You may want to group actions by autonomy-level. If the harness can undo an action, it can approve the action more freely.

The problem with tools

Tools trade-off control with composability. Consider three actions as tool calls. The context from each tool call is returned back to Claude. Each round trip costs latency, serializes the tool result into context (e.g., it will pass thousands of rows even if the next step only needs five), and introduces a reasoning step. The composition tax grows with the number of actions.

Programmatic tool calling

Claude is developing a capability that unites the composability of code with the control surface of tools. Claude can perform programmatic tool calling (PTC, see docs here): you can define tools, as usual. But rather than calling them individually, Claude can compose them as functions and run them in a code execution container. The output of each function returns to the container rather than to Claude’s context window.

https://x.com/arena/status/2027095484834398512?s=20

When the code calls a tool (e.g., await web_search(query)) the container pauses. The call crosses the sandbox boundary as a typed tool-use event. It is fulfilled just as if the model directly called the tool (e.g., via a handler, an MCP server etc). But the result returns to the running code, not to Claude's context window. The code processes it following the control flow that Claude specified (e.g., calls another tool, filters the data, accumulates results). Only the final output reaches Claude.

With Opus 4.6, we’ve seen gains in token efficiency and performance on non-coding evals (e.g., BrowseComp and DeepsearchQA for web search) with PTC. For example, rather than pulling 50 raw search results into context for Claude to reason over, the code can parse, filter, and cross-reference results programmatically. This keeps what's relevant and discards the rest (e.g., dynamic filtering). Across BrowseComp and DeepsearchQA, it improved accuracy by an average of 11% while using 24% fewer input tokens. Opus 4.6 with PTC is currently #1 in LMarena’s Search Arena.

https://cdn.openai.com/pdf/5e10f4ab-d6f7-442e-9508-59515c65e35d/browsecomp.pdf

With these gains in mind, PTC is now built-into to the web search tool on the API to boost performance and save token when using search:

https://platform.claude.com/docs/en/agents-and-tools/tool-use/web-search-tool

PTC is a way to get the benefit of code execution (e.g., composability) while preserving the control surface of tools: tool implementations run on your side of the sandbox, not inside it. The tool handlers still sit in the middle of every call as a control surface, able to inspect, reject, log, or queue for human approval. But it allows Claude to fluently orchestrate actions in code.

Link: http://x.com/i/article/2027070316661653504

https://www.anthropic.com/engineering/building-effective-agents

何时使用工具

用户体验（UX）： @trq212 提到 AskUserQuestion 工具。这个例子说明，在需要捕获特定动作并以特定方式呈现给用户时，工具很有用。
护栏（Guardrails）。 有些动作需要护栏。例如，文件编辑工具可以做一次 staleness check，用来验证文件自模型上次读取以来是否发生变化。
并发控制。 有时按并发安全性对动作分组很有用（例如，只读工具可以并行运行）。
可观测性。 将特定动作隔离出来便于记录日志也很有用（例如，衡量延迟或 token 消耗）。
自主性。 你可能希望按自主级别对动作分组。如果运行框架（harness）能够撤销某个动作，就可以更自由地批准该动作。

工具的问题

程序化工具调用

https://x.com/arena/status/2027095484834398512?s=20

https://cdn.openai.com/pdf/5e10f4ab-d6f7-442e-9508-59515c65e35d/browsecomp.pdf

鉴于这些收益，PTC 现已内置到 API 的 web search tool 中，用于在使用搜索时提升性能并节省 token：

https://platform.claude.com/docs/en/agents-and-tools/tool-use/web-search-tool

链接：http://x.com/i/article/2027070316661653504

程序化工具调用——Agent 架构从"对话驱动"转向"代码驱动"的关键一步

核心观点

跟我们的关联

讨论引子

相关笔记

📋 讨论归档