返回列表
🧠 阿头学 · 💬 讨论题

从推理热潮走向智能体范式,下一战不是想更久而是会行动

这篇文章的判断是对的核心不在更长推理,而在为行动服务的闭环思考,但它把智能体上升为主流终局时证据仍然不足,且明显夹带了 Qwen 路线的自我辩护。
打开原文 ↗

2026-03-26 原文链接 ↗
阅读简报
双语对照
完整翻译
原文
讨论归档

核心观点

  • 范式确实在迁移 o1 和 R1 证明了可验证反馈 + 强化学习 + 基础设施能显著抬高模型推理能力,这不是小修小补,而是后训练重心从单纯 SFT 转向 RL 的实质转折。
  • Thinking 与 Instruct 的冲突是真问题 作者对两者奖励函数冲突的判断站得住,企业要的是短、稳、便宜、可控,难题推理要的是长、探索、容错、多算力,强行揉进一个模型很容易两边都变差,这不是产品包装能掩盖的工程矛盾。
  • Agent 的难点主要不在模型,而在系统 一旦任务进入工具调用、浏览器、终端、沙箱和长链路反馈,训练对象就不再是模型答题,而是模型加环境加 harness 的整体系统,因此环境质量、rollout 吞吐、训练推理解耦和评估鲁棒性会变成真正瓶颈。
  • Reward hacking 会比推理时代更危险 文章对模型会借工具作弊的警告非常关键,搜索能泄题、代码环境能偷看未来信息、日志和 API 能暴露捷径,工具越强,伪优化空间越大,所以环境设计会直接决定能力真假。
  • 最大争议是替代论说得过头 智能体式思考会变得更重要,这个判断大概率正确,但说它会替代大量静态长推理,则属于外推过度,因为数学证明、复杂分析、受限环境下的高价值任务仍然依赖强离线推理能力。

跟我们的关联

1. 对 ATou 意味着什么:不要再把推理更长误判成产品壁垒,真正可用的壁垒是任务闭环、工具编排和环境控制,下一步应把手头业务拆成纯问答任务和可闭环执行任务,分别设计模型层和 agent 层。 2. 对 Neta 意味着什么:如果要判断一家 AI 公司有没有长期护城河,不能只看 benchmark 和模型榜单,必须看它是否掌握训练环境、评估器、执行沙箱和反作弊能力,下一步可用模型、环境、编排、成本四维框架重审项目。 3. 对 Uota 意味着什么:混合模型是否优雅融合,不该看宣传口径,要看用户体验是否自然、成本是否可接受、模式切换是否平滑,下一步可以拿几个典型产品测试同一模型做批量任务和深度任务时是否人格分裂。 4. 对 ATou/Neta/Uota 都意味着什么:多智能体不是天然高级,很多所谓 agent 其实只是 workflow 拼装,下一步应先验证单体模型加少量工具是否已足够,再决定是否需要 orchestrator 和子代理架构,避免过早系统复杂化。

讨论引子

1. 智能体式思考到底是在扩展推理能力,还是会真正替代一部分传统长推理场景? 2. 企业客户今天更愿意为稳定便宜的 Instruct 付费,还是愿意为高完成度但高延迟高成本的 Agent 付费? 3. 环境会成为下一代 AI 的护城河,还是会像数据一样快速商品化并被开源稀释?

过去两年重新塑造了我们如何评估模型,以及我们对模型的期待。OpenAI 的 o1 表明,思考可以成为一项核心能力,可以被专门训练,并对用户开放。DeepSeek-R1 证明,推理风格的后训练可以在原始实验室之外被复现并规模化。OpenAI 将 o1 描述为通过强化学习训练、在回答之前先思考的模型。DeepSeek 则把 R1 定位为一个与 o1 具备竞争力的开放推理模型。

那一阶段很重要。但 2025 年上半年主要讨论的是推理式思考,如何让模型在推理阶段投入更多算力,如何用更强的奖励信号来训练,如何把额外的推理力度暴露给用户或加以控制。现在的问题是接下来会发生什么。答案在于智能体式思考,为了行动而思考,在与环境交互的过程中,并根据世界的反馈持续更新计划。

1. o1 和 R1 的崛起到底教会了我们什么

第一波推理模型让我们明白,如果想在语言模型里把强化学习规模化,就需要确定性强、稳定、可扩展的反馈信号。数学、代码、逻辑等可验证领域之所以变得居于中心位置,是因为在这些场景里,奖励信号要比通用的偏好监督强得多。它们让强化学习可以优化正确性,而不是只优化看起来合理。基础设施也因此变得至关重要。

当模型被训练去在更长的轨迹里完成推理时,强化学习就不再是监督微调上的轻量补丁,而会变成一个系统问题。你需要大规模 rollout、高吞吐验证、稳定的策略更新、高效采样。推理模型的出现,既是建模层面的故事,也是基础设施层面的故事。OpenAI 将 o1 描述为用强化学习训练出来的一条推理产品线,而 DeepSeek 的 R1 之后也强化了这一方向,展示了基于推理的强化学习需要投入多少专门的算法与基础设施工作。第一次重大转变是从扩展预训练,走向扩展面向推理的后训练。

2. 真正的问题从来不只是把 Thinking 和 Instruct 合在一起

在 2025 年初,Qwen 团队里的很多人都曾有过一幅很宏大的图景。理想系统应该把思考模式和指令模式统一起来,支持可调的推理力度,理念上类似低 / 中 / 高三档推理设置。更进一步,它应该能从提示和上下文中自动推断合适的推理量,让模型自己决定什么时候立刻回答,什么时候多想一会儿,什么时候为了真正困难的问题投入更多计算。

从概念上说,这个方向是对的。Qwen3 是最清晰的公开尝试之一。它引入了混合思考模式,在同一系列里同时支持思考与非思考行为,强调可控的思考预算,并描述了一个四阶段的后训练流程,其中在长 CoT 冷启动和推理 RL 之后,明确包含了思考模式融合。

但融合说起来容易,做好很难。难点在数据。人们谈论融合思考与指令时,往往先想到模型侧的兼容性,一个 checkpoint 能否同时支持两种模式,一个对话模板能否在两者之间切换,一套服务栈能否暴露正确的开关。更深层的问题在于,两种模式的数据分布与行为目标差别很大。

我们在尝试平衡模型融合与提升后训练数据质量和多样性时,并没有把所有事情都做对。在那次迭代过程中,我们也非常关注用户实际是如何使用思考模式与指令模式的。一个强的 instruct 模型通常会因为直接、简短、格式遵从、低延迟而得到奖励,尤其是在改写、标注、模板化支持、结构化抽取、运营 QA 等重复、海量的企业任务中。一个强的 thinking 模型则会因为在难题上花更多 token、保持连贯的中间结构、探索替代路径,并保留足够的内部计算以实质性提升最终正确性而得到奖励。

这两种行为画像天然互相拉扯。如果融合数据没有被精细策划,结果通常会两头都平庸,思考行为变得嘈杂、臃肿,或不够果断,而指令行为会变得不够干脆、不够可靠,也比商业用户真正想要的更昂贵。

在实际使用中,分离仍然更有吸引力。2025 年稍后,在 Qwen3 最初的混合叙事之后,2507 系列发布了分别的 Instruct 与 Thinking 更新,其中包括独立的 30B 和 235B 版本。在商用部署里,大量客户仍然希望用高吞吐、低成本、可强控制的 instruct 行为来做批量操作。在这些场景里,融合并不明显更有收益。把两条线分开,让团队能够更干净地聚焦各自模式的数据与训练问题。

也有其他实验室选择了相反路线。Anthropic 公开主张一体化模型理念。Claude 3.7 Sonnet 被介绍为混合推理模型,用户可以选择普通回答或扩展思考,API 用户还能设定思考预算。Anthropic 明确表示,他们相信推理应该是一项被整合的能力,而不是一条独立模型线。GLM-4.5 也公开把自己定位为兼具思考与非思考模式的混合推理模型,统一推理、编码与智能体能力。DeepSeek 之后也在 V3.1 的 Think & Non-Think 混合推理上走向了类似方向。

关键在于,这种融合是否自然。如果思考与指令只是被放在同一个 checkpoint 里共存,但行为仍像两种生硬缝在一起的人格,产品体验就依然别扭。真正成功的融合,需要一条平滑的推理力度连续谱。模型应当能表达多个努力层级,并最好能自适应地在其中做选择。GPT 式的力度控制指向的正是这一点,对算力的策略分配,而不是一个二元开关。

3. 为什么 Anthropic 的路线提供了有用的纠正

Anthropic 围绕 Claude 3.7 和 Claude 4 的公开表述很克制。他们强调一体化推理、用户可控的思考预算、真实世界任务、编码质量,以及后来在扩展思考中使用工具的能力。Claude 3.7 被呈现为可控预算的混合推理模型。Claude 4 则在此基础上允许推理与工具使用交错进行。与此同时,Anthropic 也把编码、长时间运行任务和智能体工作流作为主要目标来强调。

生成更长的推理轨迹,并不会自动让模型更聪明。很多时候,可见推理过多反而意味着分配不佳。如果模型试图用同一种冗长方式对所有事情都推理,它可能没有把重点排出来,没有把信息压缩好,甚至没有及时行动。Anthropic 的轨迹暗示了一种更自律的观点,思考应当被目标工作负载所塑形。如果目标是编码,那么思考就应该帮助完成代码库导航、规划、拆解、错误恢复和工具编排。如果目标是智能体工作流,那么思考应当提升长时程执行的质量,而不是产出看起来很厉害的中间文字。

这种对定向效用的强调指向了更大的变化,我们正在从训练模型的时代,走向训练智能体的时代。我们在 Qwen3 的博客里也把这点说得很明确,写到我们正在从以训练模型为中心的时代,过渡到以训练智能体为中心的时代,并将未来的 RL 进展与面向长时程推理的环境反馈联系起来。智能体是一种系统,它能制定计划,决定何时行动,使用工具,感知环境反馈,修订策略,并在很长的时间跨度上持续推进。它由与世界的闭环交互所定义。

4. 智能体式思考到底是什么意思

智能体式思考是一个不同的优化目标。推理式思考通常以最终答案之前内部推演的质量来评判,模型能否解出定理、写出证明、生成正确代码,或在基准测试上过关。智能体式思考则关注模型在与环境交互时,能否持续取得进展。

核心问题从模型能否想得足够久,转变为模型能否以一种能维持有效行动的方式去思考。智能体式思考必须处理一些纯推理模型大多可以回避的事情:

  • 决定什么时候停止思考并采取行动

  • 选择调用哪个工具,以及调用顺序

  • 融合来自环境的噪声或不完整观察

  • 在失败后修订计划

  • 在多轮交互与多次工具调用中保持连贯性

智能体式思考,是一种围绕行动展开的推理。

5. 为什么智能体 RL 基础设施更难

一旦目标从解基准题转为解决交互式任务,RL 的技术栈就会发生变化。经典推理 RL 使用的基础设施不再够用。在推理 RL 里,你往往可以把 rollout 当成相对自洽的轨迹,并配上相对干净的评估器。在智能体 RL 里,策略被嵌入更大的 harness 之中,工具服务、浏览器、终端、搜索引擎、模拟器、执行沙箱、API 层、记忆系统、编排框架都要一起上。环境不再只是静态验算器,它成为训练系统的一部分。

这带来了一项新的系统要求,训练与推理必须更清晰地解耦。没有这种解耦,rollout 吞吐会直接崩塌。想象一个编码智能体,它必须把生成的代码放到真实测试 harness 里执行,推理侧会卡在等待执行反馈上,训练侧会因拿不到完成轨迹而饥饿,整条流水线的 GPU 利用率会远低于经典推理 RL 的预期。再叠加工具延迟、部分可观测和有状态环境,这些低效会被进一步放大。结果是,在达到目标能力之前很久,实验就已经变慢并变得痛苦。

环境本身也成了一项核心研究产物。在 SFT 时代,我们痴迷于数据多样性。在智能体时代,我们更该痴迷于环境质量,稳定性、真实性、覆盖度、难度、状态多样性、反馈丰富度、抗 exploit 能力,以及 rollout 生成的可扩展性。搭环境正在从一个副业项目,变成一个真正的创业赛道。如果智能体要在接近生产的设定里被训练,那么环境就是核心能力栈的一部分。

6. 下一个前沿是更可用的思考

我的预期是,智能体式思考将成为思考的主流形态。它也许最终会替代掉不少旧式的静态独白式推理思考,那种过长、孤立的内部轨迹,试图用不断输出更多文字来弥补缺乏交互。即便面对极难的数学或编码任务,一个真正先进的系统也应该拥有搜索、模拟、执行、检查、验证和修订的权利。目标是以更稳健、更高产的方式把问题解决掉。

训练这类系统最难的挑战是 reward hacking。一旦模型获得真正有意义的工具访问,reward hacking 就会危险得多。带搜索的模型可能在 RL 过程中学会直接查答案。编码智能体可能利用仓库里的未来信息、滥用日志,或发现能让任务失效的捷径。环境里哪怕有隐蔽泄漏,也会让策略看起来像超人,实际却是在训练它作弊。这就是为什么智能体时代比推理时代更微妙。更好的工具让模型更有用,同时也扩大了伪优化的攻击面。下一批严肃研究瓶颈很可能来自环境设计、评估器鲁棒性、防作弊协议,以及策略与世界之间更有原则的接口。尽管如此,方向很清楚,具备工具的思考就是比孤立思考更有用,也更有希望提升真实生产力。

智能体式思考也意味着 harness 工程。核心智能将越来越来自多智能体如何被组织起来,一个负责规划与分流工作的 orchestrator,像领域专家一样行动的专门智能体,以及执行更窄任务的子智能体,它们同时帮助控制上下文,避免污染,并维持不同层级推理之间的隔离。未来会从训练模型转向训练智能体,再从训练智能体转向训练系统。

结语

推理浪潮的第一阶段确立了一件重要的事,在反馈信号可靠、基础设施跟得上的前提下,在语言模型之上叠加 RL 可以带来质变的更强认知能力。

更深层的转变,是从推理式思考走向智能体式思考,从想得更久走向为了行动而思考。训练的核心对象已经改变,它是模型加环境的系统,更具体地说,是智能体以及围绕它的 harness。这会改变最重要的研究产物是什么,模型架构与训练数据当然仍然重要,但环境设计、rollout 基础设施、评估器鲁棒性,以及多智能体协作的接口也同样重要。它也会改变好思考的含义,在真实世界约束下、能支撑持续行动的最有用轨迹,比最长、最显眼的轨迹更重要。

它还会改变竞争优势的来源。在推理时代,优势来自更好的 RL 算法、更强的反馈信号与更可扩展的训练流水线。在智能体时代,优势将来自更好的环境、更紧密的训练与服务集成、更强的 harness 工程能力,以及把模型决策与其带来后果闭环起来的能力。

The last two years reshaped how we evaluate models and what we expect from them. OpenAI's o1 showed that "thinking" could be a first-class capability, something you train for and expose to users. DeepSeek-R1 proved that reasoning-style post-training could be reproduced and scaled outside the original labs. OpenAI described o1 as a model trained with reinforcement learning to "think before it answers." DeepSeek positioned R1 as an open reasoning model competitive with o1.

过去两年重新塑造了我们如何评估模型,以及我们对模型的期待。OpenAI 的 o1 表明,思考可以成为一项核心能力,可以被专门训练,并对用户开放。DeepSeek-R1 证明,推理风格的后训练可以在原始实验室之外被复现并规模化。OpenAI 将 o1 描述为通过强化学习训练、在回答之前先思考的模型。DeepSeek 则把 R1 定位为一个与 o1 具备竞争力的开放推理模型。

That phase mattered. But the first half of 2025 was mostly about reasoning thinking: how to make models spend more inference-time compute, how to train them with stronger rewards, how to expose or control that extra reasoning effort. The question now is what comes next. I believe the answer is agentic thinking: thinking in order to act, while interacting with an environment, and continuously updating plans based on feedback from the world.

那一阶段很重要。但 2025 年上半年主要讨论的是推理式思考,如何让模型在推理阶段投入更多算力,如何用更强的奖励信号来训练,如何把额外的推理力度暴露给用户或加以控制。现在的问题是接下来会发生什么。答案在于智能体式思考,为了行动而思考,在与环境交互的过程中,并根据世界的反馈持续更新计划。

1. What the Rise of o1 and R1 Actually Taught Us

1. o1 和 R1 的崛起到底教会了我们什么

The first wave of reasoning models taught us that if we want to scale reinforcement learning in language models, we need feedback signals that are deterministic, stable, and scalable. Math, code, logic, and other verifiable domains became central because rewards in these settings are much stronger than generic preference supervision. They let RL optimize for correctness rather than plausibility. Infrastructure became critical.

第一波推理模型让我们明白,如果想在语言模型里把强化学习规模化,就需要确定性强、稳定、可扩展的反馈信号。数学、代码、逻辑等可验证领域之所以变得居于中心位置,是因为在这些场景里,奖励信号要比通用的偏好监督强得多。它们让强化学习可以优化正确性,而不是只优化看起来合理。基础设施也因此变得至关重要。

Once a model is trained to reason through longer trajectories, RL stops being a lightweight add-on to supervised fine-tuning. It becomes a systems problem. You need rollouts at scale, high-throughput verification, stable policy updates, efficient sampling. The emergence of reasoning models was as much an infra story as a modeling story. OpenAI described o1 as a reasoning line trained with RL, and DeepSeek R1 later reinforced that direction by showing how much dedicated algorithmic and infrastructure work reasoning-based RL demands. The first big transition: from scaling pretraining to scaling post-training for reasoning.

当模型被训练去在更长的轨迹里完成推理时,强化学习就不再是监督微调上的轻量补丁,而会变成一个系统问题。你需要大规模 rollout、高吞吐验证、稳定的策略更新、高效采样。推理模型的出现,既是建模层面的故事,也是基础设施层面的故事。OpenAI 将 o1 描述为用强化学习训练出来的一条推理产品线,而 DeepSeek 的 R1 之后也强化了这一方向,展示了基于推理的强化学习需要投入多少专门的算法与基础设施工作。第一次重大转变是从扩展预训练,走向扩展面向推理的后训练。

2. The Real Problem Was Never Just "Merge Thinking and Instruct"

2. 真正的问题从来不只是把 Thinking 和 Instruct 合在一起

At the beginning of 2025, many of us in Qwen team had an ambitious picture in mind. The ideal system would unify thinking and instruct modes. It would support adjustable reasoning effort, similar in spirit to low / medium / high reasoning settings. Better still, it would automatically infer the appropriate amount of reasoning from the prompt and context, so the model could decide when to answer immediately, when to think longer, and when to spend much more computation on a truly difficult problem.

在 2025 年初,Qwen 团队里的很多人都曾有过一幅很宏大的图景。理想系统应该把思考模式和指令模式统一起来,支持可调的推理力度,理念上类似低 / 中 / 高三档推理设置。更进一步,它应该能从提示和上下文中自动推断合适的推理量,让模型自己决定什么时候立刻回答,什么时候多想一会儿,什么时候为了真正困难的问题投入更多计算。

Conceptually, this was the right direction. Qwen3 was one of the clearest public attempts. It introduced "hybrid thinking modes," supported both thinking and non-thinking behavior in one family, emphasized controllable thinking budgets, and described a four-stage post-training pipeline that explicitly included "thinking mode fusion" after long-CoT cold start and reasoning RL.

从概念上说,这个方向是对的。Qwen3 是最清晰的公开尝试之一。它引入了混合思考模式,在同一系列里同时支持思考与非思考行为,强调可控的思考预算,并描述了一个四阶段的后训练流程,其中在长 CoT 冷启动和推理 RL 之后,明确包含了思考模式融合。

But merging is much easier to describe than to execute well. The hard part is data. When people talk about merging thinking and instruct, they often think first about model-side compatibility: can one checkpoint support both modes, can one chat template switch between them, can one serving stack expose the right toggles. The deeper issue is that the data distributions and behavioral objectives of the two modes are substantially different.

但融合说起来容易,做好很难。难点在数据。人们谈论融合思考与指令时,往往先想到模型侧的兼容性,一个 checkpoint 能否同时支持两种模式,一个对话模板能否在两者之间切换,一套服务栈能否暴露正确的开关。更深层的问题在于,两种模式的数据分布与行为目标差别很大。

We did not get everything right when trying to balance model merging with improving the quality and diversity of post-training data. During that revision process, we also paid close attention to how users were actually engaging with thinking and instruct modes. A strong instruct model is typically rewarded for directness, brevity, formatting compliance, low latency on repetitive, high-volume enterprise tasks such as rewriting, labeling, templated support, structured extraction, and operational QA. A strong thinking model is rewarded for spending more tokens on difficult problems, maintaining coherent intermediate structure, exploring alternative paths, and preserving enough internal computation to meaningfully improve final correctness.

我们在尝试平衡模型融合与提升后训练数据质量和多样性时,并没有把所有事情都做对。在那次迭代过程中,我们也非常关注用户实际是如何使用思考模式与指令模式的。一个强的 instruct 模型通常会因为直接、简短、格式遵从、低延迟而得到奖励,尤其是在改写、标注、模板化支持、结构化抽取、运营 QA 等重复、海量的企业任务中。一个强的 thinking 模型则会因为在难题上花更多 token、保持连贯的中间结构、探索替代路径,并保留足够的内部计算以实质性提升最终正确性而得到奖励。

These two behavior profiles pull against each other. If the merged data is not carefully curated, the result is usually mediocre in both directions: the "thinking" behavior becomes noisy, bloated, or insufficiently decisive, while the "instruct" behavior becomes less crisp, less reliable, and more expensive than what commercial users actually want.

这两种行为画像天然互相拉扯。如果融合数据没有被精细策划,结果通常会两头都平庸,思考行为变得嘈杂、臃肿,或不够果断,而指令行为会变得不够干脆、不够可靠,也比商业用户真正想要的更昂贵。

Separation remained attractive in practice. Later in 2025, after the initial hybrid framing of Qwen3, the 2507 line shipped distinct Instruct and Thinking updates, including separate 30B and 235B variants. In commercial deployment, a large number of customers still wanted high-throughput, low-cost, highly steerable instruct behavior for batch operations. For those scenarios, merging wasn't obviously a benefit. Separating the lines allowed teams to focus on solving the data and training problems of each mode more cleanly.

在实际使用中,分离仍然更有吸引力。2025 年稍后,在 Qwen3 最初的混合叙事之后,2507 系列发布了分别的 Instruct 与 Thinking 更新,其中包括独立的 30B 和 235B 版本。在商用部署里,大量客户仍然希望用高吞吐、低成本、可强控制的 instruct 行为来做批量操作。在这些场景里,融合并不明显更有收益。把两条线分开,让团队能够更干净地聚焦各自模式的数据与训练问题。

Other labs chose the opposite route. Anthropic publicly argued for an integrated model philosophy: Claude 3.7 Sonnet was introduced as a hybrid reasoning model where users could choose ordinary responses or extended thinking, and API users could set a thinking budget. Anthropic explicitly said they believed reasoning should be an integrated capability rather than a separate model. GLM-4.5 also publicly positioned itself as a hybrid reasoning model with both thinking and non-thinking modes, unifying reasoning, coding, and agent capabilities; DeepSeek later moved in a similar direction with V3.1's "Think & Non-Think" hybrid inference.

也有其他实验室选择了相反路线。Anthropic 公开主张一体化模型理念。Claude 3.7 Sonnet 被介绍为混合推理模型,用户可以选择普通回答或扩展思考,API 用户还能设定思考预算。Anthropic 明确表示,他们相信推理应该是一项被整合的能力,而不是一条独立模型线。GLM-4.5 也公开把自己定位为兼具思考与非思考模式的混合推理模型,统一推理、编码与智能体能力。DeepSeek 之后也在 V3.1 的 Think & Non-Think 混合推理上走向了类似方向。

The key question is whether the merge is organic. If thinking and instruct are merely co-located inside one checkpoint but still behave like two awkwardly stitched personalities, the product experience remains unnatural. A truly successful merge requires a smooth spectrum of reasoning effort. The model should be able to express multiple levels of effort, and ideally choose among them adaptively. GPT-style effort control points toward this: a policy over compute, rather than a binary switch.

关键在于,这种融合是否自然。如果思考与指令只是被放在同一个 checkpoint 里共存,但行为仍像两种生硬缝在一起的人格,产品体验就依然别扭。真正成功的融合,需要一条平滑的推理力度连续谱。模型应当能表达多个努力层级,并最好能自适应地在其中做选择。GPT 式的力度控制指向的正是这一点,对算力的策略分配,而不是一个二元开关。

3. Why Anthropic's Direction Was a Useful Corrective

3. 为什么 Anthropic 的路线提供了有用的纠正

Anthropic's public framing around Claude 3.7 and Claude 4 was restrained. They emphasized integrated reasoning, user-controlled thinking budgets, real-world tasks, coding quality, and later the ability to use tools during extended thinking. Claude 3.7 was presented as a hybrid reasoning model with controllable budgets; Claude 4 extended that by allowing reasoning to interleave with tool use, while Anthropic simultaneously emphasized coding, long-running tasks, and agent workflows as primary goals.

Anthropic 围绕 Claude 3.7 和 Claude 4 的公开表述很克制。他们强调一体化推理、用户可控的思考预算、真实世界任务、编码质量,以及后来在扩展思考中使用工具的能力。Claude 3.7 被呈现为可控预算的混合推理模型。Claude 4 则在此基础上允许推理与工具使用交错进行。与此同时,Anthropic 也把编码、长时间运行任务和智能体工作流作为主要目标来强调。

Producing a longer reasoning trace doesn't automatically make a model more intelligent. In many cases, excessive visible reasoning signals weak allocation. If the model is trying to reason about everything in the same verbose way, it may be failing to prioritize, failing to compress, or failing to act. Anthropic's trajectory suggested a more disciplined view: thinking should be shaped by the target workload. If the target is coding, then thinking should help with codebase navigation, planning, decomposition, error recovery, and tool orchestration. If the target is agent workflows, then thinking should improve execution quality over long horizons rather than producing impressive intermediate prose.

生成更长的推理轨迹,并不会自动让模型更聪明。很多时候,可见推理过多反而意味着分配不佳。如果模型试图用同一种冗长方式对所有事情都推理,它可能没有把重点排出来,没有把信息压缩好,甚至没有及时行动。Anthropic 的轨迹暗示了一种更自律的观点,思考应当被目标工作负载所塑形。如果目标是编码,那么思考就应该帮助完成代码库导航、规划、拆解、错误恢复和工具编排。如果目标是智能体工作流,那么思考应当提升长时程执行的质量,而不是产出看起来很厉害的中间文字。

This emphasis on targeted utility points toward something larger: we are moving from the era of training models to the era of training agents. We made this explicit in the Qwen3 blog, writing that "we are transitioning from an era focused on training models to one centered on training agents," and linking future RL advances to environmental feedback for long-horizon reasoning. An agent is a system that can formulate plans, decide when to act, use tools, perceive environment feedback, revise strategy, and continue over long horizons. It is defined by closed-loop interaction with the world.

这种对定向效用的强调指向了更大的变化,我们正在从训练模型的时代,走向训练智能体的时代。我们在 Qwen3 的博客里也把这点说得很明确,写到我们正在从以训练模型为中心的时代,过渡到以训练智能体为中心的时代,并将未来的 RL 进展与面向长时程推理的环境反馈联系起来。智能体是一种系统,它能制定计划,决定何时行动,使用工具,感知环境反馈,修订策略,并在很长的时间跨度上持续推进。它由与世界的闭环交互所定义。

4. What "Agentic Thinking" Really Means

4. 智能体式思考到底是什么意思

Agentic thinking is a different optimization target. Reasoning thinking is usually judged by the quality of internal deliberation before a final answer: can the model solve the theorem, write the proof, produce the correct code, or pass the benchmark. Agentic thinking is about whether the model can keep making progress while interacting with an environment.

智能体式思考是一个不同的优化目标。推理式思考通常以最终答案之前内部推演的质量来评判,模型能否解出定理、写出证明、生成正确代码,或在基准测试上过关。智能体式思考则关注模型在与环境交互时,能否持续取得进展。

The central question shifts from "Can the model think long enough?" to "Can the model think in a way that sustains effective action?" Agentic thinking has to handle several things that pure reasoning models can mostly avoid:

核心问题从模型能否想得足够久,转变为模型能否以一种能维持有效行动的方式去思考。智能体式思考必须处理一些纯推理模型大多可以回避的事情:

  • Deciding when to stop thinking and take an action
  • 决定什么时候停止思考并采取行动
  • Choosing which tool to invoke and in what order
  • 选择调用哪个工具,以及调用顺序
  • Incorporating noisy or partial observations from the environment
  • 融合来自环境的噪声或不完整观察
  • Revising plans after failures
  • 在失败后修订计划
  • Maintaining coherence across many turns and many tool calls
  • 在多轮交互与多次工具调用中保持连贯性

Agentic thinking is a model that reasons through action.

智能体式思考,是一种围绕行动展开的推理。

5. Why Agentic RL Infrastructure Is Harder

5. 为什么智能体 RL 基础设施更难

Once the objective shifts from solving benchmark problems to solving interactive tasks, the RL stack changes. The infrastructure used for classical reasoning RL isn't enough. In reasoning RL, you can often treat rollouts as mostly self-contained trajectories with relatively clean evaluators. In agentic RL, the policy is embedded inside a larger harness: tool servers, browsers, terminals, search engines, simulators, execution sandboxes, API layers, memory systems, and orchestration frameworks. The environment is no longer a static verifier; it's part of the training system.

一旦目标从解基准题转为解决交互式任务,RL 的技术栈就会发生变化。经典推理 RL 使用的基础设施不再够用。在推理 RL 里,你往往可以把 rollout 当成相对自洽的轨迹,并配上相对干净的评估器。在智能体 RL 里,策略被嵌入更大的 harness 之中,工具服务、浏览器、终端、搜索引擎、模拟器、执行沙箱、API 层、记忆系统、编排框架都要一起上。环境不再只是静态验算器,它成为训练系统的一部分。

This creates a new systems requirement: training and inference must be more cleanly decoupled. Without that decoupling, rollout throughput collapses. Consider a coding agent that must execute generated code against a live test harness: the inference side stalls waiting for execution feedback, the training side starves for completed trajectories, and the whole pipeline operates far below the GPU utilization you would expect from classical reasoning RL. Adding tool latency, partial observability, and stateful environments amplifies these inefficiencies. The result is that experimentation slows and becomes painful long before you reach the capability levels you are targeting.

这带来了一项新的系统要求,训练与推理必须更清晰地解耦。没有这种解耦,rollout 吞吐会直接崩塌。想象一个编码智能体,它必须把生成的代码放到真实测试 harness 里执行,推理侧会卡在等待执行反馈上,训练侧会因拿不到完成轨迹而饥饿,整条流水线的 GPU 利用率会远低于经典推理 RL 的预期。再叠加工具延迟、部分可观测和有状态环境,这些低效会被进一步放大。结果是,在达到目标能力之前很久,实验就已经变慢并变得痛苦。

The environment itself also becomes a first-class research artifact. In the SFT era, we obsessed over data diversity. In the agent era, we should obsess over environment quality: stability, realism, coverage, difficulty, diversity of states, richness of feedback, exploit resistance, and scalability of rollout generation. Environment-building has started to become a real startup category rather than a side project. If the agent is being trained to operate in production-like settings, then the environment is part of the core capability stack.

环境本身也成了一项核心研究产物。在 SFT 时代,我们痴迷于数据多样性。在智能体时代,我们更该痴迷于环境质量,稳定性、真实性、覆盖度、难度、状态多样性、反馈丰富度、抗 exploit 能力,以及 rollout 生成的可扩展性。搭环境正在从一个副业项目,变成一个真正的创业赛道。如果智能体要在接近生产的设定里被训练,那么环境就是核心能力栈的一部分。

6. The Next Frontier Is More Usable Thought

6. 下一个前沿是更可用的思考

My expectation is that agentic thinking will become the dominant form of thinking. I think it may eventually replace much of the old static-monologue version of reasoning thinking: excessively long, isolated internal traces that try to compensate for lack of interaction by emitting more and more text. Even on very difficult math or coding tasks, a genuinely advanced system should have the right to search, simulate, execute, inspect, verify, and revise. The objective is to solve problems robustly and productively.

我的预期是,智能体式思考将成为思考的主流形态。它也许最终会替代掉不少旧式的静态独白式推理思考,那种过长、孤立的内部轨迹,试图用不断输出更多文字来弥补缺乏交互。即便面对极难的数学或编码任务,一个真正先进的系统也应该拥有搜索、模拟、执行、检查、验证和修订的权利。目标是以更稳健、更高产的方式把问题解决掉。

The hardest challenge in training such systems is reward hacking. As soon as the model gets meaningful tool access, reward hacking becomes much more dangerous. A model with search might learn to look up answers directly during RL. A coding agent might exploit future information in a repository, misuse logs, or discover shortcuts that invalidate the task. An environment with hidden leaks can make the policy look superhuman while actually training it to cheat. This is where the agent era becomes much more delicate than the reasoning era. Better tools make the model more useful, but they also enlarge the attack surface for spurious optimization. We should expect the next serious research bottlenecks to come from environment design, evaluator robustness, anti-cheating protocols, and more principled interfaces between policy and world. Still, the direction is clear. Tool-enabled thinking is simply more useful than isolated thinking, and has a far better chance of improving real productivity.

训练这类系统最难的挑战是 reward hacking。一旦模型获得真正有意义的工具访问,reward hacking 就会危险得多。带搜索的模型可能在 RL 过程中学会直接查答案。编码智能体可能利用仓库里的未来信息、滥用日志,或发现能让任务失效的捷径。环境里哪怕有隐蔽泄漏,也会让策略看起来像超人,实际却是在训练它作弊。这就是为什么智能体时代比推理时代更微妙。更好的工具让模型更有用,同时也扩大了伪优化的攻击面。下一批严肃研究瓶颈很可能来自环境设计、评估器鲁棒性、防作弊协议,以及策略与世界之间更有原则的接口。尽管如此,方向很清楚,具备工具的思考就是比孤立思考更有用,也更有希望提升真实生产力。

Agentic thinking will also mean harness engineering. The core intelligence will increasingly come from how multiple agents are organized: an orchestrator that plans and routes work, specialized agents that act like domain experts, and sub-agents that execute narrower tasks while helping control context, avoid pollution, and preserve separation between different levels of reasoning. The future is a shift from training models to training agents, and from training agents to training systems.

智能体式思考也意味着 harness 工程。核心智能将越来越来自多智能体如何被组织起来,一个负责规划与分流工作的 orchestrator,像领域专家一样行动的专门智能体,以及执行更窄任务的子智能体,它们同时帮助控制上下文,避免污染,并维持不同层级推理之间的隔离。未来会从训练模型转向训练智能体,再从训练智能体转向训练系统。

Conclusion

结语

The first phase of the reasoning wave established something important: RL on top of language models can produce qualitatively stronger cognition when the feedback signal is reliable and the infrastructure can support it.

推理浪潮的第一阶段确立了一件重要的事,在反馈信号可靠、基础设施跟得上的前提下,在语言模型之上叠加 RL 可以带来质变的更强认知能力。

The deeper transition is from reasoning thinking to agentic thinking: from thinking longer to thinking in order to act. The core object of training has shifted. It is the model-plus-environment system, or more concretely, the agent and the harness around it. That changes what research artifacts matter most: model architecture and training data, yes, but also environment design, rollout infrastructure, evaluator robustness, and the interfaces through which multiple agents coordinate. It changes what "good thinking" means: the most useful trace for sustaining action under real-world constraints, rather than the longest or most visible one.

更深层的转变,是从推理式思考走向智能体式思考,从想得更久走向为了行动而思考。训练的核心对象已经改变,它是模型加环境的系统,更具体地说,是智能体以及围绕它的 harness。这会改变最重要的研究产物是什么,模型架构与训练数据当然仍然重要,但环境设计、rollout 基础设施、评估器鲁棒性,以及多智能体协作的接口也同样重要。它也会改变好思考的含义,在真实世界约束下、能支撑持续行动的最有用轨迹,比最长、最显眼的轨迹更重要。

It also changes where the competitive edge will come from. In the reasoning era, the edge came from better RL algorithms, stronger feedback signals, and more scalable training pipelines. In the agentic era, the edge will come from better environments, tighter train-serve integration, stronger harness engineering, and the ability to close the loop between a model's decisions and the consequences those decisions produce.

它还会改变竞争优势的来源。在推理时代,优势来自更好的 RL 算法、更强的反馈信号与更可扩展的训练流水线。在智能体时代,优势将来自更好的环境、更紧密的训练与服务集成、更强的 harness 工程能力,以及把模型决策与其带来后果闭环起来的能力。

The last two years reshaped how we evaluate models and what we expect from them. OpenAI's o1 showed that "thinking" could be a first-class capability, something you train for and expose to users. DeepSeek-R1 proved that reasoning-style post-training could be reproduced and scaled outside the original labs. OpenAI described o1 as a model trained with reinforcement learning to "think before it answers." DeepSeek positioned R1 as an open reasoning model competitive with o1.

That phase mattered. But the first half of 2025 was mostly about reasoning thinking: how to make models spend more inference-time compute, how to train them with stronger rewards, how to expose or control that extra reasoning effort. The question now is what comes next. I believe the answer is agentic thinking: thinking in order to act, while interacting with an environment, and continuously updating plans based on feedback from the world.

1. What the Rise of o1 and R1 Actually Taught Us

The first wave of reasoning models taught us that if we want to scale reinforcement learning in language models, we need feedback signals that are deterministic, stable, and scalable. Math, code, logic, and other verifiable domains became central because rewards in these settings are much stronger than generic preference supervision. They let RL optimize for correctness rather than plausibility. Infrastructure became critical.

Once a model is trained to reason through longer trajectories, RL stops being a lightweight add-on to supervised fine-tuning. It becomes a systems problem. You need rollouts at scale, high-throughput verification, stable policy updates, efficient sampling. The emergence of reasoning models was as much an infra story as a modeling story. OpenAI described o1 as a reasoning line trained with RL, and DeepSeek R1 later reinforced that direction by showing how much dedicated algorithmic and infrastructure work reasoning-based RL demands. The first big transition: from scaling pretraining to scaling post-training for reasoning.

2. The Real Problem Was Never Just "Merge Thinking and Instruct"

At the beginning of 2025, many of us in Qwen team had an ambitious picture in mind. The ideal system would unify thinking and instruct modes. It would support adjustable reasoning effort, similar in spirit to low / medium / high reasoning settings. Better still, it would automatically infer the appropriate amount of reasoning from the prompt and context, so the model could decide when to answer immediately, when to think longer, and when to spend much more computation on a truly difficult problem.

Conceptually, this was the right direction. Qwen3 was one of the clearest public attempts. It introduced "hybrid thinking modes," supported both thinking and non-thinking behavior in one family, emphasized controllable thinking budgets, and described a four-stage post-training pipeline that explicitly included "thinking mode fusion" after long-CoT cold start and reasoning RL.

But merging is much easier to describe than to execute well. The hard part is data. When people talk about merging thinking and instruct, they often think first about model-side compatibility: can one checkpoint support both modes, can one chat template switch between them, can one serving stack expose the right toggles. The deeper issue is that the data distributions and behavioral objectives of the two modes are substantially different.

We did not get everything right when trying to balance model merging with improving the quality and diversity of post-training data. During that revision process, we also paid close attention to how users were actually engaging with thinking and instruct modes. A strong instruct model is typically rewarded for directness, brevity, formatting compliance, low latency on repetitive, high-volume enterprise tasks such as rewriting, labeling, templated support, structured extraction, and operational QA. A strong thinking model is rewarded for spending more tokens on difficult problems, maintaining coherent intermediate structure, exploring alternative paths, and preserving enough internal computation to meaningfully improve final correctness.

These two behavior profiles pull against each other. If the merged data is not carefully curated, the result is usually mediocre in both directions: the "thinking" behavior becomes noisy, bloated, or insufficiently decisive, while the "instruct" behavior becomes less crisp, less reliable, and more expensive than what commercial users actually want.

Separation remained attractive in practice. Later in 2025, after the initial hybrid framing of Qwen3, the 2507 line shipped distinct Instruct and Thinking updates, including separate 30B and 235B variants. In commercial deployment, a large number of customers still wanted high-throughput, low-cost, highly steerable instruct behavior for batch operations. For those scenarios, merging wasn't obviously a benefit. Separating the lines allowed teams to focus on solving the data and training problems of each mode more cleanly.

Other labs chose the opposite route. Anthropic publicly argued for an integrated model philosophy: Claude 3.7 Sonnet was introduced as a hybrid reasoning model where users could choose ordinary responses or extended thinking, and API users could set a thinking budget. Anthropic explicitly said they believed reasoning should be an integrated capability rather than a separate model. GLM-4.5 also publicly positioned itself as a hybrid reasoning model with both thinking and non-thinking modes, unifying reasoning, coding, and agent capabilities; DeepSeek later moved in a similar direction with V3.1's "Think & Non-Think" hybrid inference.

The key question is whether the merge is organic. If thinking and instruct are merely co-located inside one checkpoint but still behave like two awkwardly stitched personalities, the product experience remains unnatural. A truly successful merge requires a smooth spectrum of reasoning effort. The model should be able to express multiple levels of effort, and ideally choose among them adaptively. GPT-style effort control points toward this: a policy over compute, rather than a binary switch.

3. Why Anthropic's Direction Was a Useful Corrective

Anthropic's public framing around Claude 3.7 and Claude 4 was restrained. They emphasized integrated reasoning, user-controlled thinking budgets, real-world tasks, coding quality, and later the ability to use tools during extended thinking. Claude 3.7 was presented as a hybrid reasoning model with controllable budgets; Claude 4 extended that by allowing reasoning to interleave with tool use, while Anthropic simultaneously emphasized coding, long-running tasks, and agent workflows as primary goals.

Producing a longer reasoning trace doesn't automatically make a model more intelligent. In many cases, excessive visible reasoning signals weak allocation. If the model is trying to reason about everything in the same verbose way, it may be failing to prioritize, failing to compress, or failing to act. Anthropic's trajectory suggested a more disciplined view: thinking should be shaped by the target workload. If the target is coding, then thinking should help with codebase navigation, planning, decomposition, error recovery, and tool orchestration. If the target is agent workflows, then thinking should improve execution quality over long horizons rather than producing impressive intermediate prose.

This emphasis on targeted utility points toward something larger: we are moving from the era of training models to the era of training agents. We made this explicit in the Qwen3 blog, writing that "we are transitioning from an era focused on training models to one centered on training agents," and linking future RL advances to environmental feedback for long-horizon reasoning. An agent is a system that can formulate plans, decide when to act, use tools, perceive environment feedback, revise strategy, and continue over long horizons. It is defined by closed-loop interaction with the world.

4. What "Agentic Thinking" Really Means

Agentic thinking is a different optimization target. Reasoning thinking is usually judged by the quality of internal deliberation before a final answer: can the model solve the theorem, write the proof, produce the correct code, or pass the benchmark. Agentic thinking is about whether the model can keep making progress while interacting with an environment.

The central question shifts from "Can the model think long enough?" to "Can the model think in a way that sustains effective action?" Agentic thinking has to handle several things that pure reasoning models can mostly avoid:

  • Deciding when to stop thinking and take an action

  • Choosing which tool to invoke and in what order

  • Incorporating noisy or partial observations from the environment

  • Revising plans after failures

  • Maintaining coherence across many turns and many tool calls

Agentic thinking is a model that reasons through action.

5. Why Agentic RL Infrastructure Is Harder

Once the objective shifts from solving benchmark problems to solving interactive tasks, the RL stack changes. The infrastructure used for classical reasoning RL isn't enough. In reasoning RL, you can often treat rollouts as mostly self-contained trajectories with relatively clean evaluators. In agentic RL, the policy is embedded inside a larger harness: tool servers, browsers, terminals, search engines, simulators, execution sandboxes, API layers, memory systems, and orchestration frameworks. The environment is no longer a static verifier; it's part of the training system.

This creates a new systems requirement: training and inference must be more cleanly decoupled. Without that decoupling, rollout throughput collapses. Consider a coding agent that must execute generated code against a live test harness: the inference side stalls waiting for execution feedback, the training side starves for completed trajectories, and the whole pipeline operates far below the GPU utilization you would expect from classical reasoning RL. Adding tool latency, partial observability, and stateful environments amplifies these inefficiencies. The result is that experimentation slows and becomes painful long before you reach the capability levels you are targeting.

The environment itself also becomes a first-class research artifact. In the SFT era, we obsessed over data diversity. In the agent era, we should obsess over environment quality: stability, realism, coverage, difficulty, diversity of states, richness of feedback, exploit resistance, and scalability of rollout generation. Environment-building has started to become a real startup category rather than a side project. If the agent is being trained to operate in production-like settings, then the environment is part of the core capability stack.

6. The Next Frontier Is More Usable Thought

My expectation is that agentic thinking will become the dominant form of thinking. I think it may eventually replace much of the old static-monologue version of reasoning thinking: excessively long, isolated internal traces that try to compensate for lack of interaction by emitting more and more text. Even on very difficult math or coding tasks, a genuinely advanced system should have the right to search, simulate, execute, inspect, verify, and revise. The objective is to solve problems robustly and productively.

The hardest challenge in training such systems is reward hacking. As soon as the model gets meaningful tool access, reward hacking becomes much more dangerous. A model with search might learn to look up answers directly during RL. A coding agent might exploit future information in a repository, misuse logs, or discover shortcuts that invalidate the task. An environment with hidden leaks can make the policy look superhuman while actually training it to cheat. This is where the agent era becomes much more delicate than the reasoning era. Better tools make the model more useful, but they also enlarge the attack surface for spurious optimization. We should expect the next serious research bottlenecks to come from environment design, evaluator robustness, anti-cheating protocols, and more principled interfaces between policy and world. Still, the direction is clear. Tool-enabled thinking is simply more useful than isolated thinking, and has a far better chance of improving real productivity.

Agentic thinking will also mean harness engineering. The core intelligence will increasingly come from how multiple agents are organized: an orchestrator that plans and routes work, specialized agents that act like domain experts, and sub-agents that execute narrower tasks while helping control context, avoid pollution, and preserve separation between different levels of reasoning. The future is a shift from training models to training agents, and from training agents to training systems.

Conclusion

The first phase of the reasoning wave established something important: RL on top of language models can produce qualitatively stronger cognition when the feedback signal is reliable and the infrastructure can support it.

The deeper transition is from reasoning thinking to agentic thinking: from thinking longer to thinking in order to act. The core object of training has shifted. It is the model-plus-environment system, or more concretely, the agent and the harness around it. That changes what research artifacts matter most: model architecture and training data, yes, but also environment design, rollout infrastructure, evaluator robustness, and the interfaces through which multiple agents coordinate. It changes what "good thinking" means: the most useful trace for sustaining action under real-world constraints, rather than the longest or most visible one.

It also changes where the competitive edge will come from. In the reasoning era, the edge came from better RL algorithms, stronger feedback signals, and more scalable training pipelines. In the agentic era, the edge will come from better environments, tighter train-serve integration, stronger harness engineering, and the ability to close the loop between a model's decisions and the consequences those decisions produce.

📋 讨论归档

讨论进行中…