🧠 阿头学 · 💰投资

Forge——智能体强化学习的系统突破与工程实践

MiniMax 通过系统解耦、调度优化和复合奖励，声称在十万级脚手架、200k 上下文下实现百万级样本/天的 RL 训练，但核心创新多为工程组合而非算法突破，且缺乏独立验证。
打开原文 ↗

2026-02-13 原文链接 ↗

阅读简报

双语对照

完整翻译

原文

讨论归档

核心观点

系统解耦是真正的工程价值 将 Agent 逻辑与训练基础设施通过 Gateway + Data Pool 物理隔离，使任意白盒/黑盒脚手架无缝接入，这解决了工业级 Agent 训练中"脚手架多样性"与"模型优化"的根本矛盾，是目前最务实的架构选择。

Windowed FIFO 用启发式方案平衡了一个真实的 trade-off 异步调度中，纯 FIFO 导致队首阻塞，纯贪心导致分布偏移。窗口化约束在两者间找到折中，但窗口大小 W 的设定高度依赖经验，缺乏自动调优逻辑，稳定性保证仍不够强。

上下文管理动作化是最有洞察力的算法改进 将上下文裁剪/压缩建模为模型的显式动作纳入 RL 训练，从根本上解决了推理与训练时的分布不一致问题，比单纯增加上下文长度更具效率，但这一点在文中被淹没在工程细节里。

前缀树合并与 KV 缓存优化击中了真实瓶颈 在多轮 Agent RL 中，大量请求确实共享长公共前缀，消除重复计算在理论和实践上都必然带来数量级效率提升，但"40 倍加速"的基线不明确，可能只是从极端 naive 方案起算。

复合奖励框架对长时域信用分配的解决程度被高估 过程奖励、时间奖励、Reward-to-go 都是合理的降方差技巧，但无法根本解决"200k 上下文中哪一步动作真正导致成功"的脆弱性，且"任务完成时间奖励"极易诱发奖励黑客行为（追求速度而牺牲正确性）。

跟我们的关联

对 ATou 意味着什么： 如果你在构建 Agent 产品，Forge 的中间件架构提供了一个清晰的参考——不要把模型逻辑和脚手架状态写死在一起，优先保证轨迹可回放、奖励可计算、接口标准化。下一步可以尝试在自己的 Agent 框架中引入"上下文管理作为显式动作"的设计，让模型学会主动清理无用历史而非被动接受框架的裁剪。

对 Neta 意味着什么： 这篇文章揭示了 RL 系统设计中的一个隐蔽杀手——分布偏移往往比算力不足更致命。Windowed FIFO 的启发在于：为了保证模型的"脑力均衡"，系统必须有目的地压制简单任务的产出速度，强制等待复杂样本。这个逻辑可迁移到任何"任务调度/优先级管理"场景，包括团队工作流。

对 Uota 意味着什么： 前缀树合并和 KV 缓存优化是支撑长上下文 RL 训练的基础设施，但这些都是工程优化而非新理论。更值得关注的是"效率感知奖励"的引入——将任务完成时间纳入目标函数，标志着 Agent 评价标准从"正确性"向"生产力"的跃迁，这直接决定了海外 SaaS Agent 产品的 ROI。

对通用决策的意味： 文中的目标函数 `J = Throughput × Sample Efficiency` 可以抽象为通用框架——任何"智能系统/团队效率"问题都可以问三句：这会提高吞吐吗？会提高单位投入产出吗？会破坏稳定/收敛吗？用这个标尺评估任何新方案，避免"只优化一个维度、严重损伤其他维度"的陷阱。

讨论引子

1. Windowed FIFO 中窗口大小 W 如何自动调优？ 在极端长尾任务（数小时 rollouts）下，固定 W 是否仍会导致队列拖死，需要动态调整还是 per-task 优先级？

2. "任务完成时间奖励"如何避免奖励黑客行为？ 模型为了追求速度，是否会倾向于给出简短但错误的回答或跳过必要推理步骤，文中未提及正确性与速度的权重冲突如何平衡。

3. 黑盒智能体的"稳定提升"具体指什么？ 在完全不透明的脚手架下，仅靠 Gateway 采集的 I/O 数据进行信用分配，其效率是否远低于白盒，不同脚手架间的性能波动有多大？

将强化学习（RL）扩展到复杂的真实世界智能体时，会遭遇一个根本性的“三难困境”：系统吞吐、训练稳定性与智能体灵活性的平衡。这些彼此冲突的约束，长期以来阻碍了大规模 RL 在工业级系统中的落地。

在本文中，我们将揭示：我们如何在内部 RL 框架 Forge 中，以一种整体性的方式解决这“一道不可能三角”——融合灵活的系统架构、算法设计、优化的异步调度，以及极致的训推效率。借助标准化的交互协议，Forge 支持训练任意智能体脚手架（agent scaffold），从而支撑起大规模 RL，最终沉淀为 MiniMax M2.5 模型的突破性能力。

在构建 MiniMax M2.5 的全过程中，我们的 RL 系统穿越了十万级不同的真实世界智能体脚手架与环境。在最长可达 200k 的上下文长度下，系统仍能保持百万级样本/天的处理吞吐，实现奖励的稳定收敛，并带来底层模型能力的真实提升。结合我们的 CISPO 算法与复合奖励框架，M2.5 将高效、可靠的真实世界生产力边界向前推进，践行我们的使命 “Intelligence with Everyone”（让智能惠及每个人）。

问题表述

在深入介绍架构设计之前，我们先将智能体 RL 系统的优化目标表述为：最大化有效智能体（A）训练产出（J），其定义为：

\begin{aligned} \max_{\theta} J(\mathcal{\theta}) = & \text{Throughput}(\mathcal{A}) \times \text{Sample Efficiency}(\mathcal{A}) \ \text{s.t.} \quad & \forall \mathcal{A} \in \Omega_{\text{agent}} \quad (\text{Arbitrary Agent}) \ & \mathbb{E}[\text{Update Variance}] < \delta \quad (\text{Stability}) \ & \mathbb{E}[|J^{(T)} - J^*|] < \epsilon \quad (\text{Convergence}) \end{aligned}

max⁡θJ(θ)=Throughput(A)×Sample Efficiency(A)s.t.∀A∈Ωagent(Arbitrary Agent)E[Update Variance]<δ(Stability)E[∥J(T)−J∗∥]<ϵ(Convergence)\begin{aligned} \max_{\theta} J(\mathcal{\theta}) = & \text{Throughput}(\mathcal{A}) \times \text{Sample Efficiency}(\mathcal{A}) \ \text{s.t.} \quad & \forall \mathcal{A} \in \Omega_{\text{agent}} \quad (\text{Arbitrary Agent}) \ & \mathbb{E}[\text{Update Variance}] < \delta \quad (\text{Stability}) \ & \mathbb{E}[|J^{(T)} - J^*|] < \epsilon \quad (\text{Convergence}) \end{aligned}

其中，System Throughput 指系统每秒处理的 token 原始数量，其瓶颈来自整个 RL 系统的 4 个组成部分：rollout、训练、数据处理与 I/O。Sample Efficiency 指每个样本带来的平均性能提升，受数据分布、数据质量、算法效率以及离策略程度（off-policy-ness）影响。我们用代理指标分别刻画稳定性与收敛性约束，如上式所示。要实现最大的 J，会遭遇三类结构性挑战，下面逐一展开。

1.1 智能体可扩展性与框架灵活性

现有 RL 范式由于两处结构性缺陷，对智能体复杂度形成了“玻璃天花板”：

受限的智能体自主性：标准框架通常将智能体视为白盒函数，并让智能体与训练器共享状态。这种僵化使得复杂认知架构（例如动态上下文管理、多智能体协作）难以建模，从而也阻碍了模型能力在任意黑盒智能体上有效泛化——因为这些框架预设了特定的结构约束。

Token 一致性屏障：现有的 TITO（Token-In-Token-Out）架构迫使智能体与底层 token 逻辑深度耦合。在复杂的上下文管理（CM）下，要在推理抽象（高层逻辑）与训练表征（token 级数据）之间维持严格一致，在计算上代价高到难以承受。

1.2 系统效率与计算冗余

智能体 rollout 的完成时间呈现极端方差：从几秒（简单 API 调用）到数小时（复杂推理链）不等。这会造成调度层面的僵局：

异步控制器：系统面临硬件效率与训练稳定性的关键权衡：严格 FIFO/同步调度会遭遇拖尾效应（Straggler Effect）——单个高延迟任务引发队首阻塞（Head-of-Line, HoL Blocking），导致集群空转；而贪心/FFFO 模式虽能最大化吞吐，却会带来严重的数据分布偏移（Data Distribution Shift）。这种偏移会制造非平稳训练环境——早期由短而“简单”的任务主导，后期则被聚集的“困难”任务主导——从而引发优化不稳定与梯度振荡。

前缀冗余：在智能体场景中，tokenizer 机制与固有的上下文管理天然会产生大量共享相同前缀的请求。这种冗余会在训练中造成显著的计算浪费，并引出独特的工程挑战。

1.3 算法挑战：信用分配与优化稳定性

稀疏奖励与高梯度方差：智能体任务通常具有很长的时域与延迟反馈，一个结果往往依赖于成千上万步动作序列。在 200k 的上下文窗口中，将信用分配到特定 token 或某次工具调用，在数学上非常脆弱。奖励稀疏会使回报计算的信噪比偏低，进而导致高梯度方差，破坏大规模模型训练的稳定性。

对延迟不敏感的优化：传统 RL 目标通常只关注正确性（逐步奖励或结果奖励），而忽略墙钟时间（wall-clock）的执行成本。在真实世界的智能体场景中，完成同一任务往往存在多条有效轨迹，但由于工具执行开销与串行处理等原因，它们在延迟上可能差异巨大。标准范式无法有效激励并行性或高效的工具使用，最终得到“功能正确但实践迟缓”的智能体。

系统架构与智能体 RL 范式

为缓解“效率 vs. 离策略程度（Off-Policyness）”之间的权衡并最小化冗余，我们引入以下架构创新。

2.1 RL 系统设计

为了实现真正可扩展的架构，我们不再局限于某个具体实现，而是采用一种更通用的“中间件（Middleware）”设计，从而将智能体的推理逻辑与底层训练基础设施解耦。

我们的 RL 系统由 3 个模块组成：

Agent Side：这一层抽象了通用智能体（包含白盒与黑盒架构）及其运行环境。它编排递归式的环境交互，使智能体成为纯粹的轨迹生产者。通过将环境反馈与系统开销解耦，智能体可以专注于核心业务逻辑（如上下文管理与推理链），对底层训练与推理机制保持无感。

Middleware Abstraction Layer：作为桥梁，这一层在物理上隔离 Agent Side 与 Training/Inference Side，其中包括 Gateway server 与 Data Pool。

Gateway Server：作为标准化通信网关，处理智能体与 LLM 之间的 completion 请求。通过使用通用的标准协议，该服务器有效隔离了实际底层模型的复杂性与智能体的高层行为逻辑。

Data Pool：作为分布式数据存储，Data Pool 以异步方式收集来自智能体的 rollout 轨迹与报告。它充当解耦生成与训练的缓冲区，使用户能够灵活应用数据处理与 batching 策略，以提升训练效率并满足算法需求。

Training and Inference Side：这一层负责重计算，包含 LLM Engine 与 Train Engine。

Rollout Engine：面向高吞吐 token 生成，响应由 Middleware 转发的请求。

Train Engine：从 Data Pool 消费处理后的 token 序列以更新策略；同时与 LLM Engine 保持同步，确保智能体用最新策略分布进行探索。

在离线评测中，我们观察到由脚手架差异导致的显著性能差距。借助 RL 框架的模块化设计，我们可以在不修改智能体内部的前提下，使用大规模脚手架阵列进行训练，从而有效推动模型在多样脚手架（亦即环境）上实现泛化。我们的架构实现了引擎与智能体的完全解耦，保证各类智能体能够无缝接入。迄今为止，我们已集成数百种脚手架类型，以及数千种不同的工具调用格式。

2.2 面向上下文管理（CM）的白盒智能体 RL

对于白盒智能体，完善的脚手架设计与增强使我们能够直接观测并优化模型在特定智能体架构上的表现。在 MiniMax M2.5 的研发中，我们重点解决了以往模型在需要主动上下文管理的长时域任务（例如 DeepSearch）中暴露的若干关键问题：

上下文腐烂（Context Rot）挑战：随着交互轮次增加，中间推理步骤与冗余观测不断累积，会产生“注意力稀释”效应。这些噪声积累会让模型偏离关键信息，即使其仍严格处于绝对上下文窗口限制之内。

推理-训练不匹配：上下文管理在推理阶段确实能有效延长交互时域并提升长上下文场景下的智能体表现，但如果只在推理中使用，会与 RL 训练数据产生严重分布偏移。这种差异迫使模型在运行时突然适应意外的上下文跃迁、并即时处理陌生的长上下文结构，最终反而削弱整体性能。

为解决分布偏移并维持推理保真度，我们将 CM 机制直接纳入 RL 交互闭环，等价地把上下文管理视作驱动状态转移的功能性动作：

CM 驱动的状态转移：我们将 CM 建模为显式的智能体动作，使上下文转移自然地嵌入环境动力学中。从 St 到 St+1 的状态转移隐式包含了上下文切换逻辑，从而把上下文适配直接折叠进模型的训练目标。

自适应推理模式：在该框架下优化策略 π，模型学会内化分布偏移，进而涌现出稳健的推理模式，天然优先关注“状态关键（state-critical）”的 token。

上下文感知的管理策略：在这一范式下，模型被训练为在 RL 生成过程中预判潜在的上下文管理操作与切换。通过主动保留任务关键信息并裁剪无关的上下文噪声，模型在部署于上下文管理型智能体框架时能够显著提升性能。

2.3 黑盒智能体 RL：跨异构脚手架的鲁棒性

在实际部署中，我们相当一部分用户使用的是专有或复杂的智能体架构，它们以“黑盒”形式运行。我们观察到，模型表现往往会随底层智能体脚手架而剧烈波动，因为标准训练范式难以跨不同认知架构泛化。为此，我们通过一项专门的黑盒智能体实验验证了框架能力，确保无论智能体内部是否透明，都能获得一致的优化效果。

无侵入集成与兼容性：Forge 对智能体内部实现细节完全无感。智能体只需将请求路由到 RL 服务 Gateway，框架便会在后台自动完成数据采集与训练。因此，在实际 RL 训练中，Forge 可无缝支持任意上下文操控（如记忆压缩、历史重写），并兼容任何复杂的内部 Agent Loop（例如 Deep Think、多智能体架构）。

多脚手架泛化：通过将训练闭环与智能体内部状态解耦，MiniMax M2.5 能与海量黑盒智能体广泛兼容。这种适配覆盖从高度依赖 Sandbox 与 Model Context Protocol（MCP）环境的代码智能体（例如我们将 OpenCode Agent 完全按黑盒方式训练），到采用激进上下文压缩策略的智能体（如 Truncate BC）。实证结果显示，即便面对完全不透明的黑盒系统，该方法仍能带来一致且稳定的提升。

工程优化

3.1 混合调度策略：窗口化 FIFO（Windowed FIFO）

为解决系统吞吐与分布一致性之间的冲突，我们引入 Windowed FIFO。该策略对训练调度器施加滑动约束，作为严格同步排序与贪心异步执行之间的“折中地带”。

其核心逻辑决定了训练调度器如何从全局生成队列中取样。即使一次提交了大量请求（例如生成批大小为 N），调度器也只能看到大小为 W 的可见窗口（例如 W=0.3N）。

受限可见范围：设生成队列为 Q=[T0,T1,...,TN−1]，当前队首位于索引 i。训练调度器严格限制只能从 [Ti,Ti+W−1] 区间内抓取已完成的轨迹。

局部“贪心”无序（窗口内）：在活动窗口 [Ti,Ti+W−1] 内，调度器可以立即取走任意已完成轨迹。这缓解了队首阻塞（HoL blocking），因为窗口内的快任务无需等待绝对队首任务完成。

全局“严格”阻塞（窗口边界）：关键在于，即便队列中索引 j>i+W（窗口外）的任务已经完成——在大生成批中这对简单快任务很常见——调度器也被禁止取走它。

约束实现：窗口只有在队首任务被消费后才会前移（i→i+1）。该机制有效迫使调度器在当前窗口内等待“拖尾者”（复杂、长时域任务），避免训练分布向队列后部的“快而易”样本漂移。

3.2 通过前缀树合并加速智能体轨迹训练

在智能体训练中，数据集通常由大量多轮对话样本构成。从结构上看，这些样本高度重叠。

挑战：传统方法中的冗余

前缀重叠：在朴素的多轮对话中，消息按顺序不断追加。在 tokenizer 一致的前提下，多个共享相同历史的 completion 理论上可以合并。

复杂上下文管理：智能体常采用复杂的上下文管理策略，例如丢弃无关中间结果或进行自我摘要。因此，不同 completion 往往共享大量公共前缀。

朴素方案的局限：传统训练将每个样本视为独立实体，反复重算这些公共前缀。在长上下文场景中，这种计算冗余会造成巨量 TFLOPS 浪费，并严重限制训练吞吐。

前缀树合并（Prefix Tree Merging）

为消除冗余，我们提出前缀树合并方案，将训练过程从“线性处理”转为“树结构”处理。

前缀树合并：针对智能体场景下复杂上下文管理（如图示的“长公共上下文”），只要多个 completion 共享底层前缀，即便后续回复略有差异或属于不同采样分支，也可在样本层面合并成一棵前缀树。

借助注意力原语（如 Magi Attention），我们确保逻辑执行与标准前向计算保持一致。前向计算结束后，根据元数据对前缀树进行拆解，按常规方式计算 loss，从而对下游逻辑零影响。

通过消除冗余的前缀 prefilling，该方案实现了 40 倍训练加速，并显著降低内存开销，以支持更长序列或更大 batch size；同时保证与标准方法严格数学等价，对 loss 计算与指标零影响。

3.3 极致推理加速

我们通过三项架构创新优化生成流水线：

基于 MTP 的推测解码（Speculative Decoding）：不同于静态草稿模型，我们使用 Multi-Token Prediction（MTP）头，并通过 Top-K KL loss 持续微调。这确保其与不断演化的 RL 策略对齐，在缓解分布偏移的同时维持高接受率，从而实现显著加速。

异构 PD 解耦（Heterogeneous PD Disaggregation）：我们将 Prefill 与 Decode 解耦，消除二者在混合 MoE 调度中的相互干扰，并允许对每个实例采用独立的并行策略，从而同时最大化全局吞吐并优化长时域任务的尾延迟（tail latency）。

全局 L3 KV 缓存池：为避免多轮智能体 RL 中的冗余 prefilling，并在 group-level rollout 中最大化前缀缓存命中率，我们引入一个由 DFS 支撑的全局 L3 Cache。成本感知调度器会动态路由请求，权衡排队延迟与缓存迁移成本，在不压垮实例的前提下最大化缓存局部性。

可扩展的智能体 RL 算法

4.1 RL 算法

我们以 CISPO 作为核心算法，并针对长时域智能体的特性做了专门适配。

统一的混合域训练：不同于多阶段强化学习（往往导致跨域负迁移或干扰），我们采用统一训练策略：同时混合 Reasoning、通用 QA 与 Agent 三个域的任务进行训练。这种联合训练缓解了顺序训练中常见的性能退化，并显著增强模型在多样任务上的泛化能力。

4.2 稠密且效率感知的奖励

我们提出一套复合奖励框架，旨在应对超长上下文（最高 200k）带来的信用分配难题，并保证训练稳定性：

过程奖励（Process Reward）：为提供稠密反馈，我们针对中间行为（例如惩罚语言混用或特定工具调用错误），而非仅依赖最终结果。

任务完成时间奖励（Task Completion Time Reward）：在智能体场景中，完成任务往往存在多条轨迹。总耗时不仅取决于 token 生成，还取决于特定工具执行与子智能体调用所带来的延迟。由于完成时间直接影响真实用户体验，我们将相对完成时间纳入奖励信号，以激励智能体利用并行性，从而加速任务执行。

用于降方差的 Reward-to-go：标准稀疏奖励在长时域任务中常导致高梯度方差。我们采用 Reward-to-go 形式对回报进行归一化，从而有效降低梯度方差、提升信用分配精度，并稳定优化过程。

结论

我们成功解决了面向智能体扩展 RL 的“不可能三角”。通过 Forge，我们在实现 RL 系统吞吐突破的同时，确保了对任意智能体脚手架的稳健泛化。将这一灵活架构与稳定的 CISPO 算法结合，我们支撑了 MiniMax M2.5 背后的大规模训练。这种整体性方法突破了以往限制，交付了高效的真实世界智能体能力，并推进我们的使命 “Intelligence with Everyone”（让智能惠及每个人）。

链接：http://x.com/i/article/2022169816556331008

Scaling RL for complex, real-world agents confronts a fundamental trilemma: balancing system throughput, training stability, and agent flexibility. These conflicting constraints have long impeded the application of large-scale RL in industrial-grade systems.

In this post, we reveal how we resolved this "impossible triangle" through a holistic approach in our internal RL framework Forge, combining flexible system architecture, algorithmic design, optimized asynchronous scheduling, and extreme training-inference efficiency. By leveraging standardized interaction protocols, Forge supports the training of arbitrary agent scaffolds, enabling the massive-scale RL that culminates in the breakthrough capabilities of the MiniMax M2.5 model.

Throughout the construction of MiniMax M2.5, our RL system navigated over a hundred thousand distinct real-world agent scaffolds and environments. Operating with context lengths of up to 200k, the system maintained a daily processing throughput on the scale of millions of samples, realizing consistent reward convergence and genuine capability improvements in the underlying model. Integrated with our CISPO algorithm and composite reward framework, M2.5 pushes the frontier for efficient and reliable real-world productivity, achieving our mission "Intelligence with Everyone".

Problem Formulation

Before delving into our architectural design, we first formulate the optimization objective of our Agent RL system to be the maximization of the Effective Agent(A) Training Yield (J), defined as:

where System Throughput refers to the raw number of tokens processed per second, bottlenecked by 4 components of the whole RL system: rollout, training, data processing and I/O. Sample Efficiency refers to the average performance improvement for each sample determined by data distribution, data quality, algorithm efficiency, and off-policy-ness. We choose our specific constraints using proxy indicators for both stability and convergence considerations, as noted in the equation. Achieving maximal J is hindered by three structural challenges, which we explain in detail below.

1.1 Agent Extensibility and Framework Flexibility

Current RL paradigms impose a "Glass Ceiling" on agent complexity due to two structural flaws:

Restricted Agent Autonomy: Standard frameworks treat agents as white-box functions with a shared state between agent and trainer. This rigidity makes it difficult to model complex cognitive architectures (e.g., dynamic Context Management, Multi-Agent collaboration) and therefore prevent the model capability from generalizing effectively on an arbitrary black-box agent without these assumed structural constraints.

Token Consistency Barrier: Existing TITO (Token-In-Token-Out) architectures force Agents to be coupled deeply with the underlying token logic. Maintaining strict consistency between the Inference Abstraction (high-level logic) and the Training Representation (token-level data) under complex Context Management (CM) is computationally prohibitive.

1.2 System Efficiency and Compute Redundancy

Agent rollout completion times exhibit extreme variance, ranging from seconds (simple API calls) to hours (complex reasoning chains). This creates a scheduling deadlock:

Asynchronous Controller: Systems face a critical trade-off between hardware efficiency and training stability: while Strict FIFO/Sync scheduling suffers from the Straggler Effect, where a single high-latency task causes Head-of-Line (HoL) Blocking and idles the cluster, Greedy/FFFO modes maximize throughput at the cost of a severe Data Distribution Shift. This shift creates a non-stationary training environment—initially dominated by short, "easy" tasks and later by clustered "hard" tasks—leading to optimization instability and gradient oscillation.

Prefix Redundancy: In Agent scenarios, the interplay between tokenizer mechanics and inherent Context Management naturally results in a substantial volume of requests sharing identical prefixes. This redundancy causes significant computational waste during training, thereby introducing distinct engineering challenges.

1.3 Algorithmic Challenges: Credit Assignment and Optimization Stability

Sparse Rewards and High Gradient Variance: Agentic tasks typically involve extended horizons with delayed feedback, where a single outcome depends on a sequence of thousands of actions. Assigning credit to specific tokens or tool invocations within a 200k context window is mathematically precarious. This sparsity leads to a low signal-to-noise ratio in the return calculation, causing high gradient variance that destabilizes the training of large-scale models.

Latency-Agnostic Optimization: Traditional RL objectives focus solely on correctness (step-wise or outcome rewards) while ignoring the wall-clock execution cost. In real-world agentic scenarios, multiple valid trajectories exist, but they differ significantly in latency due to tool execution overhead and serial processing. Standard paradigms fail to incentivize parallelism or efficient tool usage, resulting in functionally correct but practically sluggish agents.

System Architecture and Agent RL Paradigm

To alleviate the "Efficiency vs. Off-Policyness" trade-off and minimize redundancy, we introduce the following architectural innovations.

2.1 RL System Design

To achieve a truly scalable architecture, we move beyond specific implementations to a generalized "Middleware" design. This decouples the Agent's reasoning logic from the underlying training infrastructure.

Our RL system is made up of 3 modules:

Agent Side: This layer abstracts the General Agent—comprising both white-box and black-box architectures—and its operational environment. It orchestrates recursive environmental interactions, allowing the Agent to function as a pure trajectory producer. By decoupling environmental feedback from system overhead, the Agent can focus exclusively on core business logic (such as Context Management and Reasoning Chains), remaining agnostic to the underlying training and inference mechanics.

Middleware Abstraction Layer: Acting as the bridge, this layer physically isolates the Agent Side from the Training/Inference Side, including the Gateway server and the Data Pool.

Gateway Server: It serves as a standardized communication gateway that processes completion requests between the agent and the LLM. By utilizing common standard protocols, this server effectively isolates the complexities of the actual underlying model from the agent's high-level behavioral logic.

Data Pool: As a distributed data storage, Data Pool asynchronously collects rollout trajectories and reports from the agent. It serves as a buffer that decouples generation and training, allowing users to apply flexible data processing and batching strategy for training efficiency and algorithmic usage.

Training and Inference Side: This layer manages the heavy computational lifting, consisting of the LLM Engine and Train Engine.

Rollout Engine: Dedicated to high-throughput token generation, responding to requests forwarded by the Middleware.

Train Engine: Consumes processed token sequences from the Data Pool to update the policy. It maintains synchronization with the LLM Engine to ensure the agent explores using the latest policy distribution.

During offline evaluation, we observed significant performance discrepancies attributed to differences in the scaffolds. Leveraging the modular design of our RL framework, we can conduct training using an extensive array of scaffolds without requiring internal modifications to the Agent. This approach effectively enables the model to generalize across diverse scaffolds, a.k.a. environments. Our architecture achieves a complete decoupling of engines and agents, ensuring the seamless integration of various agents. In total, we have integrated hundreds of types of scaffolds and thousands of distinct tool invocation formats.

2.2 White-Box Agent RL for Context Management (CM)

For white-box agents, comprehensive scaffold design and augmentation allow us to directly observe and optimize the model's performance on specific agent architectures. In the development of MiniMax M2.5, we specifically addressed several critical issues that plagued previous models during long-horizon tasks requiring active context management (such as DeepSearch):

The Challenge of Context Rot: As the number of interaction turns increases, the accumulation of intermediate reasoning steps and redundant observations creates an "attention dilution" effect. This accumulated noise causes the model to lose focus on critical information, even when operating strictly within its absolute context window limits.

Inference-Training Mismatch: While context management can effectively extend the interaction horizon and boost agent performance in long-context scenarios, applying it exclusively during inference introduces a severe distribution shift from the RL training data. This discrepancy forces the model to abruptly adapt to unexpected context transitions and process unfamiliar long-context structures on the fly, ultimately degrading its overall performance.

To resolve this distribution shift and maintain reasoning fidelity, we integrate the CM mechanism directly into the RL interaction loop, effectively treating Context Management as a functional action that drives state transitions:

CM-Driven State Transitions: We model CM as an explicit agent action, with context transitions naturally embedded within the environment's dynamics. The state transition from St to St+1 implicitly encapsulates the context-switching logic, effectively folding context adaptation directly into the model's training objective.

Adaptive Reasoning Patterns: By optimizing the policy π within this framework, the model learns to internalize the distribution shift. This prompts the emergence of robust reasoning patterns that inherently prioritize "state-critical" tokens.

Context-Aware Management Strategy: Under this paradigm, the model is trained to anticipate potential context management operations and shifts during the RL generation process. By actively retaining task-critical information while pruning irrelevant contextual noise, the model significantly enhances its performance when deployed within Context-Management Agent frameworks.

2.3 Black-box Agent RL: Robustness Across Heterogeneous Scaffolds

In practical deployment, a significant portion of our user base operates proprietary or complex agent architectures that function as "Black Boxes." We have observed that model performance often varies drastically depending on the underlying agent scaffold, as standard training paradigms fail to generalize across different cognitive architectures. To address this, we validated our framework through a dedicated Black-box Agent Experiment, ensuring consistent optimization regardless of the agent's internal opacity.

Non-Intrusive Integration and Compatibility: Forge remains completely agnostic to the agent's internal implementation details. Agents simply route their requests to the RL service Gateway, and the framework automatically handles data collection and training under the hood. Consequently, during actual RL training, Forge seamlessly supports arbitrary context manipulations (such as memory compression and history rewriting) alongside any complex internal Agent Loop (e.g., Deep Think, Multi-Agent architectures).

Multi-Scaffold Generalization: By decoupling the training loop from the agent's internal state, MiniMax M2.5 achieves broad compatibility with a vast array of black-box agents. This adaptability spans everything from code-centric agents heavily reliant on Sandbox and Model Context Protocol (MCP) environments—for instance, training our OpenCode Agent entirely as a black box—to agents employing aggressive context reduction strategies, such as Truncate BC. Empirical results demonstrate that this approach delivers consistent, stable improvements even across completely opaque black-box systems.

Engineering Optimization

3.1 Hybrid Scheduling Strategy: Windowed FIFO

To resolve the conflict between System Throughput and Distributional Consistency, we introduce Windowed FIFO. This strategy imposes a sliding constraint on the Training Scheduler, acting as a "middle ground" between strict synchronous ordering and greedy asynchronous execution.

The core logic governs how the Training Scheduler fetches samples from the global generation queue. Even if a large batch of requests (e.g., Generation Batch Size N) is submitted, the scheduler is restricted to a visibility window of size W (e.g., W=0.3N).

Restricted Visibility Scope: Let the generation queue be denoted as Q=[T0,T1,...,TN−1], with the current head at index i. The Training Scheduler is strictly limited to fetching completed trajectories from the range [Ti,Ti+W−1].

Local "Greedy" Disorder (Within Window): Inside the active window [Ti,Ti+W−1], the scheduler can retrieve any completed trajectory immediately. This mitigates the Head-of-Line (HoL) blocking effect, as fast tasks within the window do not need to wait for the absolute first task to finish.

Global "Strict" Blocking (Window Boundary): Crucially, even if a task at index j>i+W (outside the window) is completed—common for simple, fast tasks in a large generation batch—the scheduler is forbidden from fetching it.

Constraint Implementation: The window slides forward (i→i+1) only as tasks at the head are consumed. This mechanism effectively forces the scheduler to wait for "stragglers" (complex, long-horizon tasks) within the current window, preventing the training distribution from drifting towards "fast and easy" samples found later in the queue.

3.2 Accelerating Agent Trajectory Training with Prefix Tree Merging

In the training of agents, datasets typically consist of extensive multi-turn dialogue samples. Structurally, these samples exhibit a high degree of overlap.

The Challenge: Redundancy in Traditional Methods

Prefix Overlap: In naive multi-turn dialogues, messages are sequentially appended. Given a consistent tokenizer, multiple completions sharing the same history could theoretically be merged.

Complex Context Management: Agents often employ sophisticated context management strategies, such as discarding irrelevant intermediate results or performing self-summarization. Consequently, distinct completions frequently share extensive common prefixes.

Limitations of the Naive Approach: Traditional training methods treat each sample as an independent entity, repeatedly recalculating these common prefixes. In long-context scenarios, this computational redundancy results in a massive waste of TFLOPS and severely constrains training throughput.

Prefix Tree Merging

To eliminate this redundancy, we propose a Prefix Tree Merging scheme, transforming the training process from "linear processing" to a "tree-structured" approach.

Prefix Tree Merge: Addressing the complex context management in Agent scenarios (as illustrated by the "long common context"), multiple completions can be merged into a single prefix tree at the sample level—even if subsequent responses differ slightly or belong to different sampling branches—provided they share an underlying prefix.

By utilizing attention primitives (such as Magi Attention), we ensure that the logical execution remains consistent with a standard forward pass. Following the forward pass, the prefix tree is deconstructed based on metadata to compute the loss normally, ensuring zero impact on downstream logic.

By eliminating redundant prefix prefilling, this solution achieves a 40x training speedup and significantly reduces memory overhead to support longer sequences or larger batch sizes, all while guaranteeing strict mathematical equivalence to standard methods with zero impact on loss computation or metrics.

3.3 Extreme Inference Acceleration

We optimize the generation pipeline through three architectural innovations:

MTP-based Speculative Decoding: Instead of static draft models, we use Multi-Token Prediction (MTP) heads continuously fine-tuned via Top-K KL loss. This ensures alignment with the evolving RL policy, sustaining high acceptance rates and significant speedup by mitigating distribution shifts.

Heterogeneous PD Disaggregation: We decouple Prefill and Decode to eliminate PD interference in mixed MoE scheduling and allow for independent parallelism strategies for each instance, simultaneously maximizing global throughput and optimizing tail latency for long-horizon tasks.

Global L3 KV Cache Pool: To prevent redundant prefilling in multi-turn agent RL and maximize prefix cache hit rate with group-level rollout, we introduce a DFS-backed Global L3 Cache. A cost-aware scheduler dynamically routes requests by weighing queuing delay against cache migration costs, maximizing cache locality without overloading instances.

Scalable Agent RL Algorithm

4.1 RL Algorithm

We leverage CISPO as the core algorithm, specifically adapted for the characteristics of long-horizon Agents.

Unified Mixed-Domain Training: Unlike multi-stage reinforcement learning, which often leads to negative transfer or interference between domains, we adopt a unified training strategy. We mix tasks across Reasoning, General QA, and Agent domains simultaneously. This joint training approach mitigates the performance degradation typically seen in sequential training and significantly enhances the model's generalizability across diverse tasks.

4.2 Dense and Efficiency-Aware Reward

We propose a composite reward framework designed to tackle the credit assignment challenges of ultra-long contexts (up to 200k) while ensuring training stability:

Process Reward: To provide dense feedback, we target intermediate behaviors (e.g., penalizing language mixing or specific tool invocation errors) rather than relying solely on the final outcome.

Task Completion Time Reward: In agentic scenarios, multiple trajectories exist for task completion. The total duration depends not only on token generation but also on the latency associated with specific tool execution and sub-agent invocations. Since completion time is critical to the actual user experience, we incorporate relative completion time as a reward signal. This incentivizes the agent to leverage parallelism, thereby accelerating task execution.

Reward-to-go for Variance Reduction: Standard sparse rewards often lead to high gradient variance in long-horizon tasks. We employ the Reward-to-go formulation to normalize returns. This effectively reduces gradient variance and improves the precision of credit assignment, stabilizing the optimization process.

Conclusions

We have successfully addressed the "impossible triangle" of scaling RL for agents. Through Forge, we achieved a breakthrough in RL system throughput while ensuring robust generalization across arbitrary Agent scaffolds. By integrating this flexible architecture with our stable CISPO algorithm, we enabled the massive-scale training behind MiniMax M2.5. This holistic approach overcomes previous constraints, delivering efficient, real-world agent capabilities and advancing our mission of "Intelligence with Everyone."

Link: http://x.com/i/article/2022169816556331008

问题表述