🧠 阿头学 · 💬 讨论题

跨数据中心 KVCache：Prefill 即服务正在改写 LLM 部署边界

这篇文章最重要的判断是：混合注意力把 KVCache 降到“可搬运”后，LLM 的 prefill 确实有机会跨数据中心外包，但真正让它成立的不是模型本身，而是“只外包长且未缓存请求”的系统调度。
打开原文 ↗

2026-04-19 原文链接 ↗

阅读简报

双语对照

完整翻译

原文

讨论归档

核心观点

部署边界不是算力墙，而是 KV 搬运墙 作者抓得很准：PD 解耦早就成熟，但它一直被锁在单数据中心 RDMA 域内，根本原因不是 prefill 和 decode 不能拆，而是 dense attention 产生的 KVCache 太大，跨集群传输会把收益吃光；这个 framing 是站得住的，也是全文最有价值的地方。

混合注意力不是小优化，而是在重写部署经济学 文中给出的多组数据说明，混合模型的 KV 吞吐相对稠密模型能降一个数量级，32K 场景下 MiMo-V2-Flash 对 MiniMax-M2.5 可低到 13×；这意味着“普通以太网也许够用”不再是空想，而是开始进入工程可行区间，这个判断是成立的。

关键不是全面外包，而是选择性卸载 作者最强的系统判断是：不是所有 prefill 都值得远端做，短请求、缓存命中请求走远端会亏；只有长且未缓存的请求，才足以摊薄跨集群 KV 传输成本。这个结论既符合系统直觉，也比“把所有 prefill 都扔出去”的朴素方案可信得多。

系统收益主要来自“模型改变量 + 调度机制”叠加，不是单一创新 文章没有把“KV 变小”吹成万能钥匙，而是明确说它只是必要条件；真正可用还要加长度阈值路由、带宽感知调度、缓存感知放置和双时间尺度资源重配。这一点说明作者理解现实系统复杂性，不算天真。

收益数字有吸引力，但证据强度还不够支撑行业级定论 54% 吞吐提升、P90 TTFT 降低 64%、平均跨集群带宽 13 Gbps，这些结果方向上很强，但它们来自内部 1T 模型、内部 profiling 和案例研究，不是公开可复现的真实跨 IDC 生产验证；因此“有前景”这个判断成立，“下一代范式已成”这个判断过头了。

跟我们的关联

对 ATou 意味着什么、下一步怎么用 这篇文章对 ATou 的意义是：判断 AI infra 机会时，不能只盯 GPU 和 token/s，而要盯“状态是否搬得动”；下一步可以把“必要条件 vs 充分条件”作为分析模板，专门审视那些声称靠模型进步就能改写系统架构的项目。

对 Neta 意味着什么、下一步怎么用 对 Neta 来说，这篇文章说明 agent 时代的推理基础设施会更像分阶段供应链，而不是一体化推理引擎；下一步可以把“长上下文准备远端化、低延迟交互本地化”作为 agent 架构判断假设，观察谁真正具备分流、缓存和状态编排能力。

对 Uota 意味着什么、下一步怎么用 对 Uota 而言，这不是纯技术细节，而是一个组织与产品分工模型：不是所有重活都该集中处理，只有边际收益显著的部分才值得外包；下一步可以把“选择性卸载”抽象成通用策略，用来讨论团队分工、产品分层和全球化运营。

对投资/战略意味着什么、下一步怎么用 这篇文章提示一个值得下注的方向：未来基础设施优势可能来自“跨域状态编排能力”，而不只是更贵的芯片；下一步应重点看三类标的是否形成闭环：KV-friendly 模型、跨集群调度系统、按阶段专用硬件，而不是孤立看某一层。

讨论引子

1. 如果未来主流模型没有持续走向 KV-friendly 的混合注意力，这套 PrfaaS 逻辑会不会迅速失效？ 2. 真正限制跨数据中心 KVCache 落地的，是平均带宽不够，还是尾延迟、抖动和故障处理太难？ 3. 对 agent 产品来说，“长上下文远端化、交互本地化”会成为默认架构，还是只适合少数高价值场景？

Prefill-as-a-Service：下一代模型的 KVCache 可能跨越数据中心

Ruoyu Qin1,2 Weiran He1 Yaoyu Wang1 Zheming Li1 Xinran Xu1 Yongwei Wu2 Weimin Zheng2 Mingxing Zhang2 1Moonshot AI 2Tsinghua University 通讯作者：zhang_mingxing@mail.tsinghua.edu.cn

摘要

预填充-解码（PD）解耦已经成为大规模 LLM 服务的标准架构，但在实践中，它的部署边界仍然由 KVCache 传输决定。在传统的稠密注意力模型中，预填充会产生巨大的 KVCache 流量，使预填充和解码必须在同一个高带宽网络域内紧密耦合，这限制了异构部署和资源弹性。近期的混合注意力架构大幅减少了 KVCache 大小，使跨集群 KVCache 传输变得越来越可行。然而，KVCache 变小本身并不足以让异构跨数据中心 PD 服务在实践中可用：真实工作负载仍然具有突发性，请求长度高度偏斜，前缀缓存分布不均，集群间带宽也会波动。因此，一个天真地将预填充完全外部化的设计，仍可能遭遇拥塞、队列不稳定和利用率低下等问题。

本文提出 Prefill-as-a-Service（PrfaaS），一种跨数据中心服务架构。它有选择地将长上下文预填充卸载到独立的、计算密集型预填充集群，再通过普通以太网将生成的 KVCache 传输到本地 PD 集群进行解码。PrfaaS 并不把 KVCache 缩小视为充分条件，而是把模型侧的 KV 效率与系统侧的选择性卸载、带宽感知调度和缓存感知请求放置结合起来。该设计不再要求异构加速器共享同一套低延迟 RDMA 网络，从而允许预填充和解码能力在松耦合集群之间独立扩展。在使用内部 1T 参数混合模型的案例研究中，加入 PrfaaS 的异构部署，相比同构 PD 和天真的异构基线，服务吞吐分别提升 54% 和 32%，同时只消耗适度的跨数据中心带宽。

1 引言

预填充-解码（PD）解耦已经成为大规模 LLM 服务的主流部署范式，因为它把推理中两个本质不同的阶段分离开来：预填充主要受计算密集度支配，而解码主要受内存带宽支配。在 Moonshot AI，Mooncake [22] 通过把 KVCache 作为一等系统资源，推动了这一转变在实践中的落地。随后，这一方向也通过我们与 vLLM [29]、SGLang [7] 和 Dynamo [20] 等开放框架的合作，扩展到了更广泛的服务生态。原则上，PD 解耦还应该释放出一个更有野心的目标：异构服务，即预填充运行在计算密集型加速器上，解码运行在带宽优化型加速器上。硬件路线图已经在朝这个方向演进。NVIDIA 的 Rubin CPX [19] 明确面向高吞吐长上下文预填充，而 Groq 的 Language Processing Unit（LPU）[1, 8] 等架构，则强调解码所需的极高内存带宽。

然而在实践中，这一异构愿景仍然很难实现，因为当前的 PD 解耦仍依赖一个很强的网络假设：当预填充和解码被放置在不同节点上时，预填充产生的 KVCache 必须足够快地传输过去，以免阻塞计算。在传统部署中，这实际上把两个阶段都限制在同一个紧密耦合的高带宽网络域内，通常是单个数据中心级 RDMA 网络。在这样的同构集群内部，PD 解耦运行良好。困难在于，这种单数据中心范式并不能自然延伸到异构服务。加速器资源通常同时按芯片类型和物理位置池化，因此面向计算和面向推理的硬件往往无法出现在同一个紧密耦合域内。这就形成了很强的动机：跨集群边界分离预填充和解码，把预填充迁移到更快的计算硬件上，从而降低长上下文请求的成本和延迟。但只有当 KVCache 传输足够便宜时，这种优势才会真正出现。一旦预填充和解码不再共享同一套高带宽网络，生成的 KVCache 就必须穿过更慢的集群间链路。如果传输成本过高，它会抹掉预填充侧的收益，并成为新的瓶颈。即便两个集群在地理上相近，要求它们共享一套 RDMA 级网络也很僵硬，运维上往往不现实。更糟的是，一旦异构部署被迫塞进一个紧密耦合的集群，它的预填充与解码硬件比例就很难随着流量模式变化而调整。因此，当前 PD 部署仍没有达到异构解耦本应提供的灵活性。

核心障碍因此就是 KVCache 传输。近期的混合注意力架构在这一点上改变了局面。新兴模型 [26, 27, 31, 3, 11, 2, 18] 会把少量全注意力层与更多线性复杂度层或有界状态层交错使用，例如 Kimi Delta Attention（KDA）[26]、Sliding Window Attention（SWA）[5] 以及相关机制。相较稠密注意力架构，这种设计显著降低了 KVCache 的增长，通常可达到一个数量级的降幅，因此让跨数据中心 KVCache 传输变得可想象。但可想象还不是可实践：如果一个设计天真地外部化所有预填充，它仍然会受到突发到达、请求长度偏斜、前缀缓存分布不均、集群间带宽波动的影响。混合架构缓解了 KVCache 瓶颈，但没有消除系统设计的必要性；相反，它创造了一个必须由系统设计去利用的机会。

（a）现状：紧密耦合的单集群推理。

（b）PrfaaS：通过跨数据中心 KVCache 实现多集群解耦推理。

图 1：两种面向 PD 解耦 LLM 服务的部署范式对比。

这一观察导向了本文的关键设计原则：通过跨数据中心 KVCache 实现 Prefill-as-a-Service（PrfaaS）。如图 1 所示，PrfaaS 并不把异构加速器强行放进一个 RDMA 孤岛，而是使用廉价、高吞吐计算资源构建专用于长上下文预填充的独立集群。此外，PrfaaS 也不是完全分离每个请求，而是只把长且未缓存的预填充卸载到这些计算密集型预填充集群，短请求仍然走本地 PD 路径。生成的 KVCache 随后通过普通以太网传输到具备解码能力的 PD 集群。这个设计反映了底层系统现实：拆分预填充的动机很强，但混合模型缩小后的 KVCache 仍然没有便宜到可以不加区分地传输。让该设计可行的，是选择性卸载。它把跨集群传输集中在那些最能从预填充加速中受益的请求上，同时避免把短请求送入带宽受限路径所带来的低效。

要让这个设计发挥作用，调度和缓存管理必须明确处理剩余的系统挑战。考虑到即便 KVCache 缩小后带宽仍然受限，PrfaaS 使用基于长度阈值的路由，只卸载足够长的请求；使用带宽感知调度器，在拥塞积累之前响应链路条件波动；使用带有混合前缀缓存池的全局 KVCache 管理器，联合考虑请求长度、缓存放置和可用跨集群带宽。这些机制使跨集群异构服务能够在现实环境中成立：它们保留了 PD 解耦的收益，又不要求异构加速器共享同一套低延迟 RDMA 网络，并允许面向计算的预填充能力与面向带宽的解码能力在松耦合集群、数据中心或区域之间独立扩展。

这种灵活性直接回应了实践中很难解决的部署约束，包括不同加速器类别无法共址、云区域之间硬件不对称，以及机会性的远端容量。我们通过一个案例研究来评估这一想法，使用遵循 Kimi Linear 架构 [26] 的内部 1T 参数混合模型。在一个异构部署中，系统由用于长上下文预填充的独立 PrfaaS 集群，以及用于解码和短预填充的传统 PD 集群组成。该系统相比同构 PD 和天真异构基线，服务吞吐分别提升 54% 和 32%，同时每台机器只消耗适度的跨数据中心带宽。这些结果表明，具备 KVCache 效率的模型架构是跨数据中心异构服务的必要条件，但不是充分条件。让部署真正可行的，是模型侧 KVCache 缩减与系统侧选择性卸载、带宽感知调度的结合。它们共同把跨数据中心 PD 解耦从一个有吸引力的想法，变成了现实可用的服务架构。

2 背景

2.1 传统 PD 解耦中的带宽墙

预填充-解码（PD）解耦已经成为大规模 LLM 服务的标准系统抽象，因为它清晰地分离了推理中两个本质不同的阶段：预填充受算术吞吐支配，而解码受内存带宽支配。这种分离提升了利用率，并支持按阶段进行优化。但它并非没有代价。一旦预填充和解码被放置到不同节点上，每个请求都必须把 KVCache 从预填充侧导出到解码侧，把原本发生在设备内部的状态转换变成跨节点传输问题。在实践中，正是这一传输要求使今天的 PD 部署被限制在单个数据中心内，并依赖 RDMA 级扩展网络。

表 1：代表性模型配置。Type A 表示线性复杂度块，Type B 表示二次复杂度全注意力块。

Model	Attention Type A	Attention Type B	A:B Ratio	Model Params
Kimi Linear [26]	KDA [26]	MLA [12]	3:1	48B
MiMo-V2-Flash [31]	SWA [5]	GQA [4]	5:1	309B
Qwen3.5-397B [27]	GDN [33]	GQA	3:1	397B
Ring-2.5-1T [3]	Lightning [23]	MLA	7:1	1T
MiniMax-M2.5 [16]	–	GQA	–	229B
Qwen3-235B [32]	–	GQA	–	235B

在对延迟敏感的服务约束下，预填充实例生成的 KVCache 可以异步发送，以最大化计算利用率。为了避免 GPU 空转，预填充集群的出口带宽 BoutB_{\text{out}} 必须超过该集群生成 KVCache 的总速率。由于这个总速率会随实例数量线性增长，约束最终归结为单个模型实例的 KV 吞吐，我们将其定义为

Φkv(l)=Skv(l)Tprefill(l),\Phi_{\text{kv}}(l)=\frac{S_{\text{kv}}(l)}{T_{\text{prefill}}(l)}, (1)

其中 Skv(l)S_{\text{kv}}(l) 是长度为 ll 的请求对应的 KVCache 大小，Tprefill(l)T_{\text{prefill}}(l) 是相应的预填充延迟。该指标的值很大程度上取决于模型架构。表 1 总结了本文考虑的配置。对传统稠密注意力架构而言，这种传输需求是主导性的系统约束。在标准 Transformer 风格注意力中，KVCache 随上下文长度线性增长，并可达到数十 GB。图 2 展示了 MiniMax-M2.5 在不同输入长度下的 KV 吞吐。MiniMax-M2.5 是一个具有代表性的 GQA 稠密模型。瓶颈非常明显：对于 32K token 请求，单个 MiniMax-M2.5 实例产生 KVCache 的速率约为 60 Gbps，所需出口带宽远超典型机器的跨数据中心以太网能力。这正是传统 PD 解耦在运维上仍然绑定到紧密集成网络域的原因。网络预算如此之大，以至于将预填充和解码迁移到更松散的互连上，更不用说跨数据中心，根本不现实。

这种网络耦合也阻碍了异构服务的清晰扩展。针对每个阶段的专用芯片已经存在：Rubin CPX 等硬件面向预填充吞吐，而 LPU 风格设计面向解码带宽。然而，高性能互连仍与机器形态和部署环境紧密耦合，因此以 RDMA 级带宽连接不同类型硬件通常需要定制工程。更糟的是，一旦异构硬件被迫放入同一个紧密耦合集群，系统就继承了固定的预填充与解码硬件比例。在生产流量中，请求组合、请求量和前缀缓存命中率会持续波动，因此流水线的一侧不可避免地会过度供给，而另一侧成为瓶颈。在同构集群中，任何机器都可以随着负载变化在预填充和解码角色之间动态切换。异构集群没有这种灵活性：专用于预填充的芯片无法服务解码，反之亦然，这会导致严重的负载不均和容量闲置。结果就是更高的运维复杂度，以及异构 PD 在定制或低吞吐场景之外的实际采用受限。

图 2：MiniMax-M2.5 在 8×\timesH200 实例上不同输入长度下的 KV 吞吐。

表 2：不同注意力机制的预填充延迟和 KV 吞吐特征。两个指标越低越好。

Mechanism	Prefill Latency	KV Throughput
GQA	高	高
MLA	高	低
Sparse Attention	低	高
SWA	低	低
Linear Attention	低	低

2.2 混合注意力改变了 PD 的部署边界

表 3：不同输入长度下的 KV 吞吐 Φkv\Phi_{\text{kv}}（Gbps）。所有模型均在 8×\timesH200 上使用 SGLang v0.5.9 [7] 进行基准测试。

Seq Len	Kimi Linear	MiMo-V2-Flash	Qwen3.5-397B	Ring-2.5-1T	MiniMax-M2.5	Qwen3-235B
1K	1.19	0.82	4.13	7.27	4.94	4.12
8K	2.29	2.85	6.28	4.47	32.87	22.42
32K	3.87	4.66	8.25	2.59	59.93	33.35
128K	4.88	4.71	7.47	1.46	47.82	21.50

改变这一局面的，并不只是一个新的调度器，而是模型架构的转变。随着 LLM 走向更长上下文，传统 MHA 的成本变得越来越难以承受，促使行业广泛转向对 KVCache 更友好的设计。表 2 从两个维度对主流注意力改进进行分类：预填充延迟（TprefillT_{\text{prefill}}）和 KV 吞吐（Φkv\Phi_{\text{kv}}）。在长上下文工作负载下，GQA 和 MLA 等全注意力机制仍保留二次复杂度，因此预填充成本很高。稀疏注意力 [6] 减少了计算量，可以降低预填充延迟，但它仍需要向解码实例传输与序列长度相关的 KVCache，使 KV 吞吐继续成为主导瓶颈。相比之下，线性注意力和 SWA 保持线性计算成本，同时其有界状态大小显著降低了 KV 吞吐。

越来越多旗舰开源模型采用线性注意力 [26, 27, 3, 18] 或 SWA [31, 11, 2]，并将这些机制组合成混合堆栈，在少量全注意力层与更多线性复杂度层之间交错。代表性例子包括线性与全注意力比例为 3:1 的 Qwen3.5-397B、SWA 与全注意力比例为 5:1 的 MiMo-V2-Flash，以及线性与全注意力比例为 7:1 的 Ring-2.5-1T。在这些架构中，只有全注意力层会产生随序列长度增长的 KVCache，而线性复杂度层维持固定大小的循环状态，其占用在长上下文场景中变得可以忽略。

公式（1）量化了模型架构如何决定 PD 解耦的带宽需求。表 3 对多个近期开放的稠密模型和混合模型的 Φkv\Phi_{\text{kv}} 进行了基准测试。与相近规模的稠密模型相比，混合模型的 Φkv\Phi_{\text{kv}} 显著降低，这意味着每单位计算生成的、后续必须穿越网络的状态要少得多。在 32K token 下，MiMo-V2-Flash 的 KV 吞吐为 4.66 Gbps，而 MiniMax-M2.5 为 59.93 Gbps，降低了 13×\times。Qwen3.5-397B 为 8.25 Gbps，而 Qwen3-235B 为 33.35 Gbps，降低了 4×\times。本文还指出，对于 Ring-2.5-1T，MLA 相比 GQA 带来约 4.5×\times 压缩，而 7:1 的混合比例又带来约 ∼{\sim}8×\times 降低，总体 KV 内存节省约为 36×\times。

关键的系统含义不只是推理成本更低，而是更低的 KV 吞吐把 PD 解耦可部署的网络边界，从 RDMA 级网络推向普通以太网。在稠密注意力模型中，预填充过快地产生了过多状态，因此网络成为阶段之间的硬耦合。在混合模型中，预填充仍然执行大量工作，但输出的 KVCache 大幅缩小。这并不意味着跨数据中心 PD 已经便宜到可以任意传输。更准确地说，它开启了一个性质不同的运行区间：跨集群 KVCache 传输对经过选择的请求变得可行，因此值得在系统层面进行优化。

2.3 从数据中心内 PD 到 Prefill-as-a-Service 范式

在数据中心内部的 PD 部署中，预填充节点和解码节点通过 RDMA 等高带宽全互联网络通信，因此相对于预填充计算，网络远不是瓶颈。但在跨集群 PD 中，集群间带宽与模型 KV 吞吐之间的关系，直接决定跨数据中心 KVCache 是否可行。集群级带宽需求来自单实例 KV 吞吐。对于一个 NN-GPU 预填充集群，最低出口带宽为

Bout=NP⋅𝔼[Skv]𝔼[Tprefill]≈NP⋅Φkv(Lavg),B_{\text{out}}=\frac{N}{P}\cdot\frac{\mathbb{E}[S_{\text{kv}}]}{\mathbb{E}[T_{\text{prefill}}]}\approx\frac{N}{P}\cdot\Phi_{\text{kv}}(L_{\text{avg}}), (2)

其中 PP 是并行度（每个实例使用的 GPU 数），LavgL_{\text{avg}} 是实际卸载到 PrfaaS 集群的请求的平均未缓存输入长度。值得注意的是，BoutB_{\text{out}} 不仅取决于由模型架构和硬件决定的 Φkv\Phi_{\text{kv}}，也取决于由请求长度分布、前缀缓存命中率和路由策略共同塑造的 LavgL_{\text{avg}}。这种依赖关系正说明了为什么混合模型本身还不够。在模型侧，采用 KVCache 友好架构可以降低 Φkv\Phi_{\text{kv}}。在系统侧，选择性卸载和一个同时考虑带宽约束与 KVCache 局部性的带宽感知调度器（§3.4），则决定哪些请求最初会消耗跨数据中心预算，从而把 BoutB_{\text{out}} 维持在可用的数据中心间带宽范围内。

为使分析具体化，我们考虑一个由 512 块 H200 GPU 组成的预填充集群，且 Lavg=32KL_{\text{avg}}=32\text{K}。在该配置下，MiniMax-M2.5 和 Qwen3 分别需要 3.8 Tbps 和 2.1 Tbps 出口带宽，这实际上把部署锁死在紧密集成的单集群网络中。相比之下，采用混合架构的模型，其 KV 吞吐降低了一个数量级，使带宽需求降到现代数据中心间链路可以承受的水平。Ring-2.5-1T 需要约 170 Gbps 的专用线路容量。此外，如果将更长的请求（例如 128K token）路由到 PrfaaS 集群处理，带宽需求还会进一步降至 100 Gbps 以下。即便在 10,000-GPU 数据中心规模下，总出口带宽也只有约 1.8 Tbps，完全处在物理数据中心间链路容量范围内。

这就是前文所说的系统转折点。一旦 KV 吞吐降到足够低，异构服务就不再只能以尴尬的方式，把不同类型加速器共同塞进同一个 RDMA 孤岛后面。相反，预填充可以被有选择地外部化到独立的、计算密集型 Prefill-as-a-Service 集群，而解码仍保留在传统的带宽优化型 PD 集群中。问题也随之从“如何把异构硬件强行放进一个紧密耦合部署？”转变为“如何识别值得卸载的请求，并在松耦合集群之间高效传输它们的 KVCache？”从这个意义上说，KVCache 友好模型架构，加上选择性卸载和带宽感知调度，共同使异构 PD 解耦能够扩展到单个数据中心之外。

3 基于跨数据中心 KVCache 的解耦

3.1 概览

图 3：PrfaaS-PD 架构的部署拓扑。

跨数据中心 KVCache 的核心思想，不是外部化每一次预填充，而是在远端预填充加速值得付出传输成本时，有选择地把解耦式 LLM 服务扩展到单个集群边界之外。我们通过 PrfaaS-PD 架构实现这一愿景。该架构利用跨数据中心 KVCache，在松连接集群之间解耦预填充和解码，服务那些长且未缓存的预填充最能受益于更快计算的请求。如图 3 所示，专用 PrfaaS 集群在具备成本效益的高吞吐加速器上执行计算密集型长上下文预填充，并通过普通以太网把生成的 KVCache 流式传输到本地 PD 集群，而短请求或对带宽不友好的请求仍留在本地 PD 路径上。

PrfaaS-PD 架构包含三个子系统：计算、网络和存储。计算子系统由多个集群组成，每个集群只包含同构硬件，因为不同芯片类型通常很难共址在同一设施内。集群分为两类。本地 PD 集群执行 PD 解耦服务，可以端到端完成一个请求的推理。PrfaaS 集群则为增量未缓存长度超过路由阈值的请求提供选择性远端预填充能力。预填充之后，生成的 KVCache 被传输到本地 PD 集群进行解码。网络子系统分为两层：集群内网络使用 RDMA，服务对延迟敏感的集合通信和 PD KVCache 传输；集群间链路则依赖 VPC 对等连接或专线，完成跨数据中心 KVCache 传输。存储子系统位于各个集群内部，在所有机器之间构建分布式混合前缀缓存池（§3.2）。全局 KVCache 管理器维护所有集群的 KVCache 元数据。在这些基础设施组件之上，全局调度器根据请求特征、网络状况和缓存分布，将请求路由到合适的集群与节点，在跨集群带宽约束下最大化端到端吞吐，详见 §3.4。

3.2 混合前缀缓存池

前缀缓存池允许服务系统把 KVCache 卸载到集群内的分布式主机内存和 SSD，从而大幅提高前缀缓存命中率。然而，传统前缀缓存池是为单一 KVCache 类型设计的，并在 token 或块级别进行匹配与淘汰。在混合模型中，线性注意力或 SWA 层的循环状态是请求级的：其大小与输入长度无关，且只有当缓存长度完全匹配时才能复用。相比之下，全注意力层的 KVCache 是块级的：它们随输入长度线性增长，并支持部分前缀匹配。这种异质性挑战了传统所有层统一的 KVCache 存储范式。

基于 vLLM 的混合 KVCache 管理器 [28]，我们构建了一个面向跨集群 KVCache 传输的混合前缀缓存系统，如图 4 所示。线性状态和全注意力 KVCache 由不同的 KVCache 组管理，并保持对齐的块大小，使所有组都能从共享 KVCache 池中分配和释放块。在这个共享池之上，我们把缓存块划分为两类：前缀缓存块和传输缓存块。前缀缓存块必须完全填充后，才能跨请求复用。传输缓存块保存一次预填充请求尾部产生、用于 PD 解耦传输的 KVCache；传输完成后，缓存池会丢弃这些块。

当新请求到达时，全局 KVCache 管理器会为每个集群计算前缀匹配信息，请求路由器再利用这些信息选择预填充集群及其中具有缓存亲和性的节点。除了路由之外，KVCache 管理器还会执行缓存再平衡，以缓解热点。当有足够的集群间带宽可用时，跨集群缓存传输也是可行的，相关内容见 §3.4.3。

图 4：混合前缀缓存池。线性状态和全注意力 KVCache 由独立组管理，底层由统一块池支撑。块被划分为前缀缓存（仅集群内，按块对齐）和传输缓存（跨集群，传输后丢弃）。

3.3 PrfaaS-PD 解耦

在传统集群内 PD 解耦基础上，我们引入 PrfaaS 集群作为一种选择性扩展，以提升部署可扩展性并降低成本，同时避免把所有请求都强制放到跨集群路径上。一个 PrfaaS-PD 系统可以包含多个 PrfaaS 集群和本地 PD 集群，其规模与比例由硬件能力、网络带宽和请求流量决定。

每个 PrfaaS 集群都作为无状态 KVCache 生产者，其有效吞吐等于预填充计算速率和网络出口带宽二者中的较小值。并非所有请求都同等受益于卸载到 PrfaaS 集群。短上下文预填充通常受内存或通信约束，而非受计算约束，因此算术利用率较低，无法充分发挥 PrfaaS 集群中计算密集型加速器的能力。因此，我们在本地 PD 集群中保留预填充节点，并采用基于长度的路由策略。令 ll 表示请求的增量预填充长度（不包含任何已缓存前缀），tt 表示路由阈值。当 ltlt 时，请求被路由到 PrfaaS 集群，并在完成后将生成的 KVCache 传输给解码节点。当 l≤tl\leq t 时，请求由 PD 集群内部的预填充节点处理。随着智能体工作负载的采用增多，大多数请求都是带有前缀缓存命中的增量预填充请求。对于这类请求，全局 KVCache 管理器会跟踪所有缓存条目的存储位置，确保跨集群传输的只有增量部分。调度器在做出路由决策时，必须联合考虑缓存亲和性和集群带宽，详见 §3.4.3。

在实践中实现 PrfaaS-PD 解耦，需要稳定、高吞吐的以太网传输。虽然混合模型架构显著降低了名义带宽需求，但突发流量和链路利用率不均仍可能造成拥塞，增加排队延迟并降低 KVCache 交付效率。因此，我们的设计目标是在不引发持续拥塞的前提下平滑传输流量并维持较高链路利用率。为此，我们结合了按层预填充流水线以重叠 KVCache 生成和传输、多连接 TCP 传输以充分利用可用带宽，以及与调度器集成的拥塞监测，以便及早发现丢包和重传信号，防止拥塞积累。

3.4 建模与调度

一个 PrfaaS-PD 系统包含三类角色：PrfaaS 预填充、PD-P（PD 集群内的预填充节点）和 PD-D（PD 集群内的解码节点）。为了推导调度策略，我们构建了一个解析吞吐模型，用来刻画各个角色的计算和带宽约束，并用它指导短期路由和长期资源分配。

3.4.1 吞吐模型

表 4：PrfaaS-PD 吞吐模型使用的记号。

Traffic		System
Λ\Lambda	请求到达率（吞吐）	NprfaasN_{\text{prfaas}}	PrfaaS 预填充实例数
LL	未缓存输入长度（随机变量）	NpN_{p}, NdN_{d}	PD-P / PD-D 实例数
tt	路由阈值	BoutB_{\text{out}}	PrfaaS 出口带宽
llongl_{\text{long}}	𝔼[L∣Lt]\mathbb{E}[L\mid Lt]，PrfaaS 平均长度	𝐵𝑆max\mathit{BS}_{\max}	最大解码批大小
lshortl_{\text{short}}	𝔼[L∣L≤t]\mathbb{E}[L\mid L\leq t]，PD-P 平均长度	Tprefill(l)T_{\text{prefill}}(l)	长度 ll 的预填充时间
pp	P(Lt)P(Lt)，路由到 PrfaaS 的比例	TdecodeT_{\text{decode}}	每步解码时间
LoutL_{\text{out}}	平均输出长度	Θprfaas\Theta_{\text{prfaas}}	PrfaaS 吞吐（req/s）
Skv(l)S_{\text{kv}}(l)	长度 ll 的 KVCache 大小	Θpd-p\Theta_{\text{pd-p}}, Θpd-d\Theta_{\text{pd-d}}	PD-P / PD-D 吞吐（req/s）

我们使用表 4 中的记号，对 PrfaaS-PD 系统的稳态吞吐建模。为了便于处理，我们用一个代表性长度 llong=𝔼[L∣Lt]l_{\text{long}}=\mathbb{E}[L\mid Lt] 近似所有 PrfaaS 请求，其服务时间为 Tprefill(llong)T_{\text{prefill}}(l_{\text{long}})；用 lshort=𝔼[L∣L≤t]l_{\text{short}}=\mathbb{E}[L\mid L\leq t] 近似所有 PD-P 请求，其服务时间为 Tprefill(lshort)T_{\text{prefill}}(l_{\text{short}})。请求以聚合速率 Λ\Lambda 均匀到达。比例 p=P(Lt)p=P(Lt) 的请求被路由到 PrfaaS，其余 1−p1-p 由 PD-P 处理。

每个 PrfaaS 请求经历两个流水线阶段：预填充计算和 KVCache 传输。通过按层预填充流水线，PrfaaS 集群吞吐由计算和出口传输中较慢的一方决定：

Θprfaas=min⁡(NprfaasTprefill(llong),BoutSkv(llong)).\Theta_{\text{prfaas}}=\min!\left(\frac{N_{\text{prfaas}}}{T_{\text{prefill}}(l_{\text{long}})},\;\frac{B_{\text{out}}}{S_{\text{kv}}(l_{\text{long}})}\right). (3)

由于集群内 RDMA 带宽不是瓶颈，PD-P 吞吐只由计算容量决定：

Θpd-p=NpTprefill(lshort).\Theta_{\text{pd-p}}=\frac{N_{p}}{T_{\text{prefill}}(l_{\text{short}})}. (4)

解码阶段（PD-D）的吞吐为

Θpd-d=Nd⋅𝐵𝑆maxTdecode⋅Lout,\Theta_{\text{pd-d}}=\frac{N_{d}\cdot\mathit{BS}{\max}}{T, (5)}}\cdot L_{\text{out}}

其中 𝐵𝑆max\mathit{BS}{\max} 和 TdecodeT 被视为由 SLO 决定的常数。}

PrfaaS 和 PD-P 作为上游生产者，分别对互不重叠的请求子集进行预填充（比例 pp 和 1−p1{-}p），而 PD-D 是唯一的下游消费者。它们共同形成一条汇聚流水线：

RequestPrfaaS PrefillPD-P PrefillPD-D Decodepp1−p1{-}pEthernetRDMA

端到端系统吞吐受最慢阶段限制，同时考虑路由拆分：

Λmax=min⁡(Θprfaasp,Θpd-p1−p,Θpd-d).\Lambda_{\max}=\min!\left(\frac{\Theta_{\text{prfaas}}}{p},\;\frac{\Theta_{\text{pd-p}}}{1-p},\;\Theta_{\text{pd-d}}\right). (6)

3.4.2 吞吐最优配置

给定固定硬件资源（NprfaasN_{\text{prfaas}}, Np+NdN_{p}{+}N_{d}）和网络带宽 BoutB_{\text{out}}，我们寻求两个决策变量来最大化 Λmax\Lambda_{\max}：路由阈值 tt（它决定 pp、llongl_{\text{long}} 和 lshortl_{\text{short}}）以及 PD 集群内预填充与解码比例 Np/NdN_{p}/N_{d}。

阈值 tt 控制 PrfaaS 与 PD-P 之间的负载权衡。提高 tt 会把 PrfaaS 限制在更长的请求上，而这类请求的 Tprefill(l)T_{\text{prefill}}(l) 增长快于 Skv(l)S_{\text{kv}}(l)（近二次预填充对线性 KVCache 大小）。这会降低单实例 KV 吞吐并缓解带宽压力。相反，降低 tt 会让 PrfaaS 涌入更短请求，这些请求的高 KV 吞吐更容易触发带宽瓶颈。最优 tt 会平衡 PrfaaS 和 PD-P 吞吐，使两个阶段接近同时饱和：

Θprfaasp=Θpd-p1−p.\frac{\Theta_{\text{prfaas}}}{p}=\frac{\Theta_{\text{pd-p}}}{1-p}. (7)

对于固定集群规模 Np+NdN_{p}+N_{d}（短期内一个数据中心的机器数量是固定的），比例 Np/NdN_{p}/N_{d} 应该平衡聚合生产者吞吐与消费者吞吐。预填充节点太少会让解码阶段得不到足够 KVCache，而预填充节点太多又会让预填充能力闲置。最优比例满足

Θprfaas+Θpd-p=Θpd-d.\Theta_{\text{prfaas}}+\Theta_{\text{pd-p}}=\Theta_{\text{pd-d}}. (8)

公式（7）和（8）约束了两个未知量，即 tt 和 Np/NdN_{p}/N_{d}。由于 Θprfaas/p\Theta_{\text{prfaas}}/p 会随 pp 增大而单调下降（也即随 tt 降低而下降），而 Θpd-p/(1−p)\Theta_{\text{pd-p}}/(1{-}p) 会上升，因此对 tt 和 Np/NdN_{p}/N_{d} 进行网格搜索，可以高效找到最优运行点。

3.4.3 双时间尺度调度

调度是让跨数据中心解耦从架构上可行变为实践中可用的关键。理论上，混合模型架构把 KV 吞吐降低到普通以太网链路可以支撑跨集群传输的程度，上述稳态分析也可以通过优化 tt 和 Np/NdN_{p}/N_{d} 来最大化集群吞吐。但在实践中，流量变化和突发性会在 PrfaaS 出口造成瞬时拥塞和队列积压。此外，集群维护的大规模前缀缓存，其批量传输也会进一步挤压跨集群链路。为处理这种动态环境，我们设计了一种双时间尺度调度算法，把跨集群带宽和吞吐作为主要约束，将支配请求路由和资源分配的因素分为短期与长期两类，并为每类提供专门策略。

短期：带宽与缓存感知路由。

PrfaaS 集群具有一个由带宽施加的吞吐上限 Bout/Skv(llong)B_{\text{out}}/S_{\text{kv}}(l_{\text{long}})。当集群接近这个上限时，出口链路会开始积累拥塞。因此，调度器会持续监控 PrfaaS 出口利用率和请求队列深度。当利用率接近阈值或排队突然上升时，就会触发短期路由调整。

在初始化或策略更新时，调度器会剖析当前计算容量和出口带宽，然后基于前缀匹配后的增量预填充长度分布搜索最优阈值 tt。通过只把足够长的请求路由到 PrfaaS，调度器降低了每个请求的带宽需求，并避免在接近带宽上限时发生拥塞。

对于有前缀缓存命中的请求，调度器必须联合考虑缓存亲和性和带宽可用性。令 ltotall_{\text{total}} 表示请求总输入长度，lprfaasl_{\text{prfaas}} 和 lpdl_{\text{pd}} 分别表示 PrfaaS 和 PD 集群中的缓存前缀长度。路由取决于带宽还是计算是约束瓶颈。当带宽稀缺时，各集群中的前缀缓存会被独立评估：如果 ltotal−lpd≤tl_{\text{total}}-l_{\text{pd}}\leq t，请求就在 PD-P 中本地预填充，否则卸载到 PrfaaS。当带宽充足时，计算成为稀缺资源，跨集群缓存传输可以减少重复计算。调度器通过令 lprefix=max⁡(lprfaas,lpd)l_{\text{prefix}}=\max(l_{\text{prfaas}},\,l_{\text{pd}})，考虑所有集群中的最佳缓存；如果 ltotal−lprefix≤tl_{\text{total}}-l_{\text{prefix}}\leq t，请求在 PD-P 中预填充，否则进入 PrfaaS。当拥有更长缓存的集群与计算集群不同时，会执行跨集群缓存传输。

长期：由流量驱动的分配再优化。

在更长时间尺度上，流量变化会在流水线阶段之间造成持续失衡。当 Θprfaas+Θpd-p≪Θpd-d\Theta_{\text{prfaas}}+\Theta_{\text{pd-p}}\ll\Theta_{\text{pd-d}} 时，预填充是系统瓶颈；当 Θprfaas+Θpd-p≫Θpd-d\Theta_{\text{prfaas}}+\Theta_{\text{pd-p}}\gg\Theta_{\text{pd-d}} 时，解码是系统瓶颈。调度器通过监控各阶段队列深度和利用率来识别约束瓶颈。由于这个时间尺度上的流量变化较为缓慢，且通常具有周期性，调度器会定期重新评估负载平衡，并在 PD 集群内把节点在预填充和解码角色之间转换，调整 NpN_{p} 和 NdN_{d}，以恢复公式（7）和（8）的最优性条件。随着实例数量变化，路由阈值 tt 也会相应重新优化。

4 案例研究：PrfaaS-PD 架构的带宽需求

本节使用一个内部 1T 参数混合架构模型作为案例，评估选择性 PrfaaS 卸载能否在现实硬件与部署条件下，把跨数据中心 KVCache 传输控制在现实可行的带宽预算内，同时提升系统吞吐。沿用表 4 的记号，我们应用 §3.4 中基于剖析数据的吞吐模型，求解两个能最大化系统吞吐的关键参数：路由阈值 tt，以及本地 PD 集群内的预填充与解码实例比例 Np/NdN_{p}/N_{d}。在最终配置下，异构 PrfaaS-PD 部署相比同构纯 PD 基线，吞吐提升 54%，P90 TTFT 降低 64%；相比没有调度的天真异构部署，吞吐提升 32%。PrfaaS 集群平均出口带宽只有 13 Gbps，完全处在以太网能力范围内。

4.1 设置

表 5：内部 1T 混合模型在不同输入长度下的 KVCache 大小 SkvS_{\text{kv}}、预填充延迟 TprefillT_{\text{prefill}} 和 KV 吞吐 Φkv\Phi_{\text{kv}}。预填充延迟在 8×\timesH200 上使用内部 vLLM [29] 进行基准测试。

Seq Len	KVCache Size	Prefill Latency	KV Throughput
1K	190.8 MiB	0.44 s	3.61 Gbps
8K	308.9 MiB	0.72 s	3.59 Gbps
32K	701.3 MiB	1.84 s	3.19 Gbps
128K	2316.3 MiB	7.40 s	2.62 Gbps

我们部署了两个通过 VPC 网络连接的集群，提供约 100 Gbps 的聚合跨集群带宽。PrfaaS 集群由 32 块计算吞吐更高的 H200 GPU 组成，专用于 LtLt 的长上下文预填充请求。本地 PD 集群由 64 块 H20 GPU 组成，以传统 PD 解耦模式运行，每个节点拥有 800 Gbps RDMA 互连，其中预填充与解码比例可根据请求流量调整。需要注意的是，虽然 H200 和 H20 价格不同，但 PrfaaS 集群只需要高计算吞吐。在生产部署中，可以用具备相近计算能力且更有成本效益的加速器替代。作为基线，我们使用由 96 块 H20 GPU 组成的同构 PD 集群。

为反映现实服务需求，工作负载使用一个内部 1T 参数混合模型，其架构遵循 Kimi Linear [26]，采用 KDA:MLA 为 3:1 的交错层结构。该模型以每实例 8 块 GPU 部署，并使用内部 vLLM 分别剖析预填充和解码。表 5 列出了该模型在不同输入长度下的 KVCache 大小 SkvS_{\text{kv}}、预填充延迟 TprefillT_{\text{prefill}} 和 KV 吞吐 Φkv\Phi_{\text{kv}}。

请求输入长度服从截断对数正态分布（μ=9.90\mu=9.90，σ=1.00\sigma=1.00，截断到 [128, 128K][128,\,128\text{K}]），平均约 27K token，反映真实世界工作负载中的长上下文分布特征。输出长度固定为 1024 token，SLO（不含推测解码）设为 40 tokens/s。下文报告的所有吞吐和带宽结果，均由测得的剖析数据输入 §3.4 的吞吐模型得到。

4.2 吞吐建模与求解

（a）搜索预填充/解码分配。

（b）搜索路由阈值 tt。

图 5：两个优化变量的网格搜索过程示意。（a）将 tt 固定在最优值，并搜索本地 PD 集群内的预填充/解码实例拆分。（b）固定 Np=3N_{p}=3、Nd=5N_{d}=5，并搜索 tt。

使用 §3.4 中的 PrfaaS-PD 吞吐模型，我们优化路由阈值 tt 和 PD 集群中的预填充/解码分配，以最大化 Λmax\Lambda_{\max}。我们通过穷举二维网格搜索求解。最优配置列在表 6 第二列。

图 5 展示了搜索过程，即固定一个变量并搜索另一个变量。图 5(a) 显示，在固定阈值 tt（因此固定 Θprfaas\Theta_{\text{prfaas}}）时，当预填充和解码吞吐大致平衡时，系统吞吐达到峰值。本地 PD 集群的最优分配为 Np=3N_{p}=3 和 Nd=5N_{d}=5。图 5(b) 固定 Np=3N_{p}=3 和 Nd=5N_{d}=5，因此整体 Λmax\Lambda_{\max} 由 min⁡(Θprfaas/p,Θpd-p/(1−p))\min(\Theta_{\text{prfaas}}/p,\;\Theta_{\text{pd-p}}/(1{-}p)) 决定。最优点出现在两条曲线交点处，得到 t=19.4t=19.4K。在这个运行点，约 50% 的请求（更长的那部分）被卸载到 PrfaaS 集群，充分利用高计算吞吐加速器。

表 6：PrfaaS-PD、同构 PD 和天真异构 PD 部署的最优配置对比。

Metric	PrfaaS-PD	Homogeneous PD	Naive Heterogeneous PD
Threshold tt	19.4K	—	—
NprfaasN_{\text{prfaas}} / NpN_{p} / NdN_{d}	4 / 3 / 5	— / 9 / 3	4 / — / 8
Mean / P90 TTFT (s)	2.22 / 3.51	4.44 / 9.73	1.74 / 3.51
Θprfaas\Theta_{\text{prfaas}} / Θpd-p\Theta_{\text{pd-p}} / Θpd-d\Theta_{\text{pd-d}} (req/s)	1.61 / 1.64 / 3.91	— / 2.11 / 2.35	2.45 / — / 6.25
Λmax\Lambda_{\max} (req/s)	3.24	2.11	2.45
Ratio	1.54×\times	1.00×\times	1.16×\times

4.3 结果分析

4.3.1 跨数据中心带宽利用率

PrfaaS-PD 架构的一个关键前提是，集群间网络链路能够支撑 KVCache 传输需求。我们通过测量已部署工作负载分布下 PrfaaS 集群的 KV 吞吐来评估这一点。

当路由阈值设为 t=19.4t=19.4K 时，49.6% 的请求被路由到 PrfaaS，卸载子集的 𝔼[L∣Lt]≈44\mathbb{E}[L\mid Lt]\approx 44K token。因此，PrfaaS 聚合出口负载约为 13 Gbps，只消耗以太网链路的 13%。这确认了：在适当调度下，混合架构模型的 KVCache 可以通过普通以太网传输，并保留大量余量。相比之下，带有完整 KV head 的传统 Transformer 模型会产生显著更大的 KVCache 体量，可能需要 RDMA 级互连。

4.3.2 与同构 PD 对比

我们将同样的吞吐建模方法应用到由 96 块 H20 GPU 组成的同构 PD 基线。结果见表 6 第三列。在同构集群中，当预填充和解码吞吐平衡时，系统吞吐也达到最大，得到的最优分配为 9 个预填充实例和 3 个解码实例。

得益于 PrfaaS 集群更强的计算吞吐，PrfaaS-PD 配置在本地 PD 集群中少需要两个预填充实例，从而释放容量给更多解码槽位。整体系统吞吐提升 54%。

PrfaaS-PD 的另一个关键收益是降低 TTFT，尤其是长上下文请求的 TTFT。在同构基线中，长请求和短请求争夺预填充容量，推高了排队延迟。在 PrfaaS-PD 中，长请求被卸载到专用高吞吐 PrfaaS 集群，即便计入跨集群传输延迟，预填充完成速度也显著快于在 PD 集群内处理。如表 6 所示，相比同构基线，平均 TTFT 和 P90 TTFT 分别降低 50% 和 64%。

4.3.3 与天真异构 PD 对比

与同构 PD 的比较展示了异构部署的基础性能优势。与天真异构 PD 的比较则进一步凸显了调度在异构系统中的重要性。在天真异构 PD 配置中，不应用任何调度优化：所有预填充都分配给 H200 GPU，所有解码都分配给 H20 GPU，没有基于长度的路由，也没有负载平衡。如表 6 所示，天真异构 PD 相比同构基线仅达到 1.16×1.16\times 吞吐，相比 PrfaaS-PD 低 25%。这种退化来自预填充和解码吞吐之间的严重失衡，更根本地说，来自把异构预填充当成普适路径，而不是只选择性卸载那些最能从 PrfaaS 受益的请求。

4.4 小结

这个案例研究表明，对于混合架构模型，当跨数据中心 KVCache 与选择性 PrfaaS 卸载结合，而不是被不加区分地应用时，它就变得可行。

从可行性看，混合模型的 KVCache 传输只消耗 100 Gbps 以太网链路的 13%。这完全处在普通以太网基础设施能力范围内，并远低于 RDMA 互连的带宽需求，确认了跨数据中心 KVCache 传输的可行性。

从有效性看，PrfaaS-PD 配置（32 块 H200 GPU 用于 PrfaaS，64 块 H20 GPU 用于本地 PD）相比 96-H20 同构纯 PD 基线，吞吐提升 54%，P90 TTFT 降低 64%；在相同成本下，吞吐增益约为 15%。这些收益来自把计算密集型长上下文预填充卸载到高吞吐 PrfaaS 加速器。需要说明的是，H200 和 H20 在这里只是代表性硬件组合，并非唯一可行组合。具备成本效益的预填充专用芯片可以在生产中进一步降低部署成本。

此外，当前 PrfaaS 集群处于计算受限状态，仍有充足带宽余量。在更大规模部署或更高带宽专线下，PrfaaS 集群可以进一步扩展，以带来额外吞吐收益。例如，在拥有数千块 PrfaaS GPU 的 IDC 级部署中，KVCache 传输所需的聚合出口带宽仍然只是 1 Tbps 量级，完全处于现代数据中心网络能力范围内，从而可以进一步提升吞吐和资源效率。

5 讨论

跨数据中心 KVCache 把 PD 解耦从单个紧密耦合集群扩展到了松连接的异构集群。它的实用性取决于模型架构、系统设计和硬件之间的协同进展。本节讨论这些趋势如何相互强化，以及它们对下一代 LLM 服务系统意味着什么。

KVCache 友好的模型架构。

随着上下文窗口持续增长，KVCache 在存储和传输中的推理成本占比越来越高。MLA、滑动窗口注意力和线性注意力等架构技术已经表明，在不牺牲模型能力的情况下，KVCache 大小可以显著降低。未来，模型协同设计很可能不仅优化 FLOPs，也会优化 KVCache 传输体量。这些改进会直接降低跨数据中心 KVCache 的延迟和带宽成本，扩大 PrfaaS-PD 具备成本效益的部署范围。

KVCache 压缩与复用。

除了架构创新，越来越多工作从算法或系统层面降低 LLM 推理成本，重点在 KVCache 压缩与复用。H2O [35] 和 KIVI [14] 等方法会选择性淘汰或量化 KVCache 条目，以缩小内存占用。CacheGen [13] 应用传统压缩技术来降低 KVCache 传输体量，而 CacheBlend [34] 和 FusionRAG [30] 支持在请求之间复用近似匹配的 KVCache。这些技术与 KVCache 友好的模型设计互补，通过降低有效内存压力和网络流量，让跨数据中心 KVCache 在生产工作负载下更加稳健。

按阶段专用的推理硬件。

预填充和解码之间的解耦也开始体现在硬件设计中。预填充是计算密集型的，而解码主要受内存带宽支配。近期芯片路线图越来越按阶段专用化：NVIDIA Rubin CPX [19] 强调预填充的计算吞吐，而 LPU [8] 和 Taalas HC1 [25] 等芯片具有极高内存带宽，服务快速解码。跨数据中心 KVCache 与这一趋势天然契合。它不再要求异构芯片部署在同一个紧密耦合网络域内，使运营方可以独立规划预填充和解码集群规模，并把每个阶段部署到最适合它的硬件上。

6 相关工作

面向在线 LLM 推理的系统优化，已经逐渐从单体式单集群引擎，转向解耦的、异构感知的、以 KVCache 为中心的流水线。一方面，解耦式服务把每个请求拆分为计算密集型预填充阶段和内存带宽密集型解码阶段，以减少阶段间干扰并支持独立扩展。Splitwise [21] 和 DistServe [37] 分别从成本/功耗和 goodput 角度建模 PD 解耦，并提出相应的部署、调度和放置策略，表明在合适的 SLO 和硬件约束下，PD 解耦可以同时提升吞吐并降低成本。另一方面，随着集群硬件和互连越来越异构，Helix [15]、Hetis [17] 和 LLM-PQ [36] 把异构 GPU/网络以及阶段级差异纳入优化空间，通过按阶段专用的硬件放置获得吞吐或延迟收益。与此同时，DynamoLLM [24] 和 FREESH [9] 从能源、成本和碳效率角度，强调满足服务 SLO 条件下的系统级资源重配置和跨域调度。更关键的是，KVCache 已经从每个请求的临时状态演进为一等系统资源：Mooncake [22] 引入全局 KVCache 池，以提升跨节点和跨请求复用；CacheBlend [34] 与 FusionRAG [30] 通过非前缀 KVCache 融合复用显著降低 TTFT；KIVI [14]、KVQuant [10] 和 H2O [35] 则通过量化或基于重要性的淘汰进一步缩小 KVCache 体量，以提升长上下文服务能力。然而，尚无已有工作在一个统一系统中联合优化跨数据中心预填充卸载、异构部署以及带宽/缓存感知调度。本文从跨数据中心 KVCache 视角处理该问题，把解耦推理与异构资源编排结合起来，构建低成本、高吞吐服务系统。

7 结论

为解决异构解耦推理的实际部署挑战，我们提出跨数据中心 KVCache 的概念，将解耦式服务从单个同构集群扩展到跨集群异构部署。在此基础上，我们设计了 PrfaaS-PD 解耦架构，通过普通以太网连接的异构 PrfaaS 集群，以低成本增强系统服务吞吐。我们设想，跨数据中心 KVCache 范式将与下一代模型、硬件和网络共同演进，从而支持大规模高效 LLM 服务。

参考文献

[1] D. Abts, J. Ross, J. Sparling, M. Wong-VanHaren, M. Baker, T. Hawkins, A. Bell, J. Thompson, T. Kahsai, G. Kimmell, et al. (2020) Think fast: a tensor streaming processor (tsp) for accelerating deep learning workloads. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pp. 145–158. 被引用于：§1。
[2] S. Agarwal, L. Ahmad, J. Ai, S. Altman, A. Applebaum, E. Arbus, R. K. Arora, Y. Bai, B. Baker, H. Bao, et al. (2025) Gpt-oss-120b gpt-oss-20b model card. arXiv preprint arXiv:2508.10925. 被引用于：§1, §2.2。
[3] I. AI (2026) Ring-2.5-1t. 注：https://github.com/inclusionAI/Ring-V2.5 被引用于：§1, §2.2, Table 1。
[4] J. Ainslie, J. Lee-Thorp, M. De Jong, Y. Zemlyanskiy, F. Lebrón, and S. Sanghai (2023) Gqa: training generalized multi-query transformer models from multi-head checkpoints. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 4895–4901. 被引用于：Table 1。
[5] I. Beltagy, M. E. Peters, and A. Cohan (2020) Longformer: the long-document transformer. arXiv preprint arXiv:2004.05150. 被引用于：§1, Table 1。
[6] R. Child, S. Gray, A. Radford, and I. Sutskever (2019) Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509. 被引用于：§2.2。
[7] L. Corp. (2026) SGLang. 注：https://github.com/sgl-project/sglang 被引用于：§1, Table 3, Table 3。
[8] Groq (2025) What is a language processing unit?. 注：https://groq.com/blog/the-groq-lpu-explained 被引用于：§1, §5。
[9] X. He, Z. Fang, J. Lian, D. H. Tsang, B. Zhang, and Y. Chen (2025) FREESH: fair, resource-and energy-efficient scheduling for llm serving on heterogeneous gpus. arXiv preprint arXiv:2511.00807. 被引用于：§6。
[10] C. Hooper, S. Kim, H. Mohammadzadeh, M. W. Mahoney, Y. S. Shao, K. Keutzer, and A. Gholami (2024) Kvquant: towards 10 million context length llm inference with kv cache quantization. Advances in Neural Information Processing Systems 37, pp. 1270–1303. 被引用于：§6。
[11] A. Huang, A. Li, A. Kong, B. Wang, B. Jiao, B. Dong, B. Wang, B. Chen, B. Li, B. Ma, et al. (2026) Step 3.5 flash: open frontier-level intelligence with 11b active parameters. arXiv preprint arXiv:2602.10604. 被引用于：§1, §2.2。
[12] A. Liu, B. Feng, B. Wang, B. Wang, B. Liu, C. Zhao, C. Dengr, C. Ruan, D. Dai, D. Guo, et al. (2024) Deepseek-v2: a strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434. 被引用于：Table 1。
[13] Y. Liu, H. Li, Y. Cheng, S. Ray, Y. Huang, Q. Zhang, K. Du, J. Yao, S. Lu, G. Ananthanarayanan, et al. (2024) Cachegen: kv cache compression and streaming for fast large language model serving. In Proceedings of the ACM SIGCOMM 2024 Conference, pp. 38–56. 被引用于：§5。
[14] Z. Liu, J. Yuan, H. Jin, S. Zhong, Z. Xu, V. Braverman, B. Chen, and X. Hu (2024) KIVI: a tuning-free asymmetric 2bit quantization for kv cache. In International Conference on Machine Learning, pp. 32332–32344. 被引用于：§5, §6。
[15] Y. Mei, Y. Zhuang, X. Miao, J. Yang, Z. Jia, and R. Vinayak (2025) Helix: serving large language models over heterogeneous gpus and network via max-flow. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, pp. 586–602. 被引用于：§6。
[16] Minimax (2026) MiniMax m2.5: built for real-world productivity. 注：https://www.minimax.io/news/minimax-m25 被引用于：Table 1。
[17] Z. Mo, J. Liao, H. Xu, Z. Zhou, and C. Xu (2025) Hetis: serving llms in heterogeneous gpu clusters with fine-grained and dynamic parallelism. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1710–1724. 被引用于：§6。
[18] NVIDIA (2025) Nemotron 3 Nano: open, efficient mixture-of-experts hybrid Mamba-Transformer model for Agentic reasoning. 注：Technical report 外部链接：Link 被引用于：§1, §2.2。
[19] NVIDIA (2025) NVIDIA rubin cpx accelerates inference performance and efficiency for 1m+ token context workloads. 注：https://developer.nvidia.com/blog/nvidia-rubin-cpx-accelerates-inference-performance-and-efficiency-for-1m-token-context-workloads/ 被引用于：§1, §5。
[20] NVIDIA (2026) Dynamo. 注：https://github.com/ai-dynamo/dynamo 被引用于：§1。
[21] P. Patel, E. Choukse, C. Zhang, A. Shah, Í. Goiri, S. Maleki, and R. Bianchini (2024) Splitwise: efficient generative llm inference using phase splitting. In 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), pp. 118–132. 被引用于：§6。
[22] R. Qin, Z. Li, W. He, J. Cui, H. Tang, F. Ren, T. Ma, S. Cai, Y. Zhang, M. Zhang, et al. (2024) Mooncake: a kvcache-centric disaggregated architecture for llm serving. ACM Transactions on Storage. 被引用于：§1, §6。
[23] Z. Qin, W. Sun, D. Li, X. Shen, W. Sun, and Y. Zhong (2024) Lightning attention-2: a free lunch for handling unlimited sequence lengths in large language models. 外部链接：2401.04658 被引用于：Table 1。
[24] J. Stojkovic, C. Zhang, Í. Goiri, J. Torrellas, and E. Choukse (2025) Dynamollm: designing llm inference clusters for performance and energy efficiency. In 2025 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 1348–1362. 被引用于：§6。
[25] Taalas (2025) Taalas hc1. 注：https://taalas.com/products 被引用于：§5。
[26] K. Team, Y. Zhang, Z. Lin, X. Yao, J. Hu, F. Meng, C. Liu, X. Men, S. Yang, Z. Li, et al. (2025) Kimi linear: an expressive, efficient attention architecture. arXiv preprint arXiv:2510.26692. 被引用于：§1, §1, §2.2, Table 1, Table 1, §4.1。
[27] Q. Team (2026) Qwen3.5: towards native multimodal agents. 注：https://qwen.ai/blog?id=qwen3.5 被引用于：§1, §2.2, Table 1。
[28] vLLM Team at IBM (2025) Hybrid models as first-class citizens in vLLM. 注：https://pytorch.org/blog/hybrid-models-as-first-class-citizens-in-vllm/PyTorch Blog, November 2025 被引用于：§3.2。
[29] vLLM Team (2026) VLLM. 注：https://github.com/vllm-project/vllm 被引用于：§1, Table 5, Table 5。
[30] J. Wang, W. Xie, M. Zhang, B. Zhang, J. Dong, Y. Zhu, C. Lin, J. Tang, Y. Han, Z. Ai, et al. (2026) From prefix cache to fusion rag cache: accelerating llm inference in retrieval-augmented generation. arXiv preprint arXiv:2601.12904. 被引用于：§5, §6。
[31] B. Xiao, B. Xia, B. Yang, B. Gao, B. Shen, C. Zhang, C. He, C. Lou, F. Luo, G. Wang, et al. (2026) Mimo-v2-flash technical report. arXiv preprint arXiv:2601.02780. 被引用于：§1, §2.2, Table 1。
[32] A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. (2025) Qwen3 technical report. arXiv preprint arXiv:2505.09388. 被引用于：Table 1。
[33] S. Yang, J. Kautz, and A. Hatamizadeh (2025) Gated delta networks: improving mamba2 with delta rule. In International Conference on Learning Representations (ICLR), 被引用于：Table 1。
[34] J. Yao, H. Li, Y. Liu, S. Ray, Y. Cheng, Q. Zhang, K. Du, S. Lu, and J. Jiang (2025) Cacheblend: fast large language model serving for rag with cached knowledge fusion. In Proceedings of the twentieth European conference on computer systems, pp. 94–109. 被引用于：§5, §6。
[35] Z. Zhang, Y. Sheng, T. Zhou, T. Chen, L. Zheng, R. Cai, Z. Song, Y. Tian, C. Ré, C. Barrett, et al. (2023) H2o: heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems 36, pp. 34661–34710. 被引用于：§5, §6。
[36] J. Zhao, B. Wan, C. Wu, Y. Peng, and H. Lin (2024) Llm-pq: serving llm on heterogeneous clusters with phase-aware partition and adaptive quantization. In Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, pp. 460–462. 被引用于：§6。
[37] Y. Zhong, S. Liu, J. Chen, J. Hu, Y. Zhu, X. Liu, X. Jin, and H. Zhang (2024) DistServe: disaggregating prefill and decoding for goodput-optimized large language model serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pp. 193–210. 被引用于：§6。

Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter

Ruoyu Qin1,2 Weiran He1 Yaoyu Wang1 Zheming Li1 Xinran Xu1 Yongwei Wu2 Weimin Zheng2 Mingxing Zhang2 1Moonshot AI 2Tsinghua University Corresponding author: zhang_mingxing@mail.tsinghua.edu.cn

Abstract

Prefill-decode (PD) disaggregation has become the standard architecture for large-scale LLM serving, but in practice its deployment boundary is still determined by KVCache transfer. In conventional dense-attention models, prefill generates huge KVCache traffics that keep prefill and decode tightly coupled within a single high-bandwidth network domain, limiting heterogeneous deployment and resource elasticity. Recent hybrid-attention architectures substantially reduce KVCache size, making cross-cluster KVCache transport increasingly plausible. However, smaller KVCache alone does not make heterogeneous cross-datacenter PD serving practical: real workloads remain bursty, request lengths are highly skewed, prefix caches are unevenly distributed, and inter-cluster bandwidth fluctuates. A naive design that fully externalizes prefill can therefore still suffer from congestion, unstable queueing, and poor utilization.

We present Prefill-as-a-Service (PrfaaS), a cross-datacenter serving architecture that selectively offloads long-context prefill to standalone, compute-dense prefill clusters and transfers the resulting KVCache over commodity Ethernet to local PD clusters for decode. Rather than treating reduced KVCache as sufficient, PrfaaS combines model-side KV efficiency with system-side selective offloading, bandwidth-aware scheduling, and cache-aware request placement. This design removes the requirement that heterogeneous accelerators share the same low-latency RDMA fabric, enabling independent scaling of prefill and decode capacity across loosely coupled clusters. In a case study using an internal 1T-parameter hybrid model, a PrfaaS-augmented heterogeneous deployment achieves 54% and 32% higher serving throughput than homogeneous PD and naive heterogeneous baselines, respectively, while consuming only modest cross-datacenter bandwidth.

Prefill-as-a-Service：下一代模型的 KVCache 可能跨越数据中心

Ruoyu Qin1,2 Weiran He1 Yaoyu Wang1 Zheming Li1 Xinran Xu1 Yongwei Wu2 Weimin Zheng2 Mingxing Zhang2 1Moonshot AI 2Tsinghua University 通讯作者：zhang_mingxing@mail.tsinghua.edu.cn

摘要

1 Introduction

Prefill-decode (PD) disaggregation has become the dominant deployment paradigm for large-scale LLM serving because it separates two fundamentally different phases of inference: prefill is primarily compute-intensive, whereas decode is primarily memory-bandwidth-intensive. At Moonshot AI, Mooncake [22] helped push this shift into practice by treating KVCache as a first-class systems resource, a direction that has since propagated across the broader serving ecosystem via our collaboration with open frameworks such as vLLM [29], SGLang [7], and Dynamo [20]. In principle, PD disaggregation should also unlock a more ambitious goal: heterogeneous serving, in which prefill runs on compute-dense accelerators and decode runs on bandwidth-optimized accelerators. Hardware roadmaps are already moving in this direction. NVIDIA’s Rubin CPX [19] explicitly targets high-throughput long-context prefill, while architectures such as Groq’s Language Processing Unit (LPU) [1, 8] emphasize the extreme memory bandwidth required for decode.

In practice, however, this heterogeneous vision remains difficult to realize because current PD disaggregation still relies on a strong networking assumption: when prefill and decode are placed on different nodes, the KVCache produced by prefill must be transferred quickly enough to avoid stalling computation. In conventional deployments, this effectively confines both phases to the same tightly coupled, high-bandwidth network domain, typically a single datacenter-scale RDMA fabric, where PD disaggregation works well inside a homogeneous cluster. The difficulty is that this single-datacenter paradigm does not extend naturally to heterogeneous serving. Accelerator resources are usually pooled by both chip type and physical location, so compute-oriented and inference-oriented hardware are often unavailable within the same tightly coupled domain. This creates a strong incentive to separate prefill from decode across cluster boundaries, which can reduce costs and latency for long-context requests by moving prefill to faster compute hardware. However, this advantage materializes only if KVCache transfer remains sufficiently cheap. Once prefill and decode no longer share the same high-bandwidth fabric, the generated KVCache must traverse a slower inter-cluster link. If that transfer cost is too high, it erases the prefill-side gain and becomes the new bottleneck. Even when two clusters are geographically close, requiring them to share a single RDMA-scale fabric is operationally rigid and often unrealistic. Worse, once heterogeneous deployment is forced into one tightly coupled cluster, its prefill-to-decode hardware ratio becomes difficult to adapt as traffic patterns evolve. As a result, current PD deployments still fall short of the flexibility that heterogeneous disaggregation is supposed to provide.

The central obstacle, therefore, is KVCache transfer. Recent hybrid-attention architectures change this picture in an important way. Emerging models [26, 27, 31, 3, 11, 2, 18] interleave a small number of full-attention layers with a larger number of linear-complexity or bounded-state layers, such as Kimi Delta Attention (KDA) [26], Sliding Window Attention (SWA) [5], and related mechanisms. This design substantially reduces KVCache growth relative to dense-attention architectures, often by an order of magnitude, thereby making cross-datacenter KVCache transfer plausible. But plausibility is not yet practicality: a naive design that externalizes all prefills would still suffer from bursty arrivals, skewed request lengths, uneven prefix-cache distribution, and fluctuating inter-cluster bandwidth. Hybrid architectures relax the KVCache bottleneck, but they do not eliminate the need for system design; rather, they create the opportunity that system design must exploit.

(a) Status quo: Tightly coupled single-cluster inference.

(b) PrfaaS: Multi-cluster disaggregated inference via cross-datacenter KVCache.

Figure 1: Comparison of two deployment paradigms for PD-disaggregated LLM serving.

This observation leads to the key design principle of this paper: Prefill-as-a-Service (PrfaaS) via cross-datacenter KVCache. As illustrated in Figure 1, instead of forcing heterogeneous accelerators into a single RDMA island, PrfaaS constructs standalone clusters dedicated to long-context prefill using inexpensive, high-throughput compute. Besides, rather than fully separating every request, PrfaaS offloads only long uncached prefills to these compute-dense prefill clusters, while short requests remain on the local PD path. The resulting KVCache is then transferred over commodity Ethernet to decode-capable PD clusters. This design reflects the underlying systems reality: the motivation to split prefill is strong, but the reduced KVCache of hybrid models is still not cheap enough to justify indiscriminate transfer. What makes the design feasible is selective offloading, which concentrates cross-cluster transfer on the requests for which prefill acceleration matters most, while avoiding the inefficiency of sending short requests through a bandwidth-constrained path.

Making this design work requires scheduling and cache management that explicitly address the remaining systems challenges. Considering that bandwidth remains constrained even after KVCache reduction, PrfaaS uses length-based threshold routing to offload only sufficiently long requests, a bandwidth-aware scheduler to react to fluctuating link conditions before congestion accumulates, and a global KVCache manager with a hybrid prefix-cache pool to account jointly for request length, cache placement, and available cross-cluster bandwidth. These mechanisms make cross-cluster heterogeneous serving viable in realistic environments: they preserve the benefits of PD disaggregation without requiring heterogeneous accelerators to share the same low-latency RDMA fabric, and they allow compute-oriented prefill capacity and bandwidth-oriented decode capacity to scale independently across loosely coupled clusters, datacenters, or regions.

This flexibility directly addresses deployment constraints that are otherwise difficult to resolve in practice, including non-colocated accelerator classes, hardware asymmetry across cloud regions, and opportunistic remote capacity. We evaluate this idea through a case study using an internal 1T-parameter hybrid model following the Kimi Linear architecture [26]. With a heterogeneous deployment consisting of a standalone PrfaaS cluster for long-context prefill and a conventional PD cluster for decode and short prefills, the system achieves 54% and 32% higher serving throughput than homogeneous PD and naive heterogeneous baselines, respectively, while consuming only modest cross-datacenter bandwidth per machine. These results show that KVCache-efficient model architectures are necessary but not sufficient for cross-datacenter heterogeneous serving. What makes the deployment practical is the combination of model-side KVCache reduction with system-side selective offloading and bandwidth-aware scheduling. Together, they turn cross-datacenter PD disaggregation from an appealing idea into a realistic serving architecture.

1 引言

（a）现状：紧密耦合的单集群推理。

（b）PrfaaS：通过跨数据中心 KVCache 实现多集群解耦推理。

图 1：两种面向 PD 解耦 LLM 服务的部署范式对比。

2 Background

2.1 The Bandwidth Wall in Conventional PD Disaggregation

Prefill-decode (PD) disaggregation has become the standard systems abstraction for large-scale LLM serving because it cleanly separates two fundamentally different phases of inference: prefill is dominated by arithmetic throughput, while decode is dominated by memory bandwidth. That separation improves utilization and enables phase-specific optimization. But it does not come for free. Once prefill and decode are placed on different nodes, every request must export its KVCache from the prefill side to the decode side, turning what was previously an on-device state transition into a cross-node transport problem. In practice, that transport requirement is what keeps today’s PD deployments confined to a single data center and attached to RDMA-class scale-out networks.

Table 1: Configurations of representative models. Type A denotes the linear-complexity block, and Type B denotes the quadratic-complexity full attention block. Model Attention Type A Attention Type B A:B Ratio Model Params Kimi Linear [26] KDA [26] MLA [12] 3:1 48B MiMo-V2-Flash [31] SWA [5] GQA [4] 5:1 309B Qwen3.5-397B [27] GDN [33] GQA 3:1 397B Ring-2.5-1T [3] Lightning [23] MLA 7:1 1T MiniMax-M2.5 [16] – GQA – 229B Qwen3-235B [32] – GQA – 235B

Under latency-sensitive serving constraints, KVCache produced by prefill instances can be shipped asynchronously to maximize compute utilization. To avoid GPU idling, the egress bandwidth BoutB_{\text{out}} of a prefill cluster must exceed the aggregate rate at which the cluster produces KVCache. Because this aggregate rate scales linearly with the number of instances, the binding constraint reduces to the KV throughput of a single model instance, which we define as

Φkv(l)=Skv(l)Tprefill(l),\Phi_{\text{kv}}(l)=\frac{S_{\text{kv}}(l)}{T_{\text{prefill}}(l)},  (1)

where Skv(l)S_{\text{kv}}(l) is the KVCache size for a request of length ll and Tprefill(l)T_{\text{prefill}}(l) is the corresponding prefill latency. The value of this metric depends largely on model architecture. Table 1 summarizes the configurations considered in this paper. For conventional dense-attention architectures, this transport demand is a dominant systems constraint. In standard Transformer-style attention, KVCache grows linearly with context length and can reach tens of gigabytes. Figure 2 shows the KV throughput of MiniMax-M2.5, a representative dense model with GQA, at various input lengths. The bottleneck is stark: for a 32K-token request, a single MiniMax-M2.5 instance produces KVCache at roughly 60 Gbps, requiring egress bandwidth that far exceeds the cross-datacenter Ethernet capacity of a typical machine. This is precisely why conventional PD disaggregation remains operationally tied to tightly integrated network domains. The network budget is so large that moving prefill and decode across looser interconnects, let alone across data centers, is simply not realistic.

That network coupling also prevents heterogeneous serving from scaling cleanly. Specialized chips already exist for each phase: hardware such as Rubin CPX targets prefill throughput, while LPU-style designs target decode bandwidth. Yet high-performance interconnects remain tightly coupled to machine form factors and deployment environments, so connecting unlike hardware at RDMA-class bandwidth generally requires bespoke engineering. Worse, once heterogeneous hardware is forced into a single tightly coupled cluster, the system inherits a fixed prefill-to-decode hardware ratio. In production traffic, request mix, request volume, and prefix-cache hit rate fluctuate continuously, so one side of the pipeline inevitably becomes overprovisioned while the other becomes the bottleneck. In a homogeneous cluster, any machine can be dynamically reassigned between prefill and decode roles as load shifts. A heterogeneous cluster offers no such flexibility: a chip specialized for prefill cannot serve decode and vice versa, leading to severe load imbalance and stranded capacity. The result is higher operational complexity and limited real-world adoption of heterogeneous PD beyond bespoke or low-throughput scenarios.

Figure 2: KV throughput of MiniMax-M2.5 on an 8×\timesH200 instance at various input lengths. Mechanism Prefill Latency KV Throughput GQA High High MLA High Low Sparse Attention Low High SWA Low Low Linear Attention Low Low Table 2: Prefill latency and KV throughput characteristics of different attention mechanisms. Lower is better for both metrics.

2.2 Hybrid Attention Changes the PD Deployment Boundary

Table 3: KV throughput Φkv\Phi_{\text{kv}} (Gbps) at various input lengths. All models are benchmarked on 8×\timesH200 with SGLang v0.5.9 [7]. Hybrid Dense Seq Len Kimi Linear MiMo-V2-Flash Qwen3.5-397B Ring-2.5-1T MiniMax-M2.5 Qwen3-235B 1K 1.19 0.82 4.13 7.27 4.94 4.12 8K 2.29 2.85 6.28 4.47 32.87 22.42 32K 3.87 4.66 8.25 2.59 59.93 33.35 128K 4.88 4.71 7.47 1.46 47.82 21.50

What changes this picture is not a new scheduler alone, but a shift in model architecture. As LLMs move toward longer contexts, the cost of conventional MHA becomes increasingly untenable, prompting a broad transition toward KVCache-friendly designs. Table 2 categorizes mainstream attention improvements along two dimensions: prefill latency (TprefillT_{\text{prefill}}) and KV throughput (Φkv\Phi_{\text{kv}}). Under long-context workloads, full attention mechanisms such as GQA and MLA retain quadratic complexity, resulting in high prefill cost. Sparse attention [6] reduces the amount of computation and can lower prefill latency, but it still requires transferring sequence-length-dependent KVCache to decode instances, leaving KV throughput as the dominant bottleneck. By contrast, linear attention and SWA maintain linear computation cost, while their bounded state size substantially reduces KV throughput.

A growing number of flagship open-source models adopt linear attention [26, 27, 3, 18] or SWA [31, 11, 2], combining these mechanisms into hybrid stacks that interleave a small number of full-attention layers with a larger number of linear-complexity layers. Representative examples include Qwen3.5-397B at a 3:1 linear-to-full ratio, MiMo-V2-Flash at a 5:1 SWA-to-full ratio, and Ring-2.5-1T at a 7:1 linear-to-full ratio. In these architectures, only the full-attention layers produce KVCache that scales with sequence length, while the linear-complexity layers maintain fixed-size recurrent state whose footprint becomes negligible in the long-context regime.

Equation (1) quantifies how model architecture governs the bandwidth demand of PD disaggregation. Table 3 benchmarks Φkv\Phi_{\text{kv}} for several recent open-source dense and hybrid models. Compared with dense models of similar size, hybrid models exhibit a sharp reduction in Φkv\Phi_{\text{kv}}, meaning that each unit of compute generates far less state that must later traverse the network. At 32K tokens, MiMo-V2-Flash achieves a KV throughput of 4.66 Gbps versus 59.93 Gbps for MiniMax-M2.5, a 13×\times reduction. Qwen3.5-397B reaches 8.25 Gbps versus 33.35 Gbps for Qwen3-235B, a 4×\times reduction. The paper further notes that for Ring-2.5-1T, MLA contributes roughly a 4.5×\times compression over GQA, while the 7:1 hybrid ratio contributes another ∼{\sim}8×\times reduction, yielding an overall KV memory saving of roughly 36×\times.

The key systems implication is not merely lower inference cost, but a reduced KV throughput that shifts the deployable network boundary of PD disaggregation from RDMA-class fabrics to commodity Ethernet. In dense-attention models, prefill emits too much state too quickly, so the network becomes the hard coupling between phases. In hybrid models, prefill still performs substantial work, but the emitted KVCache is dramatically smaller. This does not make cross-datacenter PD free enough for indiscriminate transfer. Rather, it opens a qualitatively different operating regime in which cross-cluster KVCache transport becomes plausible for selected requests and therefore worth optimizing at the system level.

2.3 From Intra-Datacenter PD to Prefill-as-a-Service Paradigm

In intra-datacenter PD deployments, prefill and decode nodes communicate over high-bandwidth, fully meshed networks such as RDMA, so the network is far from being a bottleneck relative to prefill computation. In cross-cluster PD, however, the relationship between inter-cluster bandwidth and model KV throughput directly determines whether cross-datacenter KVCache is feasible. The cluster-level bandwidth requirement follows from the per-instance KV throughput. For an NN-GPU prefill cluster, the minimum egress bandwidth is

Bout=NP⋅𝔼[Skv]𝔼[Tprefill]≈NP⋅Φkv(Lavg),B_{\text{out}}=\frac{N}{P}\cdot\frac{\mathbb{E}[S_{\text{kv}}]}{\mathbb{E}[T_{\text{prefill}}]}\approx\frac{N}{P}\cdot\Phi_{\text{kv}}(L_{\text{avg}}),  (2)

where PP is the parallelism degree (GPUs per instance) and LavgL_{\text{avg}} is the average uncached input length of the requests actually offloaded to the PrfaaS cluster. Notably, BoutB_{\text{out}} depends not only on Φkv\Phi_{\text{kv}}, which is governed by model architecture and hardware, but also on LavgL_{\text{avg}}, which is shaped by request length distribution, prefix-cache hit rate, and routing policy. This dependence is exactly why hybrid models alone are not enough. On the model side, adopting KVCache-friendly architectures reduces Φkv\Phi_{\text{kv}}. On the system side, selective offloading and a bandwidth-aware scheduler that accounts for bandwidth constraints and KVCache locality (§3.4) determine which requests consume the cross-datacenter budget in the first place, keeping BoutB_{\text{out}} within the available inter-datacenter bandwidth envelope.

To make the analysis concrete, we consider a prefill cluster comprising 512 H200 GPUs with Lavg=32KL_{\text{avg}}=32\text{K}. Under this configuration, MiniMax-M2.5 and Qwen3 respectively require 3.8 Tbps and 2.1 Tbps of egress bandwidth, which effectively locks deployment into a tightly integrated single-cluster fabric. By contrast, models that employ hybrid architectures see their KV throughput drop by an order of magnitude, bringing the bandwidth demand to a level that modern inter-datacenter links can sustain. Ring-2.5-1T requires roughly 170 Gbps of dedicated line capacity. Moreover, by routing even longer requests (e.g., 128K tokens) to the PrfaaS cluster for processing, the bandwidth demand falls further to below 100 Gbps. Even at the scale of a 10,000-GPU datacenter, the aggregate egress bandwidth totals only about 1.8 Tbps, well within the capacity of physical inter-datacenter links.

That is the systems turning point described before. Once KV throughput falls far enough, heterogeneous serving no longer has to be implemented solely as an awkward co-location of unlike accelerators behind the same RDMA island. Instead, prefill can be selectively externalized into standalone, compute-dense Prefill-as-a-Service clusters, while decode remains in conventional bandwidth-optimized PD clusters. The problem then shifts from “how do we force heterogeneous hardware into one tightly coupled deployment?” to “how do we identify the requests worth offloading and transport their KVCache across loosely coupled clusters efficiently?” In that sense, KVCache-friendly model architectures, together with selective offloading and bandwidth-aware scheduling, jointly enable heterogeneous PD disaggregation to scale beyond a single datacenter.

2 背景

2.1 传统 PD 解耦中的带宽墙

表 1：代表性模型配置。Type A 表示线性复杂度块，Type B 表示二次复杂度全注意力块。

Model	Attention Type A	Attention Type B	A:B Ratio	Model Params
Kimi Linear [26]	KDA [26]	MLA [12]	3:1	48B
MiMo-V2-Flash [31]	SWA [5]	GQA [4]	5:1	309B
Qwen3.5-397B [27]	GDN [33]	GQA	3:1	397B
Ring-2.5-1T [3]	Lightning [23]	MLA	7:1	1T
MiniMax-M2.5 [16]	–	GQA	–	229B
Qwen3-235B [32]	–	GQA	–	235B

Φkv(l)=Skv(l)Tprefill(l),\Phi_{\text{kv}}(l)=\frac{S_{\text{kv}}(l)}{T_{\text{prefill}}(l)}, (1)

图 2：MiniMax-M2.5 在 8×\timesH200 实例上不同输入长度下的 KV 吞吐。

表 2：不同注意力机制的预填充延迟和 KV 吞吐特征。两个指标越低越好。

Mechanism	Prefill Latency	KV Throughput
GQA	高	高
MLA	高	低
Sparse Attention	低	高
SWA	低	低
Linear Attention	低	低

2.2 混合注意力改变了 PD 的部署边界

表 3：不同输入长度下的 KV 吞吐 Φkv\Phi_{\text{kv}}（Gbps）。所有模型均在 8×\timesH200 上使用 SGLang v0.5.9 [7] 进行基准测试。

Seq Len	Kimi Linear	MiMo-V2-Flash	Qwen3.5-397B	Ring-2.5-1T	MiniMax-M2.5	Qwen3-235B
1K	1.19	0.82	4.13	7.27	4.94	4.12
8K	2.29	2.85	6.28	4.47	32.87	22.42
32K	3.87	4.66	8.25	2.59	59.93	33.35
128K	4.88	4.71	7.47	1.46	47.82	21.50

2.3 从数据中心内 PD 到 Prefill-as-a-Service 范式

3 Disaggregation over Cross-Datacenter KVCache

3.1 Overview

Figure 3: Deployment topology of the PrfaaS-PD architecture.

The core idea of cross-datacenter KVCache is not to externalize every prefill, but to selectively extend disaggregated LLM serving beyond the boundary of a single cluster when remote prefill acceleration is worth the transfer cost. We realize this vision through the PrfaaS-PD architecture, which leverages cross-datacenter KVCache to decouple prefill and decode across loosely connected clusters for requests whose long uncached prefills benefit most from faster compute. As shown in Figure 3, dedicated PrfaaS clusters perform compute-intensive long-context prefill on cost-effective, high-throughput accelerators and stream the resulting KVCache to local PD clusters via commodity Ethernet, while short or bandwidth-unfriendly requests remain on the local PD path.

The PrfaaS-PD architecture comprises three subsystems: compute, network, and storage. The compute subsystem consists of multiple clusters, each containing only homogeneous hardware, as different chip types are typically difficult to co-locate within the same facility. Clusters fall into two categories. Local PD clusters perform PD-disaggregated serving and can complete inference for a request end to end. PrfaaS clusters provide selective remote prefill capacity for requests whose incremental uncached length exceeds a routing threshold. After prefill, the resulting KVCache is transferred to a local PD cluster for decode. The network subsystem spans two tiers: intra-cluster networks use RDMA for latency-sensitive collective communication and PD KVCache transfers, while inter-cluster links rely on VPC peering or dedicated lines for cross-datacenter KVCache transfer. The storage subsystem resides within each cluster, building a distributed hybrid prefix cache pool (§3.2) across all machines. A global KVCache manager maintains KVCache metadata across all clusters. On top of these infrastructure components, a global scheduler routes requests to clusters and nodes based on request characteristics, network conditions, and cache distribution, maximizing end-to-end throughput under cross-cluster bandwidth constraints, as detailed in §3.4.

3.2 Hybrid Prefix Cache Pool

Prefix cache pools allow the serving system to offload KVCache to distributed host memory and SSDs within a cluster, substantially increasing the prefix cache hit rate. Conventional prefix cache pools, however, are designed for a single KVCache type and perform matching and eviction at the token or block level. In hybrid models, the recurrent states of linear attention or SWA layers are request-level: their size is independent of input length, and they can only be reused when the cached length matches exactly. In contrast, the KVCache of full attention layers are block-level: they grow linearly with input length and support partial prefix matching. This heterogeneity challenges the conventional all-layer-uniform KVCache storage paradigm.

Based on vLLM’s hybrid KVCache manager [28], we build a hybrid prefix cache system tailored for cross-cluster KVCache transfer, as shown in Figure 4. Linear states and full-attention KVCache are managed by separate KVCache groups with aligned block sizes, allowing all groups to allocate and free blocks from a shared KVCache pool. On top of this shared pool, we partition cache blocks into two categories: prefix-cache blocks and transfer-cache blocks. Prefix-cache blocks must be fully populated before they can be reused across requests. Transfer-cache blocks hold the KVCache produced at the tail of a prefill request for PD-disaggregated transfer, and the cache pool discards them once the transfer completes.

When a new request arrives, the global KVCache manager computes prefix-match information for every cluster, and the request router uses this information to select the prefill cluster and the cache-affine node within it. Beyond routing, the KVCache manager also performs cache rebalancing to mitigate hotspots. When sufficient inter-cluster bandwidth is available, cross-cluster cache transfer is feasible as well, as discussed in §3.4.3.

Figure 4: Hybrid prefix cache pool. Linear states and full-attention KVCache are managed by separate groups backed by a unified block pool. Blocks are categorized as prefix-cache (intra-cluster only, block-aligned) or transfer-cache (cross-cluster, discarded after transfer).

3.3 PrfaaS-PD Disaggregation

Building on conventional intra-cluster PD disaggregation, we introduce PrfaaS clusters as a selective extension that increases deployment scalability and lowers cost without forcing all requests onto the cross-cluster path. A PrfaaS-PD system may contain several PrfaaS and local PD clusters, whose sizes and ratios are determined by hardware capabilities, network bandwidth, and request traffic.

Each PrfaaS cluster functions as a stateless KVCache producer whose effective throughput equals the minimum of its prefill computation rate and its network egress bandwidth. Not all requests benefit equally from being offloaded to a PrfaaS cluster. Short-context prefill is typically memory- or communication-bound rather than compute-bound, resulting in low arithmetic utilization that cannot fully exploit the compute-dense accelerators in PrfaaS clusters. We therefore retain prefill nodes within the local PD cluster and apply a length-based routing policy. Let ll denote the incremental prefill length of a request (excluding any cached prefix) and tt a routing threshold. When ltlt, the request is routed to a PrfaaS cluster, and the resulting KVCache is transferred to a decode node upon completion. When l≤tl\leq t, the request is handled by a prefill node within the PD cluster. With the growing adoption of agentic workloads, the majority of requests are incremental prefills with prefix cache hits. For such requests, the global KVCache manager tracks the storage locations of all cached entries, ensuring that only the incremental portion is transferred across clusters. The scheduler must jointly consider cache affinity and cluster bandwidth when making routing decisions, as discussed in §3.4.3.

Realizing PrfaaS-PD disaggregation in practice requires stable, high-throughput Ethernet transport. Although hybrid model architectures substantially reduce the nominal bandwidth requirement, bursty traffic and uneven link utilization can still cause congestion, increasing queuing delay and degrading KVCache delivery efficiency. Our design therefore aims to smooth transfer traffic and sustain high link utilization without inducing persistent congestion. To this end, we combine layer-wise prefill pipelining to overlap KVCache generation with transmission, multi-connection TCP transport to fully utilize the available bandwidth, and congestion monitoring integrated with the scheduler to detect loss and retransmission signals early and prevent congestion accumulation.

3.4 Modeling and Scheduling

A PrfaaS-PD system comprises three roles: PrfaaS prefill, PD-P (the prefill nodes within a PD cluster), and PD-D (the decode nodes within a PD cluster). To derive scheduling policies, we construct an analytical throughput model that captures the compute and bandwidth constraints of each role and use it to guide both short-term routing and long-term resource allocation.

3.4.1 Throughput Model

Table 4: Notation used in the PrfaaS-PD throughput model. Traffic System Λ\Lambda Request arrival rate (throughput) NprfaasN_{\text{prfaas}} PrfaaS prefill instances LL Uncached input length (r.v.) NpN_{p}, NdN_{d} PD-P / PD-D instances tt Routing threshold BoutB_{\text{out}} PrfaaS egress bandwidth llongl_{\text{long}} 𝔼[L∣Lt]\mathbb{E}[L\mid Lt], mean PrfaaS length 𝐵𝑆max\mathit{BS}{\max} Max decode batch size lshortl PD-P / PD-D throughput (req/s) }} 𝔼[L∣L≤t]\mathbb{E}[L\mid L\leq t], mean PD-P length Tprefill(l)T_{\text{prefill}}(l) Prefill time for length ll pp P(Lt)P(Lt), fraction to PrfaaS TdecodeT_{\text{decode}} Per-step decode time LoutL_{\text{out}} Mean output length Θprfaas\Theta_{\text{prfaas}} PrfaaS throughput (req/s) Skv(l)S_{\text{kv}}(l) KVCache size for length ll Θpd-p\Theta_{\text{pd-p}}, Θpd-d\Theta_{\text{pd-d}

We model the steady-state throughput of the PrfaaS-PD system using the notation in Table 4. For tractability, we approximate all PrfaaS requests by a representative length llong=𝔼[L∣Lt]l_{\text{long}}=\mathbb{E}[L\mid Lt] with service time Tprefill(llong)T_{\text{prefill}}(l_{\text{long}}), and all PD-P requests by lshort=𝔼[L∣L≤t]l_{\text{short}}=\mathbb{E}[L\mid L\leq t] with service time Tprefill(lshort)T_{\text{prefill}}(l_{\text{short}}). Requests arrive uniformly at aggregate rate Λ\Lambda. A fraction p=P(Lt)p=P(Lt) of requests are routed to PrfaaS, and the remaining 1−p1-p are handled by PD-P.

Each PrfaaS request undergoes two pipelined phases: prefill computation and KVCache transfer. Through layer-wise prefill pipelining, the PrfaaS cluster throughput is determined by the slower of compute and egress transfer:

Θprfaas=min⁡(NprfaasTprefill(llong),BoutSkv(llong)).\Theta_{\text{prfaas}}=\min\!\left(\frac{N_{\text{prfaas}}}{T_{\text{prefill}}(l_{\text{long}})},\;\frac{B_{\text{out}}}{S_{\text{kv}}(l_{\text{long}})}\right).  (3)

Since intra-cluster RDMA bandwidth is not a bottleneck, PD-P throughput is determined solely by compute capacity:

Θpd-p=NpTprefill(lshort).\Theta_{\text{pd-p}}=\frac{N_{p}}{T_{\text{prefill}}(l_{\text{short}})}.  (4)

The decode-stage (PD-D) throughput is

Θpd-d=Nd⋅𝐵𝑆maxTdecode⋅Lout,\Theta_{\text{pd-d}}=\frac{N_{d}\cdot\mathit{BS}_{\max}}{T_{\text{decode}}\cdot L_{\text{out}}},  (5)

where 𝐵𝑆max\mathit{BS}{\max} and TdecodeT are treated as SLO-governed constants.}

PrfaaS and PD-P serve as upstream producers, each prefilling a disjoint subset of requests (fractions pp and 1−p1{-}p, respectively), while PD-D is the sole downstream consumer. Together, they form a converging pipeline:

RequestPrfaaS PrefillPD-P PrefillPD-D Decodepp1−p1{-}pEthernetRDMA

The end-to-end system throughput is limited by the slowest stage, accounting for the routing split:

Λmax=min⁡(Θprfaasp,Θpd-p1−p,Θpd-d).\Lambda_{\max}=\min\!\left(\frac{\Theta_{\text{prfaas}}}{p},\;\frac{\Theta_{\text{pd-p}}}{1-p},\;\Theta_{\text{pd-d}}\right).  (6)

3.4.2 Throughput-Optimal Configuration

Given fixed hardware resources (NprfaasN_{\text{prfaas}}, Np+NdN_{p}{+}N_{d}) and network bandwidth BoutB_{\text{out}}, we seek two decision variables that maximize Λmax\Lambda_{\max}: the routing threshold tt (which determines pp, llongl_{\text{long}}, and lshortl_{\text{short}}) and the PD-cluster prefill-to-decode ratio Np/NdN_{p}/N_{d}.

The threshold tt governs the trade-off between PrfaaS and PD-P load. Increasing tt restricts PrfaaS to longer requests, for which Tprefill(l)T_{\text{prefill}}(l) grows faster than Skv(l)S_{\text{kv}}(l) (near-quadratic prefill versus linear KVCache size). This lowers the per-instance KV throughput and eases bandwidth pressure. Conversely, decreasing tt floods PrfaaS with shorter requests whose high KV throughput is more likely to trigger the bandwidth bottleneck. The optimal tt balances PrfaaS and PD-P throughput so that both stages approach saturation simultaneously:

Θprfaasp=Θpd-p1−p.\frac{\Theta_{\text{prfaas}}}{p}=\frac{\Theta_{\text{pd-p}}}{1-p}.  (7)

For a fixed cluster size Np+NdN_{p}+N_{d} (the number of machines in a datacenter is constant in the short term), the ratio Np/NdN_{p}/N_{d} should balance the aggregate producer throughput against the consumer throughput. Too few prefill nodes starve the decode stage of KVCache, while too many leave prefill capacity idle. The optimal ratio satisfies

Θprfaas+Θpd-p=Θpd-d.\Theta_{\text{prfaas}}+\Theta_{\text{pd-p}}=\Theta_{\text{pd-d}}.  (8)

Equations (7) and (8) constrain two unknowns, tt and Np/NdN_{p}/N_{d}. Because Θprfaas/p\Theta_{\text{prfaas}}/p decreases monotonically with pp (and hence with decreasing tt) while Θpd-p/(1−p)\Theta_{\text{pd-p}}/(1{-}p) increases, a grid search over tt and Np/NdN_{p}/N_{d} efficiently finds the optimal operating point.

3.4.3 Dual-Timescale Scheduling

The scheduling is central to making cross-datacenter disaggregation practical rather than merely architecturally possible. In theory, hybrid model architectures reduce KV throughput enough that commodity Ethernet links can sustain cross-cluster transfers, and the steady-state analysis above can maximize cluster throughput by optimizing tt and Np/NdN_{p}/N_{d}. In practice, however, traffic variations and burstiness can cause transient congestion and queue buildup at the PrfaaS egress. Moreover, clusters maintain large prefix caches whose bulk transfers can further strain cross-cluster links. To handle this dynamic environment, we design a dual-timescale scheduling algorithm that treats cross-cluster bandwidth and throughput as the primary constraints, separating the factors governing request routing and resource allocation into short-term and long-term categories with a dedicated strategy for each.

Short-term: bandwidth- and cache-aware routing.

The PrfaaS cluster has a bandwidth-imposed throughput ceiling Bout/Skv(llong)B_{\text{out}}/S_{\text{kv}}(l_{\text{long}}). As the cluster approaches this ceiling, congestion builds at the egress link. The scheduler therefore continuously monitors the PrfaaS egress utilization and request queue depth. When utilization approaches a threshold or queuing surges, a short-term routing adjustment is triggered.

At initialization or upon a policy update, the scheduler profiles current compute capacity and egress bandwidth, then searches for the optimal threshold tt based on the incremental prefill-length distribution after prefix matching. By routing only sufficiently long requests to PrfaaS, the scheduler reduces per-request bandwidth demand and avoids congestion near the bandwidth ceiling.

For requests with prefix cache hits, the scheduler must jointly consider cache affinity and bandwidth availability. Let ltotall_{\text{total}} denote the total input length of a request, and let lprfaasl_{\text{prfaas}} and lpdl_{\text{pd}} denote the cached prefix lengths in the PrfaaS and PD clusters, respectively. The routing depends on whether bandwidth or compute is the binding constraint. When bandwidth is scarce, prefix caches in each cluster are evaluated independently: if ltotal−lpd≤tl_{\text{total}}-l_{\text{pd}}\leq t, the request is prefilled locally in PD-P, and otherwise offloaded to PrfaaS. When bandwidth is abundant, compute becomes the scarce resource, and cross-cluster cache transfer can reduce redundant computation. The scheduler considers the best cache across all clusters by letting lprefix=max⁡(lprfaas,lpd)l_{\text{prefix}}=\max(l_{\text{prfaas}},\,l_{\text{pd}}); if ltotal−lprefix≤tl_{\text{total}}-l_{\text{prefix}}\leq t, the request is prefilled in PD-P; otherwise it goes to PrfaaS. When the cluster with the longer cache differs from the compute cluster, a cross-cluster cache transfer is performed.

Long-term: traffic-driven allocation re-optimization.

Over longer timescales, shifts in traffic volume create persistent imbalances between pipeline stages. When Θprfaas+Θpd-p≪Θpd-d\Theta_{\text{prfaas}}+\Theta_{\text{pd-p}}\ll\Theta_{\text{pd-d}}, prefill is the system bottleneck, and when Θprfaas+Θpd-p≫Θpd-d\Theta_{\text{prfaas}}+\Theta_{\text{pd-p}}\gg\Theta_{\text{pd-d}}, decode is. The scheduler identifies the binding constraint by monitoring queue depth and utilization at each stage. Since traffic variations at this timescale are gradual and often periodic, the scheduler periodically re-evaluates load balance and converts nodes between prefill and decode roles within the PD cluster, adjusting NpN_{p} and NdN_{d} to restore the optimality conditions of Equations (7) and (8). As instance counts change, the routing threshold tt is re-optimized accordingly.

3 基于跨数据中心 KVCache 的解耦

3.1 概览

图 3：PrfaaS-PD 架构的部署拓扑。

3.2 混合前缀缓存池

3.3 PrfaaS-PD 解耦

3.4 建模与调度

3.4.1 吞吐模型

表 4：PrfaaS-PD 吞吐模型使用的记号。

Traffic		System
Λ\Lambda	请求到达率（吞吐）	NprfaasN_{\text{prfaas}}	PrfaaS 预填充实例数
LL	未缓存输入长度（随机变量）	NpN_{p}, NdN_{d}	PD-P / PD-D 实例数
tt	路由阈值	BoutB_{\text{out}}	PrfaaS 出口带宽
llongl_{\text{long}}	𝔼[L∣Lt]\mathbb{E}[L\mid Lt]，PrfaaS 平均长度	𝐵𝑆max\mathit{BS}_{\max}	最大解码批大小
lshortl_{\text{short}}	𝔼[L∣L≤t]\mathbb{E}[L\mid L\leq t]，PD-P 平均长度	Tprefill(l)T_{\text{prefill}}(l)	长度 ll 的预填充时间
pp	P(Lt)P(Lt)，路由到 PrfaaS 的比例	TdecodeT_{\text{decode}}	每步解码时间
LoutL_{\text{out}}	平均输出长度	Θprfaas\Theta_{\text{prfaas}}	PrfaaS 吞吐（req/s）
Skv(l)S_{\text{kv}}(l)	长度 ll 的 KVCache 大小	Θpd-p\Theta_{\text{pd-p}}, Θpd-d\Theta_{\text{pd-d}}	PD-P / PD-D 吞吐（req/s）

每个 PrfaaS 请求经历两个流水线阶段：预填充计算和 KVCache 传输。通过按层预填充流水线，PrfaaS 集群吞吐由计算和出口传输中较慢的一方决定：

由于集群内 RDMA 带宽不是瓶颈，PD-P 吞吐只由计算容量决定：

Θpd-p=NpTprefill(lshort).\Theta_{\text{pd-p}}=\frac{N_{p}}{T_{\text{prefill}}(l_{\text{short}})}. (4)

解码阶段（PD-D）的吞吐为

Θpd-d=Nd⋅𝐵𝑆maxTdecode⋅Lout,\Theta_{\text{pd-d}}=\frac{N_{d}\cdot\mathit{BS}{\max}}{T, (5)}}\cdot L_{\text{out}}

其中 𝐵𝑆max\mathit{BS}{\max} 和 TdecodeT 被视为由 SLO 决定的常数。}

RequestPrfaaS PrefillPD-P PrefillPD-D Decodepp1−p1{-}pEthernetRDMA

端到端系统吞吐受最慢阶段限制，同时考虑路由拆分：

Λmax=min⁡(Θprfaasp,Θpd-p1−p,Θpd-d).\Lambda_{\max}=\min!\left(\frac{\Theta_{\text{prfaas}}}{p},\;\frac{\Theta_{\text{pd-p}}}{1-p},\;\Theta_{\text{pd-d}}\right). (6)

3.4.2 吞吐最优配置

Θprfaasp=Θpd-p1−p.\frac{\Theta_{\text{prfaas}}}{p}=\frac{\Theta_{\text{pd-p}}}{1-p}. (7)

Θprfaas+Θpd-p=Θpd-d.\Theta_{\text{prfaas}}+\Theta_{\text{pd-p}}=\Theta_{\text{pd-d}}. (8)

3.4.3 双时间尺度调度

短期：带宽与缓存感知路由。

长期：由流量驱动的分配再优化。

4 Case Study: Bandwidth Demand of PrfaaS-PD Architecture

In this section, we use an internal 1T-parameter hybrid-architecture model as a case study to evaluate whether selective PrfaaS offloading can keep cross-datacenter KVCache transfer within a realistic bandwidth budget while improving system throughput under realistic hardware and deployment settings. Following the notation in Table 4, we apply the profiling-based throughput model from §3.4 to solve for two key parameters that maximize system throughput: the routing threshold tt and the prefill-to-decode instance ratio Np/NdN_{p}/N_{d} within the local PD cluster. Under the resulting configuration, the heterogeneous PrfaaS-PD deployment achieves 54% higher throughput and 64% lower P90 TTFT than a homogeneous PD-only baseline, and 32% higher throughput than a naive heterogeneous deployment without scheduling. The average PrfaaS cluster egress is only 13 Gbps, well within the Ethernet capacity.

4.1 Setup

Table 5: KVCache size SkvS_{\text{kv}}, prefill latency TprefillT_{\text{prefill}}, and KV throughput Φkv\Phi_{\text{kv}} of the internal 1T hybrid model at various input lengths. Prefill latency is benchmarked on 8×\timesH200 with in-house vLLM [29]. Seq Len KVCache Size Prefill Latency KV Throughput 1K 190.8 MiB 0.44 s 3.61 Gbps 8K 308.9 MiB 0.72 s 3.59 Gbps 32K 701.3 MiB 1.84 s 3.19 Gbps 128K 2316.3 MiB 7.40 s 2.62 Gbps

We deploy two clusters connected by a VPC network, providing an aggregate cross-cluster bandwidth of approximately 100 Gbps. The PrfaaS cluster consists of 32 H200 GPUs with higher compute throughput, dedicated to long-context prefill requests with LtLt. The local PD cluster consists of 64 H20 GPUs operating in conventional PD-disaggregated mode with 800 Gbps RDMA interconnect per node, where the prefill-to-decode ratio can be adjusted according to request traffic. Note that although H200 and H20 have different price points, the PrfaaS cluster only requires high compute throughput. In production deployments, cost-effective accelerators with comparable compute capability can serve as substitutes. As a baseline, we use a homogeneous PD cluster of 96 H20 GPUs.

To reflect realistic serving requirements, the workload uses an internal hybrid model with 1T parameters whose architecture follows Kimi Linear [26], employing an interleaved KDA:MLA layer structure at a 3:1 ratio. The model is deployed at 8 GPUs per instance and profiled separately for prefill and decode using in-house vLLM. Table 5 lists the KVCache size SkvS_{\text{kv}}, prefill latency TprefillT_{\text{prefill}}, and KV throughput Φkv\Phi_{\text{kv}} for this model at various input lengths.

Request input lengths follow a truncated log-normal distribution (μ=9.90\mu=9.90, σ=1.00\sigma=1.00, truncated to [128, 128K][128,\,128\text{K}]) with a mean of approximately 27K tokens, reflecting the long-context distribution characteristic of real-world workloads. The output length is fixed at 1024 tokens, and the SLO (excluding speculative decoding) is set to 40 tokens/s. All throughput and bandwidth results reported below are derived by feeding the measured profiling data into the throughput model of §3.4.

4.2 Throughput Modeling and Solution

(a) Search over prefill/decode allocation.

(b) Search over routing threshold tt.

Figure 5: Illustration of the grid search process for the two optimization variables. (a) fixes tt at the optimum and searches over the prefill/decode instance split within the local PD cluster. (b) fixes Np=3N_{p}=3, Nd=5N_{d}=5 and searches over tt.

Using the PrfaaS-PD throughput model from §3.4, we optimize the routing threshold tt and the PD-cluster prefill/decode allocation to maximize Λmax\Lambda_{\max}. We solve them by exhaustive two-dimensional grid search. The optimal configuration is listed in the second column of Table 6.

Figure 5 illustrates the search process, fixing one variable and searching over the other. Figure 5(a) shows that, for a fixed threshold tt (and hence fixed Θprfaas\Theta_{\text{prfaas}}), the system throughput peaks when prefill and decode throughput are approximately balanced. The optimal allocation in the local PD cluster is Np=3N_{p}=3 and Nd=5N_{d}=5. Figure 5(b) fixes Np=3N_{p}=3 and Nd=5N_{d}=5, therefore the overall Λmax\Lambda_{\max} is determined by min⁡(Θprfaas/p,Θpd-p/(1−p))\min(\Theta_{\text{prfaas}}/p,\;\Theta_{\text{pd-p}}/(1{-}p)). The optimum occurs at the intersection of the two curves, yielding t=19.4t=19.4K. At this operating point, approximately 50% of all requests (the longer ones) are offloaded to the PrfaaS cluster, fully utilizing the high-compute-throughput accelerators.

Table 6: Comparison of optimal configurations across PrfaaS-PD, homogeneous PD, and naive heterogeneous PD deployments. Metric PrfaaS-PD Homogeneous PD Naive Heterogeneous PD Threshold tt 19.4K — — NprfaasN_{\text{prfaas}} / NpN_{p} / NdN_{d} 4 / 3 / 5 — / 9 / 3 4 / — / 8 Mean / P90 TTFT (s) 2.22 / 3.51 4.44 / 9.73 1.74 / 3.51 Θprfaas\Theta_{\text{prfaas}} / Θpd-p\Theta_{\text{pd-p}} / Θpd-d\Theta_{\text{pd-d}} (req/s) 1.61 / 1.64 / 3.91 — / 2.11 / 2.35 2.45 / — / 6.25 Λmax\Lambda_{\max} (req/s) 3.24 2.11 2.45 Ratio 1.54×\times 1.00×\times 1.16×\times

4.3 Result Analysis

4.3.1 Cross-Datacenter Bandwidth Utilization

A key prerequisite for the PrfaaS-PD architecture is that the inter-cluster network link can sustain the KVCache transfer demand. We evaluate this by measuring the KV throughput of the PrfaaS cluster under the deployed workload distribution.

With the routing threshold set to t=19.4t=19.4K, 49.6% of requests are routed to PrfaaS, and the offloaded subset has 𝔼[L∣Lt]≈44\mathbb{E}[L\mid Lt]\approx 44K tokens. The aggregate PrfaaS egress load is therefore approximately 13 Gbps, consuming only 13% of the Ethernet link. This confirms that the KVCache of a hybrid-architecture model can be transported over commodity Ethernet with substantial headroom with proper scheduling. By contrast, conventional Transformer models with full KV heads would produce significantly larger KVCache volumes, potentially requiring RDMA-class interconnects.

4.3.2 Comparison with Homogeneous PD

We apply the same throughput modeling methodology to the homogeneous PD baseline of 96 H20 GPUs. The results are shown in the third column of Table 6. In the homogeneous cluster, the system throughput is also maximized when prefill and decode throughput are balanced, yielding an optimal allocation of 9 prefill instances and 3 decode instances.

Thanks to the superior compute throughput of the PrfaaS cluster, the PrfaaS-PD configuration requires two fewer prefill instances in the local PD cluster, freeing capacity for additional decode slots. The overall system throughput improves by 54%.

Another key benefit of PrfaaS-PD is the reduction in TTFT, particularly for long-context requests. In the homogeneous baseline, long requests compete with short requests for prefill capacity, inflating queuing delays. In PrfaaS-PD, long requests are offloaded to the dedicated high-throughput PrfaaS cluster, where prefill completes significantly faster than in the PD cluster even after accounting for cross-cluster transfer latency. As shown in Table 6, the mean and P90 TTFT are reduced by 50% and 64%, respectively, compared to the homogeneous baseline.

4.3.3 Comparison with Naive Heterogeneous PD

The comparison with homogeneous PD demonstrates the fundamental performance advantage of heterogeneous deployment. The comparison with naive heterogeneous PD further highlights the importance of scheduling in heterogeneous systems. In the naive heterogeneous PD configuration, no scheduling optimization is applied: all prefill is assigned to H200 GPUs and all decode to H20 GPUs, without length-based routing or load balancing. As shown in Table 6, the naive heterogeneous PD achieves only 1.16×1.16\times throughput over the homogeneous baseline, a 25% reduction compared to PrfaaS-PD. This degradation stems from the severe imbalance between prefill and decode throughput, and more fundamentally from treating heterogeneous prefill as a universal path rather than selectively offloading only the requests that benefit most from PrfaaS.

4.4 Summary

This case study demonstrates that for hybrid-architecture models, cross-datacenter KVCache becomes practical when it is paired with selective PrfaaS offloading rather than applied indiscriminately.

Regarding feasibility, the KVCache transfer of a hybrid model consumes only 13% of a 100 Gbps Ethernet link. This is well within reach of commodity Ethernet infrastructure and far below the bandwidth demands of RDMA interconnects, confirming the viability of cross-datacenter KVCache transfer.

Regarding effectiveness, the PrfaaS-PD configuration (32 H200 GPUs for PrfaaS, 64 H20 GPUs for local PD) achieves 54% higher throughput and 64% lower P90 TTFT over a 96-H20 homogeneous PD-only baseline; at equal cost, the throughput gain is approximately 15%. These gains stem from offloading compute-intensive long-context prefill to high-throughput PrfaaS accelerators. We note that H200 and H20 serve here as a representative hardware pair, not the sole viable combination. Cost-effective prefill-specialized chips can further reduce deployment cost in production.

Besides, the PrfaaS cluster is currently compute-bound with ample bandwidth headroom. Under larger-scale deployments or higher-bandwidth dedicated lines, the PrfaaS cluster can be further expanded to yield additional throughput gains. For example, in IDC-scale deployments with thousands of PrfaaS GPUs, the aggregate egress bandwidth required for KVCache transfer remains on the order of 1 Tbps, well within the capacity of modern datacenter fabrics, enabling further improvements in throughput and resource efficiency.

4 案例研究：PrfaaS-PD 架构的带宽需求

4.1 设置

Seq Len	KVCache Size	Prefill Latency	KV Throughput
1K	190.8 MiB	0.44 s	3.61 Gbps
8K	308.9 MiB	0.72 s	3.59 Gbps
32K	701.3 MiB	1.84 s	3.19 Gbps
128K	2316.3 MiB	7.40 s	2.62 Gbps

4.2 吞吐建模与求解

（a）搜索预填充/解码分配。

（b）搜索路由阈值 tt。

表 6：PrfaaS-PD、同构 PD 和天真异构 PD 部署的最优配置对比。

Metric	PrfaaS-PD	Homogeneous PD	Naive Heterogeneous PD
Threshold tt	19.4K	—	—
NprfaasN_{\text{prfaas}} / NpN_{p} / NdN_{d}	4 / 3 / 5	— / 9 / 3	4 / — / 8
Mean / P90 TTFT (s)	2.22 / 3.51	4.44 / 9.73	1.74 / 3.51
Θprfaas\Theta_{\text{prfaas}} / Θpd-p\Theta_{\text{pd-p}} / Θpd-d\Theta_{\text{pd-d}} (req/s)	1.61 / 1.64 / 3.91	— / 2.11 / 2.35	2.45 / — / 6.25
Λmax\Lambda_{\max} (req/s)	3.24	2.11	2.45
Ratio	1.54×\times	1.00×\times	1.16×\times

4.3 结果分析

4.3.1 跨数据中心带宽利用率

PrfaaS-PD 架构的一个关键前提是，集群间网络链路能够支撑 KVCache 传输需求。我们通过测量已部署工作负载分布下 PrfaaS 集群的 KV 吞吐来评估这一点。

4.3.2 与同构 PD 对比

得益于 PrfaaS 集群更强的计算吞吐，PrfaaS-PD 配置在本地 PD 集群中少需要两个预填充实例，从而释放容量给更多解码槽位。整体系统吞吐提升 54%。

4.3.3 与天真异构 PD 对比

4.4 小结

这个案例研究表明，对于混合架构模型，当跨数据中心 KVCache 与选择性 PrfaaS 卸载结合，而不是被不加区分地应用时，它就变得可行。

5 Discussion

Cross-datacenter KVCache extends PD disaggregation from a single tightly coupled cluster to loosely connected heterogeneous clusters. Its practicality depends on coordinated progress across model architecture, system design, and hardware. In this section, we discuss how these trends reinforce one another and what they imply for next-generation LLM serving systems.

KVCache-friendly model architecture.

As context windows continue to grow, KVCache increasingly dominates inference cost in storage and transfer. Architectural techniques such as MLA, sliding window attention, and linear attention have already shown that KVCache size can be reduced substantially without sacrificing model capability. Going forward, model co-design is likely to optimize not only FLOPs but also KVCache transfer volume. These improvements directly reduce the latency and bandwidth cost of cross-datacenter KVCache, expanding the range of deployments in which PrfaaS-PD is cost-effective.

KVCache compression and reuse.

Beyond architectural innovations, a growing body of work reduces LLM inference cost through KVCache compression and reuse at the algorithm or system level. Methods such as H2O [35] and KIVI [14] selectively evict or quantize KVCache entries to shrink memory footprints. CacheGen [13] applies conventional compression techniques to reduce KVCache transfer volume, while CacheBlend [34] and FusionRAG [30] enable reuse of approximately matched KVCache across requests. Together, these techniques complement KVCache-friendly model design by reducing effective memory pressure and network traffic, thereby making cross-datacenter KVCache more robust under production workloads.

Phase-specialized inference hardware.

The disaggregation between prefill and decode is also becoming visible in hardware design. Prefill is compute-intensive, whereas decode is dominated by memory bandwidth. Recent chip roadmaps are increasingly phase-specialized: NVIDIA Rubin CPX [19] emphasizes compute throughput for prefill, while chips such as the LPU [8] and Taalas HC1 [25] feature extremely high memory bandwidth for fast decode. Cross-datacenter KVCache fits this trend naturally. It removes the requirement that heterogeneous chips be deployed within the same tightly coupled network domain, allowing operators to size prefill and decode clusters independently and to deploy each phase on the hardware best suited to it.

5 讨论

KVCache 友好的模型架构。

KVCache 压缩与复用。

按阶段专用的推理硬件。

6 Related Work

System optimization for online LLM inference has gradually shifted from monolithic single-cluster engines toward disaggregated, heterogeneity-aware, and KVCache-centric pipelines. On one hand, disaggregated serving splits each request into a compute-intensive prefill phase and a memory-bandwidth-intensive decode phase to reduce inter-phase interference and allow independent scaling. Splitwise [21] and DistServe [37] formulate PD disaggregation from cost/power and goodput perspectives, respectively, with corresponding deployment, scheduling, and placement strategies, showing that PD disaggregation can simultaneously improve throughput and reduce cost under appropriate SLO and hardware constraints. On the other hand, as cluster hardware and interconnects become increasingly heterogeneous, Helix [15], Hetis [17], and LLM-PQ [36] incorporate heterogeneous GPUs/networks and phase-level differences into the optimization space to achieve throughput or latency gains through phase-specialized hardware placement. Meanwhile, DynamoLLM [24] and FREESH [9] emphasize system-level resource reconfiguration and cross-domain scheduling from the perspective of energy, cost, and carbon efficiency while meeting serving SLOs. More critically, KVCache has evolved from per-request ephemeral state into a first-class system resource: Mooncake [22] introduces a global KVCache pool to improve cross-node and cross-request reuse; CacheBlend [34] and FusionRAG [30] significantly reduce TTFT through non-prefix KVCache fusion reuse; KIVI [14], KVQuant [10], and H2O [35] further shrink KVCache volume via quantization or importance-based eviction to improve long-context serviceability. However, no prior work jointly optimizes cross-datacenter prefill offloading, heterogeneous deployment, and bandwidth/cache-aware scheduling within a unified system. This paper approaches the problem from a cross-datacenter KVCache perspective, combining disaggregated inference with heterogeneous resource orchestration to build a low-cost, high-throughput serving system.

6 相关工作

7 Conclusion

To address the practical deployment challenges of heterogeneous disaggregated inference, we propose the concept of cross-datacenter KVCache, extending disaggregated serving from single homogeneous clusters to cross-cluster heterogeneous deployments. On this basis, we design the PrfaaS-PD disaggregation architecture, which augments system serving throughput at low cost through heterogeneous PrfaaS clusters connected via commodity Ethernet. We envision that the cross-datacenter KVCache paradigm will co-evolve with next-generation models, hardware, and networks to enable highly efficient LLM serving at scale.

7 结论

References

[1] D. Abts, J. Ross, J. Sparling, M. Wong-VanHaren, M. Baker, T. Hawkins, A. Bell, J. Thompson, T. Kahsai, G. Kimmell, et al. (2020) Think fast: a tensor streaming processor (tsp) for accelerating deep learning workloads. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pp. 145–158. Cited by: §1.
[2] S. Agarwal, L. Ahmad, J. Ai, S. Altman, A. Applebaum, E. Arbus, R. K. Arora, Y. Bai, B. Baker, H. Bao, et al. (2025) Gpt-oss-120b gpt-oss-20b model card. arXiv preprint arXiv:2508.10925. Cited by: §1, §2.2.
[3] I. AI (2026) Ring-2.5-1t. Note: https://github.com/inclusionAI/Ring-V2.5 Cited by: §1, §2.2, Table 1.
[4] J. Ainslie, J. Lee-Thorp, M. De Jong, Y. Zemlyanskiy, F. Lebrón, and S. Sanghai (2023) Gqa: training generalized multi-query transformer models from multi-head checkpoints. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 4895–4901. Cited by: Table 1.
[5] I. Beltagy, M. E. Peters, and A. Cohan (2020) Longformer: the long-document transformer. arXiv preprint arXiv:2004.05150. Cited by: §1, Table 1.
[6] R. Child, S. Gray, A. Radford, and I. Sutskever (2019) Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509. Cited by: §2.2.
[7] L. Corp. (2026) SGLang. Note: https://github.com/sgl-project/sglang Cited by: §1, Table 3, Table 3.
[8] Groq (2025) What is a language processing unit?. Note: https://groq.com/blog/the-groq-lpu-explained Cited by: §1, §5.
[9] X. He, Z. Fang, J. Lian, D. H. Tsang, B. Zhang, and Y. Chen (2025) FREESH: fair, resource-and energy-efficient scheduling for llm serving on heterogeneous gpus. arXiv preprint arXiv:2511.00807. Cited by: §6.
[10] C. Hooper, S. Kim, H. Mohammadzadeh, M. W. Mahoney, Y. S. Shao, K. Keutzer, and A. Gholami (2024) Kvquant: towards 10 million context length llm inference with kv cache quantization. Advances in Neural Information Processing Systems 37, pp. 1270–1303. Cited by: §6.
[11] A. Huang, A. Li, A. Kong, B. Wang, B. Jiao, B. Dong, B. Wang, B. Chen, B. Li, B. Ma, et al. (2026) Step 3.5 flash: open frontier-level intelligence with 11b active parameters. arXiv preprint arXiv:2602.10604. Cited by: §1, §2.2.
[12] A. Liu, B. Feng, B. Wang, B. Wang, B. Liu, C. Zhao, C. Dengr, C. Ruan, D. Dai, D. Guo, et al. (2024) Deepseek-v2: a strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434. Cited by: Table 1.
[13] Y. Liu, H. Li, Y. Cheng, S. Ray, Y. Huang, Q. Zhang, K. Du, J. Yao, S. Lu, G. Ananthanarayanan, et al. (2024) Cachegen: kv cache compression and streaming for fast large language model serving. In Proceedings of the ACM SIGCOMM 2024 Conference, pp. 38–56. Cited by: §5.
[14] Z. Liu, J. Yuan, H. Jin, S. Zhong, Z. Xu, V. Braverman, B. Chen, and X. Hu (2024) KIVI: a tuning-free asymmetric 2bit quantization for kv cache. In International Conference on Machine Learning, pp. 32332–32344. Cited by: §5, §6.
[15] Y. Mei, Y. Zhuang, X. Miao, J. Yang, Z. Jia, and R. Vinayak (2025) Helix: serving large language models over heterogeneous gpus and network via max-flow. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, pp. 586–602. Cited by: §6.
[16] Minimax (2026) MiniMax m2.5: built for real-world productivity. Note: https://www.minimax.io/news/minimax-m25 Cited by: Table 1.
[17] Z. Mo, J. Liao, H. Xu, Z. Zhou, and C. Xu (2025) Hetis: serving llms in heterogeneous gpu clusters with fine-grained and dynamic parallelism. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1710–1724. Cited by: §6.
[18] NVIDIA (2025) Nemotron 3 Nano: open, efficient mixture-of-experts hybrid Mamba-Transformer model for Agentic reasoning. Note: Technical report External Links: Link Cited by: §1, §2.2.
[19] NVIDIA (2025) NVIDIA rubin cpx accelerates inference performance and efficiency for 1m+ token context workloads. Note: https://developer.nvidia.com/blog/nvidia-rubin-cpx-accelerates-inference-performance-and-efficiency-for-1m-token-context-workloads/ Cited by: §1, §5.
[20] NVIDIA (2026) Dynamo. Note: https://github.com/ai-dynamo/dynamo Cited by: §1.
[21] P. Patel, E. Choukse, C. Zhang, A. Shah, Í. Goiri, S. Maleki, and R. Bianchini (2024) Splitwise: efficient generative llm inference using phase splitting. In 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), pp. 118–132. Cited by: §6.
[22] R. Qin, Z. Li, W. He, J. Cui, H. Tang, F. Ren, T. Ma, S. Cai, Y. Zhang, M. Zhang, et al. (2024) Mooncake: a kvcache-centric disaggregated architecture for llm serving. ACM Transactions on Storage. Cited by: §1, §6.
[23] Z. Qin, W. Sun, D. Li, X. Shen, W. Sun, and Y. Zhong (2024) Lightning attention-2: a free lunch for handling unlimited sequence lengths in large language models. External Links: 2401.04658 Cited by: Table 1.
[24] J. Stojkovic, C. Zhang, Í. Goiri, J. Torrellas, and E. Choukse (2025) Dynamollm: designing llm inference clusters for performance and energy efficiency. In 2025 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 1348–1362. Cited by: §6.
[25] Taalas (2025) Taalas hc1. Note: https://taalas.com/products Cited by: §5.
[26] K. Team, Y. Zhang, Z. Lin, X. Yao, J. Hu, F. Meng, C. Liu, X. Men, S. Yang, Z. Li, et al. (2025) Kimi linear: an expressive, efficient attention architecture. arXiv preprint arXiv:2510.26692. Cited by: §1, §1, §2.2, Table 1, Table 1, §4.1.
[27] Q. Team (2026) Qwen3.5: towards native multimodal agents. Note: https://qwen.ai/blog?id=qwen3.5 Cited by: §1, §2.2, Table 1.
[28] vLLM Team at IBM (2025) Hybrid models as first-class citizens in vLLM. Note: https://pytorch.org/blog/hybrid-models-as-first-class-citizens-in-vllm/PyTorch Blog, November 2025 Cited by: §3.2.
[29] vLLM Team (2026) VLLM. Note: https://github.com/vllm-project/vllm Cited by: §1, Table 5, Table 5.
[30] J. Wang, W. Xie, M. Zhang, B. Zhang, J. Dong, Y. Zhu, C. Lin, J. Tang, Y. Han, Z. Ai, et al. (2026) From prefix cache to fusion rag cache: accelerating llm inference in retrieval-augmented generation. arXiv preprint arXiv:2601.12904. Cited by: §5, §6.
[31] B. Xiao, B. Xia, B. Yang, B. Gao, B. Shen, C. Zhang, C. He, C. Lou, F. Luo, G. Wang, et al. (2026) Mimo-v2-flash technical report. arXiv preprint arXiv:2601.02780. Cited by: §1, §2.2, Table 1.
[32] A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. (2025) Qwen3 technical report. arXiv preprint arXiv:2505.09388. Cited by: Table 1.
[33] S. Yang, J. Kautz, and A. Hatamizadeh (2025) Gated delta networks: improving mamba2 with delta rule. In International Conference on Learning Representations (ICLR), Cited by: Table 1.
[34] J. Yao, H. Li, Y. Liu, S. Ray, Y. Cheng, Q. Zhang, K. Du, S. Lu, and J. Jiang (2025) Cacheblend: fast large language model serving for rag with cached knowledge fusion. In Proceedings of the twentieth European conference on computer systems, pp. 94–109. Cited by: §5, §6.
[35] Z. Zhang, Y. Sheng, T. Zhou, T. Chen, L. Zheng, R. Cai, Z. Song, Y. Tian, C. Ré, C. Barrett, et al. (2023) H2o: heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems 36, pp. 34661–34710. Cited by: §5, §6.
[36] J. Zhao, B. Wan, C. Wu, Y. Peng, and H. Lin (2024) Llm-pq: serving llm on heterogeneous clusters with phase-aware partition and adaptive quantization. In Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, pp. 460–462. Cited by: §6.
[37] Y. Zhong, S. Liu, J. Chen, J. Hu, Y. Zhu, X. Liu, X. Jin, and H. Zhang (2024) DistServe: disaggregating prefill and decoding for goodput-optimized large language model serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pp. 193–210. Cited by: §6.

Experimental support, please view the build logs for errors. Generated by [L A T E

  xml](https://math.nist.gov/~BMiller/LaTeXML/).

参考文献

[1] D. Abts, J. Ross, J. Sparling, M. Wong-VanHaren, M. Baker, T. Hawkins, A. Bell, J. Thompson, T. Kahsai, G. Kimmell, et al. (2020) Think fast: a tensor streaming processor (tsp) for accelerating deep learning workloads. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pp. 145–158. 被引用于：§1。
[2] S. Agarwal, L. Ahmad, J. Ai, S. Altman, A. Applebaum, E. Arbus, R. K. Arora, Y. Bai, B. Baker, H. Bao, et al. (2025) Gpt-oss-120b gpt-oss-20b model card. arXiv preprint arXiv:2508.10925. 被引用于：§1, §2.2。
[3] I. AI (2026) Ring-2.5-1t. 注：https://github.com/inclusionAI/Ring-V2.5 被引用于：§1, §2.2, Table 1。
[4] J. Ainslie, J. Lee-Thorp, M. De Jong, Y. Zemlyanskiy, F. Lebrón, and S. Sanghai (2023) Gqa: training generalized multi-query transformer models from multi-head checkpoints. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 4895–4901. 被引用于：Table 1。
[5] I. Beltagy, M. E. Peters, and A. Cohan (2020) Longformer: the long-document transformer. arXiv preprint arXiv:2004.05150. 被引用于：§1, Table 1。
[6] R. Child, S. Gray, A. Radford, and I. Sutskever (2019) Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509. 被引用于：§2.2。
[7] L. Corp. (2026) SGLang. 注：https://github.com/sgl-project/sglang 被引用于：§1, Table 3, Table 3。
[8] Groq (2025) What is a language processing unit?. 注：https://groq.com/blog/the-groq-lpu-explained 被引用于：§1, §5。
[9] X. He, Z. Fang, J. Lian, D. H. Tsang, B. Zhang, and Y. Chen (2025) FREESH: fair, resource-and energy-efficient scheduling for llm serving on heterogeneous gpus. arXiv preprint arXiv:2511.00807. 被引用于：§6。
[10] C. Hooper, S. Kim, H. Mohammadzadeh, M. W. Mahoney, Y. S. Shao, K. Keutzer, and A. Gholami (2024) Kvquant: towards 10 million context length llm inference with kv cache quantization. Advances in Neural Information Processing Systems 37, pp. 1270–1303. 被引用于：§6。
[11] A. Huang, A. Li, A. Kong, B. Wang, B. Jiao, B. Dong, B. Wang, B. Chen, B. Li, B. Ma, et al. (2026) Step 3.5 flash: open frontier-level intelligence with 11b active parameters. arXiv preprint arXiv:2602.10604. 被引用于：§1, §2.2。
[12] A. Liu, B. Feng, B. Wang, B. Wang, B. Liu, C. Zhao, C. Dengr, C. Ruan, D. Dai, D. Guo, et al. (2024) Deepseek-v2: a strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434. 被引用于：Table 1。
[13] Y. Liu, H. Li, Y. Cheng, S. Ray, Y. Huang, Q. Zhang, K. Du, J. Yao, S. Lu, G. Ananthanarayanan, et al. (2024) Cachegen: kv cache compression and streaming for fast large language model serving. In Proceedings of the ACM SIGCOMM 2024 Conference, pp. 38–56. 被引用于：§5。
[14] Z. Liu, J. Yuan, H. Jin, S. Zhong, Z. Xu, V. Braverman, B. Chen, and X. Hu (2024) KIVI: a tuning-free asymmetric 2bit quantization for kv cache. In International Conference on Machine Learning, pp. 32332–32344. 被引用于：§5, §6。
[15] Y. Mei, Y. Zhuang, X. Miao, J. Yang, Z. Jia, and R. Vinayak (2025) Helix: serving large language models over heterogeneous gpus and network via max-flow. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, pp. 586–602. 被引用于：§6。
[16] Minimax (2026) MiniMax m2.5: built for real-world productivity. 注：https://www.minimax.io/news/minimax-m25 被引用于：Table 1。
[17] Z. Mo, J. Liao, H. Xu, Z. Zhou, and C. Xu (2025) Hetis: serving llms in heterogeneous gpu clusters with fine-grained and dynamic parallelism. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1710–1724. 被引用于：§6。
[18] NVIDIA (2025) Nemotron 3 Nano: open, efficient mixture-of-experts hybrid Mamba-Transformer model for Agentic reasoning. 注：Technical report 外部链接：Link 被引用于：§1, §2.2。
[19] NVIDIA (2025) NVIDIA rubin cpx accelerates inference performance and efficiency for 1m+ token context workloads. 注：https://developer.nvidia.com/blog/nvidia-rubin-cpx-accelerates-inference-performance-and-efficiency-for-1m-token-context-workloads/ 被引用于：§1, §5。
[20] NVIDIA (2026) Dynamo. 注：https://github.com/ai-dynamo/dynamo 被引用于：§1。
[21] P. Patel, E. Choukse, C. Zhang, A. Shah, Í. Goiri, S. Maleki, and R. Bianchini (2024) Splitwise: efficient generative llm inference using phase splitting. In 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), pp. 118–132. 被引用于：§6。
[22] R. Qin, Z. Li, W. He, J. Cui, H. Tang, F. Ren, T. Ma, S. Cai, Y. Zhang, M. Zhang, et al. (2024) Mooncake: a kvcache-centric disaggregated architecture for llm serving. ACM Transactions on Storage. 被引用于：§1, §6。
[23] Z. Qin, W. Sun, D. Li, X. Shen, W. Sun, and Y. Zhong (2024) Lightning attention-2: a free lunch for handling unlimited sequence lengths in large language models. 外部链接：2401.04658 被引用于：Table 1。
[24] J. Stojkovic, C. Zhang, Í. Goiri, J. Torrellas, and E. Choukse (2025) Dynamollm: designing llm inference clusters for performance and energy efficiency. In 2025 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 1348–1362. 被引用于：§6。
[25] Taalas (2025) Taalas hc1. 注：https://taalas.com/products 被引用于：§5。
[26] K. Team, Y. Zhang, Z. Lin, X. Yao, J. Hu, F. Meng, C. Liu, X. Men, S. Yang, Z. Li, et al. (2025) Kimi linear: an expressive, efficient attention architecture. arXiv preprint arXiv:2510.26692. 被引用于：§1, §1, §2.2, Table 1, Table 1, §4.1。
[27] Q. Team (2026) Qwen3.5: towards native multimodal agents. 注：https://qwen.ai/blog?id=qwen3.5 被引用于：§1, §2.2, Table 1。
[28] vLLM Team at IBM (2025) Hybrid models as first-class citizens in vLLM. 注：https://pytorch.org/blog/hybrid-models-as-first-class-citizens-in-vllm/PyTorch Blog, November 2025 被引用于：§3.2。
[29] vLLM Team (2026) VLLM. 注：https://github.com/vllm-project/vllm 被引用于：§1, Table 5, Table 5。
[30] J. Wang, W. Xie, M. Zhang, B. Zhang, J. Dong, Y. Zhu, C. Lin, J. Tang, Y. Han, Z. Ai, et al. (2026) From prefix cache to fusion rag cache: accelerating llm inference in retrieval-augmented generation. arXiv preprint arXiv:2601.12904. 被引用于：§5, §6。
[31] B. Xiao, B. Xia, B. Yang, B. Gao, B. Shen, C. Zhang, C. He, C. Lou, F. Luo, G. Wang, et al. (2026) Mimo-v2-flash technical report. arXiv preprint arXiv:2601.02780. 被引用于：§1, §2.2, Table 1。
[32] A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. (2025) Qwen3 technical report. arXiv preprint arXiv:2505.09388. 被引用于：Table 1。
[33] S. Yang, J. Kautz, and A. Hatamizadeh (2025) Gated delta networks: improving mamba2 with delta rule. In International Conference on Learning Representations (ICLR), 被引用于：Table 1。
[34] J. Yao, H. Li, Y. Liu, S. Ray, Y. Cheng, Q. Zhang, K. Du, S. Lu, and J. Jiang (2025) Cacheblend: fast large language model serving for rag with cached knowledge fusion. In Proceedings of the twentieth European conference on computer systems, pp. 94–109. 被引用于：§5, §6。
[35] Z. Zhang, Y. Sheng, T. Zhou, T. Chen, L. Zheng, R. Cai, Z. Song, Y. Tian, C. Ré, C. Barrett, et al. (2023) H2o: heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems 36, pp. 34661–34710. 被引用于：§5, §6。
[36] J. Zhao, B. Wan, C. Wu, Y. Peng, and H. Lin (2024) Llm-pq: serving llm on heterogeneous clusters with phase-aware partition and adaptive quantization. In Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, pp. 460–462. 被引用于：§6。
[37] Y. Zhong, S. Liu, J. Chen, J. Hu, Y. Zhu, X. Liu, X. Jin, and H. Zhang (2024) DistServe: disaggregating prefill and decoding for goodput-optimized large language model serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pp. 193–210. 被引用于：§6。

Instructions for reporting errors

We are continuing to improve HTML versions of papers, and your feedback helps enhance accessibility and mobile support. To report errors in the HTML that will help us improve conversion and rendering, choose any of the methods listed below:

Click the "Report Issue" ( ) button, located in the page header.

Tip: You can select the relevant text first, to include it in your report.

Our team has already identified the following issues. We appreciate your time reviewing and reporting rendering errors we may not have found yet. Your efforts will help us improve the HTML versions for all readers, because disability should not be a barrier to accessing research. Thank you for your continued support in championing open access for all.

Have a free development cycle? Help support accessibility at arXiv! Our collaborators at LaTeXML maintain a list of packages that need conversion, and welcome developer contributions.

Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter

Ruoyu Qin1,2 Weiran He1 Yaoyu Wang1 Zheming Li1 Xinran Xu1 Yongwei Wu2 Weimin Zheng2 Mingxing Zhang2 1Moonshot AI 2Tsinghua University Corresponding author: zhang_mingxing@mail.tsinghua.edu.cn

Abstract

1 Introduction

(a) Status quo: Tightly coupled single-cluster inference.

(b) PrfaaS: Multi-cluster disaggregated inference via cross-datacenter KVCache.

Figure 1: Comparison of two deployment paradigms for PD-disaggregated LLM serving.

2 Background

2.1 The Bandwidth Wall in Conventional PD Disaggregation

Φkv(l)=Skv(l)Tprefill(l),\Phi_{\text{kv}}(l)=\frac{S_{\text{kv}}(l)}{T_{\text{prefill}}(l)},  (1)

2.2 Hybrid Attention Changes the PD Deployment Boundary

2.3 From Intra-Datacenter PD to Prefill-as-a-Service Paradigm

Bout=NP⋅𝔼[Skv]𝔼[Tprefill]≈NP⋅Φkv(Lavg),B_{\text{out}}=\frac{N}{P}\cdot\frac{\mathbb{E}[S_{\text{kv}}]}{\mathbb{E}[T_{\text{prefill}}]}\approx\frac{N}{P}\cdot\Phi_{\text{kv}}(L_{\text{avg}}),  (2)

3 Disaggregation over Cross-Datacenter KVCache

3.1 Overview

Figure 3: Deployment topology of the PrfaaS-PD architecture.

3.2 Hybrid Prefix Cache Pool

3.3 PrfaaS-PD Disaggregation

3.4 Modeling and Scheduling

3.4.1 Throughput Model

Θprfaas=min⁡(NprfaasTprefill(llong),BoutSkv(llong)).\Theta_{\text{prfaas}}=\min\!\left(\frac{N_{\text{prfaas}}}{T_{\text{prefill}}(l_{\text{long}})},\;\frac{B_{\text{out}}}{S_{\text{kv}}(l_{\text{long}})}\right).  (3)

Since intra-cluster RDMA bandwidth is not a bottleneck, PD-P throughput is determined solely by compute capacity:

Θpd-p=NpTprefill(lshort).\Theta_{\text{pd-p}}=\frac{N_{p}}{T_{\text{prefill}}(l_{\text{short}})}.  (4)

The decode-stage (PD-D) throughput is

Θpd-d=Nd⋅𝐵𝑆maxTdecode⋅Lout,\Theta_{\text{pd-d}}=\frac{N_{d}\cdot\mathit{BS}_{\max}}{T_{\text{decode}}\cdot L_{\text{out}}},  (5)

where 𝐵𝑆max\mathit{BS}{\max} and TdecodeT are treated as SLO-governed constants.}

RequestPrfaaS PrefillPD-P PrefillPD-D Decodepp1−p1{-}pEthernetRDMA

The end-to-end system throughput is limited by the slowest stage, accounting for the routing split:

Λmax=min⁡(Θprfaasp,Θpd-p1−p,Θpd-d).\Lambda_{\max}=\min\!\left(\frac{\Theta_{\text{prfaas}}}{p},\;\frac{\Theta_{\text{pd-p}}}{1-p},\;\Theta_{\text{pd-d}}\right).  (6)

3.4.2 Throughput-Optimal Configuration

Θprfaasp=Θpd-p1−p.\frac{\Theta_{\text{prfaas}}}{p}=\frac{\Theta_{\text{pd-p}}}{1-p}.  (7)

Θprfaas+Θpd-p=Θpd-d.\Theta_{\text{prfaas}}+\Theta_{\text{pd-p}}=\Theta_{\text{pd-d}}.  (8)

3.4.3 Dual-Timescale Scheduling

Short-term: bandwidth- and cache-aware routing.

Long-term: traffic-driven allocation re-optimization.

4 Case Study: Bandwidth Demand of PrfaaS-PD Architecture

4.1 Setup

4.2 Throughput Modeling and Solution

(a) Search over prefill/decode allocation.

(b) Search over routing threshold tt.

4.3 Result Analysis

4.3.1 Cross-Datacenter Bandwidth Utilization

4.3.2 Comparison with Homogeneous PD

4.3.3 Comparison with Naive Heterogeneous PD

4.4 Summary

This case study demonstrates that for hybrid-architecture models, cross-datacenter KVCache becomes practical when it is paired with selective PrfaaS offloading rather than applied indiscriminately.

5 Discussion

KVCache-friendly model architecture.

KVCache compression and reuse.

Phase-specialized inference hardware.

6 Related Work

7 Conclusion

References

[1] D. Abts, J. Ross, J. Sparling, M. Wong-VanHaren, M. Baker, T. Hawkins, A. Bell, J. Thompson, T. Kahsai, G. Kimmell, et al. (2020) Think fast: a tensor streaming processor (tsp) for accelerating deep learning workloads. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pp. 145–158. Cited by: §1.
[2] S. Agarwal, L. Ahmad, J. Ai, S. Altman, A. Applebaum, E. Arbus, R. K. Arora, Y. Bai, B. Baker, H. Bao, et al. (2025) Gpt-oss-120b gpt-oss-20b model card. arXiv preprint arXiv:2508.10925. Cited by: §1, §2.2.
[3] I. AI (2026) Ring-2.5-1t. Note: https://github.com/inclusionAI/Ring-V2.5 Cited by: §1, §2.2, Table 1.
[4] J. Ainslie, J. Lee-Thorp, M. De Jong, Y. Zemlyanskiy, F. Lebrón, and S. Sanghai (2023) Gqa: training generalized multi-query transformer models from multi-head checkpoints. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 4895–4901. Cited by: Table 1.
[5] I. Beltagy, M. E. Peters, and A. Cohan (2020) Longformer: the long-document transformer. arXiv preprint arXiv:2004.05150. Cited by: §1, Table 1.
[6] R. Child, S. Gray, A. Radford, and I. Sutskever (2019) Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509. Cited by: §2.2.
[7] L. Corp. (2026) SGLang. Note: https://github.com/sgl-project/sglang Cited by: §1, Table 3, Table 3.
[8] Groq (2025) What is a language processing unit?. Note: https://groq.com/blog/the-groq-lpu-explained Cited by: §1, §5.
[9] X. He, Z. Fang, J. Lian, D. H. Tsang, B. Zhang, and Y. Chen (2025) FREESH: fair, resource-and energy-efficient scheduling for llm serving on heterogeneous gpus. arXiv preprint arXiv:2511.00807. Cited by: §6.
[10] C. Hooper, S. Kim, H. Mohammadzadeh, M. W. Mahoney, Y. S. Shao, K. Keutzer, and A. Gholami (2024) Kvquant: towards 10 million context length llm inference with kv cache quantization. Advances in Neural Information Processing Systems 37, pp. 1270–1303. Cited by: §6.
[11] A. Huang, A. Li, A. Kong, B. Wang, B. Jiao, B. Dong, B. Wang, B. Chen, B. Li, B. Ma, et al. (2026) Step 3.5 flash: open frontier-level intelligence with 11b active parameters. arXiv preprint arXiv:2602.10604. Cited by: §1, §2.2.
[12] A. Liu, B. Feng, B. Wang, B. Wang, B. Liu, C. Zhao, C. Dengr, C. Ruan, D. Dai, D. Guo, et al. (2024) Deepseek-v2: a strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434. Cited by: Table 1.
[13] Y. Liu, H. Li, Y. Cheng, S. Ray, Y. Huang, Q. Zhang, K. Du, J. Yao, S. Lu, G. Ananthanarayanan, et al. (2024) Cachegen: kv cache compression and streaming for fast large language model serving. In Proceedings of the ACM SIGCOMM 2024 Conference, pp. 38–56. Cited by: §5.
[14] Z. Liu, J. Yuan, H. Jin, S. Zhong, Z. Xu, V. Braverman, B. Chen, and X. Hu (2024) KIVI: a tuning-free asymmetric 2bit quantization for kv cache. In International Conference on Machine Learning, pp. 32332–32344. Cited by: §5, §6.
[15] Y. Mei, Y. Zhuang, X. Miao, J. Yang, Z. Jia, and R. Vinayak (2025) Helix: serving large language models over heterogeneous gpus and network via max-flow. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, pp. 586–602. Cited by: §6.
[16] Minimax (2026) MiniMax m2.5: built for real-world productivity. Note: https://www.minimax.io/news/minimax-m25 Cited by: Table 1.
[17] Z. Mo, J. Liao, H. Xu, Z. Zhou, and C. Xu (2025) Hetis: serving llms in heterogeneous gpu clusters with fine-grained and dynamic parallelism. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1710–1724. Cited by: §6.
[18] NVIDIA (2025) Nemotron 3 Nano: open, efficient mixture-of-experts hybrid Mamba-Transformer model for Agentic reasoning. Note: Technical report External Links: Link Cited by: §1, §2.2.
[19] NVIDIA (2025) NVIDIA rubin cpx accelerates inference performance and efficiency for 1m+ token context workloads. Note: https://developer.nvidia.com/blog/nvidia-rubin-cpx-accelerates-inference-performance-and-efficiency-for-1m-token-context-workloads/ Cited by: §1, §5.
[20] NVIDIA (2026) Dynamo. Note: https://github.com/ai-dynamo/dynamo Cited by: §1.
[21] P. Patel, E. Choukse, C. Zhang, A. Shah, Í. Goiri, S. Maleki, and R. Bianchini (2024) Splitwise: efficient generative llm inference using phase splitting. In 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), pp. 118–132. Cited by: §6.
[22] R. Qin, Z. Li, W. He, J. Cui, H. Tang, F. Ren, T. Ma, S. Cai, Y. Zhang, M. Zhang, et al. (2024) Mooncake: a kvcache-centric disaggregated architecture for llm serving. ACM Transactions on Storage. Cited by: §1, §6.
[23] Z. Qin, W. Sun, D. Li, X. Shen, W. Sun, and Y. Zhong (2024) Lightning attention-2: a free lunch for handling unlimited sequence lengths in large language models. External Links: 2401.04658 Cited by: Table 1.
[24] J. Stojkovic, C. Zhang, Í. Goiri, J. Torrellas, and E. Choukse (2025) Dynamollm: designing llm inference clusters for performance and energy efficiency. In 2025 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 1348–1362. Cited by: §6.
[25] Taalas (2025) Taalas hc1. Note: https://taalas.com/products Cited by: §5.
[26] K. Team, Y. Zhang, Z. Lin, X. Yao, J. Hu, F. Meng, C. Liu, X. Men, S. Yang, Z. Li, et al. (2025) Kimi linear: an expressive, efficient attention architecture. arXiv preprint arXiv:2510.26692. Cited by: §1, §1, §2.2, Table 1, Table 1, §4.1.
[27] Q. Team (2026) Qwen3.5: towards native multimodal agents. Note: https://qwen.ai/blog?id=qwen3.5 Cited by: §1, §2.2, Table 1.
[28] vLLM Team at IBM (2025) Hybrid models as first-class citizens in vLLM. Note: https://pytorch.org/blog/hybrid-models-as-first-class-citizens-in-vllm/PyTorch Blog, November 2025 Cited by: §3.2.
[29] vLLM Team (2026) VLLM. Note: https://github.com/vllm-project/vllm Cited by: §1, Table 5, Table 5.
[30] J. Wang, W. Xie, M. Zhang, B. Zhang, J. Dong, Y. Zhu, C. Lin, J. Tang, Y. Han, Z. Ai, et al. (2026) From prefix cache to fusion rag cache: accelerating llm inference in retrieval-augmented generation. arXiv preprint arXiv:2601.12904. Cited by: §5, §6.
[31] B. Xiao, B. Xia, B. Yang, B. Gao, B. Shen, C. Zhang, C. He, C. Lou, F. Luo, G. Wang, et al. (2026) Mimo-v2-flash technical report. arXiv preprint arXiv:2601.02780. Cited by: §1, §2.2, Table 1.
[32] A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. (2025) Qwen3 technical report. arXiv preprint arXiv:2505.09388. Cited by: Table 1.
[33] S. Yang, J. Kautz, and A. Hatamizadeh (2025) Gated delta networks: improving mamba2 with delta rule. In International Conference on Learning Representations (ICLR), Cited by: Table 1.
[34] J. Yao, H. Li, Y. Liu, S. Ray, Y. Cheng, Q. Zhang, K. Du, S. Lu, and J. Jiang (2025) Cacheblend: fast large language model serving for rag with cached knowledge fusion. In Proceedings of the twentieth European conference on computer systems, pp. 94–109. Cited by: §5, §6.
[35] Z. Zhang, Y. Sheng, T. Zhou, T. Chen, L. Zheng, R. Cai, Z. Song, Y. Tian, C. Ré, C. Barrett, et al. (2023) H2o: heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems 36, pp. 34661–34710. Cited by: §5, §6.
[36] J. Zhao, B. Wan, C. Wu, Y. Peng, and H. Lin (2024) Llm-pq: serving llm on heterogeneous clusters with phase-aware partition and adaptive quantization. In Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, pp. 460–462. Cited by: §6.
[37] Y. Zhong, S. Liu, J. Chen, J. Hu, Y. Zhu, X. Liu, X. Jin, and H. Zhang (2024) DistServe: disaggregating prefill and decoding for goodput-optimized large language model serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pp. 193–210. Cited by: §6.

Experimental support, please view the build logs for errors. Generated by [L A T E

  xml](https://math.nist.gov/~BMiller/LaTeXML/).

Instructions for reporting errors

Click the "Report Issue" ( ) button, located in the page header.

Tip: You can select the relevant text first, to include it in your report.

Have a free development cycle? Help support accessibility at arXiv! Our collaborators at LaTeXML maintain a list of packages that need conversion, and welcome developer contributions.

📋 讨论归档

讨论进行中…