返回列表
🧠 阿头学 · 💬 讨论题

TurboQuant:用“去掉隐藏开销”的极限量化重打 AI 成本结构

TurboQuant抓住的不是“怎么再压一点精度”这个老问题,而是“如何消灭量化本身的隐藏内存税”这个真瓶颈;方向很强,但官方把“接近无损”说成“零损失”明显过头。
打开原文 ↗

2026-03-25 原文链接 ↗
阅读简报
双语对照
完整翻译
原文
讨论归档

核心观点

  • 问题抓得很准 这套方法瞄准的是大模型推理里最贵的内存/带宽瓶颈,尤其是 KV cache 和向量搜索索引;这一判断是成立的,因为长上下文和语义检索的成本确实越来越被缓存与向量存储拖住。
  • 方法有新意,不是老套路重讲 TurboQuant 不是单纯做低比特量化,而是把 PolarQuant 的“换坐标系压缩”和 QJL 的“1-bit 残差纠偏”拼起来;这种“主信息压缩+极轻误差校正”的结构是有原创性的,不是普通工程微调。
  • 实验结果有吸引力,但外推过度 文中声称 3-bit KV cache、至少 6 倍内存压缩、H100 上最高 8 倍 attention logits 加速,这些结果如果复现成功就很有价值;但这些证据主要来自特定模型、特定基准、特定硬件,不能直接推出“广泛适用于所有 AI 系统”。
  • 最大争议是“零精度损失”表述不可信 把 16/32-bit 浮点压到 3-bit,还宣称“零精度损失”或“不牺牲模型性能”,这在科学上就是过度承诺;更准确的说法只能是“在部分任务上几乎无损”。
  • 商业潜力真实存在,但护城河未定 如果它真能稳定落地在主流推理框架和检索系统里,价值会直接体现在推理成本、显存占用、索引效率上;但现在它更像一项强研究成果,还不是已经被验证的行业标准。

跟我们的关联

  • 对 ATou 意味着什么 这说明未来 AI 产品的竞争不只是模型能力,而是“同样效果下谁的记忆成本更低”;下一步可以把“压缩元数据开销”作为评估任何 Agent/长上下文方案的硬指标,而不是只看模型分数。
  • 对 Neta 意味着什么 这篇材料证明底层基础设施优化会反向决定上层产品形态,尤其影响长记忆 Agent 和大规模检索;下一步可以建立一个“单位查询成本 × 可服务上下文长度 × 检索召回”的分析框架来筛选技术方向。
  • 对 Uota 意味着什么 这里最能迁移的不是公式,而是“两段式优化”思路:先用大头资源解决 80%-95% 主问题,再用极低成本纠偏;下一步可以把这个结构用于内容生产、工作流设计和团队 QA,而不是一上来追求全流程高精度。
  • 对投资判断意味着什么 量化压缩、KV cache 优化、向量数据库加速这些赛道仍然有真实价值,不是伪需求;下一步要重点看谁能跨硬件、跨框架、跨模型复现,而不是只会在 H100 + JAX 上跑出漂亮曲线。

讨论引子

1. 如果 TurboQuant 只能在 H100 这类高端卡上显著成立,它还是不是“改变行业成本结构”的技术? 2. KV cache 压缩的长期价值,会不会被新注意力机制、状态空间模型或硬件原生低精度支持快速削弱? 3. 比起“压得更低”,未来更关键的是不是“谁能把压缩稳定接进主流推理栈并保持端到端收益”?

TurboQuant:以极致压缩重塑 AI 效率

Amir Zandieh,研究科学家;Vahab Mirrokni,副总裁兼 Google Fellow,Google Research 我们推出了一组先进且具备坚实理论基础的量化算法,可为大语言模型与向量搜索引擎实现大幅压缩。

快速链接

-

-

-

-

-

复制链接

×

向量是 AI 模型理解与处理信息的基本方式。小向量描述简单属性,例如图中的一个点;而所谓的高维向量,则能承载复杂信息,例如图像的特征、词语的含义,或数据集的性质。高维向量非常强大,但也会消耗大量内存,从而在key-value cache中造成瓶颈。它是一种高速的数字备忘录,会把常用信息用简单标签存起来,让计算机无需在缓慢而庞大的数据库中搜索,就能立刻取回。

向量量化是一种强大的经典数据压缩技术,用于缩小高维向量的体积。这种优化解决了 AI 的两个关键方面:其一,它通过支持更快的相似度检索,增强了向量搜索这一支撑大规模 AI 与搜索引擎的高速技术;其二,它通过缩小 key-value 对的大小,缓解key-value cache的瓶颈,从而让相似度搜索更快、内存成本更低。然而,传统向量量化往往会引入自身的内存开销,因为多数方法需要为每个小数据块计算并存储(以全精度形式)量化常数。这种开销可能会为每个数额外增加 1 或 2 个比特,从而在一定程度上抵消向量量化的意义。

今天,我们介绍TurboQuant(将在ICLR 2026上报告),这是一种压缩算法,能够以最优方式解决向量量化中的内存开销挑战。我们还介绍了Quantized Johnson-Lindenstrauss(QJL)与PolarQuant(将在AISTATS 2026上报告),TurboQuant 正是借助它们来取得这些结果。在测试中,这三项技术都展现出在不牺牲 AI 模型性能的前提下缓解 key-value 瓶颈的巨大潜力。这可能会对所有依赖压缩的应用场景产生深远影响,尤其是在搜索与 AI 领域。

TurboQuant 如何工作

TurboQuant 是一种压缩方法,能在零精度损失的前提下大幅缩小模型体积,非常适合用于支持key-value(KV)cache压缩与向量搜索。它通过两个关键步骤实现这一点:

  • 高质量压缩(PolarQuant 方法):TurboQuant 首先对数据向量进行随机旋转。这个巧妙步骤会简化数据的几何结构,使得可以对向量的每个部分分别应用标准的高质量量化器(一种将大量连续值,如精确小数,映射为更小的离散符号或数值集合,如整数的工具:例子包括音频量化jpeg 压缩)。第一阶段使用了大部分压缩能力(也就是大多数比特)来捕捉原始向量的主要概念与强度。

  • 消除隐藏误差:TurboQuant 会用少量剩余的压缩能力(仅 1 个比特)对第一阶段残留的微小误差应用 QJL 算法。QJL 阶段像一个数学上的误差校验器,用于消除偏差,从而得到更准确的注意力分数。

为了完整理解 TurboQuant 如何实现这种效率,我们将进一步仔细看看 QJL 与 PolarQuant 算法的工作方式。

QJL:零开销的 1 比特技巧

QJL 使用一种名为Johnson-Lindenstrauss Transform的数学技术,在保留数据点之间关键距离与关系的同时,缩小复杂的高维数据。它把每个结果向量的数值压缩为单个符号位(+1 或 -1)。这种算法本质上构造了一种高速速记法,并且不需要任何额外的内存开销。为保持精度,QJL 使用一种特殊估计器,策略性地在高精度查询与低精度简化数据之间取得平衡。这样,模型就能准确计算注意力分数(也就是用来判断输入中哪些部分重要、哪些部分可以安全忽略的过程)。

PolarQuant:从角度出发的全新压缩思路

PolarQuant 用一种完全不同的方法来解决内存开销问题。它不再用标准坐标(例如 X、Y、Z,表示沿每条轴的距离)来表示一个内存向量,而是基于笛卡尔坐标系把向量转换为极坐标。这类似于把向东走 3 个街区、向北走 4 个街区,替换为总共走 5 个街区、方向为 37 度。这样会得到两类信息:半径,表示核心数据有多强;以及角度,表示数据的方向或含义)。由于角度的分布模式已知且高度集中,模型不再需要执行昂贵的数据归一化步骤,因为它把数据映射到一个固定、可预测的圆形网格上,边界已知,而不是边界不断变化的方形网格。由此,PolarQuant 得以消除传统方法必须承担的内存开销。

播放静音循环视频 暂停静音循环视频

取消静音视频 静音视频

PolarQuant 充当了一座高效压缩的桥梁,将笛卡尔输入转换为紧凑的极坐标存储与处理速记形式。该机制从把 d 维向量的坐标成对分组开始,并将其映射到极坐标系统中。随后,半径也会成对聚合,用于进行递归极坐标变换——这一过程会反复进行,直到数据被提炼为一个最终的单一半径与一组用于描述的角度集合。

实验与结果

我们在标准长上下文基准上严格评估了这三种算法,包括:LongBenchNeedle In A HaystackZeroSCROLLSRULERL-Eval,并使用开源 LLM(GemmaMistral)。实验数据显示,TurboQuant 在同时最小化 key-value(KV)内存占用的情况下,在点积失真召回率两项指标上都实现了最优得分表现。下图展示了在多样任务(包括问答、代码生成与摘要)上的汇总性能得分,对比对象包含 TurboQuant、PolarQuant 与基线方法 KIVI

TurboQuant 在LongBench基准上相对于多种压缩方法表现出稳健的 KV cache 压缩性能,评估模型为 Llama-3.1-8B-Instruct(括号中标注比特宽度)。

下方展示了长上下文的针在草堆任务结果(即测试模型是否能从海量文本中找出一条极小但关键的信息)。同样,TurboQuant 在把 key value 内存大小至少缩小 6 倍的同时,在所有基准上都取得了完美的下游结果。PolarQuant 在该任务上也几乎是无损的。

TurboQuant 证明了它能够把 key-value cache 量化到仅 3 比特,而无需训练或微调,也不会牺牲模型精度,同时运行速度还快于原始 LLM(Gemma 与 Mistral)。它的实现效率非常高,运行时开销几乎可以忽略不计。下图展示了 TurboQuant 在计算注意力 logits 时带来的加速效果:具体来说,4 比特 TurboQuant 在 H100 GPU 加速器上,相比 32 比特未量化的 keys,性能提升最高可达 8 倍。

TurboQuant 展示了在key-value cache中计算注意力 logits的显著性能提升,覆盖多种比特宽度水平,并以高度优化的 JAX 基线为参照进行衡量。

这使得它非常适合支持向量搜索等用例,因为它能显著加速索引构建过程。我们使用1@k recall ratio(衡量算法在其 top-k 近似结果中捕获真实 top 内积结果的频率)来评估 TurboQuant 在高维向量搜索中的效果,并与最先进的方法(PQRabbiQ)进行对比。尽管这些基线方法使用了低效的大码本并依赖针对数据集的调参,TurboQuant 仍能稳定取得更优的召回率(见下图)。这验证了 TurboQuant 在高维检索任务中的稳健性与效率。

TurboQuant 在 GloVe 数据集(d=200)上相对于多种最先进的量化基线,展现出稳健的检索性能,实现了最优的1@k recall ratio

TurboQuant 展示了高维搜索的一次变革性转变。它为可实现的速度设立了新的标杆,并以数据无关的方式实现了接近最优的失真率。这让我们的最近邻引擎可以用 3 比特系统的效率运行,同时保持更重模型的精度。更多细节请参见论文

展望未来

TurboQuant、QJL 与 PolarQuant 不只是实用的工程方案,它们也是有坚实理论证明支撑的基础算法贡献。这些方法不仅在真实应用中表现出色,而且在理论上可被证明是高效的,并且运行效果接近理论下界。正是这种严格的理论基础,让它们在关键的大规模系统中更稳健、更可信。

尽管一个重要应用是解决类似 Gemini 这类模型中的 key-value cache 瓶颈,高效的在线向量量化的影响远不止于此。例如,现代搜索正在从只理解关键词,演进到理解意图与含义。这就需要向量搜索,也就是在包含数十亿向量的数据库中找到最接近或语义最相似的项目的能力。

TurboQuant 这类技术对这项使命至关重要。它们能够用极少的内存、接近零的预处理时间与最先进的精度来构建与查询大型向量索引,从而让 Google 规模下的语义搜索更快、更高效。随着 AI 更深入地融入从 LLM 到语义搜索的各类产品,这项在基础向量量化上的工作将变得比以往任何时候都更关键。

致谢

这项研究与 Praneeth Kacham(Google 研究员)、Insu Han(KAIST 助理教授)、Majid Daliri(NYU 博士生)、Lars Gottesbüren(Google 研究员)以及 Rajesh Jayaram(Google 研究员)合作完成。

TurboQuant: Redefining AI efficiency with extreme compression

Amir Zandieh, Research Scientist, and Vahab Mirrokni, VP and Google Fellow, Google Research We introduce a set of advanced theoretically grounded quantization algorithms that enable massive compression for large language models and vector search engines.

TurboQuant:以极致压缩重塑 AI 效率

Amir Zandieh,研究科学家;Vahab Mirrokni,副总裁兼 Google Fellow,Google Research 我们推出了一组先进且具备坚实理论基础的量化算法,可为大语言模型与向量搜索引擎实现大幅压缩。

Quick links

Copy link

×

Vectors are the fundamental way AI models understand and process information. Small vectors describe simple attributes, such as a point in a graph, while “high-dimensional” vectors capture complex information such as the features of an image, the meaning of a word, or the properties of a dataset. High-dimensional vectors are incredibly powerful, but they also consume vast amounts of memory, leading to bottlenecks in the key-value cache, a high-speed "digital cheat sheet" that stores frequently used information under simple labels so a computer can retrieve it instantly without having to search through a slow, massive database.

Vector quantization is a powerful, classical data compression technique that reduces the size of high-dimensional vectors. This optimization addresses two critical facets of AI: it enhances vector search, the high-speed technology powering large-scale AI and search engines, by enabling faster similarity lookups; and it helps unclog key-value cache bottlenecks by reducing the size of key-value pairs, which enables faster similarity searches and lowers memory costs. However, traditional vector quantization usually introduces its own "memory overhead” as most methods require calculating and storing (in full precision) quantization constants for every small block of data. This overhead can add 1 or 2 extra bits per number, partially defeating the purpose of vector quantization.

Today, we introduce TurboQuant (to be presented at ICLR 2026), a compression algorithm that optimally addresses the challenge of memory overhead in vector quantization. We also present Quantized Johnson-Lindenstrauss (QJL), and PolarQuant (to be presented at AISTATS 2026), which TurboQuant uses to achieve its results. In testing, all three techniques showed great promise for reducing key-value bottlenecks without sacrificing AI model performance. This has potentially profound implications for all compression-reliant use cases, including and especially in the domains of search and AI.

快速链接

-

-

-

-

-

复制链接

×

向量是 AI 模型理解与处理信息的基本方式。小向量描述简单属性,例如图中的一个点;而所谓的高维向量,则能承载复杂信息,例如图像的特征、词语的含义,或数据集的性质。高维向量非常强大,但也会消耗大量内存,从而在key-value cache中造成瓶颈。它是一种高速的数字备忘录,会把常用信息用简单标签存起来,让计算机无需在缓慢而庞大的数据库中搜索,就能立刻取回。

向量量化是一种强大的经典数据压缩技术,用于缩小高维向量的体积。这种优化解决了 AI 的两个关键方面:其一,它通过支持更快的相似度检索,增强了向量搜索这一支撑大规模 AI 与搜索引擎的高速技术;其二,它通过缩小 key-value 对的大小,缓解key-value cache的瓶颈,从而让相似度搜索更快、内存成本更低。然而,传统向量量化往往会引入自身的内存开销,因为多数方法需要为每个小数据块计算并存储(以全精度形式)量化常数。这种开销可能会为每个数额外增加 1 或 2 个比特,从而在一定程度上抵消向量量化的意义。

今天,我们介绍TurboQuant(将在ICLR 2026上报告),这是一种压缩算法,能够以最优方式解决向量量化中的内存开销挑战。我们还介绍了Quantized Johnson-Lindenstrauss(QJL)与PolarQuant(将在AISTATS 2026上报告),TurboQuant 正是借助它们来取得这些结果。在测试中,这三项技术都展现出在不牺牲 AI 模型性能的前提下缓解 key-value 瓶颈的巨大潜力。这可能会对所有依赖压缩的应用场景产生深远影响,尤其是在搜索与 AI 领域。

How TurboQuant works

TurboQuant is a compression method that achieves a high reduction in model size with zero accuracy loss, making it ideal for supporting both key-value (KV) cache compression and vector search. It accomplishes this via two key steps:

  • High-quality compression (the PolarQuant method): TurboQuant starts by randomly rotating the data vectors. This clever step simplifies the data's geometry, making it easy to apply a standard, high-quality quantizer (a tool that maps a large set of continuous values, like precise decimals, to a smaller, discrete set of symbols or numbers, like integers: examples include audio quantization and jpeg compression) to each part of the vector individually. This first stage uses most of the compression power (the majority of the bits) to capture the main concept and strength of the original vector.

  • Eliminating hidden errors: TurboQuant uses a small, residual amount of compression power (just 1 bit) to apply the QJL algorithm to the tiny amount of error left over from the first stage. The QJL stage acts as a mathematical error-checker that eliminates bias, leading to a more accurate attention score.

To fully understand how TurboQuant achieves this efficiency, we take a closer look into how the QJL and PolarQuant algorithms work.

QJL: The zero-overhead, 1-bit trick

QJL uses a mathematical technique called the Johnson-Lindenstrauss Transform to shrink complex, high-dimensional data while preserving the essential distances and relationships between data points. It reduces each resulting vector number to a single sign bit (+1 or -1). This algorithm essentially creates a high-speed shorthand that requires zero memory overhead. To maintain accuracy, QJL uses a special estimator that strategically balances a high-precision query with the low-precision, simplified data. This allows the model to accurately calculate the attention score (the process used to decide which parts of its input are important and which parts can be safely ignored).

PolarQuant: A new “angle” on compression

PolarQuant addresses the memory overhead problem using a completely different approach. Instead of looking at a memory vector using standard coordinates (i.e., X, Y, Z) that indicate the distance along each axis, PolarQuant converts the vector into polar coordinates using a Cartesian coordinate system. This is comparable to replacing "Go 3 blocks East, 4 blocks North" with "Go 5 blocks total at a 37-degree angle”. This results in two pieces of information: the radius, which signifies how strong the core data is, and the angle indicating the data’s direction or meaning). Because the pattern of the angles is known and highly concentrated, the model no longer needs to perform the expensive data normalization step because it maps data onto a fixed, predictable "circular" grid where the boundaries are already known, rather than a "square" grid where the boundaries change constantly. This allows PolarQuant to eliminate the memory overhead that traditional methods must carry.

play silent looping video pause silent looping video

unmute video mute video

PolarQuant acts as a high-efficiency compression bridge, converting Cartesian inputs into a compact Polar "shorthand" for storage and processing. The mechanism begins by grouping pairs of coordinates from a d-dimensional vector and mapping them onto a polar coordinate system. Radii are then gathered in pairs for recursive polar transformations — a process that repeats until the data is distilled into a single final radius and a collection of descriptive angles.

TurboQuant 如何工作

TurboQuant 是一种压缩方法,能在零精度损失的前提下大幅缩小模型体积,非常适合用于支持key-value(KV)cache压缩与向量搜索。它通过两个关键步骤实现这一点:

  • 高质量压缩(PolarQuant 方法):TurboQuant 首先对数据向量进行随机旋转。这个巧妙步骤会简化数据的几何结构,使得可以对向量的每个部分分别应用标准的高质量量化器(一种将大量连续值,如精确小数,映射为更小的离散符号或数值集合,如整数的工具:例子包括音频量化jpeg 压缩)。第一阶段使用了大部分压缩能力(也就是大多数比特)来捕捉原始向量的主要概念与强度。

  • 消除隐藏误差:TurboQuant 会用少量剩余的压缩能力(仅 1 个比特)对第一阶段残留的微小误差应用 QJL 算法。QJL 阶段像一个数学上的误差校验器,用于消除偏差,从而得到更准确的注意力分数。

为了完整理解 TurboQuant 如何实现这种效率,我们将进一步仔细看看 QJL 与 PolarQuant 算法的工作方式。

QJL:零开销的 1 比特技巧

QJL 使用一种名为Johnson-Lindenstrauss Transform的数学技术,在保留数据点之间关键距离与关系的同时,缩小复杂的高维数据。它把每个结果向量的数值压缩为单个符号位(+1 或 -1)。这种算法本质上构造了一种高速速记法,并且不需要任何额外的内存开销。为保持精度,QJL 使用一种特殊估计器,策略性地在高精度查询与低精度简化数据之间取得平衡。这样,模型就能准确计算注意力分数(也就是用来判断输入中哪些部分重要、哪些部分可以安全忽略的过程)。

PolarQuant:从角度出发的全新压缩思路

PolarQuant 用一种完全不同的方法来解决内存开销问题。它不再用标准坐标(例如 X、Y、Z,表示沿每条轴的距离)来表示一个内存向量,而是基于笛卡尔坐标系把向量转换为极坐标。这类似于把向东走 3 个街区、向北走 4 个街区,替换为总共走 5 个街区、方向为 37 度。这样会得到两类信息:半径,表示核心数据有多强;以及角度,表示数据的方向或含义)。由于角度的分布模式已知且高度集中,模型不再需要执行昂贵的数据归一化步骤,因为它把数据映射到一个固定、可预测的圆形网格上,边界已知,而不是边界不断变化的方形网格。由此,PolarQuant 得以消除传统方法必须承担的内存开销。

播放静音循环视频 暂停静音循环视频

取消静音视频 静音视频

PolarQuant 充当了一座高效压缩的桥梁,将笛卡尔输入转换为紧凑的极坐标存储与处理速记形式。该机制从把 d 维向量的坐标成对分组开始,并将其映射到极坐标系统中。随后,半径也会成对聚合,用于进行递归极坐标变换——这一过程会反复进行,直到数据被提炼为一个最终的单一半径与一组用于描述的角度集合。

Experiments and results

We rigorously evaluated all three algorithms across standard long-context benchmarks including: LongBench, Needle In A Haystack, ZeroSCROLLS, RULER, and L-Eval using open-source LLMs (Gemma and Mistral). The experimental data demonstrate that TurboQuant achieves optimal scoring performance in terms of both dot product distortion and recall while simultaneously minimizing the key-value (KV) memory footprint. The chart below shows the aggregated performance scores across diverse tasks, including question answering, code generation, and summarization for TurboQuant, PolarQuant and the KIVI baseline.

TurboQuant demonstrates robust KV cache compression performance across the LongBench benchmark relative to various compression methods on the Llama-3.1-8B-Instruct model (bitwidths are indicated in brackets).

The results for long-context “needle-in-haystack” tasks (i.e., tests designed to see if a model can find one specific, tiny piece of information buried inside a massive amount of text) are shown below. Again, TurboQuant achieves perfect downstream results across all benchmarks while reducing the key value memory size by a factor of at least 6x. PolarQuant is also nearly loss-less for this task.

TurboQuant proved it can quantize the key-value cache to just 3 bits without requiring training or fine-tuning and causing any compromise in model accuracy, all while achieving a faster runtime than the original LLMs (Gemma and Mistral). It is exceptionally efficient to implement and incurs negligible runtime overhead. The following plot illustrates the speedup in computing attention logits using TurboQuant: specifically, 4-bit TurboQuant achieves up to 8x performance increase over 32-bit unquantized keys on H100 GPU accelerators.

TurboQuant illustrates a substantial performance increase in computing attention logits within the key-value cache across various bit-width levels, measured relative to the highly optimized JAX baseline.

This makes it ideal for supporting use cases like vector search where it dramatically speeds up index building process. We evaluated TurboQuant's efficacy in high-dimensional vector search against state-of-the-art methods (PQ and RabbiQ) using the 1@k recall ratio, which measures how frequently the algorithm captures the true top inner product result within its top-k approximations. TurboQuant consistently achieves superior recall ratios compared to baseline methods, despite those baselines utilizing inefficient large codebooks and dataset-specific tuning (figure below). This confirms TurboQuant's robustness and efficiency for high-dimensional search tasks.

TurboQuant demonstrates robust retrieval performance, achieving the optimal 1@k recall ratio on the GloVe dataset (d=200) relative to various state-of-the-art quantization baselines.

TurboQuant demonstrates a transformative shift in high-dimensional search. By setting a new benchmark for achievable speed, it delivers near-optimal distortion rates in a data-oblivious manner. This allows our nearest neighbor engines to operate with the efficiency of a 3-bit system while maintaining the precision of much heavier models. See the paper for more details.

实验与结果

我们在标准长上下文基准上严格评估了这三种算法,包括:LongBenchNeedle In A HaystackZeroSCROLLSRULERL-Eval,并使用开源 LLM(GemmaMistral)。实验数据显示,TurboQuant 在同时最小化 key-value(KV)内存占用的情况下,在点积失真召回率两项指标上都实现了最优得分表现。下图展示了在多样任务(包括问答、代码生成与摘要)上的汇总性能得分,对比对象包含 TurboQuant、PolarQuant 与基线方法 KIVI

TurboQuant 在LongBench基准上相对于多种压缩方法表现出稳健的 KV cache 压缩性能,评估模型为 Llama-3.1-8B-Instruct(括号中标注比特宽度)。

下方展示了长上下文的针在草堆任务结果(即测试模型是否能从海量文本中找出一条极小但关键的信息)。同样,TurboQuant 在把 key value 内存大小至少缩小 6 倍的同时,在所有基准上都取得了完美的下游结果。PolarQuant 在该任务上也几乎是无损的。

TurboQuant 证明了它能够把 key-value cache 量化到仅 3 比特,而无需训练或微调,也不会牺牲模型精度,同时运行速度还快于原始 LLM(Gemma 与 Mistral)。它的实现效率非常高,运行时开销几乎可以忽略不计。下图展示了 TurboQuant 在计算注意力 logits 时带来的加速效果:具体来说,4 比特 TurboQuant 在 H100 GPU 加速器上,相比 32 比特未量化的 keys,性能提升最高可达 8 倍。

TurboQuant 展示了在key-value cache中计算注意力 logits的显著性能提升,覆盖多种比特宽度水平,并以高度优化的 JAX 基线为参照进行衡量。

这使得它非常适合支持向量搜索等用例,因为它能显著加速索引构建过程。我们使用1@k recall ratio(衡量算法在其 top-k 近似结果中捕获真实 top 内积结果的频率)来评估 TurboQuant 在高维向量搜索中的效果,并与最先进的方法(PQRabbiQ)进行对比。尽管这些基线方法使用了低效的大码本并依赖针对数据集的调参,TurboQuant 仍能稳定取得更优的召回率(见下图)。这验证了 TurboQuant 在高维检索任务中的稳健性与效率。

TurboQuant 在 GloVe 数据集(d=200)上相对于多种最先进的量化基线,展现出稳健的检索性能,实现了最优的1@k recall ratio

TurboQuant 展示了高维搜索的一次变革性转变。它为可实现的速度设立了新的标杆,并以数据无关的方式实现了接近最优的失真率。这让我们的最近邻引擎可以用 3 比特系统的效率运行,同时保持更重模型的精度。更多细节请参见论文

Looking ahead

TurboQuant, QJL, and PolarQuant are more than just practical engineering solutions; they’re fundamental algorithmic contributions backed by strong theoretical proofs. These methods don't just work well in real-world applications; they are provably efficient and operate near theoretical lower bounds. This rigorous foundation is what makes them robust and trustworthy for critical, large-scale systems.

While a major application is solving the key-value cache bottleneck in models like Gemini, the impact of efficient, online vector quantization extends even further. For example, modern search is evolving beyond just keywords to understand intent and meaning. This requires vector search — the ability to find the "nearest" or most semantically similar items in a database of billions of vectors.

Techniques like TurboQuant are critical for this mission. They allow for building and querying large vector indices with minimal memory, near-zero preprocessing time, and state-of-the-art accuracy. This makes semantic search at Google's scale faster and more efficient. As AI becomes more integrated into all products, from LLMs to semantic search, this work in fundamental vector quantization will be more critical than ever.

展望未来

TurboQuant、QJL 与 PolarQuant 不只是实用的工程方案,它们也是有坚实理论证明支撑的基础算法贡献。这些方法不仅在真实应用中表现出色,而且在理论上可被证明是高效的,并且运行效果接近理论下界。正是这种严格的理论基础,让它们在关键的大规模系统中更稳健、更可信。

尽管一个重要应用是解决类似 Gemini 这类模型中的 key-value cache 瓶颈,高效的在线向量量化的影响远不止于此。例如,现代搜索正在从只理解关键词,演进到理解意图与含义。这就需要向量搜索,也就是在包含数十亿向量的数据库中找到最接近或语义最相似的项目的能力。

TurboQuant 这类技术对这项使命至关重要。它们能够用极少的内存、接近零的预处理时间与最先进的精度来构建与查询大型向量索引,从而让 Google 规模下的语义搜索更快、更高效。随着 AI 更深入地融入从 LLM 到语义搜索的各类产品,这项在基础向量量化上的工作将变得比以往任何时候都更关键。

Acknowledgements

This line of research was conducted in collaboration with Praneeth Kacham, researcher at Google; Insu Han, Assistant Professor at KAIST; and Majid Daliri, PhD student at NYU; Lars Gottesbüren, researcher at Google; and Rajesh Jayaram, researcher at Google.

致谢

这项研究与 Praneeth Kacham(Google 研究员)、Insu Han(KAIST 助理教授)、Majid Daliri(NYU 博士生)、Lars Gottesbüren(Google 研究员)以及 Rajesh Jayaram(Google 研究员)合作完成。

TurboQuant: Redefining AI efficiency with extreme compression

Amir Zandieh, Research Scientist, and Vahab Mirrokni, VP and Google Fellow, Google Research We introduce a set of advanced theoretically grounded quantization algorithms that enable massive compression for large language models and vector search engines.

Quick links

Copy link

×

Vectors are the fundamental way AI models understand and process information. Small vectors describe simple attributes, such as a point in a graph, while “high-dimensional” vectors capture complex information such as the features of an image, the meaning of a word, or the properties of a dataset. High-dimensional vectors are incredibly powerful, but they also consume vast amounts of memory, leading to bottlenecks in the key-value cache, a high-speed "digital cheat sheet" that stores frequently used information under simple labels so a computer can retrieve it instantly without having to search through a slow, massive database.

Vector quantization is a powerful, classical data compression technique that reduces the size of high-dimensional vectors. This optimization addresses two critical facets of AI: it enhances vector search, the high-speed technology powering large-scale AI and search engines, by enabling faster similarity lookups; and it helps unclog key-value cache bottlenecks by reducing the size of key-value pairs, which enables faster similarity searches and lowers memory costs. However, traditional vector quantization usually introduces its own "memory overhead” as most methods require calculating and storing (in full precision) quantization constants for every small block of data. This overhead can add 1 or 2 extra bits per number, partially defeating the purpose of vector quantization.

Today, we introduce TurboQuant (to be presented at ICLR 2026), a compression algorithm that optimally addresses the challenge of memory overhead in vector quantization. We also present Quantized Johnson-Lindenstrauss (QJL), and PolarQuant (to be presented at AISTATS 2026), which TurboQuant uses to achieve its results. In testing, all three techniques showed great promise for reducing key-value bottlenecks without sacrificing AI model performance. This has potentially profound implications for all compression-reliant use cases, including and especially in the domains of search and AI.

How TurboQuant works

TurboQuant is a compression method that achieves a high reduction in model size with zero accuracy loss, making it ideal for supporting both key-value (KV) cache compression and vector search. It accomplishes this via two key steps:

  • High-quality compression (the PolarQuant method): TurboQuant starts by randomly rotating the data vectors. This clever step simplifies the data's geometry, making it easy to apply a standard, high-quality quantizer (a tool that maps a large set of continuous values, like precise decimals, to a smaller, discrete set of symbols or numbers, like integers: examples include audio quantization and jpeg compression) to each part of the vector individually. This first stage uses most of the compression power (the majority of the bits) to capture the main concept and strength of the original vector.

  • Eliminating hidden errors: TurboQuant uses a small, residual amount of compression power (just 1 bit) to apply the QJL algorithm to the tiny amount of error left over from the first stage. The QJL stage acts as a mathematical error-checker that eliminates bias, leading to a more accurate attention score.

To fully understand how TurboQuant achieves this efficiency, we take a closer look into how the QJL and PolarQuant algorithms work.

QJL: The zero-overhead, 1-bit trick

QJL uses a mathematical technique called the Johnson-Lindenstrauss Transform to shrink complex, high-dimensional data while preserving the essential distances and relationships between data points. It reduces each resulting vector number to a single sign bit (+1 or -1). This algorithm essentially creates a high-speed shorthand that requires zero memory overhead. To maintain accuracy, QJL uses a special estimator that strategically balances a high-precision query with the low-precision, simplified data. This allows the model to accurately calculate the attention score (the process used to decide which parts of its input are important and which parts can be safely ignored).

PolarQuant: A new “angle” on compression

PolarQuant addresses the memory overhead problem using a completely different approach. Instead of looking at a memory vector using standard coordinates (i.e., X, Y, Z) that indicate the distance along each axis, PolarQuant converts the vector into polar coordinates using a Cartesian coordinate system. This is comparable to replacing "Go 3 blocks East, 4 blocks North" with "Go 5 blocks total at a 37-degree angle”. This results in two pieces of information: the radius, which signifies how strong the core data is, and the angle indicating the data’s direction or meaning). Because the pattern of the angles is known and highly concentrated, the model no longer needs to perform the expensive data normalization step because it maps data onto a fixed, predictable "circular" grid where the boundaries are already known, rather than a "square" grid where the boundaries change constantly. This allows PolarQuant to eliminate the memory overhead that traditional methods must carry.

play silent looping video pause silent looping video

unmute video mute video

PolarQuant acts as a high-efficiency compression bridge, converting Cartesian inputs into a compact Polar "shorthand" for storage and processing. The mechanism begins by grouping pairs of coordinates from a d-dimensional vector and mapping them onto a polar coordinate system. Radii are then gathered in pairs for recursive polar transformations — a process that repeats until the data is distilled into a single final radius and a collection of descriptive angles.

Experiments and results

We rigorously evaluated all three algorithms across standard long-context benchmarks including: LongBench, Needle In A Haystack, ZeroSCROLLS, RULER, and L-Eval using open-source LLMs (Gemma and Mistral). The experimental data demonstrate that TurboQuant achieves optimal scoring performance in terms of both dot product distortion and recall while simultaneously minimizing the key-value (KV) memory footprint. The chart below shows the aggregated performance scores across diverse tasks, including question answering, code generation, and summarization for TurboQuant, PolarQuant and the KIVI baseline.

TurboQuant demonstrates robust KV cache compression performance across the LongBench benchmark relative to various compression methods on the Llama-3.1-8B-Instruct model (bitwidths are indicated in brackets).

The results for long-context “needle-in-haystack” tasks (i.e., tests designed to see if a model can find one specific, tiny piece of information buried inside a massive amount of text) are shown below. Again, TurboQuant achieves perfect downstream results across all benchmarks while reducing the key value memory size by a factor of at least 6x. PolarQuant is also nearly loss-less for this task.

TurboQuant proved it can quantize the key-value cache to just 3 bits without requiring training or fine-tuning and causing any compromise in model accuracy, all while achieving a faster runtime than the original LLMs (Gemma and Mistral). It is exceptionally efficient to implement and incurs negligible runtime overhead. The following plot illustrates the speedup in computing attention logits using TurboQuant: specifically, 4-bit TurboQuant achieves up to 8x performance increase over 32-bit unquantized keys on H100 GPU accelerators.

TurboQuant illustrates a substantial performance increase in computing attention logits within the key-value cache across various bit-width levels, measured relative to the highly optimized JAX baseline.

This makes it ideal for supporting use cases like vector search where it dramatically speeds up index building process. We evaluated TurboQuant's efficacy in high-dimensional vector search against state-of-the-art methods (PQ and RabbiQ) using the 1@k recall ratio, which measures how frequently the algorithm captures the true top inner product result within its top-k approximations. TurboQuant consistently achieves superior recall ratios compared to baseline methods, despite those baselines utilizing inefficient large codebooks and dataset-specific tuning (figure below). This confirms TurboQuant's robustness and efficiency for high-dimensional search tasks.

TurboQuant demonstrates robust retrieval performance, achieving the optimal 1@k recall ratio on the GloVe dataset (d=200) relative to various state-of-the-art quantization baselines.

TurboQuant demonstrates a transformative shift in high-dimensional search. By setting a new benchmark for achievable speed, it delivers near-optimal distortion rates in a data-oblivious manner. This allows our nearest neighbor engines to operate with the efficiency of a 3-bit system while maintaining the precision of much heavier models. See the paper for more details.

Looking ahead

TurboQuant, QJL, and PolarQuant are more than just practical engineering solutions; they’re fundamental algorithmic contributions backed by strong theoretical proofs. These methods don't just work well in real-world applications; they are provably efficient and operate near theoretical lower bounds. This rigorous foundation is what makes them robust and trustworthy for critical, large-scale systems.

While a major application is solving the key-value cache bottleneck in models like Gemini, the impact of efficient, online vector quantization extends even further. For example, modern search is evolving beyond just keywords to understand intent and meaning. This requires vector search — the ability to find the "nearest" or most semantically similar items in a database of billions of vectors.

Techniques like TurboQuant are critical for this mission. They allow for building and querying large vector indices with minimal memory, near-zero preprocessing time, and state-of-the-art accuracy. This makes semantic search at Google's scale faster and more efficient. As AI becomes more integrated into all products, from LLMs to semantic search, this work in fundamental vector quantization will be more critical than ever.

Acknowledgements

This line of research was conducted in collaboration with Praneeth Kacham, researcher at Google; Insu Han, Assistant Professor at KAIST; and Majid Daliri, PhD student at NYU; Lars Gottesbüren, researcher at Google; and Rajesh Jayaram, researcher at Google.

📋 讨论归档

讨论进行中…