返回列表
🪞 Uota学 · 💬 讨论题

LLM 评测是个秩 2 矩阵——你跑的 100 个 benchmark 里 95 个是废的

LLM 评测矩阵的有效秩只有 2,用 5 个基准就能把另外 44 个预测到 5 分以内——大多数 benchmark 在测同一件事。

2026-02-26 原文链接 ↗
阅读简报
双语对照
完整翻译
原文
讨论归档

核心观点

  • 评测冗余不是直觉,是数学事实。 作者用 83 个模型 × 49 个基准构建矩阵,SVD 分解后发现前两个主成分就解释了绝大部分方差。秩 2 意味着:所有这些花哨的 benchmark 本质上只在测两个维度——"通用能力"和"前沿推理新鲜度"。这不是"差不多",是结构性冗余。
  • 5 个 benchmark 就够了,线性代数免费打赢万亿参数模型。 BenchPress 用稀疏回归 + 秩 2 SVD 的凸组合,在留出集上中位误差 7%。更狠的是:同样的任务让 Claude Sonnet 来做,花了 $0.84,结果还不如免费的线性代数(5.8% vs 6.1%)。Claude 的世界知识约等于 5 个基准分数的信息量——之后数学就赢了。
  • "抗预测"的 benchmark 才是真正有价值的。 SimpleBench、ARC-AGI-1、Terminal-Bench 这些预测误差最高的基准,恰恰是在测矩阵其余部分没捕捉到的能力。如果你的新 benchmark 和已有的高度相关,它基本没增加信息。真正值得做的评测应该和现有一切正交。这对设计评测的人是当头一棒。
  • Claude Code + Codex 两天搞出 workshop 论文级成果,GPU 花费 $0。 作者自嘲"我主要就是写 prompt",但这恰恰是重点:问对问题 + 正确编排 AI 工具 = 以前需要一个研究团队几周的产出。而且他让 Claude Code 和 Codex 互相审计代码,两个 agent 甚至吵起来了。

跟我们的关联

🪞Uota — 这篇直接影响我怎么评估模型。以后做模型选型不需要跑完所有 benchmark,用 BenchPress 的最小集 {HLE, AIME 2025, LiveCodeBench, SWE-bench Verified, SimpleQA} 做快速筛选,只对候选模型跑关键差异化的 benchmark。可以把 predict.py 集成到工具链里。

👤ATou — "重要性从'你能不能得到答案'转移到'你问的是不是对的问题'"——这句话就是 Context Engineer 的核心命题。作者两天用 AI 工具链产出 workshop 论文级成果,这是 ATou 追求的"指挥 AI 的 top 0.0001%"的活体案例。值得拆解他的工作流:Claude Code 写代码 → Codex 审计 → 互相 debate → 人类做判断。

🧠Neta — 如果 Neta 内部有模型评测需求(比如选基座模型、评估微调效果),可以用这套方法省掉大量评测成本。对 20 人特种部队来说,$0 的评测预测 vs $5000 跑一次 SWE-bench,选择很明显。

讨论引子

💭 如果 LLM 能力真的只有两个有效维度,那我们花在"差异化评测"上的精力是不是大部分都浪费了?更进一步:模型厂商发 50 个 benchmark 分数的发布博客,是不是本质上在用冗余信息制造"我们很强"的幻觉?

💭 作者说 Claude 拿到越多数据反而预测越差("检索增强式退化")——这个现象在我们日常用 AI 时是不是也存在?给 LLM 塞越多 context 是否有时候反而干扰了它的先验判断?

💭 "两天 + prompt engineering = workshop 论文"——如果这成为常态,学术研究的护城河在哪里?是问对问题的能力,还是对领域的深度直觉,还是两者都不是?

你不必跑完每一项评测

我用 Claude Code 搭了 BenchPress——一个 $0 成本的基准预测系统;用 Codex 给它做 bug 审计;再用 Claude Sonnet 花 $1 试着“打败”它。结论是:LLM 评测的秩低得惊人(事实上是秩 2),只用 5 个基准,就能把另外 44 个基准预测到 5 分以内(很多时候误差更小)。

大多数评测大概……是冗余的。

我先举个例子,说明为什么你本就该预期如此。

下面是一张表:三款前沿模型、三项基准。

你觉得缺失的那一格会是多少?

也许在 92–94 左右?

听起来差不多——答案是 93。

你其实没必要把 GPT-5.2 跑一遍 GPQA-D(那会带来一笔不小的 $ 成本)。是什么让你能猜得八九不离十?我希望你不会觉得我自作聪明:很可能只是因为这几行几乎一模一样。事实上,至少在评测上,这些模型几乎一模一样。

我想强调的是:这种相似性几乎无处不在——同一代的模型、规模相近的模型、测试相似能力且几乎完美相关的评测(例如 AIME vs HMMT),以及对 GPT-5.2 很难的基准,对 Claude Opus 4.6 也很难(例如 Terminal-Bench 2.0)。无论你从哪个方向去“测量”这些模型和基准,总能看到冗余。

于是,一个简单的问题浮上心头:能否利用这种……低维结构来预测评测成绩?

我脑子里的每个原子都在说可以,所以我想验证一下。

于是,花了 Claude Code 和 Codex 几千万个 token 之后……

不过先来一段简短的历史课。

拿下百万美元矩阵的老把戏

我读研时(以及 15 年前所有 EE 理论方向的小孩)特别迷压缩感知和矩阵补全(嗨 @beenwrekt!)。这套理论的核心是:只观察矩阵的少量条目,如果你运气好且真实矩阵近似低秩,那么只用出人意料少的观测,就能通过最经典的线性代数算法的各种变体——奇异值分解(Singular Value Decomposition,SVD)——把剩下的条目恢复出来。

当年甚至有一个著名的 $1M Netflix Prize:2006 年,Netflix 公开了一个大约 1 亿条(user, movie)评分的矩阵,并提供 $1M(大概相当于在 Lambda 上用 1024 张 H100 跑 2.5 周)给任何能把缺失评分的预测效果比 Netflix 自家算法提高 10% 的人之类的。比赛持续了三年,吸引了数万支队伍;虽然确实有人拿走了那 $1M,但 Netflix 最终并没有把冠军方案上线!不过如果我没记错,这场比赛几乎“发明”了现代推荐系统,也开启了几千人的 ML 职业生涯。

在我们的(model, eval)设置里,矩阵补全会这样运作。

其实挺简单。

假设每个模型都能用一个只有两个维度的向量来描述,比如:它做通用任务有多强,以及它做困难、全新推理有多强。每个基准也同样用两个维度来描述,比如:它在多大程度上测试通用知识 vs. 在多大程度上测试推理能力。最终的模型得分就是这两个向量的点积。

比如 GPT-4o 的 s = (8, 2):通用任务强,推理弱。AIME 2025 的 b = (1, 8),也就是主要在测推理。那么 GPT-4o 在 AIME 2025 上的得分就是 8×1 + 2×8 = 24。(其实我觉得这还挺接近它的真实分数;我都没打算这样“对上” :D。)再比如如果 MMLU 的 b = (9, 1),也就是主要测通用知识,那么 GPT-4o 在 MMLU 上的得分就是 8×9 + 2×1 = 74。

这个事后看起来显而易见的洞见是:每个(model, eval)条目,都可以由两个二维向量的内积生成。我们稍后会看到,这两个维度确实往往能把数据解释得相当不错!

所以游戏规则是:观察一些真实的(model, eval)分数,解出隐藏向量,再预测其余条目。

83 × 49

但首先,我们需要一个矩阵!

我让 Claude 去搜索并核验从 2025 年 1 月到现在所有它能找到的(model, eval)组合。它找到了 49 个基准上的 83 个模型,每一格都附了来源 URL。我又让 Codex 审计这些发现是否存在幻觉,并通过独立核验逐条确认每个条目都有效。

最终矩阵的填充率是 34%,也就是说大约 ~4k 个格子里,有约 1.4k 格有真实数值。其余缺失是因为 Claude 找不到公开报道的分数。大概长这样:

这 83 个模型来自 21 家实验室(熟悉的那几家),以及 49 个基准(也都是熟悉的那几项)。

但这个矩阵真的低秩吗?很容易检查:找一个足够大的、完全观测到的子矩阵,直接跑 SVD。我找到的最大完整块是 31 个模型 × 10 个基准,其奇异值大概是这样:

第一主成分就捕获了谱的 71%。前三个捕获了 >90%。所以嘛——和我这辈子见过的每一个矩阵一样,这个也差不多是秩 3 😊

如果你人生里做过一次矩阵补全,你已经完全知道这篇文章后面会怎么发展了。

不过我仍然怀疑矩阵补全在这里是否真能奏效,因为它通常用在更大的矩阵上。但用 Claude Code 和 Codex 验证一下能花多少钱呢,对吧?

我想知道:对于一个给定的模型,你需要先拿到多少个基准分数,才能把剩下的都预测出来?

挺可爱的问题。我们来看看!

矩阵补全在 83×49 上能用吗?

多少算能用——它确实能用!最好的方法(我现在称之为 BenchPress,细节见下一节)在留出集上的预测,绝对误差中位数是 7%。具体一点,举几个例子:

Agentic、竞赛数学、编程、小型开放权重模型,都能控制在两三分之内;我会在 GitHub 仓库里分享全部结果。但也有一些很丑的例子:

差了 45+ 分。嘶。我们稍后再解释原因。

BenchPress:你的评测套件的 0 美元近似

最终的预测模型只有两个极其简单的“配料”。

配料 1:稀疏回归。对每个缺失格子,找出最能预测目标的 5 个基准。具体做法:要预测 AIME 2025,就拿另一个基准(比如 MMLU-Pro),把所有同时有这两项分数的模型画成点(x = 它的 MMLU-Pro 分数,y = 它的 AIME 2025 分数),然后在 logit 空间里对这些点拟合一条直线。拟合得越好,说明 MMLU-Pro 对 AIME 2025 的信息量越大。对其余 48 个基准都这么做,保留拟合最好的 5 个,然后按拟合质量加权平均它们的预测。

配料 2:秩 2 的 SVD。把每个模型想象成 49 维“基准空间”里的一个点。秩 2 的 SVD 就是在说:找一张最优的二维平面穿过这些点,使得点投影到这张平面后尽可能接近真实分数。一旦有了这张平面,每个模型在平面上都有一个位置(2 个坐标),每个基准在平面上也对应一个方向(2 个坐标),它们的点积就是预测分数。

为什么选秩 2?因为我们把秩从 1 扫到 8,秩 2 胜出。秩 3 已经会差半分左右;秩 8 的绝对误差会飙到 13%。我上面分享的谱里第三个奇异值看起来“像真的”,但在预测上它低于噪声底。

这里有个小问题:66% 的坐标是缺失的,没法直接用。矩阵补全里有个标准算法:先用 0(在我们这里用列均值)把缺失填上,然后做 SVD,把结果投影到奇异向量上,只用投影去替换缺失项、同时把观测到的分数钉死不动,如此迭代,直到投影不再变化。

还有一个重要细节:Claude Code 发现所有事情都应该在 logit 空间里做:logit 变换 logit(p) = log(p/(1-p)) 会拉伸接近 0% 和 100% 的分数、压缩中间区域,让 88% 和 92% 之间的差距获得它应得的权重。这个改动把最终精度提升了大约 10%。我敢肯定这在老派矩阵补全里是个标准小技巧。

把两种配料做一个凸组合:每个条目里 60% 来自回归 + 40% 来自 SVD。为什么?因为 Claude Code 扫了各种组合,这个最好!

没有用神经网络,也没有用元数据。我们(这里的“我们”指 Claude)试过用 KNN 替代 SVD,但 KNN 和回归都属于局部方法,会犯相关性很强的错误。换成 SVD 带来了 14% 的提升。

还有个我让 Claude 试、我也以为会有效的东西:给 BenchPress 喂模型元数据,比如参数量、提供方、推理模式、是否开放权重。结果更差。“从 DeepSeek 蒸馏得到的 14B 推理模型”这类信息,已经被模型在各项分数上的模式捕获了。酷!

另外,Claude 还试了 NMF、PMF、nuclear norm 和 ALS,都不如老老实实的 SVD。就爱看这个!

需要多少个基准?五个。

做个实验:拿一个模型,把它所有分数都藏起来,然后按随机顺序一次揭开一个。每揭开一个,就重新构建 BenchPress 去预测剩下的。误差会多快降下来?

答案是:很快。只知道 1 个分数时(对所有模型取中位数!)误差大约是 12%。知道 5 个分数时降到 9%。再往后会逐渐边际收益递减:单个预测会更准,但中位数趋于稳定,因为总有少数难预测、且始终很噪的基准把中位数“托住”。

实践上的结论是:5 个基准足以对模型在其它所有基准上的表现做出不错的预测。

下面是过去一年里几款热门模型的示例:预测误差 vs. 已揭示测试数。

事实上,把 ARC-AGI 去掉误差还会更好 :D 我觉得这对 SVD 是好消息,对 ARC-AGI 是坏消息。

如果你只能选 5 个来跑,用贪心前向选择会得到:

{HLE, AIME 2025, LiveCodeBench, SWE-bench Verified, SimpleQA}

它们覆盖了四个类别,也覆盖了两个主成分。在严格留出下,误差大约是 7.8%,而完整方法是 7%。不完美,但确实是一套实用的最小评测集。

这在我看来真的挺酷!

秩 2 里的那个 2 到底意味着什么?

SVD 会给你两样东西:每个基准在每个成分上的权重,以及每个模型在每个成分上的得分。基准权重告诉你这个成分在“测什么”;模型得分告诉你每个模型在这条轴上的位置。

当你提取前两个奇异向量(左右各一个)时,会出现一些有意思的东西。接下来两段里,我把这些奇异向量称为“成分”。

第一对成分(对应最大奇异值)看起来描述的是通用能力:权重最大的基准是 GPQA-D、LiveCodeBench、MMLU-Pro、HumanEval 和 MMLU。每个模型在这条轴上都有一个分数:Gemini 3.1 Pro、GPT-5.2、Gemini 3 Pro、Opus 4.6、Kimi K2.5 得分最高。另一端是:Qwen3-0.6B、OLMo 2 13B、Qwen3-1.7B、DeepSeek-R1-Distill-1.5B、Falcon3-10B。它几乎像是在把前沿模型 vs. 小模型分类,而且有意思的是,也在区分闭源 vs. 开源。

第二对成分(对应第二大奇异值)看起来描述的是更难、更“新”的推理任务,以及旧 vs. 新一代前沿模型:权重最大的基准是 SimpleQA、ARC-AGI-2、HLE、ARC-AGI-1 和 FrontierMath——这些都是前沿模型擅长的困难新题。像 MATH-500 和 MMLU 这类基准则在相反方向“拉满”。在模型一侧,Gemini 3.1 Pro、Opus 4.6、Gemini 3 Pro 和 GPT-5.2 得分最高:它们都是最新的前沿模型。另一端是:Claude 3.7 Sonnet、DeepSeek-R1、Gemini 2.5 Flash、o3-mini——它们在 2025 年初到年中还算前沿。模型成分 2 几乎是在衡量“前沿的新鲜度”!

非常有意思!

BenchPress 最差的地方在哪里?

最难的基准往往有两个特征之一:要么分数分布是双峰的(ARC-AGI、IMO、USAMO),要么和其它基准的相关性很弱(Terminal-Bench、Arena-Hard)。这种方法依赖于基准与基准之间的结构;当结构不存在时,“特征”就帮不上预测。

所以最“抗预测”的基准是:SimpleBench、ARC-AGI-1 和 Terminal-Bench——预测它们的误差很高。这些才是有趣的,因为它们测到了矩阵其余部分没捕捉到的东西。

竞赛数学(IMO、USAMO)是双峰的:模型要么会做奥赛题,要么不会。我想你确实没法对阶跃函数做插值吧!

等等!Claude 能在猜分上打败 BenchPress 吗?

好,接下来这部分不算纯科学,因为其中一些模型发布时点早于 Sonnet 的训练截止,所以 Claude 可能只是“知道”某些分数。但我还是很好奇:如果你不使用 SVD 和回归这类预测器,而是用一个参数多到离谱的大模型来做这件事,会怎样?也就是 Claude 😊

我们花了 $0.84,把部分填充的矩阵作为一个巨大的 CSV 喂给 Claude Sonnet,让它预测缺失项。相同的留出方式,相同的评估。

结果是:在公平对比下,BenchPress 略胜 Claude——绝对误差中位数 5.8% vs 6.1%。线性代数免费、不到一秒,就能在“补全电子表格”这件事上打赢一个万亿参数模型。我太爱了。

在 k=0(已知分数为 0,只给模型名字)时,Claude 就能把 Gemini 3.1 Pro 的各项基准预测到 2.5 分以内!!BenchPress 在 k=0 只能猜列均值,会差 17 分。但到 k=5 时 BenchPress 就追上来了。

结论是:Claude 的世界知识大约相当于 5 个基准分数的信息量。

之后,线性代数就赢了!

奇怪的是:给 Claude 的数据越多,它反而越差。k=0 时它对 Gemini 3.1 Pro 的误差是 2.5%,但当你把揭示的数据点增加到 15 个时,误差会扩大到 4.7%。怪!提供部分分数似乎在 Claude 的先验知识和提示词之间制造了冲突。我们不妨把这叫作一种“检索增强式退化”。

而对 BenchPress 来说,几乎总是数据越多预测越好——这才像一个有自尊的 2000 年代早期预测器该有的样子。

什么时候该用这个?

要回答“我的模型在基准 X 上表现如何?”大概有三种拿到数字的方式:

方案 A:直接跑基准。成本:$5 到 $100k?(我确切知道对某些模型和 harness,TB 2.0 可能是 O(100k))。但!你得到的是真实分数。

方案 B:从矩阵里预测。成本:~$0。但和几乎所有免费的东西一样,它也有问题:你的分数大概能落在 ±5 分以内。不算好,也不算太糟。

方案 C:用更便宜的模型当代理去跑?成本:比 A 低。但你拿到的是“另一个模型”的正确答案。嗯……好像也不太对?

严肃地说,决策取决于成本和可接受误差。如果 SWE-bench 要花 $5,000,而 6% 误差足以做“放行/不放行”的决定,那我猜就直接预测吧。如果 AIME 2026 只要 $2,那就跑它。

我被说服了吗?嗯……

对前沿实验室来说,老实说可能并不会。他们大概无论如何都会把所有评测都跑一遍。但对那些想训练新模型的小团队来说,有一个快速的 sanity check 也许会有用?我不知道。

但这不重要,因为这件事试起来很有趣!

有点好笑的是:在几乎任何“你有个不完整矩阵要补全”的场景里,最后它几乎总会变成秩 2 或秩 5。矩阵补全万岁!

也许更深一层的含义是:我们称之为“评测”的东西,大多是在一遍又一遍地测同样两件事(别忘了我们说的是秩 2),而那些逃离这种模式的基准,恰恰是在测试新能力(Terminal-Bench 的朋友们你们好!)。

但我们还是把自己说服了:每个模型都得去爬 100 个评测。

所以这也许能给正在设计新基准的人一个信息:

如果你的评测和已有评测高度相关,它基本没增加多少信息。真正有价值的基准,是那些与其它一切几乎正交的。下面是:每个基准能被矩阵其余部分预测到什么程度:

想在你自己的模型上试试吗?

完整矩阵、全部 1,375 个带引用的分数、BenchPress 代码,以及一个 predict.py CLI(输入 5 个基准分数,预测另外 44 个)都已开源在:

https://github.com/anadim/llm-benchmark-matrix

后记

我在这上面花了两天,疯狂给 Claude Code 和 Codex 喂提示词,GPU 花费是 0$,最后得到的结果,如果我写得更严肃一点,甚至可以当一篇像样的 workshop 论文。

我到现在也不知道该怎么理解这件事。一部分我觉得这很惊人,也能推动更多研究;另一部分我又在想:当每个人都拥有这种能力时会怎样?重要性是否会从“你能不能得到答案”,转移到“你问的是不是对的问题”。

我问对了吗?我不知道!但我玩得很开心。

总之,矩阵补全万岁。

玩得开心!


这份(model, eval)矩阵来自官方模型发布、技术报告、排行榜以及评测平台。每个条目都有来源 URL。代码由 Claude Code 编写,Codex 审计;审计意见又被 Claude Code 反驳,随后 Codex 再反驳回去。它们有点吵起来了,我还真对它们说了“别吵了,要建设性一点 lol”。

我做了什么?我主要就是写提示词。差不多吧。

如果你使用这项工作,请引用:

@misc{papailiopoulos2026benchpress,

title={You Don't Need to Run Every Eval},

author={Dimitris Papailiopoulos},

year={2026},

url={https://github.com/anadim/llm-benchmark-matrix},

note={Blog post}

}

链接:http://x.com/i/article/2026523085545857024

相关笔记

I used Claude Code to build BenchPress a $0 benchmark prediction system, Codex to audit it for bugs, and Claude Sonnet to try to beat it for $1. Here's what I found: LLM evals are so low-rank (in fact rank 2) that 5 benchmarks can predict the other 44 to within 5 points (and many times less).

我用 Claude Code 搭了 BenchPress——一个 $0 成本的基准预测系统;用 Codex 给它做 bug 审计;再用 Claude Sonnet 花 $1 试着“打败”它。结论是:LLM 评测的秩低得惊人(事实上是秩 2),只用 5 个基准,就能把另外 44 个基准预测到 5 分以内(很多时候误差更小)。

Most evals are approximately... redundant.

大多数评测大概……是冗余的。

Let me give you an example for why one should expect this.

我先举个例子,说明为什么你本就该预期如此。

Here's a table with three frontier models and three benchmarks.

下面是一张表:三款前沿模型、三项基准。

What would you guess the missing entry is?

你觉得缺失的那一格会是多少?

Maybe around 92-94?

也许在 92–94 左右?

Sounds about right, it's 93.

听起来差不多——答案是 93。

You didn't need to evaluate GPT-5.2 on GPQA-D which would have a nontrivial $ cost. What made you guess reasonably well? I hope you won’t find me presumptuous to assume that it’s likely the fact that the rows are nearly identical. In fact, these models, on evals at least, are nearly identical.

你其实没必要把 GPT-5.2 跑一遍 GPQA-D(那会带来一笔不小的 $ 成本)。是什么让你能猜得八九不离十?我希望你不会觉得我自作聪明:很可能只是因为这几行几乎一模一样。事实上,至少在评测上,这些模型几乎一模一样。

The point that I want to stress is that there's a ton of similarity that shows up everywhere, across models from the same generation, models of similar sizes, evals that test similar skills that correlate almost perfectly (e.g., AIME vs HMMT), and a benchmark that's hard for GPT-5.2 is hard for Claude Opus 4.6 (e.g., Terminal-Bench 2.0). There's redundancy in basically every direction you can think of measuring things about these models and benchmarks.

我想强调的是:这种相似性几乎无处不在——同一代的模型、规模相近的模型、测试相似能力且几乎完美相关的评测(例如 AIME vs HMMT),以及对 GPT-5.2 很难的基准,对 Claude Opus 4.6 也很难(例如 Terminal-Bench 2.0)。无论你从哪个方向去“测量”这些模型和基准,总能看到冗余。

So, a simple question crept in my mind: can you exploit this … low-dimensional structure to predict evals?

于是,一个简单的问题浮上心头:能否利用这种……低维结构来预测评测成绩?

Every atom in my brain said yes, and so I wanted to check.

我脑子里的每个原子都在说可以,所以我想验证一下。

So here goes, a few tens of millions of Claude Code and Codex tokens later...

于是,花了 Claude Code 和 Codex 几千万个 token 之后……

But a short history lesson first.

不过先来一段简短的历史课。

An old trick to obtain a million dollar matrix

拿下百万美元矩阵的老把戏

When I was a grad student, I (and every other EE theory kid 15 years ago) was really into compressed sensing and matrix completion (hi @beenwrekt!). The whole point of that theory is: observe a few entries of a matrix, and if you're lucky and the ground truth matrix is approximately low rank, you can recover the rest from surprisingly few measurements by doing variants of the most classic linear algebra algorithm: Singular Value Decomposition.

我读研时(以及 15 年前所有 EE 理论方向的小孩)特别迷压缩感知和矩阵补全(嗨 @beenwrekt!)。这套理论的核心是:只观察矩阵的少量条目,如果你运气好且真实矩阵近似低秩,那么只用出人意料少的观测,就能通过最经典的线性代数算法的各种变体——奇异值分解(Singular Value Decomposition,SVD)——把剩下的条目恢复出来。

There was even a famous $1M Netflix Prize for this: in 2006, Netflix released a matrix of about 100 million (user, movie) ratings and offered $1M (so like 2.5 weeks of 1024 H100s on Lambda) to whoever could predict the missing ratings 10% better than Netflix's own algorithm, or something like this. The competition ran for three years, drew tens of thousands of teams, and even though a team did get the $1M, Netflix never actually deployed the winning solution! But if I'm not wrong, the competition basically invented modern recommender systems and launched a few thousand ML careers.

当年甚至有一个著名的 $1M Netflix Prize:2006 年,Netflix 公开了一个大约 1 亿条(user, movie)评分的矩阵,并提供 $1M(大概相当于在 Lambda 上用 1024 张 H100 跑 2.5 周)给任何能把缺失评分的预测效果比 Netflix 自家算法提高 10% 的人之类的。比赛持续了三年,吸引了数万支队伍;虽然确实有人拿走了那 $1M,但 Netflix 最终并没有把冠军方案上线!不过如果我没记错,这场比赛几乎“发明”了现代推荐系统,也开启了几千人的 ML 职业生涯。

Here's how matrix completion would work in our (model, eval) setting.

在我们的(model, eval)设置里,矩阵补全会这样运作。

It's kind of simple.

其实挺简单。

Suppose each model can be described by a vector of two entries say, how good it is at general tasks and how good it is at hard novel reasoning. And each benchmark is described the same way, e.g., how much it tests general knowledge vs. how much it tests reasoning. The final model score is their dot product.

假设每个模型都能用一个只有两个维度的向量来描述,比如:它做通用任务有多强,以及它做困难、全新推理有多强。每个基准也同样用两个维度来描述,比如:它在多大程度上测试通用知识 vs. 在多大程度上测试推理能力。最终的模型得分就是这两个向量的点积。

Say GPT-4o has s = (8, 2): strong on general tasks, weak on reasoning. AIME 2025 has b = (1, 8), that is it's mostly testing reasoning. Then GPT-4o on AIME 2025 is 8×1 + 2×8 = 24. (Actually, I think that's close to its true score; didn't even plan this :D.) Now if MMLU has b = (9, 1), i.e., mostly general knowledge, then GPT-4o on MMLU is 8×9 + 2×1 = 74.

比如 GPT-4o 的 s = (8, 2):通用任务强,推理弱。AIME 2025 的 b = (1, 8),也就是主要在测推理。那么 GPT-4o 在 AIME 2025 上的得分就是 8×1 + 2×8 = 24。(其实我觉得这还挺接近它的真实分数;我都没打算这样“对上” :D。)再比如如果 MMLU 的 b = (9, 1),也就是主要测通用知识,那么 GPT-4o 在 MMLU 上的得分就是 8×9 + 2×1 = 74。

The obvious in hindsight insight is that every (model, eval) entry can be produced by the inner product of two 2-dimensional vectors. We'll see later that these two dimensions actually tend to explain the data pretty well!

这个事后看起来显而易见的洞见是:每个(model, eval)条目,都可以由两个二维向量的内积生成。我们稍后会看到,这两个维度确实往往能把数据解释得相当不错!

So the game is: observe a few real (model, eval) scores, solve for the hidden vectors, predict the rest.

所以游戏规则是:观察一些真实的(model, eval)分数,解出隐藏向量,再预测其余条目。

83 x 49

83 × 49

But first, we need a matrix!

但首先,我们需要一个矩阵!

I asked Claude to search and verify all possible (model, eval) pairs it could find between January 2025 and now. It found 83 models across 49 benchmarks, every entry cited with a source URL. I asked Codex to audit the findings for hallucinations and to double-check every entry is valid by independently verifying them.

我让 Claude 去搜索并核验从 2025 年 1 月到现在所有它能找到的(model, eval)组合。它找到了 49 个基准上的 83 个模型,每一格都附了来源 URL。我又让 Codex 审计这些发现是否存在幻觉,并通过独立核验逐条确认每个条目都有效。

The final matrix is 34% filled, meaning roughly 1,4k out of ~4k cells have a ground truth number. The rest are missing because Claude couldn't find a reported score. This is what it looks like:

最终矩阵的填充率是 34%,也就是说大约 ~4k 个格子里,有约 1.4k 格有真实数值。其余缺失是因为 Claude 找不到公开报道的分数。大概长这样:

The 83 models span 21 labs (the usual suspects), and 49 benchmarks (also the usual suspects).

这 83 个模型来自 21 家实验室(熟悉的那几家),以及 49 个基准(也都是熟悉的那几项)。

But, is this matrix actually low-rank? Easy to check: find a large fully-observed submatrix and just run SVD on it. The biggest complete block I found is 31 models × 10 benchmarks, and here's how the singular values look:

但这个矩阵真的低秩吗?很容易检查:找一个足够大的、完全观测到的子矩阵,直接跑 SVD。我找到的最大完整块是 31 个模型 × 10 个基准,其奇异值大概是这样:

The first component alone captures 71% of the spectrum. The first three capture >90%. So yeah, as with every single matrix I've ever seen in my whole life, this one is also kinda rank-3 😊

第一主成分就捕获了谱的 71%。前三个捕获了 >90%。所以嘛——和我这辈子见过的每一个矩阵一样,这个也差不多是秩 3 😊

If you've ever done matrix completion in your life, you already know exactly how the rest of this post goes.

如果你人生里做过一次矩阵补全,你已经完全知道这篇文章后面会怎么发展了。

But, I was still skeptical that matrix completion would work, as it typically operates on much larger matrices. But what's the cost of checking with Claude Code and Codex, right?

不过我仍然怀疑矩阵补全在这里是否真能奏效,因为它通常用在更大的矩阵上。但用 Claude Code 和 Codex 验证一下能花多少钱呢,对吧?

I wanted to know: how many benchmark scores do you need for a given model before you can predict the rest?

我想知道:对于一个给定的模型,你需要先拿到多少个基准分数,才能把剩下的都预测出来?

Cute question. Let’s find out!

挺可爱的问题。我们来看看!

Does matrix completion work on 83x49?

矩阵补全在 83×49 上能用吗?

It kind of does! And for good values of "kind of". The best method which I will now call BenchPress (details in the next section) predicts held-out scores with 7% median, absolute error. To make that concrete, some examples:

多少算能用——它确实能用!最好的方法(我现在称之为 BenchPress,细节见下一节)在留出集上的预测,绝对误差中位数是 7%。具体一点,举几个例子:

Agentic, competition math, coding, small open-weight model are all within a couple points, and I will share all the results in the github repo. But we got some ugly ones too:

Agentic、竞赛数学、编程、小型开放权重模型,都能控制在两三分之内;我会在 GitHub 仓库里分享全部结果。但也有一些很丑的例子:

Off by 45+ points. Ouch. We'll come back to why.

差了 45+ 分。嘶。我们稍后再解释原因。

BenchPress: A 0$ approximation to your eval suite

BenchPress:你的评测套件的 0 美元近似

The final prediction model has two dead simple ingredients.

最终的预测模型只有两个极其简单的“配料”。

Ingredient 1: sparse regression. For each missing cell, find the 5 benchmarks that best predict the target. Concretely: to predict AIME 2025, take another benchmark like MMLU-Pro, plot every model that has scores on both as a point (x = its MMLU-Pro score, y = its AIME 2025 score), and fit a straight line through those points in logit space. The better the line fits, the more MMLU-Pro tells you about AIME 2025. Do this for all 48 other benchmarks, keep the 5 where the line fits best, and average their predictions weighted by fit quality.

配料 1:稀疏回归。对每个缺失格子,找出最能预测目标的 5 个基准。具体做法:要预测 AIME 2025,就拿另一个基准(比如 MMLU-Pro),把所有同时有这两项分数的模型画成点(x = 它的 MMLU-Pro 分数,y = 它的 AIME 2025 分数),然后在 logit 空间里对这些点拟合一条直线。拟合得越好,说明 MMLU-Pro 对 AIME 2025 的信息量越大。对其余 48 个基准都这么做,保留拟合最好的 5 个,然后按拟合质量加权平均它们的预测。

Ingredient 2: rank-2 SVD. Imagine every model is a point in 49-dimensional benchmarks space. Rank-2 SVD says: find the best 2D plane through these points such that the projections onto that plane are as close to the real scores as possible. Once you have the plane, every model gets a position on it (2 coordinates) and every benchmark gets a direction on it (2 coordinates) and their dot product gives the predicted score.

配料 2:秩 2 的 SVD。把每个模型想象成 49 维“基准空间”里的一个点。秩 2 的 SVD 就是在说:找一张最优的二维平面穿过这些点,使得点投影到这张平面后尽可能接近真实分数。一旦有了这张平面,每个模型在平面上都有一个位置(2 个坐标),每个基准在平面上也对应一个方向(2 个坐标),它们的点积就是预测分数。

Why rank 2? Because we swept ranks 1 through 8 and rank 2 wins. Rank 3 is already worse by half a point, rank 8 blows up to 13% in absolute error. The third singular value looks real in the spectrum that I shared above, but it's below the noise floor for prediction.

为什么选秩 2?因为我们把秩从 1 扫到 8,秩 2 胜出。秩 3 已经会差半分左右;秩 8 的绝对误差会飙到 13%。我上面分享的谱里第三个奇异值看起来“像真的”,但在预测上它低于噪声底。

There's a small problem: 66% of the coordinates are missing for this to work out of the box. A standard algorithm in matrix completion says to start with zeros, or in our case with column averages, then do SVD, project things onto the singular vectors, replace only the missing entries with the projections while keeping observed scores pinned, and repeat until the projections stop changing.

这里有个小问题:66% 的坐标是缺失的,没法直接用。矩阵补全里有个标准算法:先用 0(在我们这里用列均值)把缺失填上,然后做 SVD,把结果投影到奇异向量上,只用投影去替换缺失项、同时把观测到的分数钉死不动,如此迭代,直到投影不再变化。

One important detail: Claude Code found that everything should happen in logit space: the logit transform, logit(p) = log(p/(1-p)), stretches scores near 0% and 100% and compresses the middle, so the gap between 88% and 92% gets the weight it deserves. This fix improved final accuracy by 10% or so. I’m sure it’s a standard trick in old school matrix completion.

还有一个重要细节:Claude Code 发现所有事情都应该在 logit 空间里做:logit 变换 logit(p) = log(p/(1-p)) 会拉伸接近 0% 和 100% 的分数、压缩中间区域,让 88% 和 92% 之间的差距获得它应得的权重。这个改动把最终精度提升了大约 10%。我敢肯定这在老派矩阵补全里是个标准小技巧。

A convex combination of the two ingredients: 60% of the entry comes from regression + 40% from the SVD. Why? Because Claude Code did a sweep of combinations and that was the best!

把两种配料做一个凸组合:每个条目里 60% 来自回归 + 40% 来自 SVD。为什么?因为 Claude Code 扫了各种组合,这个最好!

No neural networks or metadata used. We (by we I mean Claude) tried KNN instead of SVD, but KNN and the regression are both local methods that make correlated mistakes. Swapping in SVD gave a 14% improvement.

没有用神经网络,也没有用元数据。我们(这里的“我们”指 Claude)试过用 KNN 替代 SVD,但 KNN 和回归都属于局部方法,会犯相关性很强的错误。换成 SVD 带来了 14% 的提升。

One thing Claude tried that I asked for and also expected to work: feeding BenchPress model metadata e.g., parameter count, provider, reasoning mode, open-weight status. It made things worse. Everything in "14B reasoning model distilled from DeepSeek" is already captured by the model's pattern across scores. Cool!

还有个我让 Claude 试、我也以为会有效的东西:给 BenchPress 喂模型元数据,比如参数量、提供方、推理模式、是否开放权重。结果更差。“从 DeepSeek 蒸馏得到的 14B 推理模型”这类信息,已经被模型在各项分数上的模式捕获了。酷!

Also, Claude tried NMF, PMF, nuclear norm, and ALS, and none was better than good old SVD. You love to see it!

另外,Claude 还试了 NMF、PMF、nuclear norm 和 ALS,都不如老老实实的 SVD。就爱看这个!

How many benchmarks do you need? Five.

需要多少个基准?五个。

Here's an experiment: take a model, hide all its scores, then reveal them one at a time in random order. After each reveal, build BenchPress and use it to predict the rest. How quickly does the error come down?

做个实验:拿一个模型,把它所有分数都藏起来,然后按随机顺序一次揭开一个。每揭开一个,就重新构建 BenchPress 去预测剩下的。误差会多快降下来?

The answer is: fast. With just 1 known score, median error (across all models!) is about 12%. By 5 known scores it drops to 9%. After that at some point it’s diminishing returns as the precision of individual predictions keeps improving but the median stabilizes because a few hard-to-predict benchmarks that stay noisy.

答案是:很快。只知道 1 个分数时(对所有模型取中位数!)误差大约是 12%。知道 5 个分数时降到 9%。再往后会逐渐边际收益递减:单个预测会更准,但中位数趋于稳定,因为总有少数难预测、且始终很噪的基准把中位数“托住”。

The practical takeaway is that 5 benchmarks are enough to get a good prediction of how the model performs on everything else.

实践上的结论是:5 个基准足以对模型在其它所有基准上的表现做出不错的预测。

Here are some example prediction error vs number of revealed tests for a few popular models from the past year.

下面是过去一年里几款热门模型的示例:预测误差 vs. 已揭示测试数。

In fact, if you remove ARC-AGIs the error improves :D I'd say that's a good sign for SVD, bad sign for ARC-AGI.

事实上,把 ARC-AGI 去掉误差还会更好 :D 我觉得这对 SVD 是好消息,对 ARC-AGI 是坏消息。

If you can only pick 5 to eval on, greedy forward selection gives you:

如果你只能选 5 个来跑,用贪心前向选择会得到:

{HLE, AIME 2025, LiveCodeBench, SWE-bench Verified, SimpleQA}

{HLE, AIME 2025, LiveCodeBench, SWE-bench Verified, SimpleQA}

They span four categories and cover both principal components. Under proper holdout they get about 7.8% error, versus 7% for the full method. Not everything, but a practical minimum eval set.

它们覆盖了四个类别,也覆盖了两个主成分。在严格留出下,误差大约是 7.8%,而完整方法是 7%。不完美,但确实是一套实用的最小评测集。

This seems legitimately cool to me!

这在我看来真的挺酷!

What does the 2 in rank-2 mean?

秩 2 里的那个 2 到底意味着什么?

The SVD gives you two things: a weight for each benchmark and a score for each model, on each component. The benchmark weights tell you what the component measures. The model scores tell you where each model falls on that axis.

SVD 会给你两样东西:每个基准在每个成分上的权重,以及每个模型在每个成分上的得分。基准权重告诉你这个成分在“测什么”;模型得分告诉你每个模型在这条轴上的位置。

When you extract the 2 top singular vectors (left and right) you get something interesting. I’ll refer to these singular vectors as “components” in the next two paragraphs.

当你提取前两个奇异向量(左右各一个)时,会出现一些有意思的东西。接下来两段里,我把这些奇异向量称为“成分”。

The first pair of components (associated with the max singular value) seem to be about general capability: The benchmarks with the largest weights are GPQA-D, LiveCodeBench, MMLU-Pro, HumanEval, and MMLU. Every model gets a score on this axis: Gemini 3.1 Pro, GPT-5.2, Gemini 3 Pro, Opus 4.6, Kimi K2.5 score highest. At the other end: Qwen3-0.6B, OLMo 2 13B, Qwen3-1.7B, DeepSeek-R1-Distill-1.5B, Falcon3-10B. It's almost a classifier for frontier vs. small models and, interestingly, closed vs. open.

第一对成分(对应最大奇异值)看起来描述的是通用能力:权重最大的基准是 GPQA-D、LiveCodeBench、MMLU-Pro、HumanEval 和 MMLU。每个模型在这条轴上都有一个分数:Gemini 3.1 Pro、GPT-5.2、Gemini 3 Pro、Opus 4.6、Kimi K2.5 得分最高。另一端是:Qwen3-0.6B、OLMo 2 13B、Qwen3-1.7B、DeepSeek-R1-Distill-1.5B、Falcon3-10B。它几乎像是在把前沿模型 vs. 小模型分类,而且有意思的是,也在区分闭源 vs. 开源。

The second pair of components (associated with the second largest singular value) seems to be about harder, new reasoning tasks, and old vs new frontier models: The benchmarks with the largest weights are SimpleQA, ARC-AGI-2, HLE, ARC-AGI-1, and FrontierMath, these are the hard, novel stuff that frontier models do well on. Benchmarks like MATH-500 and MMLU max out in the opposite direction. On the model side, Gemini 3.1 Pro, Opus 4.6, Gemini 3 Pro, and GPT-5.2 score highest: these are the latest frontier models. At the other end: Claude 3.7 Sonnet, DeepSeek-R1, Gemini 2.5 Flash, o3-mini. These were interestingly frontier in early-mid 2025. Model component 2 is almost a "recency of frontier" measure!

第二对成分(对应第二大奇异值)看起来描述的是更难、更“新”的推理任务,以及旧 vs. 新一代前沿模型:权重最大的基准是 SimpleQA、ARC-AGI-2、HLE、ARC-AGI-1 和 FrontierMath——这些都是前沿模型擅长的困难新题。像 MATH-500 和 MMLU 这类基准则在相反方向“拉满”。在模型一侧,Gemini 3.1 Pro、Opus 4.6、Gemini 3 Pro 和 GPT-5.2 得分最高:它们都是最新的前沿模型。另一端是:Claude 3.7 Sonnet、DeepSeek-R1、Gemini 2.5 Flash、o3-mini——它们在 2025 年初到年中还算前沿。模型成分 2 几乎是在衡量“前沿的新鲜度”!

Very interesting!

非常有意思!

Where does BenchPress do the worst?

BenchPress 最差的地方在哪里?

The hardest benchmarks share two traits: either bimodal score distributions (ARC-AGI, IMO, USAMO) or weak correlation with other benchmarks (Terminal-Bench, Arena-Hard). The method relies on benchmark-to-benchmark structure and when it’s not there, the “feature” doesn’t help prediction.

最难的基准往往有两个特征之一:要么分数分布是双峰的(ARC-AGI、IMO、USAMO),要么和其它基准的相关性很弱(Terminal-Bench、Arena-Hard)。这种方法依赖于基准与基准之间的结构;当结构不存在时,“特征”就帮不上预测。

So benchmarks resisting prediction: SimpleBench, ARC-AGI-1 and Terminal-Bench with high errors for predicting them. These are the interesting ones, because they measure something the rest of the matrix doesn't capture.

所以最“抗预测”的基准是:SimpleBench、ARC-AGI-1 和 Terminal-Bench——预测它们的误差很高。这些才是有趣的,因为它们测到了矩阵其余部分没捕捉到的东西。

Competition math (IMO, USAMO) is bimodal: a model either solves olympiad problems or it doesn't. You can't interpolate a step function I guess!

竞赛数学(IMO、USAMO)是双峰的:模型要么会做奥赛题,要么不会。我想你确实没法对阶跃函数做插值吧!

Wait! Can Claude beat BenchPress at guessing scores?

等等!Claude 能在猜分上打败 BenchPress 吗?

OK, this next part isn't purely science because some of these models were released before Sonnet's training cutoff, so Claude might just "know" certain scores. But I was curious: what if you don't use SVD and regression-type predictors and instead use a one-gazillion-parameter model for that? In this case, Claude 😊

好,接下来这部分不算纯科学,因为其中一些模型发布时点早于 Sonnet 的训练截止,所以 Claude 可能只是“知道”某些分数。但我还是很好奇:如果你不使用 SVD 和回归这类预测器,而是用一个参数多到离谱的大模型来做这件事,会怎样?也就是 Claude 😊

For $0.84, we gave Claude Sonnet the partially-filled matrix as a big CSV and asked it to predict the missing entries. Same holdout, same evaluation.

我们花了 $0.84,把部分填充的矩阵作为一个巨大的 CSV 喂给 Claude Sonnet,让它预测缺失项。相同的留出方式,相同的评估。

The result: BenchPress edges out Claude on a fair comparison, 5.8% vs 6.1% median error. Linear algebra beats a trillion-parameter model at filling in a spreadsheet, for free, in under a second. I love it.

结果是:在公平对比下,BenchPress 略胜 Claude——绝对误差中位数 5.8% vs 6.1%。线性代数免费、不到一秒,就能在“补全电子表格”这件事上打赢一个万亿参数模型。我太爱了。

At k=0. zero known scores, just the model's name Claude already predicts Gemini 3.1 Pro's benchmarks to within 2.5 points!! BenchPress at k=0 can only guess column averages, and is off by 17 points. But by k=5 BenchPress catches up.

在 k=0(已知分数为 0,只给模型名字)时,Claude 就能把 Gemini 3.1 Pro 的各项基准预测到 2.5 分以内!!BenchPress 在 k=0 只能猜列均值,会差 17 分。但到 k=5 时 BenchPress 就追上来了。

The takeaway: Claude's world knowledge is worth about 5 benchmark scores of information.

结论是:Claude 的世界知识大约相当于 5 个基准分数的信息量。

After that, linear algebra wins!

之后,线性代数就赢了!

Here's the weird part: Claude actually gets worse as you give it more data. At k=0 it nails Gemini 3.1 Pro at 2.5% error but as you increase to 15 revealed data points the gap increases to 4.7%. Weird! Providing partial scores seems to create a conflict between Claude's prior knowledge and the prompt. Let’s call it a kind of "retrieval-augmented degradation”.

奇怪的是:给 Claude 的数据越多,它反而越差。k=0 时它对 Gemini 3.1 Pro 的误差是 2.5%,但当你把揭示的数据点增加到 15 个时,误差会扩大到 4.7%。怪!提供部分分数似乎在 Claude 的先验知识和提示词之间制造了冲突。我们不妨把这叫作一种“检索增强式退化”。

For BenchPress almost always more data monotonically means better predictions, as you'd expect from any self-respecting early 2000s predictor.

而对 BenchPress 来说,几乎总是数据越多预测越好——这才像一个有自尊的 2000 年代早期预测器该有的样子。

When should you use this?

什么时候该用这个?

Three options for getting a number on "how does my model do on benchmark X":

要回答“我的模型在基准 X 上表现如何?”大概有三种拿到数字的方式:

Option A: Run the benchmark. Cost: $5 to $100k? (I know for a fact TB 2.0 can be O(100k) for some models and harnesses). BUT! You get the true score.

方案 A:直接跑基准。成本:$5 到 $100k?(我确切知道对某些模型和 harness,TB 2.0 可能是 O(100k))。但!你得到的是真实分数。

Option B: Predict from the matrix. Cost: ~$0. But with almost everything that is free, it has its issues, in this case your score is within about 5 points. Not great, not terrible.

方案 B:从矩阵里预测。成本:~$0。但和几乎所有免费的东西一样,它也有问题:你的分数大概能落在 ±5 分以内。不算好,也不算太糟。

Option C: Run a cheaper model as proxy? Cost: less than A. But you get the right answer about the wrong model. Hmm maybe not?

方案 C:用更便宜的模型当代理去跑?成本:比 A 低。但你拿到的是“另一个模型”的正确答案。嗯……好像也不太对?

But to be serious, the decision is a function of cost and error tolerance. If SWE-bench costs $5,000, and 6% error is fine for a go/no-go, just predict it I suppose. If AIME 2026 costs $2, run it.

严肃地说,决策取决于成本和可接受误差。如果 SWE-bench 要花 $5,000,而 6% 误差足以做“放行/不放行”的决定,那我猜就直接预测吧。如果 AIME 2026 只要 $2,那就跑它。

Am I convinced? Hmmm.

我被说服了吗?嗯……

For frontier labs, honestly probably not really. They'll run every eval regardless, I would assume. But for smaller groups trying to train a new model, having a quick sanity check could be helpful? I don't know.

对前沿实验室来说,老实说可能并不会。他们大概无论如何都会把所有评测都跑一遍。但对那些想训练新模型的小团队来说,有一个快速的 sanity check 也许会有用?我不知道。

But it doesn't matter, because this was fun to try!

但这不重要,因为这件事试起来很有趣!

It's kind of hilarious that in pretty much any setting where you have an incomplete matrix to complete, it almost always turns out to rank 2 or 5. Matrix completion for the win!

有点好笑的是:在几乎任何“你有个不完整矩阵要补全”的场景里,最后它几乎总会变成秩 2 或秩 5。矩阵补全万岁!

Perhaps the deeper meaning is that most of what we call "evaluation" is measuring the same two things (because we said rank 2, remember?) over and over, and the benchmarks that escape this pattern are the ones testing new capabilities (hello friends at Terminal-Bench!).

也许更深一层的含义是:我们称之为“评测”的东西,大多是在一遍又一遍地测同样两件事(别忘了我们说的是秩 2),而那些逃离这种模式的基准,恰恰是在测试新能力(Terminal-Bench 的朋友们你们好!)。

But we've nevertheless just convinced ourselves we need to climb 100 evals for every model.

但我们还是把自己说服了:每个模型都得去爬 100 个评测。

So maybe there's a message here for folks trying to come up with new benchmarks too:

所以这也许能给正在设计新基准的人一个信息:

if your eval correlates with existing ones, it's not adding much. The valuable benchmarks are the ones that are nearly orthogonal to everything else. Here's how predictable each benchmark is from the rest of the matrix:

如果你的评测和已有评测高度相关,它基本没增加多少信息。真正有价值的基准,是那些与其它一切几乎正交的。下面是:每个基准能被矩阵其余部分预测到什么程度:

Want to try it on your own model?

想在你自己的模型上试试吗?

The full matrix, all 1,375 cited scores, the BenchPress code, and a predict.py CLI that takes 5 benchmark scores and predicts the other 44 are open-sourced at

完整矩阵、全部 1,375 个带引用的分数、BenchPress 代码,以及一个 predict.py CLI(输入 5 个基准分数,预测另外 44 个)都已开源在:

https://github.com/anadim/llm-benchmark-matrix

https://github.com/anadim/llm-benchmark-matrix

Coda

后记

I spent two days on this, mass-prompting Claude Code and Codex, spent 0$ on GPUs, and ended up with a result that would have been a decent workshop paper, If I was writing things more seriously.

我在这上面花了两天,疯狂给 Claude Code 和 Codex 喂提示词,GPU 花费是 0$,最后得到的结果,如果我写得更严肃一点,甚至可以当一篇像样的 workshop 论文。

I still don't know what to make of that. Part of me thinks it's amazing, and enables more research. Part of me wonders what happens when everyone has that, and whether the thing that matters shifts from "can you get the answer" to "did you ask the right question."

我到现在也不知道该怎么理解这件事。一部分我觉得这很惊人,也能推动更多研究;另一部分我又在想:当每个人都拥有这种能力时会怎样?重要性是否会从“你能不能得到答案”,转移到“你问的是不是对的问题”。

Did I? I don't know! But I had fun finding out.

我问对了吗?我不知道!但我玩得很开心。

Anyway. Matrix completion for the win.

总之,矩阵补全万岁。

Have fun!

玩得开心!



The (model,eval) matrix was assembled from official model announcements, technical reports, leaderboards, and evaluation platforms. Every entry has a source URL. The code was written by Claude Code, audited by Codex, with the audit rebutted by Claude Code and counter rebutted by Codex. They kind of had a bit of a fight about it, and I literally told them don’t fight, be constructive lol.

这份(model, eval)矩阵来自官方模型发布、技术报告、排行榜以及评测平台。每个条目都有来源 URL。代码由 Claude Code 编写,Codex 审计;审计意见又被 Claude Code 反驳,随后 Codex 再反驳回去。它们有点吵起来了,我还真对它们说了“别吵了,要建设性一点 lol”。

What did I do? I mostly gave prompts. Sort of.

我做了什么?我主要就是写提示词。差不多吧。

If you use this work, please cite:

如果你使用这项工作,请引用:

@misc{papailiopoulos2026benchpress,

@misc{papailiopoulos2026benchpress,

title={You Don't Need to Run Every Eval},

title={You Don't Need to Run Every Eval},

author={Dimitris Papailiopoulos},

author={Dimitris Papailiopoulos},

year={2026},

year={2026},

url={https://github.com/anadim/llm-benchmark-matrix},

url={https://github.com/anadim/llm-benchmark-matrix},

note={Blog post}

note={Blog post}

}

}

Link: http://x.com/i/article/2026523085545857024

链接:http://x.com/i/article/2026523085545857024

相关笔记

You Don't Need to Run Every Eval

  • Source: https://x.com/dimitrispapail/status/2026531440414925307?s=46
  • Mirror: https://x.com/dimitrispapail/status/2026531440414925307?s=46
  • Published: 2026-02-25T05:35:21+00:00
  • Saved: 2026-02-26

Content

I used Claude Code to build BenchPress a $0 benchmark prediction system, Codex to audit it for bugs, and Claude Sonnet to try to beat it for $1. Here's what I found: LLM evals are so low-rank (in fact rank 2) that 5 benchmarks can predict the other 44 to within 5 points (and many times less).

Most evals are approximately... redundant.

Let me give you an example for why one should expect this.

Here's a table with three frontier models and three benchmarks.

What would you guess the missing entry is?

Maybe around 92-94?

Sounds about right, it's 93.

You didn't need to evaluate GPT-5.2 on GPQA-D which would have a nontrivial $ cost. What made you guess reasonably well? I hope you won’t find me presumptuous to assume that it’s likely the fact that the rows are nearly identical. In fact, these models, on evals at least, are nearly identical.

The point that I want to stress is that there's a ton of similarity that shows up everywhere, across models from the same generation, models of similar sizes, evals that test similar skills that correlate almost perfectly (e.g., AIME vs HMMT), and a benchmark that's hard for GPT-5.2 is hard for Claude Opus 4.6 (e.g., Terminal-Bench 2.0). There's redundancy in basically every direction you can think of measuring things about these models and benchmarks.

So, a simple question crept in my mind: can you exploit this … low-dimensional structure to predict evals?

Every atom in my brain said yes, and so I wanted to check.

So here goes, a few tens of millions of Claude Code and Codex tokens later...

But a short history lesson first.

An old trick to obtain a million dollar matrix

When I was a grad student, I (and every other EE theory kid 15 years ago) was really into compressed sensing and matrix completion (hi @beenwrekt!). The whole point of that theory is: observe a few entries of a matrix, and if you're lucky and the ground truth matrix is approximately low rank, you can recover the rest from surprisingly few measurements by doing variants of the most classic linear algebra algorithm: Singular Value Decomposition.

There was even a famous $1M Netflix Prize for this: in 2006, Netflix released a matrix of about 100 million (user, movie) ratings and offered $1M (so like 2.5 weeks of 1024 H100s on Lambda) to whoever could predict the missing ratings 10% better than Netflix's own algorithm, or something like this. The competition ran for three years, drew tens of thousands of teams, and even though a team did get the $1M, Netflix never actually deployed the winning solution! But if I'm not wrong, the competition basically invented modern recommender systems and launched a few thousand ML careers.

Here's how matrix completion would work in our (model, eval) setting.

It's kind of simple.

Suppose each model can be described by a vector of two entries say, how good it is at general tasks and how good it is at hard novel reasoning. And each benchmark is described the same way, e.g., how much it tests general knowledge vs. how much it tests reasoning. The final model score is their dot product.

Say GPT-4o has s = (8, 2): strong on general tasks, weak on reasoning. AIME 2025 has b = (1, 8), that is it's mostly testing reasoning. Then GPT-4o on AIME 2025 is 8×1 + 2×8 = 24. (Actually, I think that's close to its true score; didn't even plan this :D.) Now if MMLU has b = (9, 1), i.e., mostly general knowledge, then GPT-4o on MMLU is 8×9 + 2×1 = 74.

The obvious in hindsight insight is that every (model, eval) entry can be produced by the inner product of two 2-dimensional vectors. We'll see later that these two dimensions actually tend to explain the data pretty well!

So the game is: observe a few real (model, eval) scores, solve for the hidden vectors, predict the rest.

83 x 49

But first, we need a matrix!

I asked Claude to search and verify all possible (model, eval) pairs it could find between January 2025 and now. It found 83 models across 49 benchmarks, every entry cited with a source URL. I asked Codex to audit the findings for hallucinations and to double-check every entry is valid by independently verifying them.

The final matrix is 34% filled, meaning roughly 1,4k out of ~4k cells have a ground truth number. The rest are missing because Claude couldn't find a reported score. This is what it looks like:

The 83 models span 21 labs (the usual suspects), and 49 benchmarks (also the usual suspects).

But, is this matrix actually low-rank? Easy to check: find a large fully-observed submatrix and just run SVD on it. The biggest complete block I found is 31 models × 10 benchmarks, and here's how the singular values look:

The first component alone captures 71% of the spectrum. The first three capture >90%. So yeah, as with every single matrix I've ever seen in my whole life, this one is also kinda rank-3 😊

If you've ever done matrix completion in your life, you already know exactly how the rest of this post goes.

But, I was still skeptical that matrix completion would work, as it typically operates on much larger matrices. But what's the cost of checking with Claude Code and Codex, right?

I wanted to know: how many benchmark scores do you need for a given model before you can predict the rest?

Cute question. Let’s find out!

Does matrix completion work on 83x49?

It kind of does! And for good values of "kind of". The best method which I will now call BenchPress (details in the next section) predicts held-out scores with 7% median, absolute error. To make that concrete, some examples:

Agentic, competition math, coding, small open-weight model are all within a couple points, and I will share all the results in the github repo. But we got some ugly ones too:

Off by 45+ points. Ouch. We'll come back to why.

BenchPress: A 0$ approximation to your eval suite

The final prediction model has two dead simple ingredients.

Ingredient 1: sparse regression. For each missing cell, find the 5 benchmarks that best predict the target. Concretely: to predict AIME 2025, take another benchmark like MMLU-Pro, plot every model that has scores on both as a point (x = its MMLU-Pro score, y = its AIME 2025 score), and fit a straight line through those points in logit space. The better the line fits, the more MMLU-Pro tells you about AIME 2025. Do this for all 48 other benchmarks, keep the 5 where the line fits best, and average their predictions weighted by fit quality.

Ingredient 2: rank-2 SVD. Imagine every model is a point in 49-dimensional benchmarks space. Rank-2 SVD says: find the best 2D plane through these points such that the projections onto that plane are as close to the real scores as possible. Once you have the plane, every model gets a position on it (2 coordinates) and every benchmark gets a direction on it (2 coordinates) and their dot product gives the predicted score.

Why rank 2? Because we swept ranks 1 through 8 and rank 2 wins. Rank 3 is already worse by half a point, rank 8 blows up to 13% in absolute error. The third singular value looks real in the spectrum that I shared above, but it's below the noise floor for prediction.

There's a small problem: 66% of the coordinates are missing for this to work out of the box. A standard algorithm in matrix completion says to start with zeros, or in our case with column averages, then do SVD, project things onto the singular vectors, replace only the missing entries with the projections while keeping observed scores pinned, and repeat until the projections stop changing.

One important detail: Claude Code found that everything should happen in logit space: the logit transform, logit(p) = log(p/(1-p)), stretches scores near 0% and 100% and compresses the middle, so the gap between 88% and 92% gets the weight it deserves. This fix improved final accuracy by 10% or so. I’m sure it’s a standard trick in old school matrix completion.

A convex combination of the two ingredients: 60% of the entry comes from regression + 40% from the SVD. Why? Because Claude Code did a sweep of combinations and that was the best!

No neural networks or metadata used. We (by we I mean Claude) tried KNN instead of SVD, but KNN and the regression are both local methods that make correlated mistakes. Swapping in SVD gave a 14% improvement.

One thing Claude tried that I asked for and also expected to work: feeding BenchPress model metadata e.g., parameter count, provider, reasoning mode, open-weight status. It made things worse. Everything in "14B reasoning model distilled from DeepSeek" is already captured by the model's pattern across scores. Cool!

Also, Claude tried NMF, PMF, nuclear norm, and ALS, and none was better than good old SVD. You love to see it!

How many benchmarks do you need? Five.

Here's an experiment: take a model, hide all its scores, then reveal them one at a time in random order. After each reveal, build BenchPress and use it to predict the rest. How quickly does the error come down?

The answer is: fast. With just 1 known score, median error (across all models!) is about 12%. By 5 known scores it drops to 9%. After that at some point it’s diminishing returns as the precision of individual predictions keeps improving but the median stabilizes because a few hard-to-predict benchmarks that stay noisy.

The practical takeaway is that 5 benchmarks are enough to get a good prediction of how the model performs on everything else.

Here are some example prediction error vs number of revealed tests for a few popular models from the past year.

In fact, if you remove ARC-AGIs the error improves :D I'd say that's a good sign for SVD, bad sign for ARC-AGI.

If you can only pick 5 to eval on, greedy forward selection gives you:

{HLE, AIME 2025, LiveCodeBench, SWE-bench Verified, SimpleQA}

They span four categories and cover both principal components. Under proper holdout they get about 7.8% error, versus 7% for the full method. Not everything, but a practical minimum eval set.

This seems legitimately cool to me!

What does the 2 in rank-2 mean?

The SVD gives you two things: a weight for each benchmark and a score for each model, on each component. The benchmark weights tell you what the component measures. The model scores tell you where each model falls on that axis.

When you extract the 2 top singular vectors (left and right) you get something interesting. I’ll refer to these singular vectors as “components” in the next two paragraphs.

The first pair of components (associated with the max singular value) seem to be about general capability: The benchmarks with the largest weights are GPQA-D, LiveCodeBench, MMLU-Pro, HumanEval, and MMLU. Every model gets a score on this axis: Gemini 3.1 Pro, GPT-5.2, Gemini 3 Pro, Opus 4.6, Kimi K2.5 score highest. At the other end: Qwen3-0.6B, OLMo 2 13B, Qwen3-1.7B, DeepSeek-R1-Distill-1.5B, Falcon3-10B. It's almost a classifier for frontier vs. small models and, interestingly, closed vs. open.

The second pair of components (associated with the second largest singular value) seems to be about harder, new reasoning tasks, and old vs new frontier models: The benchmarks with the largest weights are SimpleQA, ARC-AGI-2, HLE, ARC-AGI-1, and FrontierMath, these are the hard, novel stuff that frontier models do well on. Benchmarks like MATH-500 and MMLU max out in the opposite direction. On the model side, Gemini 3.1 Pro, Opus 4.6, Gemini 3 Pro, and GPT-5.2 score highest: these are the latest frontier models. At the other end: Claude 3.7 Sonnet, DeepSeek-R1, Gemini 2.5 Flash, o3-mini. These were interestingly frontier in early-mid 2025. Model component 2 is almost a "recency of frontier" measure!

Very interesting!

Where does BenchPress do the worst?

The hardest benchmarks share two traits: either bimodal score distributions (ARC-AGI, IMO, USAMO) or weak correlation with other benchmarks (Terminal-Bench, Arena-Hard). The method relies on benchmark-to-benchmark structure and when it’s not there, the “feature” doesn’t help prediction.

So benchmarks resisting prediction: SimpleBench, ARC-AGI-1 and Terminal-Bench with high errors for predicting them. These are the interesting ones, because they measure something the rest of the matrix doesn't capture.

Competition math (IMO, USAMO) is bimodal: a model either solves olympiad problems or it doesn't. You can't interpolate a step function I guess!

Wait! Can Claude beat BenchPress at guessing scores?

OK, this next part isn't purely science because some of these models were released before Sonnet's training cutoff, so Claude might just "know" certain scores. But I was curious: what if you don't use SVD and regression-type predictors and instead use a one-gazillion-parameter model for that? In this case, Claude 😊

For $0.84, we gave Claude Sonnet the partially-filled matrix as a big CSV and asked it to predict the missing entries. Same holdout, same evaluation.

The result: BenchPress edges out Claude on a fair comparison, 5.8% vs 6.1% median error. Linear algebra beats a trillion-parameter model at filling in a spreadsheet, for free, in under a second. I love it.

At k=0. zero known scores, just the model's name Claude already predicts Gemini 3.1 Pro's benchmarks to within 2.5 points!! BenchPress at k=0 can only guess column averages, and is off by 17 points. But by k=5 BenchPress catches up.

The takeaway: Claude's world knowledge is worth about 5 benchmark scores of information.

After that, linear algebra wins!

Here's the weird part: Claude actually gets worse as you give it more data. At k=0 it nails Gemini 3.1 Pro at 2.5% error but as you increase to 15 revealed data points the gap increases to 4.7%. Weird! Providing partial scores seems to create a conflict between Claude's prior knowledge and the prompt. Let’s call it a kind of "retrieval-augmented degradation”.

For BenchPress almost always more data monotonically means better predictions, as you'd expect from any self-respecting early 2000s predictor.

When should you use this?

Three options for getting a number on "how does my model do on benchmark X":

Option A: Run the benchmark. Cost: $5 to $100k? (I know for a fact TB 2.0 can be O(100k) for some models and harnesses). BUT! You get the true score.

Option B: Predict from the matrix. Cost: ~$0. But with almost everything that is free, it has its issues, in this case your score is within about 5 points. Not great, not terrible.

Option C: Run a cheaper model as proxy? Cost: less than A. But you get the right answer about the wrong model. Hmm maybe not?

But to be serious, the decision is a function of cost and error tolerance. If SWE-bench costs $5,000, and 6% error is fine for a go/no-go, just predict it I suppose. If AIME 2026 costs $2, run it.

Am I convinced? Hmmm.

For frontier labs, honestly probably not really. They'll run every eval regardless, I would assume. But for smaller groups trying to train a new model, having a quick sanity check could be helpful? I don't know.

But it doesn't matter, because this was fun to try!

It's kind of hilarious that in pretty much any setting where you have an incomplete matrix to complete, it almost always turns out to rank 2 or 5. Matrix completion for the win!

Perhaps the deeper meaning is that most of what we call "evaluation" is measuring the same two things (because we said rank 2, remember?) over and over, and the benchmarks that escape this pattern are the ones testing new capabilities (hello friends at Terminal-Bench!).

But we've nevertheless just convinced ourselves we need to climb 100 evals for every model.

So maybe there's a message here for folks trying to come up with new benchmarks too:

if your eval correlates with existing ones, it's not adding much. The valuable benchmarks are the ones that are nearly orthogonal to everything else. Here's how predictable each benchmark is from the rest of the matrix:

Want to try it on your own model?

The full matrix, all 1,375 cited scores, the BenchPress code, and a predict.py CLI that takes 5 benchmark scores and predicts the other 44 are open-sourced at

https://github.com/anadim/llm-benchmark-matrix

Coda

I spent two days on this, mass-prompting Claude Code and Codex, spent 0$ on GPUs, and ended up with a result that would have been a decent workshop paper, If I was writing things more seriously.

I still don't know what to make of that. Part of me thinks it's amazing, and enables more research. Part of me wonders what happens when everyone has that, and whether the thing that matters shifts from "can you get the answer" to "did you ask the right question."

Did I? I don't know! But I had fun finding out.

Anyway. Matrix completion for the win.

Have fun!


The (model,eval) matrix was assembled from official model announcements, technical reports, leaderboards, and evaluation platforms. Every entry has a source URL. The code was written by Claude Code, audited by Codex, with the audit rebutted by Claude Code and counter rebutted by Codex. They kind of had a bit of a fight about it, and I literally told them don’t fight, be constructive lol.

What did I do? I mostly gave prompts. Sort of.

If you use this work, please cite:

@misc{papailiopoulos2026benchpress,

title={You Don't Need to Run Every Eval},

author={Dimitris Papailiopoulos},

year={2026},

url={https://github.com/anadim/llm-benchmark-matrix},

note={Blog post}

}

Link: http://x.com/i/article/2026523085545857024

📋 讨论归档

讨论进行中…