🧠 阿头学 · 💬 讨论题

模型能力评测必须纳入测试时算力变量

现有基准测试因忽略测试时算力而失效，必须转向性能 - 算力曲线评估以真实反映模型能力与安全风险。
打开原文 ↗

2026-06-09 原文链接 ↗

阅读简报

双语对照

完整翻译

原文

讨论归档

核心观点

基准测试失效 单一分数掩盖了模型对算力的依赖，导致无法比较不同预算下的真实性能。
安全评估盲区 忽略推理预算的安全评测无法防御高算力攻击，国家行为体可投入千万美元突破限制。
评测标准重构 必须将 Token 数、美元成本或实际耗时作为横轴，绘制性能扩展曲线而非记录标量终点。
外推风险不可靠 试图通过低预算评测外推高预算风险存在逻辑矛盾，因为性能平台期可能根本不可预测。

跟我们的关联

对 ATou 意味着安全合规成本上升，下一步需在发布流程中嵌入多梯度算力压力测试。
对 Neta 意味着基准解读逻辑变更，下一步应拒绝单一分数榜单，转而分析性能 - 成本斜率。
对 Uota 意味着采购标准需调整，下一步应要求供应商提供不同预算下的边际效用曲线。

讨论引子

如果性能平台期彻底消失，我们如何确定模型的能力上限？
在无法承担高预算评测时，低预算外推高预算风险的方法是否可靠？
不同厂商的 Token 成本差异巨大，如何建立统一的算力衡量标准？

简而言之：随着 LLM 能力不断增强，基准测试成绩越来越取决于测试时算力。事实上，我们可能并不知道现代 LLM 的能力上限在哪里，因为测量它的成本太高了。我们应当调整 LLM 的评测方式，把表现相对于 token、成本或时间的关系纳入衡量。

GPT-5.5 发布当天，最初的反应是怀疑。基准分数确实更好了，但提升幅度不算大：

不过，几小时后，等人们有时间亲自上手体验这个模型时，情况就变得很明显了：相较 GPT-5.4，它是一次台阶式跃升。经典的基准测试表格显然没有讲出全部故事。这是为什么？

当我们把 GPT-5.5 和 5.4 放在同一张图里，并把横轴设为 token 时，原因就更清楚了：

GPT-5.5 并不是在和 5.4 相同的 token 预算下接受评测的，或者说，也不是在相同的美元预算下。一旦控制住测试时算力这个变量，5.5 看起来就明显强于 5.4。

每当我谈到这个问题，人们常会问，为什么不直接用一种评测框架，把测试时算力一路往上加，直到性能进入平台期。问题在于，从经验上看，这个平台期离得非常远。有时在现实可承受的预算内，我们甚至根本观察不到平台期。下面是 @karpathy 的 autoresearch 实验，哪怕在做了数百次实验之后，性能仍在持续提升：

下面则是 @AISecurityInst 的网络安全评测，其中 Mythos 和 GPT-5.5 的表现即使在 1 亿 token 之后仍在快速提升：

注意，对于更强的模型，性能随时间提升得更明显。看起来很可能是，随着模型越来越强，它们在更长时间跨度上运作的效率也会更高。平台期被进一步往后推，甚至可能彻底消失。

基于这个原因，我认为，评测模型的正确方式应当是画出性能相对于测试时算力的曲线图，横轴可以是 token、成本，或者实际耗时。已经有一些基准开始朝这个方向走了。比如，ARC-AGI 衡量的就是分数相对于成本的关系。

另一个合理做法，是设定一个明确的 token、时间或成本预算，并把这个预算告知模型。这和人类在 SAT 或国际数学奥林匹克这类场景中的评测方式是一致的。

每一种横轴都有取舍。不同模型之间的 token 并不能直接比较，因为分词器、速度以及每个 token 的成本都不一样。美元成本又依赖于实现细节，比如批处理方式和硬件利用率，因此成本和延迟之间可以互相权衡。最后，实际耗时也不是完美指标，因为像 best-of-N 这样的多智能体方法可以在不显著增加延迟的情况下扩展测试时算力。即便如此，这些曲线中的任何一种，都比单一标量更有信息量。

对 AI 准备工作的影响

在前沿模型发布之前，实验室通常会评估网络安全、生物安全以及其他滥用风险。如果模型跨过了某个能力阈值，那么在缓解措施到位之前，发布可能会被推迟。但如果能力本身是推理算力的函数，那么安全评测到底应该在什么样的推理预算下进行？

现实中，多数面向模型发布的安全评测，并不会考虑模型消耗了多少推理资源。Gemini 3 Deep Think 的发布以及随后引发的争议，就是一个很有代表性的例子。

Gemini 3 Deep Think 发布时，它的基准分数远高于此前的模型。然而，发布时并没有同时提供评估其风险的 model card。

这引发了 AI 安全社区中一些人的愤怒。

https://x.com/@karpathy

在我看来，对 DeepMind 这次发布的批评没有抓住更深层的问题。真正的问题是，AI 实验室和安全机构在做发布评估时，并没有持续一致地把测试时算力纳入考量。

Deep Think 看起来很可能是由其他模型搭出的一个 scaffold，而那些模型本身是有 system card 的。外部任何人很可能都能复现出类似的 scaffold。换句话说，只要愿意为 Deep Think 级别的推理成本付费，把一堆模型查询拼接起来，Deep Think 的能力大概率本来就是可获得的。Deep Think 只是让普通用户更方便地用上了这些能力。

在我看来，真正该引发愤怒的，是当 Gemini 3 以及其他模型发布时，它们的 system card 并没有把基准表现作为测试时算力的函数来衡量。如果是在我理想中的世界里，模型评测大概应该长成这样：

一个专门的国家行为体，完全可能在单一任务上投入超过 1000 万美元的推理成本。但一次模型评测通常要涉及成千上万，甚至上百万次 rollout，所以如果每一次 rollout 都用这么高的算力预算去评测，实际上是不可行的。好在，性能似乎会随着投入的推理算力，以某种相对可预测的方式扩展。因此，我们可以先在相对较低的推理预算下进行评测，然后再去推算，在高得多的预算下，模型可能具备什么能力，同时明确这种推算存在不确定性。

长时程评测会带来一些复杂性，而这些复杂性未必总能通过从小预算外推出结果来解决。比如，最终可能会发现，想要有把握地评估一个 AI agent 在 1 年跨度上的失准风险，唯一办法就是把这个 agent 真的运行整整一年。AI 实验室很快可能会陷入一种奇怪处境：它们的 agent 运行时间跨度已经超过了新模型的开发周期。到了那一步，如果还想在发布前完成对模型整个最大运行生命周期的评估，就可能只能通过推迟模型发布来实现。

具体建议

具体来说，我对 AI 社区的建议如下：

AI 实验室应当在发布新模型时，用 token、成本或时间作为横轴来公布基准表现。 至少，实验室应当报告获得某个单一基准分数时所使用的推理预算。
基准排行榜应当追踪推理使用量，或者明确设定 token、成本、时间预算。 许多基准已经在往这个方向转变，但这还没有成为标准做法。
Preparedness Frameworks 和 Responsible Scaling Policies 在判断一个模型是否跨过安全阈值时，应当明确考虑推理算力。 此外，评测还应估计模型在多种推理预算下的能力，包括基于小预算运行结果做出的高预算能力推算，并注明不确定性。

如果你关注我有一段时间了，这整篇文章可能会显得并不新鲜。自从 2024 年 9 月 o1 发布时起，我们就已经知道，推理模型的表现会随着更多推理算力而扩展。

然而，将近两年过去，前沿 AI 实验室在发布新模型时，仍然经常只给出单一数字的基准结果。AI 安全机构在看到某个 scaffold 因为使用了 100 倍推理预算而得到更好表现时，依然会感到意外。Preparedness Frameworks 和 RSPs 在判断模型是否达到关键能力等级时，也仍然常常忽略推理算力的使用情况。

最新一代模型比以往任何时候都更擅长利用测试时算力，这使得性能平台期被进一步推远。如果这一趋势继续下去，而我完全预计它会继续，那么那些不把推理算力使用纳入考虑的基准分数，在每一个模型发布周期里都会变得更不具信息量。正因为如此，现在该把推理预算当作能力衡量和安全政策中的一级要素来对待了。

tl;dr: As LLMs become more capable, benchmark performance is increasingly a function of test-time compute. In fact, we likely don't know what the capability ceiling is for modern LLMs because it's too expensive to measure. We should change LLM evaluations to account for that by measuring performance vs tokens, cost, or time.

The day GPT-5.5 was released, the initial reaction was skepticism. The benchmark numbers were better, but not by much:

However, within hours, once people had time to play around with the model, it became clear that it was a step-change compared to GPT-5.4. The classic "benchmark grid" clearly wasn't telling the full story. Why is that?

The reason becomes clearer when we compare GPT-5.5 to 5.4 with tokens on the x-axis:

GPT-5.5 wasn't being evaluated at the same token budget (or dollar budget) as 5.4. Once we control for test-time compute, 5.5 looks substantially stronger than 5.4.

Frequently when I discuss this, people ask why we don't just evaluate with a harness that pushes test-time compute until performance plateaus. The problem is that, empirically, the plateau is very far out. Sometimes we may not observe a plateau at all within practical budgets. Here's @karpathy's autoresearch experiment, where the performance continues to improve even after hundreds of experiments:

And here is the @AISecurityInst's cyber eval, where performance for Mythos and GPT-5.5 continue to improve rapidly even after 100M tokens:

Notice that for the stronger models the performance improvement over time is stronger. It seems likely that as models become stronger they become more effective at operating over longer horizons. The point of plateau is pushed out, and may even disappear.

For this reason, I believe the proper way to evaluate models is with a performance vs test-time compute plot, with either tokens, cost, or wall-clock time on the x-axis. A few benchmarks have already moved in this direction. For example, ARC-AGI measures score vs cost.

Another reasonable option is to set an explicit token/time/cost budget and communicate it to the model. That mirrors how humans are evaluated in settings like the SAT or the International Mathematical Olympiad.

Each x-axis has tradeoffs. Tokens are not directly comparable across models because tokenizers, speeds, and per-token costs differ. Dollars depend on implementation details such as batching and hardware utilization, so cost and latency can trade off. Finally, wall-clock time is an imperfect measurement because multi-agent techniques like best-of-N can scale test-time compute without significantly increasing latency. Still, any of these curves is more informative than a single scalar.

GPT-5.5 发布当天，最初的反应是怀疑。基准分数确实更好了，但提升幅度不算大：

当我们把 GPT-5.5 和 5.4 放在同一张图里，并把横轴设为 token 时，原因就更清楚了：

下面则是 @AISecurityInst 的网络安全评测，其中 Mythos 和 GPT-5.5 的表现即使在 1 亿 token 之后仍在快速提升：

Implications for AI Preparedness

Before a frontier model is released, labs typically evaluate cyber, bio, and other misuse risks. If a model crosses a capability threshold, then release may be delayed until mitigations are in place. But if capability is a function of inference compute, then at what inference budget should safety evaluations be run?

In practice, most safety evaluations for model releases do not consider the amount of inference that went into the model. The release of Gemini 3 Deep Think, and the resulting outcry, is a useful example.

When Gemini 3 Deep Think was released, its benchmark scores were much higher than previous models. However, no model card evaluating its risks was released alongside it.

This led to outrage from some in the AI safety community.

https://x.com/@karpathy

In my opinion, the criticism of DeepMind's release missed the deeper issue: that AI labs and safety orgs don't consistently account for test-time compute when evaluating models for release.

Deep Think appears likely to be a scaffold of other models that do have system cards. Anyone externally could likely reproduce such a scaffold. In other words, it seems likely that the capabilities of Deep Think were available anyway to anyone willing to pay for Deep Think amounts of inference, by scaffolding a bunch of model queries together. Deep Think just makes that more convenient for the casual user.

In my opinion, the real outrage should have been that when Gemini 3 and other models were released, their system cards did not measure benchmark performance as a function of test-time compute. In my ideal world, model evaluations would look something like this:

A dedicated state actor could apply more than $10 million of inference to a single task. But evaluating a model typically involves thousands if not millions of rollouts, so evaluating at such high compute budgets for every rollout would be impractical. Fortunately, performance seems to scale somewhat predictably with the amount of inference compute applied. For this reason, we could evaluate at relatively low inference budgets and then project (with uncertainty) what capabilities might be at much higher budgets.

Long-horizon evaluations can introduce complexities that may not always be addressed with extrapolation from smaller budgets. For example, it may turn out that the only way to confidently evaluate misalignment in an AI agent at a 1-year horizon is to actually run the agent for a year. AI labs may soon find themselves in a strange position where the operating horizon of their agents exceeds the development cycle of new models. At that point, it may be impossible to finish evaluations of a model over its maximum operating lifetime ahead of release without delaying the release of the model.

对 AI 准备工作的影响

Gemini 3 Deep Think 发布时，它的基准分数远高于此前的模型。然而，发布时并没有同时提供评估其风险的 model card。

这引发了 AI 安全社区中一些人的愤怒。

https://x.com/@karpathy

Specific Recommendations

Concretely, I recommend the following to the AI community:

AI labs should publish benchmark performance of newly released models with tokens, cost, or time on an x-axis. At a minimum, labs should report the inference budget used to achieve a scalar benchmark result.
Benchmarks should track inference usage on leaderboards, or have an explicit token/cost/time budget. Many benchmarks have already shifted in this direction, but it is not yet standard practice.
Preparedness Frameworks and Responsible Scaling Policies should explicitly account for inference compute when determining whether a model crosses a safety threshold. Additionally, evaluations should estimate capabilities at multiple inference budgets, including projections from smaller-budget runs with stated uncertainty.

If you've followed me for a while, this whole article might seem like nothing new. We've known since the o1 announcement in September 2024 that the performance of reasoning models scales with more inference compute.

And yet, nearly two years later, frontier AI labs still commonly report single-number benchmark results for their new model releases; AI safety orgs are still surprised when a scaffold achieves better performance by using 100x the inference budget; and Preparedness Frameworks and RSPs still often ignore inference compute usage when determining whether a model reaches a critical capability level.

The most recent models are able to leverage test-time compute better than ever, pushing the performance plateau even farther out. If this trend continues, which I fully expect, benchmark scores that don’t account for inference compute usage will become less informative each model release cycle. For this reason, it is time to treat inference budget as a first-class part of both capability measurement and safety policy.

具体建议

具体来说，我对 AI 社区的建议如下：

AI 实验室应当在发布新模型时，用 token、成本或时间作为横轴来公布基准表现。 至少，实验室应当报告获得某个单一基准分数时所使用的推理预算。
基准排行榜应当追踪推理使用量，或者明确设定 token、成本、时间预算。 许多基准已经在往这个方向转变，但这还没有成为标准做法。
Preparedness Frameworks 和 Responsible Scaling Policies 在判断一个模型是否跨过安全阈值时，应当明确考虑推理算力。 此外，评测还应估计模型在多种推理预算下的能力，包括基于小预算运行结果做出的高预算能力推算，并注明不确定性。

The day GPT-5.5 was released, the initial reaction was skepticism. The benchmark numbers were better, but not by much:

The reason becomes clearer when we compare GPT-5.5 to 5.4 with tokens on the x-axis:

GPT-5.5 wasn't being evaluated at the same token budget (or dollar budget) as 5.4. Once we control for test-time compute, 5.5 looks substantially stronger than 5.4.

And here is the @AISecurityInst's cyber eval, where performance for Mythos and GPT-5.5 continue to improve rapidly even after 100M tokens:

Implications for AI Preparedness

When Gemini 3 Deep Think was released, its benchmark scores were much higher than previous models. However, no model card evaluating its risks was released alongside it.

This led to outrage from some in the AI safety community.

https://x.com/@karpathy

In my opinion, the criticism of DeepMind's release missed the deeper issue: that AI labs and safety orgs don't consistently account for test-time compute when evaluating models for release.

Specific Recommendations

Concretely, I recommend the following to the AI community:

AI labs should publish benchmark performance of newly released models with tokens, cost, or time on an x-axis. At a minimum, labs should report the inference budget used to achieve a scalar benchmark result.
Benchmarks should track inference usage on leaderboards, or have an explicit token/cost/time budget. Many benchmarks have already shifted in this direction, but it is not yet standard practice.
Preparedness Frameworks and Responsible Scaling Policies should explicitly account for inference compute when determining whether a model crosses a safety threshold. Additionally, evaluations should estimate capabilities at multiple inference budgets, including projections from smaller-budget runs with stated uncertainty.

📋 讨论归档

讨论进行中…