返回列表
🧠 阿头学 · 💬 讨论题

Anthropic判断AI已开始加速自身研发,但离真正自我递归改进仍有关键断层

这篇文章最成立的判断是“AI 正在迅速吞掉 AI 研发中的执行层工作”,但把这条趋势直接外推到“递归式自我改进将近”仍然证据不足,而且明显带着 Anthropic 的能力展示与治理议程。
打开原文 ↗

2026-06-05 原文链接 ↗
阅读简报
双语对照
完整翻译
原文
讨论归档

核心观点

  • 执行层自动化已成事实 Anthropic 给出的内外部证据基本足以支持一个强判断:代码生成、调试、实验执行、研究辅助这些研发执行环节,已经不是“人类主导、AI辅助”,而是在向“AI主写、人类审查”转移。
  • 真正变化是组织瓶颈上移 文中最有价值的不是“80%代码由 Claude 编写”这种冲击性数字,而是瓶颈已经从“谁来做”转向“做什么、怎么审、哪里该停”,这意味着未来组织竞争力将更多取决于判断力、审查带宽和流程设计,而不是单点执行力。
  • 递归式自我改进的论证仍然没跨过去 文章证明的是 AI 越来越擅长完成目标明确的任务,甚至能在局部开放问题上给出不错下一步,但“自主选择高价值方向、持续提出范式突破、可靠构建更强后继模型”这道坎并没有被真正证明,作者是在做趋势延长,而不是完成论证闭环。
  • 内部数据很强,但自证风险同样很强 Anthropic 的内部样本、归因口径、模型裁判和员工主观估计,共同构成了一个很有说服力但并不独立的证据体系,因此这些数据可以判断方向,但不能无保留地当成行业定论。
  • 这不是纯研究文章,而是“能力秀+政策站位” Anthropic 一边展示自己在 agent 化研发上的领先,一边推动“必要时减速/暂停”的治理框架,这种双重目标并不虚伪,但必须看清:它既在报告现实,也在塑造舆论和政策边界。

跟我们的关联

  • 对 ATou 意味着什么、下一步怎么用 对 ATou 来说,最该更新的不是“AI 会不会写代码”,而是“组织里哪些环节已经可以只给目标、不必给过程”;下一步应把手头工作按“执行/方案/方向”三层拆开,优先把高频、可验证、闭环清晰的执行层彻底 agent 化。
  • 对 Neta 意味着什么、下一步怎么用 对 Neta 来说,这篇文章说明判断力正在成为比产出更稀缺的资产;下一步不该只追求更多内容或更多实验,而应建立一套“如何选题、如何验收、何时叫停”的高质量判断机制。
  • 对 Uota 意味着什么、下一步怎么用 对 Uota 来说,文章强化了一个现实:未来很多“专业能力”的市场价值会被压缩,但“品味、解释、关系、信任、裁决”会更贵;下一步可以把讨论重点从“AI 会替代谁”转到“哪些人类能力会因为 AI 普及反而更稀缺”。
  • 对投资意味着什么、下一步怎么用 这篇材料最直接的投资含义不是追逐“AGI 叙事”,而是重估所有能承接研发执行自动化的基础设施、审查工具和工作流系统;下一步应重点看两类机会:一类是 agent 执行平台,另一类是审查、验证、安全与治理层的新瓶颈工具。

讨论引子

1. 如果 AI 已经显著吃掉执行层工作,那么未来组织里最稀缺、最难复制的能力到底是不是“研究品味”? 2. Anthropic 展示的这些数据,足以支持“复合式效率提升”,但真的足以支持“递归式自我改进将近”吗? 3. 当人类越来越少亲手做事、更多只做审查和裁决时,能力退化会不会反过来成为新的系统性风险?

在 AI 历史上的大部分时间里,它开发周期中的每一步都由人类推动。但在 Anthropic,我们正把越来越多的 AI 开发工作交给 AI 系统自己去做,这正在加快我们的工作速度。

如果这种趋势继续推进下去,并且拥有足够的算力,它将指向一种能够完全自主设计并开发自己后继版本的 AI 系统。这被称为_递归式自我改进_。我们还没有走到那一步,而且递归式自我改进也并非必然会发生。但它到来的时间,可能比大多数机构准备好的时间更早。

借助公开基准测试和此前未公开的 Anthropic 内部数据,Anthropic Institute 正在展示,AI 已经开始加速 AI 系统本身的开发。只举一个例子,今天 Anthropic 的工程师平均每个季度交付的代码量,是 2021 年到 2025 年期间的 8 倍。

本文讨论的技术趋势表明,AI 系统在未来几年会变得强大得多。这些趋势影响巨大。能够构建自身的 AI,将会是技术史上的一个重大进展,它可能在科学、医疗等领域为世界带来巨大的好处。但完全的递归式自我改进,也可能增加人类失去对 AI 系统控制的风险。如果系统真的能够完全构建自己的后继版本,那么我们如何保障它们的安全、如何监测它们、如何塑造它们的行为,就都会变得更加重要。

2021–2023

构建第一个 Claude

在早期,Anthropic 的工作看起来和其他科技公司并没有区别,都是人在笔记本电脑上写代码、写文档。

2023–2025

聊天机器人

人们开始使用早期聊天机器人来帮助完成流程中的某些部分,比如生成简短的代码片段,再把输出复制到文本编辑器里。

2025–2026

编程智能体

随着智能体能力增强,它们开始能够自己编写和编辑代码,有时甚至能完成整份文件。

今天

自主智能体

智能体现在已经可以自己运行代码,并把数小时的工作分派给其他智能体。

20XX?

闭环完成

在未来,智能体可能会强大到足以自行构建并训练模型。如果这真的发生,Claude 的未来版本就可能由 Claude 自己持续改进。

来自外部世界的证据

AI 模型能力提升的速度正在加快。它们能够稳定独立完成的任务时长,大约每四个月翻一倍,而更早之前的趋势是每七个月翻一倍。在 2024 年 3 月,Claude Opus 3 能完成大约需要人类四分钟的软件任务。一年后,Claude Sonnet 3.7 能处理大约需要一个半小时的任务。再过一年,Claude Opus 4.6 能处理 12 小时任务。1 如果这个趋势保持下去,那么那些需要熟练人员花上几天的任务,今年就可能进入 AI 的能力范围。到了 2027 年,AI 系统可能就能胜任需要人类花上几周的任务。

同样的模式也出现在编程和研究基准测试上。基准测试用于衡量模型在某个领域的表现,当模型成绩接近 100% 时,就被称为“饱和”。2 SWE-bench 是一项现实世界软件工程的标准测试。它会给模型一个真实的开源代码库和一份真实的 bug 报告,要求模型写出代码修改来修复问题,并通过项目自身的测试。两年间,模型成绩从个位数低分一路上升到让这个基准几乎饱和。

CORE-Bench 测试模型能否复现实验研究,而这正是它们进行原创研究的前提。它会把一篇已发表论文背后的代码和数据交给 AI 模型,要求它重新运行一切,并确认能否复现论文结果。2024 年时,AI 系统成功复现结果的比例大约只有 20%,十五个月后,这项基准就已经被做到了饱和。负责测量模型完成长时任务能力的 METR 发现,Claude Mythos Preview 可以工作“至少”16 小时,而且已经“达到了 [METR] 在不设计新任务的前提下所能测量的上限附近”。

公开基准测试能告诉我们很多关于这些系统能力的信息。但它们无法揭示 AI 系统究竟在多大程度上加快了 AI 开发本身。要回答这个问题,我们需要来自 Anthropic 这类 AI 公司内部的直接证据。

来自 Anthropic 内部的证据

构建一个前沿模型,大体上需要两类工作。一类是_工程_,包括编写代码、搭建基础设施、监督模型训练。另一类是_研究_,包括决定做什么实验、解读实验结果、判断下一步该尝试什么想法。

无论在工程还是研究中,整体图景都很一致。在工程上,Claude 已经可以接到一个定义并不充分的问题,然后自己想办法解决;人类给出目标,但不再需要给出方法。在研究上,Claude 在执行定义明确的实验时,已经能达到甚至超过熟练人类的水平。不过,一旦涉及在工程和研究中自己判断该选择什么目标,Claude 与人类之间仍然存在明显差距。这正是今天的 AI 与未来那种能够自主设计自己后继版本的系统之间的差距。

在 Anthropic,员工随着经验增长,通常会接到更开放、也更重要的任务。刚开始时,他们执行的是别人已经定义好的任务,比如,导出按钮坏了,请修好它。 有了经验以后,他们会拿到一个目标,然后自己设计方案,比如,调查为什么网络在高负载下会变慢。 到了最高级别,他们要决定的已经是哪些问题根本值得去做,比如,团队下个季度应该构建什么? 我们可以用 Anthropic 的内部数据,来看 Claude 在处理这些不同类型任务上的进展到底到了什么程度。

Claude 写出了 Anthropic 相当大一部分代码。 截至 2026 年 5 月,合并进 Anthropic 代码库的代码中,超过 80% 是由 Claude 编写的。3 在 Claude Code 于 2025 年 2 月以 research preview 形式发布之前,这个数字还只是个位数低段。这种变化也体现在每位工程师的产出上。在 Anthropic 的前四年里,也就是 2021 年到 2024 年,每位工程师每天合并的代码行数基本保持不变;到了 2025 年,Claude 开始自己运行代码,而不只是给工程师提建议让他们复制粘贴,代码产出就开始上升。到 2026 年,模型开始能在更长的时间范围内自主工作,增长斜率又一次变陡。下图标出了这两个拐点。到 2026 年第二季度,典型工程师每天合并的代码量,是 2024 年的 8 倍。4 这是因为大量代码实际上由 Claude 完成,而工程师主要负责指挥和审查,而不是亲手逐行输入。

Image 1: Bar graph showing code contributed per person, per quarter, starting in Q2 2021 and ending in Q2 2026. The graph notes the release dates of eight different models: Claude 1, Claude 2, Claude 3, Claude 4, Claude Code, Claude Sonnet 4.5, Claude Opus 4.5, Claude Mythos Preview (internal access), and Claude Mythos Preview.

这里需要说明一点。代码行数并不是完美指标,因为它衡量的是数量,而不是质量。所以,2026 年第二季度_8_×代码行数/工程师/天,几乎肯定夸大了真实的生产率提升。但它仍然表明了一种加速。在 Anthropic,我们不会因为某人写了多少行代码就奖励他们。团队成员之所以产出更多代码,只是因为他们在使用 AI 系统写出更多代码。

代码行数的增长,也和人们对生产率大幅提高的主观感受相吻合。在 2026 年 3 月一项面向 Anthropic 各研究团队 130 名员工的调查中,回答者的中位数估计是,在使用 Mythos Preview 的情况下,他们在原本就会做的那类项目上,产出大约是完全没有 AI 模型辅助时的 4 倍。5 我们认为,3 月份真实的提升幅度可能比这个数字要低一些。6 不过,我们认为整体判断是可信的,也与其他观察一致。Anthropic 相当一部分技术人员,确实正在以没有 AI 帮助时数倍的速度完成核心工作。

我们还看到一些证据表明,Anthropic 的员工正在用 Claude 去做那些本来根本不会发生的工作,比如构建探索性工具,或者处理长期被拖延的清理工作。举个例子,2026 年 4 月,Claude 交付了 800 多个修复,把某一类 API 错误降低到了原来的千分之一。负责监督 Claude 的工程师估计,一个人类要完成这项工作需要四年时间;修别人留下的 bug 既慢又磨人,而且人类很难同时在脑子里装下这么多陌生上下文。

大约一年前开始,越来越重度地把工作 Claude 化。这段经历非常疯狂,而到现在已经差不多有五个月没亲手写过任何代码了。

Anthropic 员工*

Claude 写出来的代码是“好的”,而且还在持续变好。 “好代码”包含两层意思。一层是它能工作。另一层是它的写法能让别的工程师看懂,并在此基础上继续开发。对于第一条,证据很明确。过去一年里,Anthropic 员工在 Claude 工作中途需要纠正它、重定向它,或者接管任务的频率一直在稳步下降,哪怕是在最复杂、最开放的任务上也是如此。这里说的是那些没有清晰规格的问题,也就是工程师自己都不确定答案到底长什么样。下图展示了 Claude 在不同难度任务上的成功率随时间变化的情况。这说明 Claude 写出的代码是能工作的。

Image 2: Line graph showing the Claude Code session success rate on four different types of tasks—trivial tasks, routine tasks, substantial tasks, and open-ended problems—with six different models: Claude Sonnet 4.5, Claude Opus 4.5, Claude Opus 4.6, Mythos Preview (internal access), Mythos Preview, and Claude Opus 4.7.

如何阅读这张图: 会话成功由一个 Claude 裁判判定;如果 Claude Code 智能体在不需要纠正的情况下,清楚地完成了用户任务,那么这次会话就被视为成功。工作负载变化可能导致成功率短期波动。

在最开放的任务中,Claude 的成功率在 2026 年 5 月达到了 76%,六个月内提升了 50 个百分点。举一个这个难度层级的任务例子,一次常规升级导致数以万计的训练任务崩溃。一位工程师只给了 Claude 一点文字说明和集群访问权限,就把它指向了这个在线事故。Claude 在运行中的任务间逐步排查,每次测试一个环境设置,最终定位出是某个隐蔽的调试标志触发了崩溃,还稳定复现了问题,并确认了修复方案。大约两个小时内,Claude 交付了通常需要两到三天才能完成的工作。

第二条标准,是写出让其他工程师能看懂并继续开发的代码。在这一点上,人类和 AI 之间的差距仍然存在,但正在迅速缩小。Anthropic 内部对此并没有完全一致的看法,但很多人认为,2025 年底时,Claude 写的代码质量仍然不如 Anthropic 工程师写的代码,而今天大致已经到了持平水平。我们预计,它会在今年之内超过人类。

这已经改变了 Anthropic 如今审查自身代码的方式。现在,我们代码库中的修改提案,在合并之前会先经过一个自动化的 Claude 审查器,专门检查 bug、安全漏洞和其他缺陷。利用这个工具,我们做了一次回顾性分析,发现如果 Anthropic 的代码库每一次变更都接受自动化 Claude 审查,那么过去 claude.ai 线上事故背后大约三分之一的 bug,本来可以在进入生产环境前就被发现。写出这些代码的工程师,本就是世界上最擅长构建这类系统的一群人。现在,Claude 已经开始捕捉他们错过的错误。

2025 年底时,Claude 写的代码比 Anthropic 人类写的代码稍差一些,如今大致已经持平,我们预计它会在今年之内明确超过人类。

Claude 很擅长去执行别人设定目标的实验。 每次 Anthropic 发布新模型时,我们都会做同一个测试。我们会给 Claude 一段用于训练小型 AI 模型的代码,并要求它在仍然通过相同正确性检查的前提下,把这段代码跑得尽可能快。目标和成功指标都是预先固定的,因此 Claude 的任务就是通过重写代码、运行、计时、重复这个循环来寻找加速点。这是一个微型版的实验研究闭环。到 2025 年 5 月,Claude Opus 4 平均能把初始代码加速约 3 倍。到 2026 年 4 月,Claude Mythos Preview 已经能做到约 52 倍。作为参照,熟练的人类研究员要做到 4 倍加速,通常需要四到八个小时。7 在研究流程的这一部分,也就是在定义明确的实验里优化具体步骤,Claude 只用了不到一年,就从超级有帮助变成了超越人类。

今天的形态大致是,人类提出想法,而模型能以前快一个数量级的速度把它们实现、测试和评估出来。

Claude 正越来越擅长提出自己的实验。 2026 年 4 月,Anthropic 发布了第一项展示 Claude 端到端运行开放式研究项目的成果。由 Claude 驱动的智能体被交给了一个 AI 安全领域的开放问题,大致是,较弱的模型能否可靠地监督更强的模型? 然后就让它们自己去解决。这包括提出假设、做实验、与并行智能体共享发现,再不断迭代。这个任务有明确的表现“地板”和“天花板”。地板是弱监督者单独工作的表现,天花板则是强模型在基于正确答案训练后的表现。两位人类研究员花了大约一周时间,填补了这个差距的约 23%;这些智能体在累计 800 小时、约 18,000 美元算力成本下,填补了 97%。这项工作当然也有一些限制。结果并没有顺利迁移到生产规模模型上,而且问题是由人类选定的,评分标准也是人类制定的。但在这些边界之内,所有实验都是智能体自己设计的。唯一真正由人类承担的角色,就是设定方向。

在一两天时间里,Claude 在几乎不需要我帮忙的情况下完成了这一切。如果是[一个初级同事]在同样时间内拿着这样的结果回来,我会有点佩服。未来已经来了。

Claude 正越来越擅长把研究会话引向研究发现。 我们检查了真实的 Claude Code 会话,时间范围是 2026 年 1 月到 3 月。当时 Anthropic 的研究员正在和 Claude 一起处理开放式调查问题,比如弄清为什么某次训练总是崩溃,或者为什么模型在某个基准上分数很低。在每个案例里,我们都找到了研究员绕远路的时刻。也就是他们追了一个方向,导致会话一度偏航,后来才重新回到正轨。接着,我们把会话偏航之前的内容,_只_给不同版本的 Claude 看,并问它接下来会怎么做。随后,再由另一个能够看到整段会话最终结果的 Claude 来判断,AI 提出的下一步,还是人类当时的下一步,更好。8

因为我们是刻意挑选这些时刻的,共有 129 个,而且我们事先知道人类当时的选择是有改进空间的,所以这并不是模型判断和人类判断之间完全对等的比较。这些时刻的价值在于,它们构成了一组现实而有挑战性的情境,在那里正确的下一步并不明显,而人类的选择又恰好能作为一个有用的标尺,让我们比较模型表现如何随时间变化。按这个标准,我们在 2025 年 11 月最好的模型 Opus 4.5,有 51% 的时间能胜过人类的选择;到 2026 年 4 月,Mythos Preview 这个比例上升到了 64%。研究工作的日常,本质上就是由这样一个个“下一步该怎么做”的判断串联起来的,所以这也是衡量模型未来是否能够独立推进调查的一个相关指标。我们把这个结果看作一个早期信号,说明 AI 系统正在变得更擅长做出 AI 研究所依赖的那类判断。

Image 3: Bar graph with the header "Can the model pick a better next step than the human?" The bar graph shows the performance of nine different models: Claude 3 Haiku, Claude Sonnet 4, Claude Sonnet 4.5, Claude Haiku 4.5, Claude Opus 4.5, Claude Sonnet 4.6, Claude Opus 4.6, Claude Opus 4.7, and Claude Mythos Preview.

如何阅读这张图: 实际天花板线衡量的是一种“理想”答案,这个答案由一个能够看到整段会话全貌,包括它最终如何结束的模型写出。

至少在当下,人类的相对优势仍然在于能够看到更大的图景,并能跳出眼前任务的边界去思考。

Anthropic 的未来工作方式可能会是什么样?

这些证据表明,在 AI 开发流程的每一个环节里,人类承担的角色都在收缩。一旦人类写的代码与 AI 写的代码在质量上达到持平,人类就会彻底停止亲手写代码,只负责审查。但如果人类审查代码的速度赶不上 Claude 生成代码的速度,那么人类审查就会变成 AI 开发的新瓶颈。同样,一旦 Claude 能独立跑实验,问题就会转向,这些实验里哪些值得做。简单说,_执行_本身,也就是写代码、跑实验、产出结果,几乎已经不再消耗人类时间,尽管它仍然会消耗算力。

目前人类仍有相对优势的领域,是研究品味和判断力,包括哪些问题重要、哪些结果可信、什么时候一条路线已经走死。

工作和生活过去像是一种由人与人之间小帮忙支撑起来的礼物经济。能帮我把这个脚本跑起来吗?……每一次都会形成一点点人情债,也会带来一点点相互理解。[Claude] 更快,不会形成任何人情债,但每一次也都意味着一次人类协作机会的流失。

在一切顺利的日子里,会忍不住觉得,自己做什么都不重要,所有东西都被自动化了,而且比自己更好、更快。但也有那种一切都坏掉、又搞不清为什么的日子,这时就会突然意识到,已经完全不知道自己到底一直在做什么了。

如果我们错了呢?

对于上面这些证据,一个很自然的反对意见是,那些仍然掌握在人类手中的工作,也就是决定该做哪些问题,才是最重要的。没有这种判断力,Claude 仍然只是一个强大的助手,而不是一个能够独自推动 AI 进展的系统。

今天的训练方法和架构究竟能不能解锁这种能力,确实并不清楚。但 AI 的进步,很少靠的是“灵光一现”。近些年 AI 历史里当然有过这样的时刻,比如 Transformer 架构,或者混合专家模型,但这类改变范式的想法往往几年才出现一次。在这中间,大多数进展都是渐进式的。我们把某样东西放大,看看哪里会坏,修好它,再试一次。而这恰好就是 Claude 现在最擅长的工作流。爱迪生说,天才是 1% 的灵感和 99% 的汗水。而现在,我们看到那 99% 的汗水正在越来越自动化。越来越清楚的一点是,推动前沿进步的大量工作其实是可以自动化的。大规模研究进展,主要是工具和资源的函数,它们决定你能多快做实验,能同时做多少实验,又能多快拿到结果。

就算我们假设 Claude 永远都得不到好的研究品味,对这些证据的一种保守解读,仍然意味着加速会不断复合。如果人类把大部分时间都花在那一小部分个位数比例的方向设定工作上,而剩下的事情都交给 Claude,那么每位工程师或研究员实际上都在调度比以前多得多的工作。我们看到的证据表明,Anthropic 的人不仅推进速度更快,而且覆盖的范围也更广。落到实际层面,就是 AI 已经让 Anthropic 的前进速度远远快于有效 AI 工具出现之前。

一种不那么保守的解读是,Claude 在研究判断力上那些虽然还很窄、但已经出现的早期证据,正说明这种能力本身也在提升。所谓“研究品味”,也许只不过是另一种 AI 起初做不好、后来逐渐做好的能力。类似的模式,我们已经在其他定性能力上见过,比如 AI 系统能够解释为什么一个笑话好笑,能够展示心智理论,能够解出语言谜题。

可能的未来

接下来会发生什么,取决于两件事。第一,趋势会不会延续。第二,如果延续,我们选择怎么做。我们至少可以想象三种未来场景。

  1. 趋势停滞,但今天的 AI 能力被广泛扩散。 本文中出现了许多指数型轨迹。但这些轨迹也可能最终只是 S 曲线。我们也许正在接近曲线弯折的地方,在那里规模收益会减弱,曲线先变直,再变平。把一个合格研究员和一个伟大研究员区分开的那种判断力,可能并不是靠扩大训练输入,比如算力和数据,就能得到的能力。如果真是这样,那么要突破这个瓶颈,就需要一个新想法,比如一种能够取代当前所有前沿模型都在用的 Transformer 架构的新架构方案。 另一种可能是,限制 AI 进展的关键约束不在模型,而在供应链。推进并扩散前沿,所需的能源和算力,也许比现有世界所拥有的还要多。限制因素可能是芯片制造速度、电网扩张速度,或者互连带宽,而不是智能本身。我们也不能排除 AI 生态会遭遇外生冲击,从而显著放缓一切,比如算力或电力供应突然减少。这两者都会拖慢进展,并使实验室继续向前投入的成本变得更高。又或者,我们还没有预见到其他阻碍进步的障碍。

即便模型能力永远停留在今天的水平,我们也预期世界会发生重大变化。Project Glasswing 就是一个早期信号。在最初几周里,Mythos Preview 在全球最重要的一些系统中发现了超过一万个高危和严重级别的软件漏洞,多到网络防御的瓶颈已经从发现漏洞,转移到了能不能足够快地修补漏洞。与此同时,我们仍处在今天这些模型向更广泛经济体系扩散的早期阶段。未来,一家 100 人公司越来越可能完成过去 1,000 人公司的工作,因为每个员工头上都将叠着一座由智能体构成的金字塔。

出于完整性考虑,我们把这个场景列了出来,但我们并不认为它很可能发生。我们目前能测量到的每一种能力,包括那些感觉更“软”的能力,比如代码质量和开放式任务成功率,到现在为止都沿着同一条曲线在走。我们还没有看到那条曲线开始弯折。在我们考虑的三种未来里,这一种会给政府和社会最多的适应时间。我们更担心后面两种,因为它们会快得多,也几乎不给准备留下空间。

  1. AI 实验室持续获得复合式效率提升。 在这个场景中,AI 开发会在很大程度上实现自动化,但研究方向仍由人类设定,结果仍由人类裁决。随着时间推移,使用 AI 系统的组织会变得越来越高效,因此我们可以预期,组织中的每个人都会获得显著的生产率倍增。100 人公司能完成 10,000 人,甚至 100,000 人组织的工作。这会彻底改变知识工作和政府服务,但也可能被用于有害用途,从对整个人口进行威权式监控,到为每个个体量身定制操控信息、并以任何人类团队都无法匹敌的规模运行影响行动。像 Anthropic 这样的公司里,人类的角色也会发生变化。人们会与 AI 系统结成伙伴关系,放大研究能力,生成新洞见,并共同构建那些用来验证 AI 输出是否可信的系统。 我们在这里列出的证据表明,我们很可能正在走向这个场景。但流程中某一部分的加速,往往只是把瓶颈转移到别处。整体速度最终还是受那些没有被加速的部分限制。在计算领域,这被称为Amdahl 定律,同样的逻辑也适用于组织。Anthropic 已经遇到了 Amdahl 定律的一种典型表现。随着我们开始在整个组织里推动越来越多的代码流动,人类代码审查已经变成了新的瓶颈。

这种摩擦不只出现在工程领域。Anthropic 员工与高能力模型协作之后,新的想法、新的项目、新的工具和新的模拟出现了爆发式增长,远远超过了我们的推进能力。组织识别并修复这些瓶颈的速度,本身可能会成为一种会随时间提升的能力,而且很可能会变成任何组织最重要的能力。

  1. AI 系统本身具备完全递归式自我改进能力,并开始构建自己的后继版本。 如果推动能力前进的技术趋势继续下去,并且 AI 系统能够发展出那种属于人类变革性创造力的能力,那么 AI 系统设计并打磨自己的可能性就是可信的。 在这个世界里,AI 开发进展的速度将完全由 AI 系统可用的算力决定,或者说,由发现训练或推理算法效率提升的速度决定。人类在它们的发展中只扮演大幅弱化的角色,大部分精力可能都会转向对一个由 AI 系统运行、不断扩大的“虚拟实验室”进行监督、验证和核查。我们预计,能够自动进行 AI 研究与开发的系统,也会拥有可迁移到其他科学领域的能力,从而开始在其他领域引发革命。

在这个未来里,对齐问题最终会被解决,还是不会,是我们最没有把握的部分。模型可能已经足够对齐,而且研究品味也足够好,以至于它们会发现并实施那些我们至今还没找到的新方案。它们也可能足够明智,在情况不对时选择停止开发。另一种可能是,今天模型中那些少见的失调现象,会在模型构建自己后继版本的过程中不断叠加,变得更频繁,却又更难理解,直到我们失去对它们的控制。也有可能,我们根本造不出、整合不好、验证不了那些能帮助我们判断自己到底正处在哪条趋势线上的工具。

我们对这样的世界会是什么样,并没有好的直觉,因为我们当前的经济仍然由人类和人类制造的工具驱动。按其本性,一个由快速递归式自我改进驱动的世界,可能会被这种自我改进的模型所主导,因为它的能力会全面超越人类,而且会扩散到更广泛的经济之中。如果人类劳动不再具有竞争力,那么经济会是什么样,很难预测。

即便模型开发真的变成完全自动化和递归式的,我们仍然无法预测这对大多数人的日常生活到底意味着什么。Amdahl 定律在这里同样成立。递归式智能可能会让我们快速实现 Machines of Loving Grace 中描绘的许多好处,至少在某些领域如此。我们预计,具身智能,也就是机器人技术,可能会很快跟上递归式智能,并沿着类似路径,在成本下降的同时获得越来越高的回报。更强的智能,可能帮助我们更快地在物理世界里造东西,更高效地开展拯救生命药物的临床试验,也可能发展出新的协作形式。

但仅仅实现递归式改进,并不意味着工业生产方式、社会组织方式或市场运作方式会立刻改变。更多的智能,不能在几十年的实际使用之前就知道一种药物最终会怎样,不能比宪法规定的时间更早举行选举,也不能在一个周末之内把陌生人变成老朋友。对大多数人来说,这个未来的体感速度,仍然会由那些瓶颈决定,哪怕上游实验室已经按算力速度狂奔。那个碰撞点,也就是不断更快构建自己的递归式智能,与人类、关系和治理构成的世界相撞的地方,也是这个未来里我们无法预测的另一部分。

我们该做什么?

如果我们能够有效放慢这项技术的发展速度,给自己更多时间去处理它那些巨大的影响,我们认为这大概率是件好事。但如果减速只是让最不谨慎的参与者在技术上追上来,那反而可能让所有人都更不安全。在没有全球协调机制的情况下,公司和政府将不得不在竞争压力和地缘政治压力下,对安全问题做出艰难决定。

我们认为,世界若能拥有一种_选择权_,可以减慢甚至暂时暂停前沿 AI 的开发,好让社会结构和对齐研究跟上技术推进的速度,那会是一件好事。Anthropic Institute 将与许多其他方合作开展研究,并采取行动,帮助建立那些让可信减速或暂停成为可能的系统。这些系统将使前沿 AI 开发者能够验证,全球其他参与者是否真的已经停止或放慢,也能验证某个坏行为者无法借着协调减速的名义偷偷抢跑。如果这样的系统存在,我们预计,只要其他位于或接近前沿的开发者也以可验证方式这样做,我们就会放慢或暂时暂停。

一场真正有意义的减速或暂停,需要多个资源充足、位于或接近前沿的实验室,在多个国家中,就相同条件达成一致并停止。它还要求各方都能验证其他方确实停下来了。由于 AI 系统的独特特征,这个军控问题中的“可探测性”,其标准低于“可验证性”,其实比其他技术难得多。训练任务比导弹发射井更容易隐藏,它们的输入又是通用型的,而悄悄违约的诱因极其巨大,因为当别人暂停时,谁继续前进,谁就可能继承领先位置。一个可信的暂停机制,还必须说明是什么触发暂停,什么条件下解除暂停,以及由谁来裁决。

从原理上说,这一切未必不可能。世界曾为其他复杂技术建立过验证机制,比如《中导条约》。但这些机制花了几十年时间,才建立起所需的基础设施和信任。而我们没有那么长时间。相比之下,单个实验室单方面暂停,是可以立刻做到的,但它达成的效果也小得多。它只会改变谁是领跑者,却不会创造出当前真正缺失的那种更广泛的审议过程。

在接下来的几个月里,我们会组织一些讨论,让政策制定者、研究人员、公民社会以及其他 AI 公司,共同回答本文提出的一些问题,尤其是围绕完全递归式自我改进,以及如何创造更好的协调和审议选项。我们会把讨论结果发布出来。现在正是一起研究这些问题的窗口期,而 AI 公司之外的人,也应该参与这场审议。

Marina Favaro 和 Jack Clark 共同撰写了本文,Santi Ruiz 提供了编辑支持。Shan Carter、Romello Goodman 和 Nikki Makagiansar 使用 Brian Calvert 与 Jun Shern Chan 收集的数据制作了文中的可视化。Daniel Freeman、Jim Baker、Max Young、Sarah Pollack、Francesco Mosconi、Holden Karnofsky、Andy Jones、Kevin Troy、Anton Korinek、Meg Tong、Andrew Ho、Dan Altman、Drake Thomas、Jack Shen、Sasha de Marigny 和 Avital Balwit 提供了反馈。


  1. METR 的关键指标,衡量的是 AI 系统在一组任务上达到 50% 可靠性时所对应的时间跨度,不过在 80% 可靠性下,趋势线看起来也是一样的。
  2. 尤其是当基准测试转向更开放的形式和更困难的任务时,比如奥赛级数学,基准往往会在低于 100% 的地方饱和,原因在于题目和答案集中本身存在错误,比如题意模糊,或题目根本无解。
  3. Anthropic 管理层曾公开估计,我们 90% 或更多的代码都是由 Claude 编写的,包括脚本和实验性代码。我们这里的 >80% 数字,衡量的是合并到生产环境中的代码行里,有多少可以归因于 Claude。这是一种更保守的度量,原因有两点。我们的归因流水线本身存在缺口,而那些没有归因给 Claude 的代码行中,也包括自动生成的代码和其他并非人类手写的产物。
  4. 这波代码产量激增,正在挤压所有人共享的基础设施。GitHub 作为全球大部分软件构建所依赖的平台,在整个 2025 年大约看到了 10 亿次代码提交;到 2026 年年中,这个数字已经达到每周 2.75 亿次,照此速度全年大约会达到 140 亿次。该公司的 COO 表示,为了跟上这个速度,他们“正在极其拼命地”扩容。
  5. 关于这项调查方法的更多细节,可见 Claude Opus 4.7 System Card 第 2.3.5 节。
  6. 许多受访者可能并没有认真考虑如何处理问题定义中的各种偏差和细微之处,而 METR 最近的研究 表明,开发者对 AI 生产率提升的估计可能会偏高。
  7. 加速倍数最终能有多大,高度取决于初始代码留下了多少优化空间,因此不应把它直接理解为现实世界中的训练加速倍数。所以这里真正不该锚定的是那个绝对倍数。更有信息量的是,这种实验设置让我们能够做出同口径比较,无论是模型之间的比较,也就是过去一年从约 3 倍到约 52 倍,还是与熟练人类之间的比较,也就是在同样任务上用四到八小时做到约 4 倍。
  8. 为了检查裁判偏差,我们又在另一组 127 个时刻上跑了同样的测试。在这些时刻里,人类的下一步本来就已经很强,而不是像原始数据集中那样还有改进空间。在那组数据中,模型建议被判定为更好的比例只有大约 20%。

  9. 本文中的 Anthropic 员工引语均来自内部讨论,并经许可使用。它们反映的是截至 2026 年 5 月的个人观点,并不代表公司的正式立场。

For most of AI’s history, humans drove every step in its development cycle. But at Anthropic, we are delegating a growing share of AI development to AI systems themselves, which is speeding up our work.

在 AI 历史上的大部分时间里,它开发周期中的每一步都由人类推动。但在 Anthropic,我们正把越来越多的 AI 开发工作交给 AI 系统自己去做,这正在加快我们的工作速度。

Taken far enough, and given enough compute, that trend points to an AI system capable of fully autonomously designing and developing its own successor. This is called recursive self-improvement. We are not there yet, and recursive self-improvement is not inevitable. But it could come sooner than most institutions are prepared for.

如果这种趋势继续推进下去,并且拥有足够的算力,它将指向一种能够完全自主设计并开发自己后继版本的 AI 系统。这被称为_递归式自我改进_。我们还没有走到那一步,而且递归式自我改进也并非必然会发生。但它到来的时间,可能比大多数机构准备好的时间更早。

Using public benchmarks and previously unreported data from within Anthropic, The Anthropic Institute is showing that AI is already accelerating the development of AI systems. To take just one example: today, Anthropic engineers on average ship 8x as much code per quarter as they did from 2021-2025.

借助公开基准测试和此前未公开的 Anthropic 内部数据,Anthropic Institute 正在展示,AI 已经开始加速 AI 系统本身的开发。只举一个例子,今天 Anthropic 的工程师平均每个季度交付的代码量,是 2021 年到 2025 年期间的 8 倍。

The technical trends discussed in this piece suggest that AI systems are going to become much more capable in coming years. These trends have huge implications. AI that can build itself would be a major development in the history of technology—one that could bring enormous good for the world in science, healthcare, and beyond. But full recursive self-improvement also might increase the risks of humans losing control over AI systems. If systems are capable of fully building their own successors, the ways we secure them, monitor them, and shape their behavior all grow much more important.

本文讨论的技术趋势表明,AI 系统在未来几年会变得强大得多。这些趋势影响巨大。能够构建自身的 AI,将会是技术史上的一个重大进展,它可能在科学、医疗等领域为世界带来巨大的好处。但完全的递归式自我改进,也可能增加人类失去对 AI 系统控制的风险。如果系统真的能够完全构建自己的后继版本,那么我们如何保障它们的安全、如何监测它们、如何塑造它们的行为,就都会变得更加重要。

2021–2023

2021–2023

Building the first Claude

构建第一个 Claude

In the early days, work at Anthropic looked like work at any other tech company: people writing code and docs on laptops.

在早期,Anthropic 的工作看起来和其他科技公司并没有区别,都是人在笔记本电脑上写代码、写文档。

2023–2025

2023–2025

Chatbots

聊天机器人

People used early chatbots to help with parts of the process, like generating short code snippets and copying the output into text editors.

人们开始使用早期聊天机器人来帮助完成流程中的某些部分,比如生成简短的代码片段,再把输出复制到文本编辑器里。

2025–2026

2025–2026

Coding agents

编程智能体

As the agents became more capable, they were able to write and edit code on their own, sometimes entire files.

随着智能体能力增强,它们开始能够自己编写和编辑代码,有时甚至能完成整份文件。

Today

今天

Autonomous agents

自主智能体

Agents can now run code themselves and delegate hours of work to other agents.

智能体现在已经可以自己运行代码,并把数小时的工作分派给其他智能体。

20XX?

20XX?

Closing the loop

闭环完成

In the future, agents could become capable enough to build and train models themselves. If this happens, future versions of Claude could be continuously improved by Claude itself.

在未来,智能体可能会强大到足以自行构建并训练模型。如果这真的发生,Claude 的未来版本就可能由 Claude 自己持续改进。

Evidence from the outside world

来自外部世界的证据

The rate at which AI models improve is accelerating. The length of tasks that they can reliably complete on their own has been doubling roughly every four months, up from an earlier trend of doubling every seven months. In March 2024, Claude Opus 3 could complete software tasks that take humans about four minutes to complete. A year later, Claude Sonnet 3.7 managed tasks that took about an hour and a half. A year after that, Claude Opus 4.6 managed 12-hour tasks.1 If this trend holds, tasks that take a skilled person days could come into range this year. In 2027, AI systems could be capable of tasks that take a person weeks.

AI 模型能力提升的速度正在加快。它们能够稳定独立完成的任务时长,大约每四个月翻一倍,而更早之前的趋势是每七个月翻一倍。在 2024 年 3 月,Claude Opus 3 能完成大约需要人类四分钟的软件任务。一年后,Claude Sonnet 3.7 能处理大约需要一个半小时的任务。再过一年,Claude Opus 4.6 能处理 12 小时任务。1 如果这个趋势保持下去,那么那些需要熟练人员花上几天的任务,今年就可能进入 AI 的能力范围。到了 2027 年,AI 系统可能就能胜任需要人类花上几周的任务。

The same pattern appears on coding and research benchmarks. Benchmarks measure the performance of models in a given domain, and they’re “saturated” when models achieve close to 100% performance.2SWE-bench is a standard test of real-world software engineering: it hands a model an actual open-source codebase and a real bug report, and asks it to write a code change that fixes the issue and passes the project’s own tests. Models have gone from scoring in the low single digits to saturating the benchmark in two years.

同样的模式也出现在编程和研究基准测试上。基准测试用于衡量模型在某个领域的表现,当模型成绩接近 100% 时,就被称为“饱和”。2 SWE-bench 是一项现实世界软件工程的标准测试。它会给模型一个真实的开源代码库和一份真实的 bug 报告,要求模型写出代码修改来修复问题,并通过项目自身的测试。两年间,模型成绩从个位数低分一路上升到让这个基准几乎饱和。

CORE-Bench tests whether a model can reproduce existing research, a prerequisite for them to conduct original research. It gives an AI model the code and data behind a published paper, and asks it to rerun everything and confirm it can replicate the paper’s results. AI systems went from succeeding at reproducing the results roughly 20% of the time in 2024 to saturating the benchmark fifteen months later. METR, which runs the benchmark measuring how well models can complete long-duration tasks, found that Claude Mythos Preview could work for “at least” 16 hours and was “at the upper end of what [METR] can measure without new tasks.”

CORE-Bench 测试模型能否复现实验研究,而这正是它们进行原创研究的前提。它会把一篇已发表论文背后的代码和数据交给 AI 模型,要求它重新运行一切,并确认能否复现论文结果。2024 年时,AI 系统成功复现结果的比例大约只有 20%,十五个月后,这项基准就已经被做到了饱和。负责测量模型完成长时任务能力的 METR 发现,Claude Mythos Preview 可以工作“至少”16 小时,而且已经“达到了 [METR] 在不设计新任务的前提下所能测量的上限附近”。

Public benchmarks say a lot about the capabilities of these systems. But they can’t reveal the impact AI systems are having on speeding up AI development itself. For that, we need direct evidence from within AI companies like Anthropic.

公开基准测试能告诉我们很多关于这些系统能力的信息。但它们无法揭示 AI 系统究竟在多大程度上加快了 AI 开发本身。要回答这个问题,我们需要来自 Anthropic 这类 AI 公司内部的直接证据。

Evidence from within Anthropic

来自 Anthropic 内部的证据

Building a frontier model takes two broad categories of work. There is engineering: writing the code, standing up the infrastructure, and overseeing the model training. And there is research: deciding what experiments to run, interpreting what comes back, and figuring out which ideas to try next.

构建一个前沿模型,大体上需要两类工作。一类是_工程_,包括编写代码、搭建基础设施、监督模型训练。另一类是_研究_,包括决定做什么实验、解读实验结果、判断下一步该尝试什么想法。

Across both engineering and research, the picture is consistent. In engineering, Claude can be handed an underspecified problem and figure out how to solve it; humans supply the goal, but they no longer need to supply the method. In research, Claude can already match or outperform skilled humans at executing a well-specified experiment. However, large performance gaps persist when it comes to Claude exercising judgement in choosing goals in both engineering and research. That’s the gap between AI today and a future system that could autonomously design its own successor.

无论在工程还是研究中,整体图景都很一致。在工程上,Claude 已经可以接到一个定义并不充分的问题,然后自己想办法解决;人类给出目标,但不再需要给出方法。在研究上,Claude 在执行定义明确的实验时,已经能达到甚至超过熟练人类的水平。不过,一旦涉及在工程和研究中自己判断该选择什么目标,Claude 与人类之间仍然存在明显差距。这正是今天的 AI 与未来那种能够自主设计自己后继版本的系统之间的差距。

It’s common for employees at Anthropic to receive more open-ended and important tasks as they gain more experience. Early on, they execute a task someone else specified, like, “The export button isn’t working, please fix it.” With experience, they’re handed a goal and design the approach themselves, such as, “Investigate why the network slows down under heavy load.” At the most senior levels, they are deciding which problems are worth working on at all: “What should the team build next quarter?” We can use internal Anthropic data to see how far Claude has come in being able to handle these different kinds of tasks.

在 Anthropic,员工随着经验增长,通常会接到更开放、也更重要的任务。刚开始时,他们执行的是别人已经定义好的任务,比如,导出按钮坏了,请修好它。 有了经验以后,他们会拿到一个目标,然后自己设计方案,比如,调查为什么网络在高负载下会变慢。 到了最高级别,他们要决定的已经是哪些问题根本值得去做,比如,团队下个季度应该构建什么? 我们可以用 Anthropic 的内部数据,来看 Claude 在处理这些不同类型任务上的进展到底到了什么程度。

Claude writes a significant proportion of Anthropic’s code.As of May 2026, more than 80% of the code we merge into Anthropic’s codebase was authored by Claude.3 Before Claude Code launched in research preview in February 2025, this number was in the low single digits. That shift also shows up in the amount of output per engineer. Lines of code merged per engineer per day stayed constant through Anthropic’s first four years (2021-2024), then began to climb upward in 2025 when Claude began to run code rather than just suggesting it for an engineer to copy and paste. The slope steepened again in 2026 when models began to work autonomously over longer time horizons. These two inflection points are shown in the chart below. In the second quarter of 2026, the typical engineer was merging 8× as much code per day as they were in 2024.4 This is because much of the code is written by Claude, with the engineer directing and reviewing, rather than typing it themselves.

Claude 写出了 Anthropic 相当大一部分代码。 截至 2026 年 5 月,合并进 Anthropic 代码库的代码中,超过 80% 是由 Claude 编写的。3 在 Claude Code 于 2025 年 2 月以 research preview 形式发布之前,这个数字还只是个位数低段。这种变化也体现在每位工程师的产出上。在 Anthropic 的前四年里,也就是 2021 年到 2024 年,每位工程师每天合并的代码行数基本保持不变;到了 2025 年,Claude 开始自己运行代码,而不只是给工程师提建议让他们复制粘贴,代码产出就开始上升。到 2026 年,模型开始能在更长的时间范围内自主工作,增长斜率又一次变陡。下图标出了这两个拐点。到 2026 年第二季度,典型工程师每天合并的代码量,是 2024 年的 8 倍。4 这是因为大量代码实际上由 Claude 完成,而工程师主要负责指挥和审查,而不是亲手逐行输入。

Image 1: Bar graph showing code contributed per person, per quarter, starting in Q2 2021 and ending in Q2 2026. The graph notes the release dates of eight different models: Claude 1, Claude 2, Claude 3, Claude 4, Claude Code, Claude Sonnet 4.5, Claude Opus 4.5, Claude Mythos Preview (internal access), and Claude Mythos Preview.

A caveat: Lines of code is an imperfect measure, as it measures quantity over quality. So 8×lines of code/engineer/day in the second quarter of 2026 is almost certainly an overstatement of the true productivity gain. Nonetheless, it indicates an acceleration. At Anthropic, we don’t reward people for how many lines of code they write; rather, team members are producing more code simply because they’re using AI systems to write more code.

这里需要说明一点。代码行数并不是完美指标,因为它衡量的是数量,而不是质量。所以,2026 年第二季度_8_×代码行数/工程师/天,几乎肯定夸大了真实的生产率提升。但它仍然表明了一种加速。在 Anthropic,我们不会因为某人写了多少行代码就奖励他们。团队成员之所以产出更多代码,只是因为他们在使用 AI 系统写出更多代码。

The increase in lines of code written lines up with subjective impressions of large productivity increases. In a March 2026 poll of 130 employees from across Anthropic research teams, the median respondent estimated that they produced around 4x as much output with Mythos Preview as they would have without access to any AI models, on the kinds of projects they would have been working on regardless.5 We expect that the true degree of uplift in March was somewhat lower.6 Nevertheless, we find the overall claim plausible, and in line with our other observations: a significant fraction of Anthropic technical staff is accomplishing their core work multiple times faster than they could without AI assistance.

代码行数的增长,也和人们对生产率大幅提高的主观感受相吻合。在 2026 年 3 月一项面向 Anthropic 各研究团队 130 名员工的调查中,回答者的中位数估计是,在使用 Mythos Preview 的情况下,他们在原本就会做的那类项目上,产出大约是完全没有 AI 模型辅助时的 4 倍。5 我们认为,3 月份真实的提升幅度可能比这个数字要低一些。6 不过,我们认为整体判断是可信的,也与其他观察一致。Anthropic 相当一部分技术人员,确实正在以没有 AI 帮助时数倍的速度完成核心工作。

We also see evidence that people at Anthropic are using Claude to do work that simply wouldn’t have happened otherwise, like building exploratory tooling and addressing long-deferred cleanup. For example, in April 2026, Claude shipped over 800 fixes that reduced a class of API errors by a factor of one thousand. The engineer overseeing Claude estimated that a human would have taken four years to complete this work; solving other people’s bugs is slow and painstaking, and humans struggle to hold that much unfamiliar context in their head at once.

我们还看到一些证据表明,Anthropic 的员工正在用 Claude 去做那些本来根本不会发生的工作,比如构建探索性工具,或者处理长期被拖延的清理工作。举个例子,2026 年 4 月,Claude 交付了 800 多个修复,把某一类 API 错误降低到了原来的千分之一。负责监督 Claude 的工程师估计,一个人类要完成这项工作需要四年时间;修别人留下的 bug 既慢又磨人,而且人类很难同时在脑子里装下这么多陌生上下文。

I started leaning hard into Claudifying about a year ago. That’s been a crazy adventure and it’s now been ~5 months since I last wrote any code myself.

大约一年前开始,越来越重度地把工作 Claude 化。这段经历非常疯狂,而到现在已经差不多有五个月没亲手写过任何代码了。

Anthropic employee*

Anthropic 员工*

The code that Claude writes is “good” and improving.“Good code” means two things: it works, and it is written in a manner that allows another engineer to understand it and build upon it. On the first criterion, the evidence is clear. The rate at which Anthropic staff correct, redirect, or take over mid-task from Claude has been falling steadily for a year, including on the most complex and open-ended tasks. This means problems with no clear specification, where the engineer isn’t sure what the answer looks like. This is evident in Claude’s success rate over time on tasks of different difficulties, as shown in the graph below. Claude writes code that works.

Claude 写出来的代码是“好的”,而且还在持续变好。 “好代码”包含两层意思。一层是它能工作。另一层是它的写法能让别的工程师看懂,并在此基础上继续开发。对于第一条,证据很明确。过去一年里,Anthropic 员工在 Claude 工作中途需要纠正它、重定向它,或者接管任务的频率一直在稳步下降,哪怕是在最复杂、最开放的任务上也是如此。这里说的是那些没有清晰规格的问题,也就是工程师自己都不确定答案到底长什么样。下图展示了 Claude 在不同难度任务上的成功率随时间变化的情况。这说明 Claude 写出的代码是能工作的。

Image 2: Line graph showing the Claude Code session success rate on four different types of tasks—trivial tasks, routine tasks, substantial tasks, and open-ended problems—with six different models: Claude Sonnet 4.5, Claude Opus 4.5, Claude Opus 4.6, Mythos Preview (internal access), Mythos Preview, and Claude Opus 4.7.

How to read this:Session success is determined by a Claude judge; a session is deemed successful if the Claude Code agent clearly succeeded at the user’s tasks without requiring corrections. Changes in workloads can lead to short-term fluctuations in success rates.

如何阅读这张图: 会话成功由一个 Claude 裁判判定;如果 Claude Code 智能体在不需要纠正的情况下,清楚地完成了用户任务,那么这次会话就被视为成功。工作负载变化可能导致成功率短期波动。

On the most open-ended tasks, Claude’s success rate reached 76% in May 2026, up 50 percentage points in six months. To give an example of tasks in this difficulty tier, a routine upgrade began crashing tens of thousands of training jobs. An engineer pointed Claude at the live incident with little more than some text content and cluster access. Working through the running jobs and testing one environment setting at a time, Claude isolated the single obscure debugging flag that was triggering the crash, reproduced it reliably, and confirmed a fix. In about two hours, Claude delivered what would normally be two to three days of work.

在最开放的任务中,Claude 的成功率在 2026 年 5 月达到了 76%,六个月内提升了 50 个百分点。举一个这个难度层级的任务例子,一次常规升级导致数以万计的训练任务崩溃。一位工程师只给了 Claude 一点文字说明和集群访问权限,就把它指向了这个在线事故。Claude 在运行中的任务间逐步排查,每次测试一个环境设置,最终定位出是某个隐蔽的调试标志触发了崩溃,还稳定复现了问题,并确认了修复方案。大约两个小时内,Claude 交付了通常需要两到三天才能完成的工作。

The second criterion is writing code that another engineer can understand and build on. Here the gap between humans and AI persists, but is closing fast. There isn’t full consensus among staff at Anthropic, but many believe that the Claude-written code was still worse in quality than human-written code at Anthropic in late 2025, and is roughly at parity today. We expect it to be better within the year.

第二条标准,是写出让其他工程师能看懂并继续开发的代码。在这一点上,人类和 AI 之间的差距仍然存在,但正在迅速缩小。Anthropic 内部对此并没有完全一致的看法,但很多人认为,2025 年底时,Claude 写的代码质量仍然不如 Anthropic 工程师写的代码,而今天大致已经到了持平水平。我们预计,它会在今年之内超过人类。

This has changed the way that Anthropic now reviews its own code. Proposed changes to our codebase are now read by an automated Claude reviewer that looks for bugs, security flaws, and other defects before it can merge. Using this tool, we ran a retrospective analysis, and found that an automated Claude review of every change to our codebase would have caught roughly a third of the bugs behind past incidents on claude.ai before they ever reached production. The engineers who wrote that code are among the best in the world at building these systems. Claude is now catching the mistakes that they missed.

这已经改变了 Anthropic 如今审查自身代码的方式。现在,我们代码库中的修改提案,在合并之前会先经过一个自动化的 Claude 审查器,专门检查 bug、安全漏洞和其他缺陷。利用这个工具,我们做了一次回顾性分析,发现如果 Anthropic 的代码库每一次变更都接受自动化 Claude 审查,那么过去 claude.ai 线上事故背后大约三分之一的 bug,本来可以在进入生产环境前就被发现。写出这些代码的工程师,本就是世界上最擅长构建这类系统的一群人。现在,Claude 已经开始捕捉他们错过的错误。

Claude-written code was somewhat worse than human-written code at Anthropic in late 2025, is roughly at parity today, and we expect it to be strictly better within the year.

2025 年底时,Claude 写的代码比 Anthropic 人类写的代码稍差一些,如今大致已经持平,我们预计它会在今年之内明确超过人类。

Claude is good at running experiments to hit a goal that someone else has set.Every time Anthropic releases a model, we run the same test: we give Claude some code that trains a small AI model, and ask it to make that code run as fast as possible while still passing the same correctness checks. The goal and the success metrics are fixed in advance, so Claude’s job is to find speedups by rewriting the code, running it, timing it, and repeating. It’s a miniature version of an experimental research loop. In May 2025, Claude Opus 4 averaged a ~3x speedup over the starting code. By April 2026, Claude Mythos Preview was achieving ~52x. For calibration, a skilled human researcher would need four to eight hours to reach 4x.7 In this part of the research workflow—optimizing steps within a clearly defined experiment—Claude has gone from super helpful to superhuman in under a year.

Claude 很擅长去执行别人设定目标的实验。 每次 Anthropic 发布新模型时,我们都会做同一个测试。我们会给 Claude 一段用于训练小型 AI 模型的代码,并要求它在仍然通过相同正确性检查的前提下,把这段代码跑得尽可能快。目标和成功指标都是预先固定的,因此 Claude 的任务就是通过重写代码、运行、计时、重复这个循环来寻找加速点。这是一个微型版的实验研究闭环。到 2025 年 5 月,Claude Opus 4 平均能把初始代码加速约 3 倍。到 2026 年 4 月,Claude Mythos Preview 已经能做到约 52 倍。作为参照,熟练的人类研究员要做到 4 倍加速,通常需要四到八个小时。7 在研究流程的这一部分,也就是在定义明确的实验里优化具体步骤,Claude 只用了不到一年,就从超级有帮助变成了超越人类。

The shape of stuff today is roughly ‘humans have ideas, and the models are able to implement, test and evaluate them an [order of magnitude] faster than before.’

今天的形态大致是,人类提出想法,而模型能以前快一个数量级的速度把它们实现、测试和评估出来。

Claude is getting better at proposing its own experiments.In April 2026, Anthropic published the first demonstration of Claude running an open-ended research project end to end. Claude-powered agents were given an open problem in AI safety—roughly, can a weaker model reliably supervise a stronger one?—and were left to solve it. This involved proposing hypotheses, testing them, sharing findings with parallel agents, and iterating. The task has a clear performance “floor” and “ceiling”: the floor is how well the weak supervisor would do on its own; the ceiling is how the strong model does when trained on correct answers. Two human researchers, over about a week, recovered roughly 23% of that gap; the agents recovered 97% over 800 cumulative hours and used roughly $18,000 in compute. There are some caveats to this work; the result didn’t transfer cleanly to production-scale models, and humans still chose the problem and created the scoring rubric. But within those bounds, the agents designed every experiment themselves. Direction-setting was the only meaningful role a human played.

Claude 正越来越擅长提出自己的实验。 2026 年 4 月,Anthropic 发布了第一项展示 Claude 端到端运行开放式研究项目的成果。由 Claude 驱动的智能体被交给了一个 AI 安全领域的开放问题,大致是,较弱的模型能否可靠地监督更强的模型? 然后就让它们自己去解决。这包括提出假设、做实验、与并行智能体共享发现,再不断迭代。这个任务有明确的表现“地板”和“天花板”。地板是弱监督者单独工作的表现,天花板则是强模型在基于正确答案训练后的表现。两位人类研究员花了大约一周时间,填补了这个差距的约 23%;这些智能体在累计 800 小时、约 18,000 美元算力成本下,填补了 97%。这项工作当然也有一些限制。结果并没有顺利迁移到生产规模模型上,而且问题是由人类选定的,评分标准也是人类制定的。但在这些边界之内,所有实验都是智能体自己设计的。唯一真正由人类承担的角色,就是设定方向。

Claude did all of this with pretty minimal help from me over the course of 1-2 days. I think if [a junior colleague] came back to me with results like this in the same span of time, I would be mildly impressed. The future is now.

在一两天时间里,Claude 在几乎不需要我帮忙的情况下完成了这一切。如果是[一个初级同事]在同样时间内拿着这样的结果回来,我会有点佩服。未来已经来了。

Claude is getting better at steering research sessions towards research findings.We examined real Claude Code sessions (between January and March 2026) where Anthropic researchers were working with Claude on an open-ended investigative problem, like figuring out why a training run kept crashing, or why a model scored poorly on a benchmark. In each case, we found a moment where the researcher took a detour: they pursued a direction that sent the session sideways before it eventually got back on track. We then showed various Claude models only the work from before the session went off-course and asked what it would do next. A separate Claude that was able to see how the session eventually turned out then judged whether the AI or the human suggested the better next step.8

Claude 正越来越擅长把研究会话引向研究发现。 我们检查了真实的 Claude Code 会话,时间范围是 2026 年 1 月到 3 月。当时 Anthropic 的研究员正在和 Claude 一起处理开放式调查问题,比如弄清为什么某次训练总是崩溃,或者为什么模型在某个基准上分数很低。在每个案例里,我们都找到了研究员绕远路的时刻。也就是他们追了一个方向,导致会话一度偏航,后来才重新回到正轨。接着,我们把会话偏航之前的内容,_只_给不同版本的 Claude 看,并问它接下来会怎么做。随后,再由另一个能够看到整段会话最终结果的 Claude 来判断,AI 提出的下一步,还是人类当时的下一步,更好。8

Because we deliberately picked moments (n=129) where we know the human’s choice had room for improvement, this isn’t a like-for-like comparison between model and human judgement. What these moments give us is a set of realistic, challenging situations where the right next step is not obvious, and where the human’s choice serves as a useful yardstick to compare model performance over time. On this measure, our best model in November 2025 (Opus 4.5) beat the human choice 51% of the time; in April 2026 (Mythos Preview), this grew to 64%. The day-to-day work of research is largely a chain of these next-step decisions, making this a relevant measure of the model’s ability to eventually run an investigation of its own. We view this result as an early signal that AI systems are getting better at making the kinds of judgement calls that AI research depends on.

因为我们是刻意挑选这些时刻的,共有 129 个,而且我们事先知道人类当时的选择是有改进空间的,所以这并不是模型判断和人类判断之间完全对等的比较。这些时刻的价值在于,它们构成了一组现实而有挑战性的情境,在那里正确的下一步并不明显,而人类的选择又恰好能作为一个有用的标尺,让我们比较模型表现如何随时间变化。按这个标准,我们在 2025 年 11 月最好的模型 Opus 4.5,有 51% 的时间能胜过人类的选择;到 2026 年 4 月,Mythos Preview 这个比例上升到了 64%。研究工作的日常,本质上就是由这样一个个“下一步该怎么做”的判断串联起来的,所以这也是衡量模型未来是否能够独立推进调查的一个相关指标。我们把这个结果看作一个早期信号,说明 AI 系统正在变得更擅长做出 AI 研究所依赖的那类判断。

Image 3: Bar graph with the header "Can the model pick a better next step than the human?" The bar graph shows the performance of nine different models: Claude 3 Haiku, Claude Sonnet 4, Claude Sonnet 4.5, Claude Haiku 4.5, Claude Opus 4.5, Claude Sonnet 4.6, Claude Opus 4.6, Claude Opus 4.7, and Claude Mythos Preview.

How to read this:The practical ceiling line measures an "ideal" answer written by a model that could see the whole session (including how it ended).

如何阅读这张图: 实际天花板线衡量的是一种“理想”答案,这个答案由一个能够看到整段会话全貌,包括它最终如何结束的模型写出。

The comparative advantage of humans as of right now is still in seeing the bigger picture and thinking beyond the confines of the immediate task.

至少在当下,人类的相对优势仍然在于能够看到更大的图景,并能跳出眼前任务的边界去思考。

What might the future of work at Anthropic look like?

Anthropic 的未来工作方式可能会是什么样?

The evidence suggests that the human role is narrowing at each step in the AI development process. Once human- and AI-authored code quality reach parity, humans will stop writing code entirely, and shift to only reviewing it. But if they can’t review code as quickly as Claude can generate it, human review will become the bottleneck to AI development. Similarly, once Claude can run experiments, the question shifts towards “Which of these experiments is worth running?” Put simply: the doing (i.e., writing the code, running the experiment, producing the result) now costs almost nothing in human time, even if it still has costs in compute.

这些证据表明,在 AI 开发流程的每一个环节里,人类承担的角色都在收缩。一旦人类写的代码与 AI 写的代码在质量上达到持平,人类就会彻底停止亲手写代码,只负责审查。但如果人类审查代码的速度赶不上 Claude 生成代码的速度,那么人类审查就会变成 AI 开发的新瓶颈。同样,一旦 Claude 能独立跑实验,问题就会转向,这些实验里哪些值得做。简单说,_执行_本身,也就是写代码、跑实验、产出结果,几乎已经不再消耗人类时间,尽管它仍然会消耗算力。

An area of human comparative advantage, for now, is research taste and judgment, including choosing which problems matter, which results to trust, and when an approach is a dead end.

目前人类仍有相对优势的领域,是研究品味和判断力,包括哪些问题重要、哪些结果可信、什么时候一条路线已经走死。

Work (and life) ran on a gift economy of small favors between humans. ‘Can you help me get this script running?’ [...] each one created a little debt, a little mutual awareness. [Claude is] faster, it creates zero debt, but each of these is a lost bid for human collaboration.

工作和生活过去像是一种由人与人之间小帮忙支撑起来的礼物经济。能帮我把这个脚本跑起来吗?……每一次都会形成一点点人情债,也会带来一点点相互理解。[Claude] 更快,不会形成任何人情债,但每一次也都意味着一次人类协作机会的流失。

On days where everything works well, I can’t help but think nothing I do matters, everything is automated and better and faster than I ever will be. But then there are days where everything breaks and I don't understand why and I realize I have no idea what I’ve been up to anymore.

在一切顺利的日子里,会忍不住觉得,自己做什么都不重要,所有东西都被自动化了,而且比自己更好、更快。但也有那种一切都坏掉、又搞不清为什么的日子,这时就会突然意识到,已经完全不知道自己到底一直在做什么了。

What if we’re wrong?

如果我们错了呢?

A natural objection to the evidence presented above is that the work that is still in human hands—choosing which problems to work on—is what matters most. Without that judgment, Claude is a capable assistant, but not a system that could drive AI progress on its own.

对于上面这些证据,一个很自然的反对意见是,那些仍然掌握在人类手中的工作,也就是决定该做哪些问题,才是最重要的。没有这种判断力,Claude 仍然只是一个强大的助手,而不是一个能够独自推动 AI 进展的系统。

It is genuinely unclear whether today’s training methods and architectures could unlock that capacity. But AI is rarely advanced by “eureka!” moments. There have been a few of these in AI’s recent history, like the Transformer architecture, or mixture-of-experts models, but paradigm-shifting ideas arrive years apart. In between, most progress is incremental: we scale something up, see what breaks, fix it, and try again. That is exactly the kind of workflow Claude now excels at. Edison said that genius is 1% inspiration and 99% perspiration. But we see perspiration becoming increasingly automated. It’s becoming clear that much of what advances the frontier is automatable; large-scale research progress is mostly a function of tools and resources, which dictate how fast you can run experiments, how many you can run at once, and how quickly you can get results.

今天的训练方法和架构究竟能不能解锁这种能力,确实并不清楚。但 AI 的进步,很少靠的是“灵光一现”。近些年 AI 历史里当然有过这样的时刻,比如 Transformer 架构,或者混合专家模型,但这类改变范式的想法往往几年才出现一次。在这中间,大多数进展都是渐进式的。我们把某样东西放大,看看哪里会坏,修好它,再试一次。而这恰好就是 Claude 现在最擅长的工作流。爱迪生说,天才是 1% 的灵感和 99% 的汗水。而现在,我们看到那 99% 的汗水正在越来越自动化。越来越清楚的一点是,推动前沿进步的大量工作其实是可以自动化的。大规模研究进展,主要是工具和资源的函数,它们决定你能多快做实验,能同时做多少实验,又能多快拿到结果。

Even if we suppose that Claude never achieves good research taste, a conservative reading of our evidence still implies compounding acceleration. If humans spend most of their time on the single-digit fraction of work that is direction-setting, while Claude handles the rest, that means each engineer or researcher is steering far more work than before. The evidence we see suggests that people at Anthropic are both moving faster and covering a broader surface. In practice, this means that AI already makes Anthropic move much faster than it did before the advent of effective AI tools.

就算我们假设 Claude 永远都得不到好的研究品味,对这些证据的一种保守解读,仍然意味着加速会不断复合。如果人类把大部分时间都花在那一小部分个位数比例的方向设定工作上,而剩下的事情都交给 Claude,那么每位工程师或研究员实际上都在调度比以前多得多的工作。我们看到的证据表明,Anthropic 的人不仅推进速度更快,而且覆盖的范围也更广。落到实际层面,就是 AI 已经让 Anthropic 的前进速度远远快于有效 AI 工具出现之前。

The less conservative reading is that the early evidence on Claude’s improving research judgment—narrow as it is today—is an indicator that this capability is improving as well. “Research taste” might be just another AI capability that AI systems fail at for a time, then get good at. We’ve seen a similar pattern with other qualitative skills, like AI systems being able to explain why a joke is funny, demonstrate theory of mind, and solve linguistic riddles.

一种不那么保守的解读是,Claude 在研究判断力上那些虽然还很窄、但已经出现的早期证据,正说明这种能力本身也在提升。所谓“研究品味”,也许只不过是另一种 AI 起初做不好、后来逐渐做好的能力。类似的模式,我们已经在其他定性能力上见过,比如 AI 系统能够解释为什么一个笑话好笑,能够展示心智理论,能够解出语言谜题。

Possible futures

可能的未来

What happens next depends on two things: whether the trend continues, and what we choose to do if it does. We can imagine at least three future scenarios:

接下来会发生什么,取决于两件事。第一,趋势会不会延续。第二,如果延续,我们选择怎么做。我们至少可以想象三种未来场景。

  1. The trend stalls, but today’s AI capabilities are widely diffused. This article features many exponential trajectories. But these trajectories may actually turn out to be S-curves. We may be approaching the bend in the curve, where returns to scale diminish and the line straightens, then flattens. The judgment that separates a competent researcher from a great one might be a capability that cannot come from scaling up training inputs like compute and data. If so, getting past this bottleneck would require a new idea, like an architectural approach that supplants the Transformer architecture that all current frontier models use. Alternately, the binding constraint to AI progress could be in the supply chain, not the model: advancing and diffusing the frontier may require more energy and compute than presently exists. The pace of chip fabrication, grid expansion, or interconnect bandwidth may be the constraint, rather than intelligence itself. We also cannot rule out an exogenous shock to the AI ecosystem that dramatically slows things, like a sudden diminishment in the supply of compute or electricity, either of which would slow progress and make forward investment by labs more expensive. Or we may not be anticipating some other barrier to progress.
  1. 趋势停滞,但今天的 AI 能力被广泛扩散。 本文中出现了许多指数型轨迹。但这些轨迹也可能最终只是 S 曲线。我们也许正在接近曲线弯折的地方,在那里规模收益会减弱,曲线先变直,再变平。把一个合格研究员和一个伟大研究员区分开的那种判断力,可能并不是靠扩大训练输入,比如算力和数据,就能得到的能力。如果真是这样,那么要突破这个瓶颈,就需要一个新想法,比如一种能够取代当前所有前沿模型都在用的 Transformer 架构的新架构方案。 另一种可能是,限制 AI 进展的关键约束不在模型,而在供应链。推进并扩散前沿,所需的能源和算力,也许比现有世界所拥有的还要多。限制因素可能是芯片制造速度、电网扩张速度,或者互连带宽,而不是智能本身。我们也不能排除 AI 生态会遭遇外生冲击,从而显著放缓一切,比如算力或电力供应突然减少。这两者都会拖慢进展,并使实验室继续向前投入的成本变得更高。又或者,我们还没有预见到其他阻碍进步的障碍。

Even if model capabilities were frozen at today’s level, we would expect major changes to occur in the world. Project Glasswing is one early sign: in its first weeks, Mythos Preview found more than ten thousand high- and critical-severity software vulnerabilities across the world’s most important systems—enough that the bottleneck in cyber defense has already shifted from finding vulnerabilities to patching them fast enough. And we are still early in the diffusion of today’s models into the wider economy, where a 100-person company can increasingly do the work of a 1,000-person one, because each employee will sit atop a pyramid of agents.

即便模型能力永远停留在今天的水平,我们也预期世界会发生重大变化。Project Glasswing 就是一个早期信号。在最初几周里,Mythos Preview 在全球最重要的一些系统中发现了超过一万个高危和严重级别的软件漏洞,多到网络防御的瓶颈已经从发现漏洞,转移到了能不能足够快地修补漏洞。与此同时,我们仍处在今天这些模型向更广泛经济体系扩散的早期阶段。未来,一家 100 人公司越来越可能完成过去 1,000 人公司的工作,因为每个员工头上都将叠着一座由智能体构成的金字塔。

We include this scenario for completeness, but we don’t believe it’s likely. Every capability we can measure, including those that feel “squishier,” like quality of code and success on open-ended tasks, has so far followed the same curve. We have not yet seen that curve bend. Of the three futures we consider, this one would give governments and societies the most time to adapt. We are more worried about the next two, which would move faster and leave far less room for preparation.

出于完整性考虑,我们把这个场景列了出来,但我们并不认为它很可能发生。我们目前能测量到的每一种能力,包括那些感觉更“软”的能力,比如代码质量和开放式任务成功率,到现在为止都沿着同一条曲线在走。我们还没有看到那条曲线开始弯折。在我们考虑的三种未来里,这一种会给政府和社会最多的适应时间。我们更担心后面两种,因为它们会快得多,也几乎不给准备留下空间。

  1. AI labs continue to see compounding efficiency gains.In this scenario, AI development becomes substantially automated, but humans continue to set research directions and judge results. Organizations that use AI systems would become much more efficient as time goes on, so we could expect to see significant productivity multipliers on each person in this organization. 100-person companies could do the work of 10,000- or 100,000-person organizations. This would revolutionize knowledge work and government services, but could also be turned to harmful ends, from authoritarian surveillance of whole populations to influence operations that tailor manipulation to each individual and run at a scale no human team could match. The role of humans at companies like Anthropic would shift. People would partner with AI systems to scale up research and generate new insights, and together they would build the systems needed to verify that AI outputs can be trusted. The evidence we’ve laid out here suggests that we’re likely heading into this scenario. But speeding up one part of a process often just shifts the bottleneck elsewhere: overall pace is capped by the parts that haven’t sped up. In computing, this is known as Amdahl’s law, and the same logic can apply to organizations. Anthropic has already encountered one signature of Amdahl’s law: as we’ve begun to push more code around the organization, human code review has become a new bottleneck.
  1. AI 实验室持续获得复合式效率提升。 在这个场景中,AI 开发会在很大程度上实现自动化,但研究方向仍由人类设定,结果仍由人类裁决。随着时间推移,使用 AI 系统的组织会变得越来越高效,因此我们可以预期,组织中的每个人都会获得显著的生产率倍增。100 人公司能完成 10,000 人,甚至 100,000 人组织的工作。这会彻底改变知识工作和政府服务,但也可能被用于有害用途,从对整个人口进行威权式监控,到为每个个体量身定制操控信息、并以任何人类团队都无法匹敌的规模运行影响行动。像 Anthropic 这样的公司里,人类的角色也会发生变化。人们会与 AI 系统结成伙伴关系,放大研究能力,生成新洞见,并共同构建那些用来验证 AI 输出是否可信的系统。 我们在这里列出的证据表明,我们很可能正在走向这个场景。但流程中某一部分的加速,往往只是把瓶颈转移到别处。整体速度最终还是受那些没有被加速的部分限制。在计算领域,这被称为Amdahl 定律,同样的逻辑也适用于组织。Anthropic 已经遇到了 Amdahl 定律的一种典型表现。随着我们开始在整个组织里推动越来越多的代码流动,人类代码审查已经变成了新的瓶颈。

We’ve also encountered this friction outside engineering. There has been an explosion of new ideas, initiatives, tools, and simulations, as a result of Anthropic employees working with highly capable models—far more than we have the capacity to pursue. The rate at which organizations can spot and fix these bottlenecks may be a skill that improves over time, and it may become the most important skill for any organization.

这种摩擦不只出现在工程领域。Anthropic 员工与高能力模型协作之后,新的想法、新的项目、新的工具和新的模拟出现了爆发式增长,远远超过了我们的推进能力。组织识别并修复这些瓶颈的速度,本身可能会成为一种会随时间提升的能力,而且很可能会变成任何组织最重要的能力。

  1. AI systems themselves become capable of full recursive self-improvement, and begin building their successors. If technical trends in advancing capabilities continue, and AI systems are able to develop the capabilities inherent to transformative human ingenuity, then it is plausible that AI systems could design and refine themselves. In this world, the pace of progress in AI development becomes determined entirely by the availability of compute (or the speed of discovering various efficiencies in algorithmic training or inference) for AI systems. Humans play a substantially diminished role in their development, likely moving most of our effort towards oversight, validation, and verification of an expanding “virtual lab” run by AI systems. We expect that systems capable of automated AI research and development would have skills that would transfer to the rest of science, allowing them to begin to revolutionize other fields.
  1. AI 系统本身具备完全递归式自我改进能力,并开始构建自己的后继版本。 如果推动能力前进的技术趋势继续下去,并且 AI 系统能够发展出那种属于人类变革性创造力的能力,那么 AI 系统设计并打磨自己的可能性就是可信的。 在这个世界里,AI 开发进展的速度将完全由 AI 系统可用的算力决定,或者说,由发现训练或推理算法效率提升的速度决定。人类在它们的发展中只扮演大幅弱化的角色,大部分精力可能都会转向对一个由 AI 系统运行、不断扩大的“虚拟实验室”进行监督、验证和核查。我们预计,能够自动进行 AI 研究与开发的系统,也会拥有可迁移到其他科学领域的能力,从而开始在其他领域引发革命。

How the alignment problem gets solved—or not—in this future is something we are least certain about. Models could prove to be sufficiently aligned and capable enough of research taste that they discover and implement novel solutions that we have not yet reached. They could also be sufficiently wise to halt development if not. Alternatively, the rare occurrences of misalignment present in today’s models could compound as the models build their successors, growing more frequent but less understood until we lose control of them. It’s possible that we can’t build, integrate, and verify the tools that we’d need to understand which trendline we are actually on.

在这个未来里,对齐问题最终会被解决,还是不会,是我们最没有把握的部分。模型可能已经足够对齐,而且研究品味也足够好,以至于它们会发现并实施那些我们至今还没找到的新方案。它们也可能足够明智,在情况不对时选择停止开发。另一种可能是,今天模型中那些少见的失调现象,会在模型构建自己后继版本的过程中不断叠加,变得更频繁,却又更难理解,直到我们失去对它们的控制。也有可能,我们根本造不出、整合不好、验证不了那些能帮助我们判断自己到底正处在哪条趋势线上的工具。

We do not have good intuitions for what this world would look like, because our economy is currently driven by humans and human-built tools. By its nature, a world driven by fast recursive self-improvement could become dominated by the self-improving model as its capabilities fully eclipse those of humans and the model proliferates across the broader economy. It is difficult to predict what the economy looks like if human labor stops being competitive.

我们对这样的世界会是什么样,并没有好的直觉,因为我们当前的经济仍然由人类和人类制造的工具驱动。按其本性,一个由快速递归式自我改进驱动的世界,可能会被这种自我改进的模型所主导,因为它的能力会全面超越人类,而且会扩散到更广泛的经济之中。如果人类劳动不再具有竞争力,那么经济会是什么样,很难预测。

Even if model development became fully automated and recursive, we can’t predict what that would mean for most humans’ daily lives. Amdahl’s law applies here as well. Recursive intelligence could lead to achieving many of the benefits outlined in Machines of Loving Grace, quickly in some domains. We expect that embodied intelligence (i.e., robotics) might quickly follow recursive intelligence, and follow a similar path of increasing returns at decreasing cost. More powerful intelligence might help us build things in the physical world more quickly, run more productive clinical trials of lifesaving drugs, and develop novel forms of coordination.

即便模型开发真的变成完全自动化和递归式的,我们仍然无法预测这对大多数人的日常生活到底意味着什么。Amdahl 定律在这里同样成立。递归式智能可能会让我们快速实现 Machines of Loving Grace 中描绘的许多好处,至少在某些领域如此。我们预计,具身智能,也就是机器人技术,可能会很快跟上递归式智能,并沿着类似路径,在成本下降的同时获得越来越高的回报。更强的智能,可能帮助我们更快地在物理世界里造东西,更高效地开展拯救生命药物的临床试验,也可能发展出新的协作形式。

But achieving recursive improvement alone does not suggest an immediate change in how industrial production occurs, societies organize, or markets function. More intelligence can’t learn what a drug does over decades of use, can’t hold elections sooner than a constitution dictates, and can’t turn a stranger into an old friend in a weekend. For most people, the felt pace of this future will still be set by the bottlenecks, even if the laboratory upstream runs at the speed of compute. That collision, where recursive intelligence building itself ever faster meets the world of humans, relationships, and governance, is another part of this future we can’t predict.

但仅仅实现递归式改进,并不意味着工业生产方式、社会组织方式或市场运作方式会立刻改变。更多的智能,不能在几十年的实际使用之前就知道一种药物最终会怎样,不能比宪法规定的时间更早举行选举,也不能在一个周末之内把陌生人变成老朋友。对大多数人来说,这个未来的体感速度,仍然会由那些瓶颈决定,哪怕上游实验室已经按算力速度狂奔。那个碰撞点,也就是不断更快构建自己的递归式智能,与人类、关系和治理构成的世界相撞的地方,也是这个未来里我们无法预测的另一部分。

What should we do?

我们该做什么?

If it were possible to effectively slow the development of this technology to give ourselves more time to deal with its immense implications, we think that would likely be a good thing. But if a slowdown simply lets the least cautious actors catch up technologically, it could leave everyone less safe. Without a global coordination mechanism, companies and governments will have to make difficult decisions about safety while under competitive and geopolitical pressures.

如果我们能够有效放慢这项技术的发展速度,给自己更多时间去处理它那些巨大的影响,我们认为这大概率是件好事。但如果减速只是让最不谨慎的参与者在技术上追上来,那反而可能让所有人都更不安全。在没有全球协调机制的情况下,公司和政府将不得不在竞争压力和地缘政治压力下,对安全问题做出艰难决定。

We believe it would be good for the world to have the option to slow or temporarily pause frontier AI development to enable societal structures and alignment research to keep up with the advance of the technology. The Anthropic Institute will conduct research—in collaboration with many others—and take actions to help build the systems that a credible slowdown or pause would require. These systems would enable frontier AI developers to verify that others globally have actually stopped or slowed, and that a bad actor could not use the auspices of a coordinated slowdown to jump ahead in secret. If such systems existed, we expect that we would slow down or temporarily pause, if other developers at or near the frontier also did so in a verifiable manner.

我们认为,世界若能拥有一种_选择权_,可以减慢甚至暂时暂停前沿 AI 的开发,好让社会结构和对齐研究跟上技术推进的速度,那会是一件好事。Anthropic Institute 将与许多其他方合作开展研究,并采取行动,帮助建立那些让可信减速或暂停成为可能的系统。这些系统将使前沿 AI 开发者能够验证,全球其他参与者是否真的已经停止或放慢,也能验证某个坏行为者无法借着协调减速的名义偷偷抢跑。如果这样的系统存在,我们预计,只要其他位于或接近前沿的开发者也以可验证方式这样做,我们就会放慢或暂时暂停。

A meaningful slowdown or pause would require multiple well-resourced labs at or near the frontier, in multiple countries, agreeing to stop under the same conditions. It would also require that each can verify that the others have actually stopped. Due to the unique characteristics of AI systems, the detectability (a lower standard than verifiability) element of this arms control problem is much more challenging than with other technologies. Training runs are far easier to conceal than missile silos, their inputs are general-purpose, and the incentive to defect quietly is enormous, because whoever continues while others pause could inherit the lead. A credible pause also has to specify what triggers it, what lifts it, and who adjudicates.

一场真正有意义的减速或暂停,需要多个资源充足、位于或接近前沿的实验室,在多个国家中,就相同条件达成一致并停止。它还要求各方都能验证其他方确实停下来了。由于 AI 系统的独特特征,这个军控问题中的“可探测性”,其标准低于“可验证性”,其实比其他技术难得多。训练任务比导弹发射井更容易隐藏,它们的输入又是通用型的,而悄悄违约的诱因极其巨大,因为当别人暂停时,谁继续前进,谁就可能继承领先位置。一个可信的暂停机制,还必须说明是什么触发暂停,什么条件下解除暂停,以及由谁来裁决。

None of this is necessarily impossible in principle—the world has built verification regimes for other complex technologies (e.g., the Intermediate-Range Nuclear Forces Treaty)—but those regimes took decades to build both the infrastructure and the trust. We don’t have that long. A unilateral pause by one lab, by contrast, is achievable immediately, but accomplishes much less: it would change who the front-runner is, but it would not create the wider deliberative process that is currently missing.

从原理上说,这一切未必不可能。世界曾为其他复杂技术建立过验证机制,比如《中导条约》。但这些机制花了几十年时间,才建立起所需的基础设施和信任。而我们没有那么长时间。相比之下,单个实验室单方面暂停,是可以立刻做到的,但它达成的效果也小得多。它只会改变谁是领跑者,却不会创造出当前真正缺失的那种更广泛的审议过程。

In the coming months, we will organize conversations where policymakers, researchers, civil society, and other AI companies can help answer some of the questions this piece raises, especially around full recursive self-improvement and how to create better options for coordination and deliberation. We’ll publish what comes out of it. The window to investigate the questions together is here, and people outside AI companies should be involved in this deliberation.

在接下来的几个月里,我们会组织一些讨论,让政策制定者、研究人员、公民社会以及其他 AI 公司,共同回答本文提出的一些问题,尤其是围绕完全递归式自我改进,以及如何创造更好的协调和审议选项。我们会把讨论结果发布出来。现在正是一起研究这些问题的窗口期,而 AI 公司之外的人,也应该参与这场审议。

Marina Favaro and Jack Clark co-authored this piece, with editorial support from Santi Ruiz. Shan Carter, Romello Goodman, and Nikki Makagiansar created the visuals from data collected by Brian Calvert and Jun Shern Chan. Daniel Freeman, Jim Baker, Max Young, Sarah Pollack, Francesco Mosconi, Holden Karnofsky, Andy Jones, Kevin Troy, Anton Korinek, Meg Tong, Andrew Ho, Dan Altman, Drake Thomas, Jack Shen, Sasha de Marigny, and Avital Balwit provided feedback.

Marina Favaro 和 Jack Clark 共同撰写了本文,Santi Ruiz 提供了编辑支持。Shan Carter、Romello Goodman 和 Nikki Makagiansar 使用 Brian Calvert 与 Jun Shern Chan 收集的数据制作了文中的可视化。Daniel Freeman、Jim Baker、Max Young、Sarah Pollack、Francesco Mosconi、Holden Karnofsky、Andy Jones、Kevin Troy、Anton Korinek、Meg Tong、Andrew Ho、Dan Altman、Drake Thomas、Jack Shen、Sasha de Marigny 和 Avital Balwit 提供了反馈。



  1. METR’s key measure tells you the time horizon over which AI systems can be 50% reliable at a basket of tasks, though the trendline looks the same at 80% reliability.
  2. Especially as they shift toward more open-ended formats and more difficult tasks (e.g., Olympiad-level mathematics), benchmarks often saturate below 100% due to errors in the question and answer sets like ambiguous problem statements and unsolvable questions.
  3. Anthropic leadership have publicly estimated that 90% or more of our code is written by Claude, including scripts and experimental code. Our >80% figure measures the share of lines merged to production that can be attributed to Claude. This is a more conservative measurement in two ways: our attribution pipeline has gaps, and the lines not attributed to Claude include auto-generated code and other artifacts that were not hand-written by humans either.
  4. This surge in code production is straining the infrastructure everyone shares. GitHub—the platform most of the world’s software is built on—saw roughly one billion code commits in all of 2025; by mid-2026 it saw 275 million a week, on pace for roughly 14 billion over the year. The company’s COO has said that it is “pushing incredibly hard” on capacity just to keep up.
  5. Additional details on the methodology of this survey are discussed in section 2.3.5 of the Claude Opus 4.7 System Card.
  6. Many respondents may not have thought carefully about how to account for various biases or subtleties in the question definition, and recent research by METR shows that developer estimates of AI productivity uplift can be overestimated.
  7. How large the speedup gets depends heavily on how much room for improvement the starting code leaves, and it should not be read as a real-world training speedup. So the absolute multiple is not the figure to anchor on here. What is more informative is the like-for-like comparison that this experimental setup makes possible, both across models (~3x to ~52x over the past year) and against a skilled human (~4x in four to eight hours on the same task).
  8. As a check on judge bias, we ran the same test on a separate set of 127 moments where the human’s next move was already strong (as opposed to the original set, where the human’s direction had room for improvement). There, the models’ suggestions were judged better only about 20% of the time.
  1. METR 的关键指标,衡量的是 AI 系统在一组任务上达到 50% 可靠性时所对应的时间跨度,不过在 80% 可靠性下,趋势线看起来也是一样的。
  2. 尤其是当基准测试转向更开放的形式和更困难的任务时,比如奥赛级数学,基准往往会在低于 100% 的地方饱和,原因在于题目和答案集中本身存在错误,比如题意模糊,或题目根本无解。
  3. Anthropic 管理层曾公开估计,我们 90% 或更多的代码都是由 Claude 编写的,包括脚本和实验性代码。我们这里的 >80% 数字,衡量的是合并到生产环境中的代码行里,有多少可以归因于 Claude。这是一种更保守的度量,原因有两点。我们的归因流水线本身存在缺口,而那些没有归因给 Claude 的代码行中,也包括自动生成的代码和其他并非人类手写的产物。
  4. 这波代码产量激增,正在挤压所有人共享的基础设施。GitHub 作为全球大部分软件构建所依赖的平台,在整个 2025 年大约看到了 10 亿次代码提交;到 2026 年年中,这个数字已经达到每周 2.75 亿次,照此速度全年大约会达到 140 亿次。该公司的 COO 表示,为了跟上这个速度,他们“正在极其拼命地”扩容。
  5. 关于这项调查方法的更多细节,可见 Claude Opus 4.7 System Card 第 2.3.5 节。
  6. 许多受访者可能并没有认真考虑如何处理问题定义中的各种偏差和细微之处,而 METR 最近的研究 表明,开发者对 AI 生产率提升的估计可能会偏高。
  7. 加速倍数最终能有多大,高度取决于初始代码留下了多少优化空间,因此不应把它直接理解为现实世界中的训练加速倍数。所以这里真正不该锚定的是那个绝对倍数。更有信息量的是,这种实验设置让我们能够做出同口径比较,无论是模型之间的比较,也就是过去一年从约 3 倍到约 52 倍,还是与熟练人类之间的比较,也就是在同样任务上用四到八小时做到约 4 倍。
  8. 为了检查裁判偏差,我们又在另一组 127 个时刻上跑了同样的测试。在这些时刻里,人类的下一步本来就已经很强,而不是像原始数据集中那样还有改进空间。在那组数据中,模型建议被判定为更好的比例只有大约 20%。
  • Quotes from Anthropic employees throughout this article are drawn from internal discussions and used with permission. They reflect individual views as of May 2026, not official company positions.
  • 本文中的 Anthropic 员工引语均来自内部讨论,并经许可使用。它们反映的是截至 2026 年 5 月的个人观点,并不代表公司的正式立场。

For most of AI’s history, humans drove every step in its development cycle. But at Anthropic, we are delegating a growing share of AI development to AI systems themselves, which is speeding up our work.

Taken far enough, and given enough compute, that trend points to an AI system capable of fully autonomously designing and developing its own successor. This is called recursive self-improvement. We are not there yet, and recursive self-improvement is not inevitable. But it could come sooner than most institutions are prepared for.

Using public benchmarks and previously unreported data from within Anthropic, The Anthropic Institute is showing that AI is already accelerating the development of AI systems. To take just one example: today, Anthropic engineers on average ship 8x as much code per quarter as they did from 2021-2025.

The technical trends discussed in this piece suggest that AI systems are going to become much more capable in coming years. These trends have huge implications. AI that can build itself would be a major development in the history of technology—one that could bring enormous good for the world in science, healthcare, and beyond. But full recursive self-improvement also might increase the risks of humans losing control over AI systems. If systems are capable of fully building their own successors, the ways we secure them, monitor them, and shape their behavior all grow much more important.

2021–2023

Building the first Claude

In the early days, work at Anthropic looked like work at any other tech company: people writing code and docs on laptops.

2023–2025

Chatbots

People used early chatbots to help with parts of the process, like generating short code snippets and copying the output into text editors.

2025–2026

Coding agents

As the agents became more capable, they were able to write and edit code on their own, sometimes entire files.

Today

Autonomous agents

Agents can now run code themselves and delegate hours of work to other agents.

20XX?

Closing the loop

In the future, agents could become capable enough to build and train models themselves. If this happens, future versions of Claude could be continuously improved by Claude itself.

Evidence from the outside world

The rate at which AI models improve is accelerating. The length of tasks that they can reliably complete on their own has been doubling roughly every four months, up from an earlier trend of doubling every seven months. In March 2024, Claude Opus 3 could complete software tasks that take humans about four minutes to complete. A year later, Claude Sonnet 3.7 managed tasks that took about an hour and a half. A year after that, Claude Opus 4.6 managed 12-hour tasks.1 If this trend holds, tasks that take a skilled person days could come into range this year. In 2027, AI systems could be capable of tasks that take a person weeks.

The same pattern appears on coding and research benchmarks. Benchmarks measure the performance of models in a given domain, and they’re “saturated” when models achieve close to 100% performance.2SWE-bench is a standard test of real-world software engineering: it hands a model an actual open-source codebase and a real bug report, and asks it to write a code change that fixes the issue and passes the project’s own tests. Models have gone from scoring in the low single digits to saturating the benchmark in two years.

CORE-Bench tests whether a model can reproduce existing research, a prerequisite for them to conduct original research. It gives an AI model the code and data behind a published paper, and asks it to rerun everything and confirm it can replicate the paper’s results. AI systems went from succeeding at reproducing the results roughly 20% of the time in 2024 to saturating the benchmark fifteen months later. METR, which runs the benchmark measuring how well models can complete long-duration tasks, found that Claude Mythos Preview could work for “at least” 16 hours and was “at the upper end of what [METR] can measure without new tasks.”

Public benchmarks say a lot about the capabilities of these systems. But they can’t reveal the impact AI systems are having on speeding up AI development itself. For that, we need direct evidence from within AI companies like Anthropic.

Evidence from within Anthropic

Building a frontier model takes two broad categories of work. There is engineering: writing the code, standing up the infrastructure, and overseeing the model training. And there is research: deciding what experiments to run, interpreting what comes back, and figuring out which ideas to try next.

Across both engineering and research, the picture is consistent. In engineering, Claude can be handed an underspecified problem and figure out how to solve it; humans supply the goal, but they no longer need to supply the method. In research, Claude can already match or outperform skilled humans at executing a well-specified experiment. However, large performance gaps persist when it comes to Claude exercising judgement in choosing goals in both engineering and research. That’s the gap between AI today and a future system that could autonomously design its own successor.

It’s common for employees at Anthropic to receive more open-ended and important tasks as they gain more experience. Early on, they execute a task someone else specified, like, “The export button isn’t working, please fix it.” With experience, they’re handed a goal and design the approach themselves, such as, “Investigate why the network slows down under heavy load.” At the most senior levels, they are deciding which problems are worth working on at all: “What should the team build next quarter?” We can use internal Anthropic data to see how far Claude has come in being able to handle these different kinds of tasks.

Claude writes a significant proportion of Anthropic’s code.As of May 2026, more than 80% of the code we merge into Anthropic’s codebase was authored by Claude.3 Before Claude Code launched in research preview in February 2025, this number was in the low single digits. That shift also shows up in the amount of output per engineer. Lines of code merged per engineer per day stayed constant through Anthropic’s first four years (2021-2024), then began to climb upward in 2025 when Claude began to run code rather than just suggesting it for an engineer to copy and paste. The slope steepened again in 2026 when models began to work autonomously over longer time horizons. These two inflection points are shown in the chart below. In the second quarter of 2026, the typical engineer was merging 8× as much code per day as they were in 2024.4 This is because much of the code is written by Claude, with the engineer directing and reviewing, rather than typing it themselves.

Image 1: Bar graph showing code contributed per person, per quarter, starting in Q2 2021 and ending in Q2 2026. The graph notes the release dates of eight different models: Claude 1, Claude 2, Claude 3, Claude 4, Claude Code, Claude Sonnet 4.5, Claude Opus 4.5, Claude Mythos Preview (internal access), and Claude Mythos Preview.

A caveat: Lines of code is an imperfect measure, as it measures quantity over quality. So 8×lines of code/engineer/day in the second quarter of 2026 is almost certainly an overstatement of the true productivity gain. Nonetheless, it indicates an acceleration. At Anthropic, we don’t reward people for how many lines of code they write; rather, team members are producing more code simply because they’re using AI systems to write more code.

The increase in lines of code written lines up with subjective impressions of large productivity increases. In a March 2026 poll of 130 employees from across Anthropic research teams, the median respondent estimated that they produced around 4x as much output with Mythos Preview as they would have without access to any AI models, on the kinds of projects they would have been working on regardless.5 We expect that the true degree of uplift in March was somewhat lower.6 Nevertheless, we find the overall claim plausible, and in line with our other observations: a significant fraction of Anthropic technical staff is accomplishing their core work multiple times faster than they could without AI assistance.

We also see evidence that people at Anthropic are using Claude to do work that simply wouldn’t have happened otherwise, like building exploratory tooling and addressing long-deferred cleanup. For example, in April 2026, Claude shipped over 800 fixes that reduced a class of API errors by a factor of one thousand. The engineer overseeing Claude estimated that a human would have taken four years to complete this work; solving other people’s bugs is slow and painstaking, and humans struggle to hold that much unfamiliar context in their head at once.

I started leaning hard into Claudifying about a year ago. That’s been a crazy adventure and it’s now been ~5 months since I last wrote any code myself.

Anthropic employee*

The code that Claude writes is “good” and improving.“Good code” means two things: it works, and it is written in a manner that allows another engineer to understand it and build upon it. On the first criterion, the evidence is clear. The rate at which Anthropic staff correct, redirect, or take over mid-task from Claude has been falling steadily for a year, including on the most complex and open-ended tasks. This means problems with no clear specification, where the engineer isn’t sure what the answer looks like. This is evident in Claude’s success rate over time on tasks of different difficulties, as shown in the graph below. Claude writes code that works.

Image 2: Line graph showing the Claude Code session success rate on four different types of tasks—trivial tasks, routine tasks, substantial tasks, and open-ended problems—with six different models: Claude Sonnet 4.5, Claude Opus 4.5, Claude Opus 4.6, Mythos Preview (internal access), Mythos Preview, and Claude Opus 4.7.

How to read this:Session success is determined by a Claude judge; a session is deemed successful if the Claude Code agent clearly succeeded at the user’s tasks without requiring corrections. Changes in workloads can lead to short-term fluctuations in success rates.

On the most open-ended tasks, Claude’s success rate reached 76% in May 2026, up 50 percentage points in six months. To give an example of tasks in this difficulty tier, a routine upgrade began crashing tens of thousands of training jobs. An engineer pointed Claude at the live incident with little more than some text content and cluster access. Working through the running jobs and testing one environment setting at a time, Claude isolated the single obscure debugging flag that was triggering the crash, reproduced it reliably, and confirmed a fix. In about two hours, Claude delivered what would normally be two to three days of work.

The second criterion is writing code that another engineer can understand and build on. Here the gap between humans and AI persists, but is closing fast. There isn’t full consensus among staff at Anthropic, but many believe that the Claude-written code was still worse in quality than human-written code at Anthropic in late 2025, and is roughly at parity today. We expect it to be better within the year.

This has changed the way that Anthropic now reviews its own code. Proposed changes to our codebase are now read by an automated Claude reviewer that looks for bugs, security flaws, and other defects before it can merge. Using this tool, we ran a retrospective analysis, and found that an automated Claude review of every change to our codebase would have caught roughly a third of the bugs behind past incidents on claude.ai before they ever reached production. The engineers who wrote that code are among the best in the world at building these systems. Claude is now catching the mistakes that they missed.

Claude-written code was somewhat worse than human-written code at Anthropic in late 2025, is roughly at parity today, and we expect it to be strictly better within the year.

Claude is good at running experiments to hit a goal that someone else has set.Every time Anthropic releases a model, we run the same test: we give Claude some code that trains a small AI model, and ask it to make that code run as fast as possible while still passing the same correctness checks. The goal and the success metrics are fixed in advance, so Claude’s job is to find speedups by rewriting the code, running it, timing it, and repeating. It’s a miniature version of an experimental research loop. In May 2025, Claude Opus 4 averaged a ~3x speedup over the starting code. By April 2026, Claude Mythos Preview was achieving ~52x. For calibration, a skilled human researcher would need four to eight hours to reach 4x.7 In this part of the research workflow—optimizing steps within a clearly defined experiment—Claude has gone from super helpful to superhuman in under a year.

The shape of stuff today is roughly ‘humans have ideas, and the models are able to implement, test and evaluate them an [order of magnitude] faster than before.’

Claude is getting better at proposing its own experiments.In April 2026, Anthropic published the first demonstration of Claude running an open-ended research project end to end. Claude-powered agents were given an open problem in AI safety—roughly, can a weaker model reliably supervise a stronger one?—and were left to solve it. This involved proposing hypotheses, testing them, sharing findings with parallel agents, and iterating. The task has a clear performance “floor” and “ceiling”: the floor is how well the weak supervisor would do on its own; the ceiling is how the strong model does when trained on correct answers. Two human researchers, over about a week, recovered roughly 23% of that gap; the agents recovered 97% over 800 cumulative hours and used roughly $18,000 in compute. There are some caveats to this work; the result didn’t transfer cleanly to production-scale models, and humans still chose the problem and created the scoring rubric. But within those bounds, the agents designed every experiment themselves. Direction-setting was the only meaningful role a human played.

Claude did all of this with pretty minimal help from me over the course of 1-2 days. I think if [a junior colleague] came back to me with results like this in the same span of time, I would be mildly impressed. The future is now.

Claude is getting better at steering research sessions towards research findings.We examined real Claude Code sessions (between January and March 2026) where Anthropic researchers were working with Claude on an open-ended investigative problem, like figuring out why a training run kept crashing, or why a model scored poorly on a benchmark. In each case, we found a moment where the researcher took a detour: they pursued a direction that sent the session sideways before it eventually got back on track. We then showed various Claude models only the work from before the session went off-course and asked what it would do next. A separate Claude that was able to see how the session eventually turned out then judged whether the AI or the human suggested the better next step.8

Because we deliberately picked moments (n=129) where we know the human’s choice had room for improvement, this isn’t a like-for-like comparison between model and human judgement. What these moments give us is a set of realistic, challenging situations where the right next step is not obvious, and where the human’s choice serves as a useful yardstick to compare model performance over time. On this measure, our best model in November 2025 (Opus 4.5) beat the human choice 51% of the time; in April 2026 (Mythos Preview), this grew to 64%. The day-to-day work of research is largely a chain of these next-step decisions, making this a relevant measure of the model’s ability to eventually run an investigation of its own. We view this result as an early signal that AI systems are getting better at making the kinds of judgement calls that AI research depends on.

Image 3: Bar graph with the header "Can the model pick a better next step than the human?" The bar graph shows the performance of nine different models: Claude 3 Haiku, Claude Sonnet 4, Claude Sonnet 4.5, Claude Haiku 4.5, Claude Opus 4.5, Claude Sonnet 4.6, Claude Opus 4.6, Claude Opus 4.7, and Claude Mythos Preview.

How to read this:The practical ceiling line measures an "ideal" answer written by a model that could see the whole session (including how it ended).

The comparative advantage of humans as of right now is still in seeing the bigger picture and thinking beyond the confines of the immediate task.

What might the future of work at Anthropic look like?

The evidence suggests that the human role is narrowing at each step in the AI development process. Once human- and AI-authored code quality reach parity, humans will stop writing code entirely, and shift to only reviewing it. But if they can’t review code as quickly as Claude can generate it, human review will become the bottleneck to AI development. Similarly, once Claude can run experiments, the question shifts towards “Which of these experiments is worth running?” Put simply: the doing (i.e., writing the code, running the experiment, producing the result) now costs almost nothing in human time, even if it still has costs in compute.

An area of human comparative advantage, for now, is research taste and judgment, including choosing which problems matter, which results to trust, and when an approach is a dead end.

Work (and life) ran on a gift economy of small favors between humans. ‘Can you help me get this script running?’ [...] each one created a little debt, a little mutual awareness. [Claude is] faster, it creates zero debt, but each of these is a lost bid for human collaboration.

On days where everything works well, I can’t help but think nothing I do matters, everything is automated and better and faster than I ever will be. But then there are days where everything breaks and I don't understand why and I realize I have no idea what I’ve been up to anymore.

What if we’re wrong?

A natural objection to the evidence presented above is that the work that is still in human hands—choosing which problems to work on—is what matters most. Without that judgment, Claude is a capable assistant, but not a system that could drive AI progress on its own.

It is genuinely unclear whether today’s training methods and architectures could unlock that capacity. But AI is rarely advanced by “eureka!” moments. There have been a few of these in AI’s recent history, like the Transformer architecture, or mixture-of-experts models, but paradigm-shifting ideas arrive years apart. In between, most progress is incremental: we scale something up, see what breaks, fix it, and try again. That is exactly the kind of workflow Claude now excels at. Edison said that genius is 1% inspiration and 99% perspiration. But we see perspiration becoming increasingly automated. It’s becoming clear that much of what advances the frontier is automatable; large-scale research progress is mostly a function of tools and resources, which dictate how fast you can run experiments, how many you can run at once, and how quickly you can get results.

Even if we suppose that Claude never achieves good research taste, a conservative reading of our evidence still implies compounding acceleration. If humans spend most of their time on the single-digit fraction of work that is direction-setting, while Claude handles the rest, that means each engineer or researcher is steering far more work than before. The evidence we see suggests that people at Anthropic are both moving faster and covering a broader surface. In practice, this means that AI already makes Anthropic move much faster than it did before the advent of effective AI tools.

The less conservative reading is that the early evidence on Claude’s improving research judgment—narrow as it is today—is an indicator that this capability is improving as well. “Research taste” might be just another AI capability that AI systems fail at for a time, then get good at. We’ve seen a similar pattern with other qualitative skills, like AI systems being able to explain why a joke is funny, demonstrate theory of mind, and solve linguistic riddles.

Possible futures

What happens next depends on two things: whether the trend continues, and what we choose to do if it does. We can imagine at least three future scenarios:

  1. The trend stalls, but today’s AI capabilities are widely diffused. This article features many exponential trajectories. But these trajectories may actually turn out to be S-curves. We may be approaching the bend in the curve, where returns to scale diminish and the line straightens, then flattens. The judgment that separates a competent researcher from a great one might be a capability that cannot come from scaling up training inputs like compute and data. If so, getting past this bottleneck would require a new idea, like an architectural approach that supplants the Transformer architecture that all current frontier models use. Alternately, the binding constraint to AI progress could be in the supply chain, not the model: advancing and diffusing the frontier may require more energy and compute than presently exists. The pace of chip fabrication, grid expansion, or interconnect bandwidth may be the constraint, rather than intelligence itself. We also cannot rule out an exogenous shock to the AI ecosystem that dramatically slows things, like a sudden diminishment in the supply of compute or electricity, either of which would slow progress and make forward investment by labs more expensive. Or we may not be anticipating some other barrier to progress.

Even if model capabilities were frozen at today’s level, we would expect major changes to occur in the world. Project Glasswing is one early sign: in its first weeks, Mythos Preview found more than ten thousand high- and critical-severity software vulnerabilities across the world’s most important systems—enough that the bottleneck in cyber defense has already shifted from finding vulnerabilities to patching them fast enough. And we are still early in the diffusion of today’s models into the wider economy, where a 100-person company can increasingly do the work of a 1,000-person one, because each employee will sit atop a pyramid of agents.

We include this scenario for completeness, but we don’t believe it’s likely. Every capability we can measure, including those that feel “squishier,” like quality of code and success on open-ended tasks, has so far followed the same curve. We have not yet seen that curve bend. Of the three futures we consider, this one would give governments and societies the most time to adapt. We are more worried about the next two, which would move faster and leave far less room for preparation.

  1. AI labs continue to see compounding efficiency gains.In this scenario, AI development becomes substantially automated, but humans continue to set research directions and judge results. Organizations that use AI systems would become much more efficient as time goes on, so we could expect to see significant productivity multipliers on each person in this organization. 100-person companies could do the work of 10,000- or 100,000-person organizations. This would revolutionize knowledge work and government services, but could also be turned to harmful ends, from authoritarian surveillance of whole populations to influence operations that tailor manipulation to each individual and run at a scale no human team could match. The role of humans at companies like Anthropic would shift. People would partner with AI systems to scale up research and generate new insights, and together they would build the systems needed to verify that AI outputs can be trusted. The evidence we’ve laid out here suggests that we’re likely heading into this scenario. But speeding up one part of a process often just shifts the bottleneck elsewhere: overall pace is capped by the parts that haven’t sped up. In computing, this is known as Amdahl’s law, and the same logic can apply to organizations. Anthropic has already encountered one signature of Amdahl’s law: as we’ve begun to push more code around the organization, human code review has become a new bottleneck.

We’ve also encountered this friction outside engineering. There has been an explosion of new ideas, initiatives, tools, and simulations, as a result of Anthropic employees working with highly capable models—far more than we have the capacity to pursue. The rate at which organizations can spot and fix these bottlenecks may be a skill that improves over time, and it may become the most important skill for any organization.

  1. AI systems themselves become capable of full recursive self-improvement, and begin building their successors. If technical trends in advancing capabilities continue, and AI systems are able to develop the capabilities inherent to transformative human ingenuity, then it is plausible that AI systems could design and refine themselves. In this world, the pace of progress in AI development becomes determined entirely by the availability of compute (or the speed of discovering various efficiencies in algorithmic training or inference) for AI systems. Humans play a substantially diminished role in their development, likely moving most of our effort towards oversight, validation, and verification of an expanding “virtual lab” run by AI systems. We expect that systems capable of automated AI research and development would have skills that would transfer to the rest of science, allowing them to begin to revolutionize other fields.

How the alignment problem gets solved—or not—in this future is something we are least certain about. Models could prove to be sufficiently aligned and capable enough of research taste that they discover and implement novel solutions that we have not yet reached. They could also be sufficiently wise to halt development if not. Alternatively, the rare occurrences of misalignment present in today’s models could compound as the models build their successors, growing more frequent but less understood until we lose control of them. It’s possible that we can’t build, integrate, and verify the tools that we’d need to understand which trendline we are actually on.

We do not have good intuitions for what this world would look like, because our economy is currently driven by humans and human-built tools. By its nature, a world driven by fast recursive self-improvement could become dominated by the self-improving model as its capabilities fully eclipse those of humans and the model proliferates across the broader economy. It is difficult to predict what the economy looks like if human labor stops being competitive.

Even if model development became fully automated and recursive, we can’t predict what that would mean for most humans’ daily lives. Amdahl’s law applies here as well. Recursive intelligence could lead to achieving many of the benefits outlined in Machines of Loving Grace, quickly in some domains. We expect that embodied intelligence (i.e., robotics) might quickly follow recursive intelligence, and follow a similar path of increasing returns at decreasing cost. More powerful intelligence might help us build things in the physical world more quickly, run more productive clinical trials of lifesaving drugs, and develop novel forms of coordination.

But achieving recursive improvement alone does not suggest an immediate change in how industrial production occurs, societies organize, or markets function. More intelligence can’t learn what a drug does over decades of use, can’t hold elections sooner than a constitution dictates, and can’t turn a stranger into an old friend in a weekend. For most people, the felt pace of this future will still be set by the bottlenecks, even if the laboratory upstream runs at the speed of compute. That collision, where recursive intelligence building itself ever faster meets the world of humans, relationships, and governance, is another part of this future we can’t predict.

What should we do?

If it were possible to effectively slow the development of this technology to give ourselves more time to deal with its immense implications, we think that would likely be a good thing. But if a slowdown simply lets the least cautious actors catch up technologically, it could leave everyone less safe. Without a global coordination mechanism, companies and governments will have to make difficult decisions about safety while under competitive and geopolitical pressures.

We believe it would be good for the world to have the option to slow or temporarily pause frontier AI development to enable societal structures and alignment research to keep up with the advance of the technology. The Anthropic Institute will conduct research—in collaboration with many others—and take actions to help build the systems that a credible slowdown or pause would require. These systems would enable frontier AI developers to verify that others globally have actually stopped or slowed, and that a bad actor could not use the auspices of a coordinated slowdown to jump ahead in secret. If such systems existed, we expect that we would slow down or temporarily pause, if other developers at or near the frontier also did so in a verifiable manner.

A meaningful slowdown or pause would require multiple well-resourced labs at or near the frontier, in multiple countries, agreeing to stop under the same conditions. It would also require that each can verify that the others have actually stopped. Due to the unique characteristics of AI systems, the detectability (a lower standard than verifiability) element of this arms control problem is much more challenging than with other technologies. Training runs are far easier to conceal than missile silos, their inputs are general-purpose, and the incentive to defect quietly is enormous, because whoever continues while others pause could inherit the lead. A credible pause also has to specify what triggers it, what lifts it, and who adjudicates.

None of this is necessarily impossible in principle—the world has built verification regimes for other complex technologies (e.g., the Intermediate-Range Nuclear Forces Treaty)—but those regimes took decades to build both the infrastructure and the trust. We don’t have that long. A unilateral pause by one lab, by contrast, is achievable immediately, but accomplishes much less: it would change who the front-runner is, but it would not create the wider deliberative process that is currently missing.

In the coming months, we will organize conversations where policymakers, researchers, civil society, and other AI companies can help answer some of the questions this piece raises, especially around full recursive self-improvement and how to create better options for coordination and deliberation. We’ll publish what comes out of it. The window to investigate the questions together is here, and people outside AI companies should be involved in this deliberation.

Marina Favaro and Jack Clark co-authored this piece, with editorial support from Santi Ruiz. Shan Carter, Romello Goodman, and Nikki Makagiansar created the visuals from data collected by Brian Calvert and Jun Shern Chan. Daniel Freeman, Jim Baker, Max Young, Sarah Pollack, Francesco Mosconi, Holden Karnofsky, Andy Jones, Kevin Troy, Anton Korinek, Meg Tong, Andrew Ho, Dan Altman, Drake Thomas, Jack Shen, Sasha de Marigny, and Avital Balwit provided feedback.


  1. METR’s key measure tells you the time horizon over which AI systems can be 50% reliable at a basket of tasks, though the trendline looks the same at 80% reliability.
  2. Especially as they shift toward more open-ended formats and more difficult tasks (e.g., Olympiad-level mathematics), benchmarks often saturate below 100% due to errors in the question and answer sets like ambiguous problem statements and unsolvable questions.
  3. Anthropic leadership have publicly estimated that 90% or more of our code is written by Claude, including scripts and experimental code. Our >80% figure measures the share of lines merged to production that can be attributed to Claude. This is a more conservative measurement in two ways: our attribution pipeline has gaps, and the lines not attributed to Claude include auto-generated code and other artifacts that were not hand-written by humans either.
  4. This surge in code production is straining the infrastructure everyone shares. GitHub—the platform most of the world’s software is built on—saw roughly one billion code commits in all of 2025; by mid-2026 it saw 275 million a week, on pace for roughly 14 billion over the year. The company’s COO has said that it is “pushing incredibly hard” on capacity just to keep up.
  5. Additional details on the methodology of this survey are discussed in section 2.3.5 of the Claude Opus 4.7 System Card.
  6. Many respondents may not have thought carefully about how to account for various biases or subtleties in the question definition, and recent research by METR shows that developer estimates of AI productivity uplift can be overestimated.
  7. How large the speedup gets depends heavily on how much room for improvement the starting code leaves, and it should not be read as a real-world training speedup. So the absolute multiple is not the figure to anchor on here. What is more informative is the like-for-like comparison that this experimental setup makes possible, both across models (~3x to ~52x over the past year) and against a skilled human (~4x in four to eight hours on the same task).
  8. As a check on judge bias, we ran the same test on a separate set of 127 moments where the human’s next move was already strong (as opposed to the original set, where the human’s direction had room for improvement). There, the models’ suggestions were judged better only about 20% of the time.

  9. Quotes from Anthropic employees throughout this article are drawn from internal discussions and used with permission. They reflect individual views as of May 2026, not official company positions.

📋 讨论归档

讨论进行中…