返回列表
💬 讨论题 · 🧠 阿头学

AI 应用层没死,但通用薄壳会死

这篇文章的判断基本成立:AI 应用层仍有大机会,但前提是别做套壳通用工具,而要做嵌入垂直工作流、直接对客户 P&L 负责的系统;不过它明显带着 a16z 的投资推销立场,且低估了传统软件巨头和模型厂商继续下沉的威胁。
打开原文 ↗

2026-05-28 原文链接 ↗
阅读简报
双语对照
完整翻译
原文
讨论归档

核心观点

  • 黄砖路必然内卷 作者对“黄砖路”的定义是横向、低步骤、浅连接器、靠模型能力直接出活的应用,这类产品确实最容易被 OpenAI、Anthropic 以及未来更强模型和原生分发渠道吞掉,因此创业公司正面硬刚大模型实验室基本没有胜算。
  • 真正可守的是工作系统 作者最有价值的判断是,长期价值不在“AI 会不会写”,而在“谁拥有真实工作的执行系统”;一旦产品接管了多步骤流程、审批、审计、例外处理、合规和人机协作,它卖的就不是能力,而是业务结果,这类系统型产品比工具型产品更有韧性。
  • 护城河来自生产环境的运营记忆 文中强调的数据飞轮、评测、guardrails 和工作流迭代是站得住的,因为行业 tacit knowledge 确实不在公开训练集中;但这类护城河成立的前提很苛刻,必须真的跑进生产、真的积累异常和反馈,而不是停留在 demo 层。
  • 企业客户买的是连续性,不是最强单模 多模型路由、模型迁移、成本优化、上线稳定性和合规兜底这些“脏活累活”确实是企业愿意付费的部分,因此垂直应用公司有机会成为“智能控制平面”;这点比单纯强调模型效果更接近企业采购现实。
  • 论证有力但不完整 作者把实验室与应用层的矛盾讲清楚了,却回避了更危险的对手——Salesforce、ServiceNow、Epic 这类已经占据工作流和数据入口的 incumbents;如果护城河核心是系统、治理和部署,那传统巨头往往比 AI 初创公司更有地利。

跟我们的关联

  • 对 ATou 意味着什么:不要再把“做个 agent 接几条连接器”当成创业方向,那大概率是送死;下一步应该用“工具与步骤测试 / 系统测试 / P&L 测试”筛掉所有薄壳想法,只保留能接管真实业务结果的系统型机会。
  • 对 Neta 意味着什么:评估 AI 项目时不能再只看模型效果或 demo 漂不漂亮,而要看它是否嵌进生产工作流、是否积累可复用的异常处理与治理能力;下一步可以把“运营记忆”当成核心指标,要求团队明确哪些反馈会沉淀成长期壁垒。
  • 对 Uota 意味着什么:这篇文章提示产品表达要从“我们用了多强的模型”转向“我们替客户稳定完成什么结果”;下一步可以重写产品叙事,围绕收入、成本、合规和责任边界来包装,而不是继续堆 AI 魔法感。
  • 对三者共同意味着什么:如果一个方向的价值主要来自模型能力提升,那它迟早会被基础模型平台收编;下一步应优先寻找高复杂度、高例外率、高监管、高协同成本的场景,因为这些地方才可能形成真正可防守的 AI 系统。

讨论引子

1. 如果真正的护城河是“工作流系统 + 治理 + 部署能力”,那最终赢家为什么会是 AI 初创公司,而不是现有 SaaS 巨头? 2. 数据飞轮在企业场景里到底有多强,还是会被数据隔离、私有部署和合同限制严重削弱? 3. 在模型成本持续下降、实验室不断下沉企业服务的情况下,多模型路由和成本优化还能构成多久的差异化?

为什么应用层并没有死

创业者和潜在员工一直在问我一个问题,AI 应用层还有什么可做的吗,还是说 OpenAI 和 Anthropic 会把一切都杀死?

这个问题背后,有一种很特别的 AI 集体癔症。有些人已经得出结论,想避免沦为永久底层,唯一稳妥的位置要么是在大模型实验室内部,要么是在前沿做机器人、硬科技之类的东西,也就是理论上凡是实验室碰不到的地方。如果每一块软件都快被吃掉了,不管是被 Codex 或 Claude 直接吞掉工作本身,还是被未来某个模型让你做出来的东西变得毫无必要,那就快跑吧。

听着,我和大多数人一样,也是个彻底的 AI 极大主义者,而且我觉得他们有一半是对的。实验室确实正在吞下应用层里很大一片表面。但所谓应用层,并不是一个同质化的机会整体。更合适的理解方式是,看你是在黄砖路上,还是在奥兹国的别处。

黄砖路,是我们对实验室正在走的那条路径的简称,也是他们正在投入惊人资源的地方。实验室之所以最适合代码生成、写作、图像创作这类问题,是因为这些问题会随着模型原始能力的提升而变好。每多花一美元做预训练和后训练,产品质量就会更高。与此同时,奥兹国其余地方住着的是更复杂、通常也是垂直领域的问题,它们远不是给一个业务用户提供一个横向工具,再配上标准工具和 computer use 那么简单。这里的价值,较少来自底层模型本身的原始能力,尽管这仍然重要,更主要来自围绕模型搭建的那套支架,它让输出在特定行业里变得可信、合规,而且真能落地运转。

我们正在实时看到这一点。OpenAI 和 Anthropic 实际上正在告诉市场,他们没法靠一个通用 AI 同事解决所有问题。他们已经宣布了大规模前置部署的合资计划,围绕为企业配置和定制模型来建立整家公司。如果你真觉得下一个模型版本就能把这些问题一并解决掉,就不会往这些项目里砸上几十亿美元。

所以,如果你想靠做 AI 应用发财,就别走黄砖路,去奥兹国别的地方建东西。下面这些,是我们学到的,以及我们投资组合里一些创始人学到的,关于什么有效的经验。

黄砖路

如果你要创业,黄砖路是最显眼的一条路,但也是最危险的一条。拿一个表现不错的模型,接上一些现成连接器,比如 G Drive、Slack、Salesforce、Notion、GitHub,再在上面发一个某种 agent 编排层。魔法就来了。

问题在于,这正是实验室正在用 Cowork 和 Codex 做的事。显然,他们自己拥有模型,这让他们有更高的利润空间、更强的控制力,也能对所有下游参与者施加定价权。但也许更重要的是,他们也掌握着决定产品擅长解决什么问题的架构选择。到目前为止,他们一直有意识地沿着模型加 tool calls 这个模式走,而这恰好就是黄砖路上那种横向、低步骤数工作所需要的东西。就算某家创业公司真能设法跑赢 Codex 或 Claude Code,实验室也拥有庞大的分发能力,以及 AI 领域最强的品牌光环。

如果你是一家 AI 应用公司,也在用这套打法,接着同样的连接器,底下没有 sub-agents,没有更深层的配置能力,也没有分发能力,那你大概率是在走一条通往虚无的路。

奥兹国的其他地方

对创业公司来说,也并不全是坏消息。黄砖路之外有着巨大的机会,那里创业公司有清晰路径去真正拥有自己的客户,并解决复杂问题。

这些公司构建的是一种 agent 体验,模型被织进一张复杂的工具、自动化和集成网络里,也就是软件,因此其中大多数创业公司天然都会走向垂直化。它们可以专注于多步骤、多参与者的工作,用 sub-agents 去处理角色和垂直领域特定任务,而这些都是 Anthropic 和 OpenAI 无法通过横向平台触达的。比如跨多个系统收集上下文,再路由给多个必须在不同阶段批准的人。这里通常会涉及一个或多个遗留系统,往往需要确定性的结果,不能接受模糊性,而且有时还直接绑定某种高价值业务结果。实验室也明白这些问题有多值钱,这就是为什么他们在搭建自己的外包配置团队,也解释了为什么会出现整整一类面向高端市场的强化学习公司。

为什么奥兹国的其他地方不会被巫师占领

对上面这些说法,一个常见回应是,到目前为止,和模型或实验室对着赌,一直都不是一笔好交易。它们大概率只会持续变强,最终侵蚀这些应用层公司服务的市场。

实验室当然会继续进步,但我认为奥兹国的其他地方,随着时间推移,还是有几种方式可以自我防守。

数据与学习飞轮

你真正内化下来的很多东西,并不在任何训练集里。那些没有写下来的行业惯例、没有文档的标准、存在于从业者脑子里的部落知识,全都不在公开网络上。无论堆多少训练算力,都替代不了真正身处这些知识所在的工作流里。这里其实叠着两个飞轮。一个是跨客户的飞轮,也就是当你看到同一种问题的更多变体后不断积累起来的模式。另一个是客户内部的飞轮,也就是某些具体决策背后的原因、那些没说出来的例外、以及这家公司自己总结出的经验法则,而这些东西只有在系统的真实交互里才会浮现出来。

就算客户数据不能跨客户使用,应用公司也仍然可以利用不同客户问题类型之间的模式识别能力,并用它指导未来问题该采用什么架构。一家公司如果已经让自己的 agents 跑过一百次法律 redline、一千轮保险核保流程,或者一万次 SDR 活动,它对问题形状的理解,是后来者第一次临时拉起一个新 agent 根本复制不了的。

理论上,一个横向 agent 也可以建立同样的学习基础设施。它之所以没有这么做,除了纯粹的聚焦问题,更关键的是 UX。能不能捕获这种知识,完全取决于你给用户提供了什么样的工作流界面,而垂直玩家可以精确围绕自己工作流真正需要显露出来的内容去塑造这些界面。横向工具做不到。评测集、带标注的输出结果、边界情况分类法,这些都能不断累积,最终形成一个垂直领域专属的数据飞轮,进一步驱动微调,而没有相当生产暴露的新进入者根本造不出来。能不能做到这点,取决于数据权利、累计下来的生产暴露规模,以及客户合同的结构,但模式识别本身不管怎样都会持续积累。

管理模型波动性与复杂性 实验室现在已经在内部做路由了,不同请求走不同类别的模型,底下还有 ensemble。它们做不到的是跨厂商路由,或者为某个具体子任务评估竞争对手的模型,或者在真正最优的狭窄环节用开源微调模型。奥兹国其余地方的公司,会针对每个子任务,从整个模型市场里选最合适的模型,而不是只用母实验室发出来的那个。它还会做那些没人想做的脏活。每次新模型出来,都要重新跑 evals,重新校准 prompts 以适应客户的边界情况,还要在不破坏生产环境的前提下完成上线。实验室不会替客户做这些事,它们只是把下一个模型卖给你,然后告诉你去迁移。奥兹国其余地方的公司,会把迁移本身吞下来。客户得到的是整个市场中最好的智能,以及穿越每次升级的连续性。

成本优化 把每一次查询都丢给 Opus 4.7,是最快把毛利率做成负数的方法。最好的奥兹国其余地方公司,会在不同层级的模型之间做路由,最难的任务用前沿模型,大部分任务用中阶模型,在它们有资格这么做的地方再用更小的定制模型或微调模型。有些公司现在甚至在这之上继续做后训练,专门针对客户关心的那一小块工作去优化自己的模型,并用远低于前沿 API 调用的成本提供服务。实验室给出的是价格底线,也就是以 X 美元卖给你最低限度的智能。奥兹国其余地方公司卖的正好相反,它卖的是在某个具体工作流真正所需的智能水平下,最低的美元成本。只有当你准确知道每个子任务到底需要什么水平时,这件事才有可能做到,而实验室在结构上不可能横跨所有垂直领域知道这些。最终,它会直接转化成更低、可控的结果价格。

治理 成为客户在某个垂直领域里运行 AI 的控制平面,本身就有相当大的价值。也就是权限、审计、agent 被允许做什么、agent 实际做了什么,这一切汇聚的地方。这个控制平面,是由一个个用例专属的 guardrails 搭起来的,而这些 guardrails 在不同行业、不同岗位之间看起来完全不一样。因为它们拥有工具、工作流,以及 agent 从头到尾接触的数据,所以它们能以横向工具难以做到的方式提供确定性结果。它们也是替最终买家吸收监管复杂性的那个实体。法律行业有 FRCP 和律师执业规则,医疗有 HIPAA,金融有 SEC 和 FINRA,还有各州保险监管,等等。一个横向玩家如果想可信地做这件事,就等于得同时变成一百个不同的垂直公司。CIO 们想要的是一个能在合同里明确写明,会为其提供的 agents 处理合规问题的合作方。

这一切最终都回到同一件事上,就是聚焦。这个聚焦可以是一个垂直领域,比如保险、法律、会计,也可以是一个被深度做好了的职能,比如销售、客服、财务。不管哪种,工作都需要一支团队埋头只盯住一类客户,盯住他们的工作流、边界情况和监管要求。实验室不是为这件事而建的。它们必须同时面向所有人、出现在所有地方,这也是它们最初修出黄砖路的原因。正是同样的取舍,把它们挡在了奥兹国其他地方之外。你可以同时无处不在,也可以在一件事上做到极强,但不能两者兼得。

以销售为例,来自 11x 技术型 CEO 的实战建议

真正落到实操里,该怎么想这件事。下面是 11x CEO Prabhav Jain 给出的一些实用建议。

从结果出发

想做一家对实验室有韧性的公司,一个很实用的路径就是,从客户真正关心的某个具体结果出发。对我们来说,那个结果就是帮助公司创造更多 pipeline。接下来问题就变得很具体了。哪些活动是我们想端到端接管、而且确实能推动 pipeline 的。把每个活动拆成任务。哪些任务是 agentic 的,哪些不是。哪些需要复杂的领域洞察,哪些不需要。实验室也会发工作流,但当一个工作流步骤很多、输入很乱、状态难以解释,或者带有现实世界约束时,光有更好的模型也到不了终点。剩下的工作,还是得靠老派的软件工程,而在这块表面上,实验室对一家高度聚焦的应用公司并没有优势。比如,我们处理的任务里,有些是 agentic 的,有些不是,比如基于自定义信号的线索挖掘、线索补全、深度账户研究、从 CRM 获取上下文、按渠道生成消息、线索资格判断 agent,以及邮件送达系统。这些都不是一次 one-shot 就能搞定的,背后需要很深的工程能力。

奥兹国这个类比里最关键的洞察是,任何真实工作流里,大约有一半是非 agentic 的,而这一半实验室没有任何优势。它们写模型层之下的确定性软件,并不会比你更强。而另外那一半就算是 agentic 的,也仍然需要你围绕自己真正想要的结果去调、去训、去约束模型。领域知识往往并不在通用训练数据里。这些能力必须从垂直领域或具体职能里一点点搭起来,再在工作流的正确时刻喂给模型。比如我们的 agents 在电话里筛选入站线索时,它们必须先接受训练,知道对某个具体行业、某类 persona 来说,什么才算一场好的销售对话。这就是应用公司的工作,而且会不断累积。

更重要的是,这些能力会不停过时,因为企业在演化,所以你持续演化这些工作流和上下文的能力,会变成竞争优势。举个例子,我们刚开始做大规模邮件外联产品时,AI 写邮件这件事才刚起步。快进到今天,大家已经能很敏锐地分辨一封邮件是 AI 写的还是人写的,更关键的是,这种判断每几个月就会变。我们的 agents 必须不断适应这种市场变化,而护城河也正是在这里搭起来的。事实上,尽管有这种动态变化,我们的正向回复率在过去几个月里还是提升了 4 倍,并且已经为客户带来了数亿美元规模的 pipeline。

去做复杂度高的问题

真正的商业价值,是在复杂问题里被释放出来的。否则你最后做出来的,只会是个薄壳。

任何足够复杂的商业问题,一拆开,脏乱差马上就会冒出来。举个 GTM 领域里听起来很简单的例子,你不应该去联系某家公司里的某个联系人,如果这家公司已经是客户。听起来简单,实际根本不是。也许你在 CRM 里有这家公司的域名。那有几十家子公司的集团怎么办。要是 CRM 记录里存的是母公司的域名呢。要是 Salesforce 里一个过期的匹配字段,把一封冷启动推销邮件发给了现有客户的 CRO 呢。现实世界的数据就是这么乱。人类都很难处理,模型也不会神奇地自动跨过这道坎。要从这一团混乱里建立秩序,靠的是围绕问题具体形状精心设计的专用 agents,不是给 CRM 接一个通用 copilot 就行。事实上,根据我们掌握的数据,我们意识到自己的数据质量和新鲜度都比客户更高,所以默认情况下,我们以自己的数据为锚。

Guardrails 不只是为了防止坏事发生。客户付钱买的就是这个。

Guardrails 被严重低估了。哪怕是在同一个产品内部,每个 use case 都需要自己的一套。对我们来说,一个受监管的金融服务潜在客户,要求的保障就和一个中型 SaaS 客户完全不同,而这些保障会层层传导到 agent 能怎么写、能联系谁、能碰哪些数据、在电话里能说什么,以及每个决策该如何被记录。

一个一刀切的系统,在这种差异面前会直接崩掉。Guardrails 必须按 use case 构建,按客户配置,并持续审计,而这部分工作完全落在应用公司身上。这也是为什么我们需要 FDE 和技术部署策略师,去针对每个客户的要求做调优。举个例子,我们曾经和一家 F1000 机构合作,面向他们庞大的 SMB 客户群做经同意的语音外呼。最初几轮迭代的接听率很低,我们必须快速迭代,摸清楚怎么才能让这种类型的受众在通话前 10 秒里愿意参与。SMB 企业主的行为方式,和大型 B2B 买家或者消费者非常不同。现在,我们一天为他们创造的销售机会,已经超过他们整个销售团队针对这个细分市场一个月创造的总量。

以保险为例,来自 FurtherAI CEO 的实战建议

销售是一个例子。保险是另一个例子,它从另一个角度说明了同样的事。下面是 FurtherAI CEO Aman Gour 对于如何走出黄砖路的理解。

当我们开始把 AI 部署进真实保险运营场景时,我们反复听到一个特定假设,模型才是智能本身,工作流只是围绕它搭起来的脚手架。

但我们合作的承保机构越多,就越确信这件事恰恰反过来。

在保险里,大量智能其实是活在工作流本身内部的。两家承保机构都可能让一份投保申请走过看起来一样的路径,提交、审核、报价、承保。但路径本身是最简单的部分。真正把两家机构区分开的,是路径内部的一切,哪些风险要升级处理,哪些损失信号重要,两条 appetite rule 冲突时哪一条优先,什么时候必须人工签字,哪些外部数据要拉进来,最终决策怎么记录。

这些逻辑并不住在某个干净统一的规则引擎里。它分散在 SOP、经理审核、核保理念、承保机构专属的风险偏好,以及多年运营经验之中。很多内容甚至根本没有被写成模型可以直接读取的形式。

这就是为什么我们不相信一种每次都从零推理的纯 agent,也不相信一种现实稍微变脏一点就会崩掉的僵硬工作流。我们一直在构建的是 agentic workflows。工作流带来可重复性、可审计性和成本控制。agent 负责处理波动性,并在理想路径断掉时恢复。人类则留在回路里,处理那些责任必须明确归属的判断性决策。

在第一天,这件事是在自动化人工工作。但随着时间推移,每一次升级处理都会变成一个信号,每一个例外都是一次反馈,每一次人工修正都会暴露出 runbook 哪些地方还不完整。慢慢地,工作流不再只是脚本,而开始变成承保机构的运营记忆。这部分正是实验室很难触达的地方。它们会继续推出更好的模型和更好的通用 agents,这本来也应该如此。但它们不会长时间待在某家承保机构的生产工作流里,所以学不到为什么某个账户被升级处理、为什么某个风险被拒保,或者为什么某位核保员覆盖了 appetite guide,而且这么做是对的。

这种理解,只能来自于让工作流在生产环境中运行成千上万次。你第一天发出来的工作流,不是护城河。真正的护城河,是生产使用随着时间推移形成的那个循环。

对我们来说,这就是走出黄砖路去构建的含义。

你该怎么判断自己是不是在奥兹国的其他地方

工具与步骤测试 这项工作一共要走多少步,你又需要构建多复杂的工具来支撑它。拿 Google Drive 上的横向 AI 搜索来对比,一个步骤,一个工具,结果也比较宽容,用户读完摘要,如果错了再问一次就行。再拿基于三年律所先例进行多步骤法律 redline 来比,里面有很多工具、几十个步骤,输出必须通过合伙人审阅,甚至可能还得在法庭上为之辩护。两者看起来都像是某个 agent 在干活,但只有后者需要那种得由一支高度聚焦团队花多年才能搭出来的深层软件。

系统测试 你是在构建一个客户拿来运行工作本身的系统,还是一个叠在他们已有系统之上的工具。系统拥有整个工作流,从数据采集、治理,到所有完成记录,客户在描述真实工作如何发生时,指的就是它。工具则只是给客户已经在运行的工作流加一点智能。工具这种情况当然也能带来真实收入,而实验室也能把它拿走,因为客户并不依赖你作为编排层。高 ACV 往往是系统的信号,因为系统替代的是真实人力,所以收费也会相应更高,但它不是保证。问问自己,如果某家实验室发了一个据说和你直接竞争的东西,客户是否仍然需要你的工具。如果答案是需要,那你做的是系统。如果答案是不需要,那你就是工具,就算你的 ACV 很高也是一样。

对冲基金 / P&L 测试 实验室的表现是拿 benchmark 来评判的,而奥兹国其余地方的表现,是拿客户的 P&L 来评判的。客户不在乎你的模型在 SWE-Bench 或 MMLU 上得分多高,他们在乎的是你的 agent 有没有把交易拿下、有没有把合同 redline 对、有没有承保对的保单。如果他们盯着的是自己工作流特有的结果,而不是某个通用能力分数,那你就在奥兹国的其他地方。如果他们买的是通用能力,那你卖给他们的东西,本质上只要买一个 Claude 或 Codex 席位就能拿到。最好的 agent 公司,最终都得像对冲基金一样执行,以客户 P&L 上体现出来的 alpha 取胜,而不是靠 benchmark 分数取胜。

两边都能赢,而且都会赢

我们会看到黄砖路上和黄砖路外都出现巨大赢家。模型公司会继续赢,因为它们拥有模型,也拥有为自己设计出来的横向工具做分发的能力。

奥兹国的其他地方也能赢,前提是它们拥有工作的系统,也就是公司真实工作发生并且相关数据会被采集下来的那块表面。这些公司掌握着数据采集、执行工作流的行动系统,以及治理能力。随着某个垂直领域里更复杂的工作流逐渐成熟,它们会不断叠加,最终变成客户赖以生存的核心体验。随着老玩家和新进入者不断推出新一代模型,这家公司就会变成那一层,把这些模型整合起来并交付给客户。底下的模型可以替换,工作的系统不行。

下一代企业软件,会在黄砖路之外建成。

如果你正在做这件事,联系我,jschmidt@a16z.com。

**Why The App Layer Isn't Dead **

为什么应用层并没有死

The question I keep getting from founders and prospective employees: is there any AI application layer left to build, or are OpenAI and Anthropic going to kill everything?

创业者和潜在员工一直在问我一个问题,AI 应用层还有什么可做的吗,还是说 OpenAI 和 Anthropic 会把一切都杀死?

There's a particular flavor of AI psychosis behind the question. Some people have concluded the only durable places to avoid the permanent underclass are inside a big lab or out on the frontier building in robotics, hardtech, or similar – theoretically anything “the labs can't touch.” If every piece of software is about to be eaten, either by Codex or Claude absorbing the work directly, or by a future model that will make whatever you’ve built unnecessary, then run!

这个问题背后,有一种很特别的 AI 集体癔症。有些人已经得出结论,想避免沦为永久底层,唯一稳妥的位置要么是在大模型实验室内部,要么是在前沿做机器人、硬科技之类的东西,也就是理论上凡是实验室碰不到的地方。如果每一块软件都快被吃掉了,不管是被 Codex 或 Claude 直接吞掉工作本身,还是被未来某个模型让你做出来的东西变得毫无必要,那就快跑吧。

Listen I'm as much of an AI maximalist as almost anyone, and I think they're half right. The labs really are coming for a huge swath of the application surface. But "the application layer" isn't just one homogenous opportunity. The right framing is whether you're on the Yellow Brick Road or somewhere else in Oz.

听着,我和大多数人一样,也是个彻底的 AI 极大主义者,而且我觉得他们有一半是对的。实验室确实正在吞下应用层里很大一片表面。但所谓应用层,并不是一个同质化的机会整体。更合适的理解方式是,看你是在黄砖路上,还是在奥兹国的别处。

The Yellow Brick Road is our shorthand for the path the labs are walking, where they’re committing extraordinary resources. The reason the labs are best-suited for problems like code generation, writing, or image-creation is because these problems improve with raw model capability: every dollar spent on pre-training and post-training improves product quality. Meanwhile, the rest of Oz is inhabited by more complex, often vertical problems, that aren’t as simple as giving a business user a horizontal tool with access to standard tools and computer use. The value comes less from the underlying model's raw capability (though that’s still important!) than from the scaffolding around it that makes the output trustworthy, compliant, and operational inside a specific industry.

黄砖路,是我们对实验室正在走的那条路径的简称,也是他们正在投入惊人资源的地方。实验室之所以最适合代码生成、写作、图像创作这类问题,是因为这些问题会随着模型原始能力的提升而变好。每多花一美元做预训练和后训练,产品质量就会更高。与此同时,奥兹国其余地方住着的是更复杂、通常也是垂直领域的问题,它们远不是给一个业务用户提供一个横向工具,再配上标准工具和 computer use 那么简单。这里的价值,较少来自底层模型本身的原始能力,尽管这仍然重要,更主要来自围绕模型搭建的那套支架,它让输出在特定行业里变得可信、合规,而且真能落地运转。

We’re seeing this play out in real time as OpenAI and Anthropic are effectively telling the market they can't solve every problem with a generic AI coworker. They've announced massive forward-deployed joint ventures to build whole companies around configuring and customizing their models for the enterprise. You don't pour billions into those programs if you think the next model release is going to take care of it.

我们正在实时看到这一点。OpenAI 和 Anthropic 实际上正在告诉市场,他们没法靠一个通用 AI 同事解决所有问题。他们已经宣布了大规模前置部署的合资计划,围绕为企业配置和定制模型来建立整家公司。如果你真觉得下一个模型版本就能把这些问题一并解决掉,就不会往这些项目里砸上几十亿美元。

So if you want to get rich building AI apps – avoid the yellow brick road and build somewhere else in Oz. Here’s what we’ve learned, and what some of our portfolio founders have learned, about what works.

所以,如果你想靠做 AI 应用发财,就别走黄砖路,去奥兹国别的地方建东西。下面这些,是我们学到的,以及我们投资组合里一些创始人学到的,关于什么有效的经验。

The Yellow Brick Road

黄砖路

If you're starting a company, The Yellow Brick Road is the most obvious path to go down, but it's the most dangerous. Take a high performing model, plug in some off-the-shelf connectors (like G Drive, Slack, Salesforce, Notion, GitHub), and ship some sort of agentic orchestration layer on top of that. Magic!

如果你要创业,黄砖路是最显眼的一条路,但也是最危险的一条。拿一个表现不错的模型,接上一些现成连接器,比如 G Drive、Slack、Salesforce、Notion、GitHub,再在上面发一个某种 agent 编排层。魔法就来了。

The problem with this is that this is what the labs are doing with Cowork and Codex. Obviously, they own the model, which gives them better margins, control, and the ability to exert pricing power on anyone who's downstream from them. But maybe most importantly also own the architectural choices that define what their products are built to solve well. They've been deliberate so far about the model plus tool calls pattern, and this is exactly what horizontal low-step-count work on the road requires. Even if a startup could somehow outperform Codex or Claude Code, the labs have massive distribution arms and the biggest brand halo in AI.

问题在于,这正是实验室正在用 Cowork 和 Codex 做的事。显然,他们自己拥有模型,这让他们有更高的利润空间、更强的控制力,也能对所有下游参与者施加定价权。但也许更重要的是,他们也掌握着决定产品擅长解决什么问题的架构选择。到目前为止,他们一直有意识地沿着模型加 tool calls 这个模式走,而这恰好就是黄砖路上那种横向、低步骤数工作所需要的东西。就算某家创业公司真能设法跑赢 Codex 或 Claude Code,实验室也拥有庞大的分发能力,以及 AI 领域最强的品牌光环。

If you're an AI app company running that playbook with the same connectors, no sub-agents or configuration below it, and no distribution, you're likely walking down the road to nowhere.

如果你是一家 AI 应用公司,也在用这套打法,接着同样的连接器,底下没有 sub-agents,没有更深层的配置能力,也没有分发能力,那你大概率是在走一条通往虚无的路。

The Rest of Oz

奥兹国的其他地方

It’s not all doom and gloom for startups. There's an enormous opportunity outside the Yellow Brick Road, where startups have a clear path to own their customer and solve complex problems.

对创业公司来说,也并不全是坏消息。黄砖路之外有着巨大的机会,那里创业公司有清晰路径去真正拥有自己的客户,并解决复杂问题。

These businesses are building agentic experiences where the model is woven through a complex web of tools, automations, and integrations (read: software), leading most of these startups to be vertical by default. They can focus on multi-step and multi-player work, with sub-agents for role- and vertical-specific tasks, that Anthropic and OpenAI can’t reach with horizontal platforms: gathering context across systems, then routing through multiple humans who have to approve at different stages. It often involves one or more legacy systems, tends toward needing deterministic outcomes where ambiguity isn't acceptable, and is at times tied to some valuable business outcome. The labs understand how valuable these problems are: that’s why they’re building their own outsourced configuration shops, and why an entire upmarket class of reinforcement learning businesses exist.

这些公司构建的是一种 agent 体验,模型被织进一张复杂的工具、自动化和集成网络里,也就是软件,因此其中大多数创业公司天然都会走向垂直化。它们可以专注于多步骤、多参与者的工作,用 sub-agents 去处理角色和垂直领域特定任务,而这些都是 Anthropic 和 OpenAI 无法通过横向平台触达的。比如跨多个系统收集上下文,再路由给多个必须在不同阶段批准的人。这里通常会涉及一个或多个遗留系统,往往需要确定性的结果,不能接受模糊性,而且有时还直接绑定某种高价值业务结果。实验室也明白这些问题有多值钱,这就是为什么他们在搭建自己的外包配置团队,也解释了为什么会出现整整一类面向高端市场的强化学习公司。

Why the rest of Oz won't be owned by the Wizard

为什么奥兹国的其他地方不会被巫师占领

The response to the above would be that to date, it’s been a pretty bad trade to bet against the models/labs improving. They’ll likely just keep getting better and eventually eat into the market served by these application layer businesses.

对上面这些说法,一个常见回应是,到目前为止,和模型或实验室对着赌,一直都不是一笔好交易。它们大概率只会持续变强,最终侵蚀这些应用层公司服务的市场。

The labs will certainly improve, but I'd argue there are a few ways the rest of Oz can defend themselves over time:

实验室当然会继续进步,但我认为奥兹国的其他地方,随着时间推移,还是有几种方式可以自我防守。

Data and learning flywheels:

数据与学习飞轮

A lot of what you internalize isn't in any training set — unwritten industry norms, undocumented standards, the tribal knowledge that lives in practitioners' heads. None of it is on the public web. No amount of training compute substitutes for being inside the workflows where this knowledge actually lives. There are two flywheels stacked on top of each other here: an across-customer one — patterns that compound as you see more variants of the same problem — and a within-customer one — the why behind specific decisions, the unsaid exceptions, the firm's own rules of thumb that only surface through real interaction with the system.

你真正内化下来的很多东西,并不在任何训练集里。那些没有写下来的行业惯例、没有文档的标准、存在于从业者脑子里的部落知识,全都不在公开网络上。无论堆多少训练算力,都替代不了真正身处这些知识所在的工作流里。这里其实叠着两个飞轮。一个是跨客户的飞轮,也就是当你看到同一种问题的更多变体后不断积累起来的模式。另一个是客户内部的飞轮,也就是某些具体决策背后的原因、那些没说出来的例外、以及这家公司自己总结出的经验法则,而这些东西只有在系统的真实交互里才会浮现出来。

Even if customer data can't be used across customers, application companies will be able to leverage pattern recognition across customer problem types, and use that to inform the right architecture for future problems. A company that has run its agents through a hundred legal redlines, a thousand insurance underwriting cycles, or ten thousand SDR campaigns has internalized the shape of the problem in a way the next entrant cannot replicate by spinning up a fresh agent for the first time.

就算客户数据不能跨客户使用,应用公司也仍然可以利用不同客户问题类型之间的模式识别能力,并用它指导未来问题该采用什么架构。一家公司如果已经让自己的 agents 跑过一百次法律 redline、一千轮保险核保流程,或者一万次 SDR 活动,它对问题形状的理解,是后来者第一次临时拉起一个新 agent 根本复制不了的。

A horizontal agent could in principle build the same learning infrastructure. The reason it doesn't, beyond pure focus, is UX: capturing this kind of knowledge depends entirely on the workflow surfaces you give the user, and vertical players can shape those surfaces around exactly what their workflow needs to surface. Horizontal tools can't. Eval sets, labeled outputs, and edge-case taxonomies can compound into a vertical-specific data flywheel which can fuel fine-tuning the next entrant can't generate without comparable production exposure. Whether this is possible depends on data rights, the volume of production exposure accumulated, and the structure of customer contracts, but pattern recognition accrues regardless.

理论上,一个横向 agent 也可以建立同样的学习基础设施。它之所以没有这么做,除了纯粹的聚焦问题,更关键的是 UX。能不能捕获这种知识,完全取决于你给用户提供了什么样的工作流界面,而垂直玩家可以精确围绕自己工作流真正需要显露出来的内容去塑造这些界面。横向工具做不到。评测集、带标注的输出结果、边界情况分类法,这些都能不断累积,最终形成一个垂直领域专属的数据飞轮,进一步驱动微调,而没有相当生产暴露的新进入者根本造不出来。能不能做到这点,取决于数据权利、累计下来的生产暴露规模,以及客户合同的结构,但模式识别本身不管怎样都会持续积累。

Managing model variability and complexity: The labs are already routing internally — different model classes for different requests, ensembles under the hood. What they can't do is route across vendors, or evaluate a competitor's model for a specific sub-task, or use an open-source fine-tune for the narrow piece where it's actually best. The Rest of Oz company picks the right model for each sub-task across the entire model market, not just what its parent lab ships. It also does the work nobody wants to do — re-running evals on upgrades, recalibrating prompts for the customer's edge cases, rolling out without breaking production — every time a new model lands. The labs aren't doing this on the customer's behalf; they sell you their next model and tell you to migrate. The Rest of Oz company absorbs the migration. What the customer gets is the best intelligence available across the whole market, plus continuity through every upgrade.

管理模型波动性与复杂性 实验室现在已经在内部做路由了,不同请求走不同类别的模型,底下还有 ensemble。它们做不到的是跨厂商路由,或者为某个具体子任务评估竞争对手的模型,或者在真正最优的狭窄环节用开源微调模型。奥兹国其余地方的公司,会针对每个子任务,从整个模型市场里选最合适的模型,而不是只用母实验室发出来的那个。它还会做那些没人想做的脏活。每次新模型出来,都要重新跑 evals,重新校准 prompts 以适应客户的边界情况,还要在不破坏生产环境的前提下完成上线。实验室不会替客户做这些事,它们只是把下一个模型卖给你,然后告诉你去迁移。奥兹国其余地方的公司,会把迁移本身吞下来。客户得到的是整个市场中最好的智能,以及穿越每次升级的连续性。

Cost optimization: Running every query through Opus 4.7 is the fastest path to negative gross margins. The best Rest of Oz companies route across tiers of models — frontier models for the hardest tasks, mid-tier for the bulk, smaller custom or fine-tuned models where they've earned the right to use them. Some are now post-training their own models on top of that, optimizing them for the narrow slice of work their customer cares about and serving them at a fraction of the cost of a frontier API call. The labs price the floor: the least intelligence available at $X. The Rest of Oz company sells the inverse — the lowest dollar cost for the specific level of intelligence the workflow actually requires. That's only possible if you know exactly what level each sub-task needs, which the labs structurally can't know across every vertical. It translates directly into lower, controlled prices for outcomes.

成本优化 把每一次查询都丢给 Opus 4.7,是最快把毛利率做成负数的方法。最好的奥兹国其余地方公司,会在不同层级的模型之间做路由,最难的任务用前沿模型,大部分任务用中阶模型,在它们有资格这么做的地方再用更小的定制模型或微调模型。有些公司现在甚至在这之上继续做后训练,专门针对客户关心的那一小块工作去优化自己的模型,并用远低于前沿 API 调用的成本提供服务。实验室给出的是价格底线,也就是以 X 美元卖给你最低限度的智能。奥兹国其余地方公司卖的正好相反,它卖的是在某个具体工作流真正所需的智能水平下,最低的美元成本。只有当你准确知道每个子任务到底需要什么水平时,这件事才有可能做到,而实验室在结构上不可能横跨所有垂直领域知道这些。最终,它会直接转化成更低、可控的结果价格。

Governance: There is considerable value in becoming the control plane for how their customers run AI in that vertical – the place where permissions, auditing, what-the-agent-is-allowed-to-do, and what-the-agent-actually-did all converge. That control plane is built out of use case specific guardrails that look completely different across industries and job types. Because they own the tools, the workflows, and the data the agent touches end-to-end, they can provide deterministic outcomes in ways horizontal tools will struggle to. They are also the entity that absorbs the regulatory complexity for the end buyer — FRCP and bar rules in legal, HIPAA in healthcare, SEC and FINRA in finance, state insurance regulations, and so on. A horizontal player can't credibly do that without becoming a hundred different verticals at once. CIOs want to have a partner that contractually states they are handling compliance for the agents they are providing.

治理 成为客户在某个垂直领域里运行 AI 的控制平面,本身就有相当大的价值。也就是权限、审计、agent 被允许做什么、agent 实际做了什么,这一切汇聚的地方。这个控制平面,是由一个个用例专属的 guardrails 搭起来的,而这些 guardrails 在不同行业、不同岗位之间看起来完全不一样。因为它们拥有工具、工作流,以及 agent 从头到尾接触的数据,所以它们能以横向工具难以做到的方式提供确定性结果。它们也是替最终买家吸收监管复杂性的那个实体。法律行业有 FRCP 和律师执业规则,医疗有 HIPAA,金融有 SEC 和 FINRA,还有各州保险监管,等等。一个横向玩家如果想可信地做这件事,就等于得同时变成一百个不同的垂直公司。CIO 们想要的是一个能在合同里明确写明,会为其提供的 agents 处理合规问题的合作方。

All of these come back to the same thing: focus. That could be a vertical (insurance, legal, accounting) or a function done deeply (sales, customer support, finance). Either way, the work needs a team that's heads-down on one customer set — its workflows, its edge cases, its regulations. The labs aren't built for that. They have to be everywhere, for everyone, which is how they built the Yellow Brick Road in the first place. The same trade-off keeps them out of the rest of Oz — you can be everywhere at once, or you can be great at one thing. Not both.

这一切最终都回到同一件事上,就是聚焦。这个聚焦可以是一个垂直领域,比如保险、法律、会计,也可以是一个被深度做好了的职能,比如销售、客服、财务。不管哪种,工作都需要一支团队埋头只盯住一类客户,盯住他们的工作流、边界情况和监管要求。实验室不是为这件事而建的。它们必须同时面向所有人、出现在所有地方,这也是它们最初修出黄砖路的原因。正是同样的取舍,把它们挡在了奥兹国其他地方之外。你可以同时无处不在,也可以在一件事上做到极强,但不能两者兼得。

Sales as an example – practical tips from 11x’s technical CEO

以销售为例,来自 11x 技术型 CEO 的实战建议

How should you think about this in practice? Here’s some practical tips from Prabhav Jain, the CEO of 11x.

真正落到实操里,该怎么想这件事。下面是 11x CEO Prabhav Jain 给出的一些实用建议。

Focus on outcomes

从结果出发

A tactical path to building a company that is resilient to the labs is to just start from a specific outcome that your customers really care about. For us that was helping companies generate more pipeline. From there the questions get tactical. Which activities do we want to own end-to-end that actually drive pipeline? Decompose each activity into tasks. Which tasks are agentic and which aren't. Which require intricate domain insight and which don't. The labs will ship workflows too, but when the workflow has many steps, messy inputs, hard-to-interpret state, or real-world constraints, a better model alone won't get you there. The work falls to good old-fashioned software engineering, and the labs hold no edge over a focused application company on that surface. For example, here are some of the tasks that we handle, some agentic, and some not: lead prospecting based on custom signals, lead enrichment, deep account research, context fetcher from CRM, channel-specific message writer, lead qualification agent, and email deliverability system. These aren’t tasks you can just one-shot and require deep engineering.

想做一家对实验室有韧性的公司,一个很实用的路径就是,从客户真正关心的某个具体结果出发。对我们来说,那个结果就是帮助公司创造更多 pipeline。接下来问题就变得很具体了。哪些活动是我们想端到端接管、而且确实能推动 pipeline 的。把每个活动拆成任务。哪些任务是 agentic 的,哪些不是。哪些需要复杂的领域洞察,哪些不需要。实验室也会发工作流,但当一个工作流步骤很多、输入很乱、状态难以解释,或者带有现实世界约束时,光有更好的模型也到不了终点。剩下的工作,还是得靠老派的软件工程,而在这块表面上,实验室对一家高度聚焦的应用公司并没有优势。比如,我们处理的任务里,有些是 agentic 的,有些不是,比如基于自定义信号的线索挖掘、线索补全、深度账户研究、从 CRM 获取上下文、按渠道生成消息、线索资格判断 agent,以及邮件送达系统。这些都不是一次 one-shot 就能搞定的,背后需要很深的工程能力。

The critical insight in the Oz analogy is that roughly half of any real workflow that is non-agentic carries no lab advantage. They are no better than you are at writing the deterministic software underneath the model layer. And the half that is agentic still requires you to tune, train, and constrain the models against the result you actually want. Domain knowledge often doesn't sit in general training data. Those skills get built from the ground up for the vertical or function, and fed into the model at the right moment in the workflow. When our agents are qualifying an inbound lead on the phone, I have to be trained on what a good sales conversation is for that specific industry and that persona. That is application company work, and it compounds.

奥兹国这个类比里最关键的洞察是,任何真实工作流里,大约有一半是非 agentic 的,而这一半实验室没有任何优势。它们写模型层之下的确定性软件,并不会比你更强。而另外那一半就算是 agentic 的,也仍然需要你围绕自己真正想要的结果去调、去训、去约束模型。领域知识往往并不在通用训练数据里。这些能力必须从垂直领域或具体职能里一点点搭起来,再在工作流的正确时刻喂给模型。比如我们的 agents 在电话里筛选入站线索时,它们必须先接受训练,知道对某个具体行业、某类 persona 来说,什么才算一场好的销售对话。这就是应用公司的工作,而且会不断累积。

More importantly, those skills become outdated all the time because businesses evolve, so your ability to evolve those workflows and context becomes a competitive advantage. As an example, when we started our scaled email outreach product, “AI” written emails were just starting to come into play. Fast forward to today, folks have a tuned sense of emails that are AI written vs human and crucially, this changes every few months. Our agents have to constantly adapt given the market dynamic, but this is where the moat is built. In fact, despite this dynamic, our positive reply rates have gone up 4x in the last few months and we’ve generated hundreds of millions in pipeline for our customers.

更重要的是,这些能力会不停过时,因为企业在演化,所以你持续演化这些工作流和上下文的能力,会变成竞争优势。举个例子,我们刚开始做大规模邮件外联产品时,AI 写邮件这件事才刚起步。快进到今天,大家已经能很敏锐地分辨一封邮件是 AI 写的还是人写的,更关键的是,这种判断每几个月就会变。我们的 agents 必须不断适应这种市场变化,而护城河也正是在这里搭起来的。事实上,尽管有这种动态变化,我们的正向回复率在过去几个月里还是提升了 4 倍,并且已经为客户带来了数亿美元规模的 pipeline。

Work on problems where complexity is high

去做复杂度高的问题

Complex problems are where real business value gets unlocked. Otherwise you’ll find yourself building a thin wrapper.

真正的商业价值,是在复杂问题里被释放出来的。否则你最后做出来的,只会是个薄壳。

Decompose any sufficiently complex business problem and messiness shows up quickly. Here’s an example from the GTM world that sounds trivial: you shouldn’t reach out to a contact at a company if that company is already a customer. It’s anything but. Maybe you have the domain associated with the company in your CRM. What about companies with dozens of subsidiaries? What if the CRM record has the parent’s domain? What if a stale matching field in Salesforce sends a cold pitch to a current customer's CRO? Real-world data is messy. Humans struggle with it. Models don't magically clear that bar. Driving order out of that mess requires purpose-built agents engineered for the specific shape of the problem, not a general-purpose copilot pointed at a CRM. In fact, based on the data that we have, we have realized that the quality and freshness of our data is much higher than our customers, so by default, we anchor on our own.

任何足够复杂的商业问题,一拆开,脏乱差马上就会冒出来。举个 GTM 领域里听起来很简单的例子,你不应该去联系某家公司里的某个联系人,如果这家公司已经是客户。听起来简单,实际根本不是。也许你在 CRM 里有这家公司的域名。那有几十家子公司的集团怎么办。要是 CRM 记录里存的是母公司的域名呢。要是 Salesforce 里一个过期的匹配字段,把一封冷启动推销邮件发给了现有客户的 CRO 呢。现实世界的数据就是这么乱。人类都很难处理,模型也不会神奇地自动跨过这道坎。要从这一团混乱里建立秩序,靠的是围绕问题具体形状精心设计的专用 agents,不是给 CRM 接一个通用 copilot 就行。事实上,根据我们掌握的数据,我们意识到自己的数据质量和新鲜度都比客户更高,所以默认情况下,我们以自己的数据为锚。

Guardrails aren’t just to prevent bad stuff from happening. That’s what your customers are paying you for.

Guardrails 不只是为了防止坏事发生。客户付钱买的就是这个。

Guardrails are severely underestimated. Even inside the same product, every use case needs its own. For us, a regulated financial services prospect demands different guarantees than a mid-market SaaS customer, and those guarantees roll down into how the agent is allowed to write, who it can contact, what data it can touch, what it can say on a call and how every decision gets logged.

Guardrails 被严重低估了。哪怕是在同一个产品内部,每个 use case 都需要自己的一套。对我们来说,一个受监管的金融服务潜在客户,要求的保障就和一个中型 SaaS 客户完全不同,而这些保障会层层传导到 agent 能怎么写、能联系谁、能碰哪些数据、在电话里能说什么,以及每个决策该如何被记录。

A one-size-fits-all system collapses under that variance. Guardrails have to be built per use case, configured per customer, and audited continuously, and that work sits squarely with the application company. This is why we have FDEs and technical deployment strategists that need to tune for each customer’s requirement. As an example, we worked with a F1000 institution to do consented outbound via voice to their large SMB customer base. The initial few iterations had low pickup rates - we had to quickly iterate and learn how to get this specific type of audience to engage in the first 10s of the call. SMB business owners behave very differently from larger B2B buyers or consumers. We now generate more sales opportunities for them in a day than their entire sales team for that segment in a month

一个一刀切的系统,在这种差异面前会直接崩掉。Guardrails 必须按 use case 构建,按客户配置,并持续审计,而这部分工作完全落在应用公司身上。这也是为什么我们需要 FDE 和技术部署策略师,去针对每个客户的要求做调优。举个例子,我们曾经和一家 F1000 机构合作,面向他们庞大的 SMB 客户群做经同意的语音外呼。最初几轮迭代的接听率很低,我们必须快速迭代,摸清楚怎么才能让这种类型的受众在通话前 10 秒里愿意参与。SMB 企业主的行为方式,和大型 B2B 买家或者消费者非常不同。现在,我们一天为他们创造的销售机会,已经超过他们整个销售团队针对这个细分市场一个月创造的总量。

Insurance as an example – practical tips from FurtherAI’s CEO

以保险为例,来自 FurtherAI CEO 的实战建议

Sales is one example. Insurance is another, and it makes the same point from a different angle. Here’s how Aman Gour, CEO of FurtherAI, thinks about building off the road:

销售是一个例子。保险是另一个例子,它从另一个角度说明了同样的事。下面是 FurtherAI CEO Aman Gour 对于如何走出黄砖路的理解。

When we started deploying AI inside real insurance operations, we kept hearing a particular assumption: the model is the intelligence, and the workflow is just scaffolding around it.

当我们开始把 AI 部署进真实保险运营场景时,我们反复听到一个特定假设,模型才是智能本身,工作流只是围绕它搭起来的脚手架。

The more carriers we worked with, the more convinced we became that this is backwards.

但我们合作的承保机构越多,就越确信这件事恰恰反过来。

In insurance, a lot of the intelligence lives inside the workflow itself. Two carriers can run a submission through what looks like the same path: submission, review, quote, bind. But the path is the easy part. What separates the two carriers is everything inside it: which risks get escalated, which loss signals matter, which appetite rule wins when two of them conflict, when a human has to sign off, which external data gets pulled in, and how the final decision gets documented.

在保险里,大量智能其实是活在工作流本身内部的。两家承保机构都可能让一份投保申请走过看起来一样的路径,提交、审核、报价、承保。但路径本身是最简单的部分。真正把两家机构区分开的,是路径内部的一切,哪些风险要升级处理,哪些损失信号重要,两条 appetite rule 冲突时哪一条优先,什么时候必须人工签字,哪些外部数据要拉进来,最终决策怎么记录。

That logic does not live in one clean rules engine. It is spread across SOPs, manager reviews, underwriting philosophy, carrier-specific appetite, and years of operational experience. A lot of it is not written down in a form a model can simply read.

这些逻辑并不住在某个干净统一的规则引擎里。它分散在 SOP、经理审核、核保理念、承保机构专属的风险偏好,以及多年运营经验之中。很多内容甚至根本没有被写成模型可以直接读取的形式。

This is why we do not believe in a pure agent that reasons from scratch every time, and we do not believe in a rigid workflow that breaks the moment reality gets messy. And instead been building agentic workflows. The workflow gives you repeatability, auditability, and cost control. The agent handles variability and recovers when the happy path breaks. The human stays in the loop for the judgment calls where accountability matters.

这就是为什么我们不相信一种每次都从零推理的纯 agent,也不相信一种现实稍微变脏一点就会崩掉的僵硬工作流。我们一直在构建的是 agentic workflows。工作流带来可重复性、可审计性和成本控制。agent 负责处理波动性,并在理想路径断掉时恢复。人类则留在回路里,处理那些责任必须明确归属的判断性决策。

On day one, this automates manual work. But over time, every escalation becomes a signal, every exception is a feedback and every human correction shows where the runbook was incomplete. Over time, the workflow stops being a script and starts becoming the carrier’s operating memory. This is the part the labs will find hard to reach. They will keep shipping better models and better general agents, and they should. But they do not sit inside a carrier’s production workflows long enough to learn why one account was escalated, why one risk was declined, or why an underwriter overrode the appetite guide and was right to do so.

在第一天,这件事是在自动化人工工作。但随着时间推移,每一次升级处理都会变成一个信号,每一个例外都是一次反馈,每一次人工修正都会暴露出 runbook 哪些地方还不完整。慢慢地,工作流不再只是脚本,而开始变成承保机构的运营记忆。这部分正是实验室很难触达的地方。它们会继续推出更好的模型和更好的通用 agents,这本来也应该如此。但它们不会长时间待在某家承保机构的生产工作流里,所以学不到为什么某个账户被升级处理、为什么某个风险被拒保,或者为什么某位核保员覆盖了 appetite guide,而且这么做是对的。

That understanding only comes from running the workflow, in production, many thousands of times. The workflow you ship on day one is not the moat. The loop that production usage creates over time is.

这种理解,只能来自于让工作流在生产环境中运行成千上万次。你第一天发出来的工作流,不是护城河。真正的护城河,是生产使用随着时间推移形成的那个循环。

For us, that is what it means to build off the road.

对我们来说,这就是走出黄砖路去构建的含义。

How do you decide if you are in the rest of Oz or not?

你该怎么判断自己是不是在奥兹国的其他地方

The tools-and-steps test: How many steps does the work take, and how complex are the tools you have to build to support it? Compare a horizontal AI search across Google Drive — one step against one tool with a forgiving outcome, the user reads the summary and re-asks if it's wrong — to a multi-step legal redline against three years of firm precedent: dozens of steps across many tools, output that has to clear partner review and may need to be argued in court. Both look like "an agent doing work," but only one of them requires the kind of deep software a focused team takes years to build.

工具与步骤测试 这项工作一共要走多少步,你又需要构建多复杂的工具来支撑它。拿 Google Drive 上的横向 AI 搜索来对比,一个步骤,一个工具,结果也比较宽容,用户读完摘要,如果错了再问一次就行。再拿基于三年律所先例进行多步骤法律 redline 来比,里面有很多工具、几十个步骤,输出必须通过合伙人审阅,甚至可能还得在法庭上为之辩护。两者看起来都像是某个 agent 在干活,但只有后者需要那种得由一支高度聚焦团队花多年才能搭出来的深层软件。

The system test: Are you building a system the customer runs their work through, or a tool that sits on top of a system they already have? Systems own the workflow end-to-end — the data capture, the governance, the records of what got done — and they're what the customer points to when describing how the actual work happens. Tools on the other hand just add intelligence to a workflow the customer already runs. The tool case generates real revenue and the labs can take it because the customer isn't depending on you as the orchestration layer. High ACV is usually a signal of a system, since systems replace real headcount and get paid accordingly, but it isn't a guarantee. Ask yourself if the customer would still need your tool if a lab shipped something that supposedly directly competes with you. If yes, you're building a system. If no, you're a tool — even if your ACV is high.

系统测试 你是在构建一个客户拿来运行工作本身的系统,还是一个叠在他们已有系统之上的工具。系统拥有整个工作流,从数据采集、治理,到所有完成记录,客户在描述真实工作如何发生时,指的就是它。工具则只是给客户已经在运行的工作流加一点智能。工具这种情况当然也能带来真实收入,而实验室也能把它拿走,因为客户并不依赖你作为编排层。高 ACV 往往是系统的信号,因为系统替代的是真实人力,所以收费也会相应更高,但它不是保证。问问自己,如果某家实验室发了一个据说和你直接竞争的东西,客户是否仍然需要你的工具。如果答案是需要,那你做的是系统。如果答案是不需要,那你就是工具,就算你的 ACV 很高也是一样。

The hedge fund / P&L test: While lab performance is judged against benchmarks, rest of Oz performance is judged against your customer's P&L. Your customer doesn't care that your model scored well on SWE-Bench or MMLU — they care whether your agent closed the deal, redlined the contract correctly, or bound the right policy. If they're fixated on their workflow-specific outcome, not on a generic capability score, you're in the rest of Oz. If they're paying for generic capability, you're selling them something they can get with a Claude or Codex seat. The best agent businesses are going to need to execute like hedge funds — winning on alpha measured in customer P&L, not in benchmark scores.

对冲基金 / P&L 测试 实验室的表现是拿 benchmark 来评判的,而奥兹国其余地方的表现,是拿客户的 P&L 来评判的。客户不在乎你的模型在 SWE-Bench 或 MMLU 上得分多高,他们在乎的是你的 agent 有没有把交易拿下、有没有把合同 redline 对、有没有承保对的保单。如果他们盯着的是自己工作流特有的结果,而不是某个通用能力分数,那你就在奥兹国的其他地方。如果他们买的是通用能力,那你卖给他们的东西,本质上只要买一个 Claude 或 Codex 席位就能拿到。最好的 agent 公司,最终都得像对冲基金一样执行,以客户 P&L 上体现出来的 alpha 取胜,而不是靠 benchmark 分数取胜。

Both can (and will) win

两边都能赢,而且都会赢

We're going to see massive winners on and off the Yellow Brick Road. The models will continue to win because they own the model and they own the distribution for the horizontal tools they have designed.

我们会看到黄砖路上和黄砖路外都出现巨大赢家。模型公司会继续赢,因为它们拥有模型,也拥有为自己设计出来的横向工具做分发的能力。

The rest of Oz can win if they own the system of work — the surface where the work of the company actually executes and the data that flows from it gets captured. These companies own the data capture, the workflow system of action, and the governance. As more complex workflows mature in a vertical, they compound into one core experience the customer comes to depend on. As new model generations ship from incumbents and new entrants, the company becomes the layer that integrates and delivers them to the customer. The model is fungible underneath; the system of work is not.

奥兹国的其他地方也能赢,前提是它们拥有工作的系统,也就是公司真实工作发生并且相关数据会被采集下来的那块表面。这些公司掌握着数据采集、执行工作流的行动系统,以及治理能力。随着某个垂直领域里更复杂的工作流逐渐成熟,它们会不断叠加,最终变成客户赖以生存的核心体验。随着老玩家和新进入者不断推出新一代模型,这家公司就会变成那一层,把这些模型整合起来并交付给客户。底下的模型可以替换,工作的系统不行。

The next generation of enterprise software is going to be built off the road.

下一代企业软件,会在黄砖路之外建成。

If you're building it, reach out: jschmidt@a16z.com.

如果你正在做这件事,联系我,jschmidt@a16z.com。

**Why The App Layer Isn't Dead **

The question I keep getting from founders and prospective employees: is there any AI application layer left to build, or are OpenAI and Anthropic going to kill everything?

There's a particular flavor of AI psychosis behind the question. Some people have concluded the only durable places to avoid the permanent underclass are inside a big lab or out on the frontier building in robotics, hardtech, or similar – theoretically anything “the labs can't touch.” If every piece of software is about to be eaten, either by Codex or Claude absorbing the work directly, or by a future model that will make whatever you’ve built unnecessary, then run!

Listen I'm as much of an AI maximalist as almost anyone, and I think they're half right. The labs really are coming for a huge swath of the application surface. But "the application layer" isn't just one homogenous opportunity. The right framing is whether you're on the Yellow Brick Road or somewhere else in Oz.

The Yellow Brick Road is our shorthand for the path the labs are walking, where they’re committing extraordinary resources. The reason the labs are best-suited for problems like code generation, writing, or image-creation is because these problems improve with raw model capability: every dollar spent on pre-training and post-training improves product quality. Meanwhile, the rest of Oz is inhabited by more complex, often vertical problems, that aren’t as simple as giving a business user a horizontal tool with access to standard tools and computer use. The value comes less from the underlying model's raw capability (though that’s still important!) than from the scaffolding around it that makes the output trustworthy, compliant, and operational inside a specific industry.

We’re seeing this play out in real time as OpenAI and Anthropic are effectively telling the market they can't solve every problem with a generic AI coworker. They've announced massive forward-deployed joint ventures to build whole companies around configuring and customizing their models for the enterprise. You don't pour billions into those programs if you think the next model release is going to take care of it.

So if you want to get rich building AI apps – avoid the yellow brick road and build somewhere else in Oz. Here’s what we’ve learned, and what some of our portfolio founders have learned, about what works.

The Yellow Brick Road

If you're starting a company, The Yellow Brick Road is the most obvious path to go down, but it's the most dangerous. Take a high performing model, plug in some off-the-shelf connectors (like G Drive, Slack, Salesforce, Notion, GitHub), and ship some sort of agentic orchestration layer on top of that. Magic!

The problem with this is that this is what the labs are doing with Cowork and Codex. Obviously, they own the model, which gives them better margins, control, and the ability to exert pricing power on anyone who's downstream from them. But maybe most importantly also own the architectural choices that define what their products are built to solve well. They've been deliberate so far about the model plus tool calls pattern, and this is exactly what horizontal low-step-count work on the road requires. Even if a startup could somehow outperform Codex or Claude Code, the labs have massive distribution arms and the biggest brand halo in AI.

If you're an AI app company running that playbook with the same connectors, no sub-agents or configuration below it, and no distribution, you're likely walking down the road to nowhere.

The Rest of Oz

It’s not all doom and gloom for startups. There's an enormous opportunity outside the Yellow Brick Road, where startups have a clear path to own their customer and solve complex problems.

These businesses are building agentic experiences where the model is woven through a complex web of tools, automations, and integrations (read: software), leading most of these startups to be vertical by default. They can focus on multi-step and multi-player work, with sub-agents for role- and vertical-specific tasks, that Anthropic and OpenAI can’t reach with horizontal platforms: gathering context across systems, then routing through multiple humans who have to approve at different stages. It often involves one or more legacy systems, tends toward needing deterministic outcomes where ambiguity isn't acceptable, and is at times tied to some valuable business outcome. The labs understand how valuable these problems are: that’s why they’re building their own outsourced configuration shops, and why an entire upmarket class of reinforcement learning businesses exist.

Why the rest of Oz won't be owned by the Wizard

The response to the above would be that to date, it’s been a pretty bad trade to bet against the models/labs improving. They’ll likely just keep getting better and eventually eat into the market served by these application layer businesses.

The labs will certainly improve, but I'd argue there are a few ways the rest of Oz can defend themselves over time:

Data and learning flywheels:

A lot of what you internalize isn't in any training set — unwritten industry norms, undocumented standards, the tribal knowledge that lives in practitioners' heads. None of it is on the public web. No amount of training compute substitutes for being inside the workflows where this knowledge actually lives. There are two flywheels stacked on top of each other here: an across-customer one — patterns that compound as you see more variants of the same problem — and a within-customer one — the why behind specific decisions, the unsaid exceptions, the firm's own rules of thumb that only surface through real interaction with the system.

Even if customer data can't be used across customers, application companies will be able to leverage pattern recognition across customer problem types, and use that to inform the right architecture for future problems. A company that has run its agents through a hundred legal redlines, a thousand insurance underwriting cycles, or ten thousand SDR campaigns has internalized the shape of the problem in a way the next entrant cannot replicate by spinning up a fresh agent for the first time.

A horizontal agent could in principle build the same learning infrastructure. The reason it doesn't, beyond pure focus, is UX: capturing this kind of knowledge depends entirely on the workflow surfaces you give the user, and vertical players can shape those surfaces around exactly what their workflow needs to surface. Horizontal tools can't. Eval sets, labeled outputs, and edge-case taxonomies can compound into a vertical-specific data flywheel which can fuel fine-tuning the next entrant can't generate without comparable production exposure. Whether this is possible depends on data rights, the volume of production exposure accumulated, and the structure of customer contracts, but pattern recognition accrues regardless.

Managing model variability and complexity: The labs are already routing internally — different model classes for different requests, ensembles under the hood. What they can't do is route across vendors, or evaluate a competitor's model for a specific sub-task, or use an open-source fine-tune for the narrow piece where it's actually best. The Rest of Oz company picks the right model for each sub-task across the entire model market, not just what its parent lab ships. It also does the work nobody wants to do — re-running evals on upgrades, recalibrating prompts for the customer's edge cases, rolling out without breaking production — every time a new model lands. The labs aren't doing this on the customer's behalf; they sell you their next model and tell you to migrate. The Rest of Oz company absorbs the migration. What the customer gets is the best intelligence available across the whole market, plus continuity through every upgrade.

Cost optimization: Running every query through Opus 4.7 is the fastest path to negative gross margins. The best Rest of Oz companies route across tiers of models — frontier models for the hardest tasks, mid-tier for the bulk, smaller custom or fine-tuned models where they've earned the right to use them. Some are now post-training their own models on top of that, optimizing them for the narrow slice of work their customer cares about and serving them at a fraction of the cost of a frontier API call. The labs price the floor: the least intelligence available at $X. The Rest of Oz company sells the inverse — the lowest dollar cost for the specific level of intelligence the workflow actually requires. That's only possible if you know exactly what level each sub-task needs, which the labs structurally can't know across every vertical. It translates directly into lower, controlled prices for outcomes.

Governance: There is considerable value in becoming the control plane for how their customers run AI in that vertical – the place where permissions, auditing, what-the-agent-is-allowed-to-do, and what-the-agent-actually-did all converge. That control plane is built out of use case specific guardrails that look completely different across industries and job types. Because they own the tools, the workflows, and the data the agent touches end-to-end, they can provide deterministic outcomes in ways horizontal tools will struggle to. They are also the entity that absorbs the regulatory complexity for the end buyer — FRCP and bar rules in legal, HIPAA in healthcare, SEC and FINRA in finance, state insurance regulations, and so on. A horizontal player can't credibly do that without becoming a hundred different verticals at once. CIOs want to have a partner that contractually states they are handling compliance for the agents they are providing.

All of these come back to the same thing: focus. That could be a vertical (insurance, legal, accounting) or a function done deeply (sales, customer support, finance). Either way, the work needs a team that's heads-down on one customer set — its workflows, its edge cases, its regulations. The labs aren't built for that. They have to be everywhere, for everyone, which is how they built the Yellow Brick Road in the first place. The same trade-off keeps them out of the rest of Oz — you can be everywhere at once, or you can be great at one thing. Not both.

Sales as an example – practical tips from 11x’s technical CEO

How should you think about this in practice? Here’s some practical tips from Prabhav Jain, the CEO of 11x.

Focus on outcomes

A tactical path to building a company that is resilient to the labs is to just start from a specific outcome that your customers really care about. For us that was helping companies generate more pipeline. From there the questions get tactical. Which activities do we want to own end-to-end that actually drive pipeline? Decompose each activity into tasks. Which tasks are agentic and which aren't. Which require intricate domain insight and which don't. The labs will ship workflows too, but when the workflow has many steps, messy inputs, hard-to-interpret state, or real-world constraints, a better model alone won't get you there. The work falls to good old-fashioned software engineering, and the labs hold no edge over a focused application company on that surface. For example, here are some of the tasks that we handle, some agentic, and some not: lead prospecting based on custom signals, lead enrichment, deep account research, context fetcher from CRM, channel-specific message writer, lead qualification agent, and email deliverability system. These aren’t tasks you can just one-shot and require deep engineering.

The critical insight in the Oz analogy is that roughly half of any real workflow that is non-agentic carries no lab advantage. They are no better than you are at writing the deterministic software underneath the model layer. And the half that is agentic still requires you to tune, train, and constrain the models against the result you actually want. Domain knowledge often doesn't sit in general training data. Those skills get built from the ground up for the vertical or function, and fed into the model at the right moment in the workflow. When our agents are qualifying an inbound lead on the phone, I have to be trained on what a good sales conversation is for that specific industry and that persona. That is application company work, and it compounds.

More importantly, those skills become outdated all the time because businesses evolve, so your ability to evolve those workflows and context becomes a competitive advantage. As an example, when we started our scaled email outreach product, “AI” written emails were just starting to come into play. Fast forward to today, folks have a tuned sense of emails that are AI written vs human and crucially, this changes every few months. Our agents have to constantly adapt given the market dynamic, but this is where the moat is built. In fact, despite this dynamic, our positive reply rates have gone up 4x in the last few months and we’ve generated hundreds of millions in pipeline for our customers.

Work on problems where complexity is high

Complex problems are where real business value gets unlocked. Otherwise you’ll find yourself building a thin wrapper.

Decompose any sufficiently complex business problem and messiness shows up quickly. Here’s an example from the GTM world that sounds trivial: you shouldn’t reach out to a contact at a company if that company is already a customer. It’s anything but. Maybe you have the domain associated with the company in your CRM. What about companies with dozens of subsidiaries? What if the CRM record has the parent’s domain? What if a stale matching field in Salesforce sends a cold pitch to a current customer's CRO? Real-world data is messy. Humans struggle with it. Models don't magically clear that bar. Driving order out of that mess requires purpose-built agents engineered for the specific shape of the problem, not a general-purpose copilot pointed at a CRM. In fact, based on the data that we have, we have realized that the quality and freshness of our data is much higher than our customers, so by default, we anchor on our own.

Guardrails aren’t just to prevent bad stuff from happening. That’s what your customers are paying you for.

Guardrails are severely underestimated. Even inside the same product, every use case needs its own. For us, a regulated financial services prospect demands different guarantees than a mid-market SaaS customer, and those guarantees roll down into how the agent is allowed to write, who it can contact, what data it can touch, what it can say on a call and how every decision gets logged.

A one-size-fits-all system collapses under that variance. Guardrails have to be built per use case, configured per customer, and audited continuously, and that work sits squarely with the application company. This is why we have FDEs and technical deployment strategists that need to tune for each customer’s requirement. As an example, we worked with a F1000 institution to do consented outbound via voice to their large SMB customer base. The initial few iterations had low pickup rates - we had to quickly iterate and learn how to get this specific type of audience to engage in the first 10s of the call. SMB business owners behave very differently from larger B2B buyers or consumers. We now generate more sales opportunities for them in a day than their entire sales team for that segment in a month

Insurance as an example – practical tips from FurtherAI’s CEO

Sales is one example. Insurance is another, and it makes the same point from a different angle. Here’s how Aman Gour, CEO of FurtherAI, thinks about building off the road:

When we started deploying AI inside real insurance operations, we kept hearing a particular assumption: the model is the intelligence, and the workflow is just scaffolding around it.

The more carriers we worked with, the more convinced we became that this is backwards.

In insurance, a lot of the intelligence lives inside the workflow itself. Two carriers can run a submission through what looks like the same path: submission, review, quote, bind. But the path is the easy part. What separates the two carriers is everything inside it: which risks get escalated, which loss signals matter, which appetite rule wins when two of them conflict, when a human has to sign off, which external data gets pulled in, and how the final decision gets documented.

That logic does not live in one clean rules engine. It is spread across SOPs, manager reviews, underwriting philosophy, carrier-specific appetite, and years of operational experience. A lot of it is not written down in a form a model can simply read.

This is why we do not believe in a pure agent that reasons from scratch every time, and we do not believe in a rigid workflow that breaks the moment reality gets messy. And instead been building agentic workflows. The workflow gives you repeatability, auditability, and cost control. The agent handles variability and recovers when the happy path breaks. The human stays in the loop for the judgment calls where accountability matters.

On day one, this automates manual work. But over time, every escalation becomes a signal, every exception is a feedback and every human correction shows where the runbook was incomplete. Over time, the workflow stops being a script and starts becoming the carrier’s operating memory. This is the part the labs will find hard to reach. They will keep shipping better models and better general agents, and they should. But they do not sit inside a carrier’s production workflows long enough to learn why one account was escalated, why one risk was declined, or why an underwriter overrode the appetite guide and was right to do so.

That understanding only comes from running the workflow, in production, many thousands of times. The workflow you ship on day one is not the moat. The loop that production usage creates over time is.

For us, that is what it means to build off the road.

How do you decide if you are in the rest of Oz or not?

The tools-and-steps test: How many steps does the work take, and how complex are the tools you have to build to support it? Compare a horizontal AI search across Google Drive — one step against one tool with a forgiving outcome, the user reads the summary and re-asks if it's wrong — to a multi-step legal redline against three years of firm precedent: dozens of steps across many tools, output that has to clear partner review and may need to be argued in court. Both look like "an agent doing work," but only one of them requires the kind of deep software a focused team takes years to build.

The system test: Are you building a system the customer runs their work through, or a tool that sits on top of a system they already have? Systems own the workflow end-to-end — the data capture, the governance, the records of what got done — and they're what the customer points to when describing how the actual work happens. Tools on the other hand just add intelligence to a workflow the customer already runs. The tool case generates real revenue and the labs can take it because the customer isn't depending on you as the orchestration layer. High ACV is usually a signal of a system, since systems replace real headcount and get paid accordingly, but it isn't a guarantee. Ask yourself if the customer would still need your tool if a lab shipped something that supposedly directly competes with you. If yes, you're building a system. If no, you're a tool — even if your ACV is high.

The hedge fund / P&L test: While lab performance is judged against benchmarks, rest of Oz performance is judged against your customer's P&L. Your customer doesn't care that your model scored well on SWE-Bench or MMLU — they care whether your agent closed the deal, redlined the contract correctly, or bound the right policy. If they're fixated on their workflow-specific outcome, not on a generic capability score, you're in the rest of Oz. If they're paying for generic capability, you're selling them something they can get with a Claude or Codex seat. The best agent businesses are going to need to execute like hedge funds — winning on alpha measured in customer P&L, not in benchmark scores.

Both can (and will) win

We're going to see massive winners on and off the Yellow Brick Road. The models will continue to win because they own the model and they own the distribution for the horizontal tools they have designed.

The rest of Oz can win if they own the system of work — the surface where the work of the company actually executes and the data that flows from it gets captured. These companies own the data capture, the workflow system of action, and the governance. As more complex workflows mature in a vertical, they compound into one core experience the customer comes to depend on. As new model generations ship from incumbents and new entrants, the company becomes the layer that integrates and delivers them to the customer. The model is fungible underneath; the system of work is not.

The next generation of enterprise software is going to be built off the road.

If you're building it, reach out: jschmidt@a16z.com.

📋 讨论归档

讨论进行中…