返回列表
🧠 阿头学 · 💬 讨论题

为什么大多数“AI 优先”战略其实只是旧流程贴 AI 补丁

这篇文章最有价值的判断是:AI 不会自动带来组织跃迁,只有把规划、开发、测试、部署和反馈闭环一起重构,AI 才可能从提效工具变成生产系统;但作者把单一团队经验包装成通用规律,外推明显过头。
打开原文 ↗

2026-04-14 原文链接 ↗
阅读简报
双语对照
完整翻译
原文
讨论归档

核心观点

  • “AI 辅助”不等于“AI 优先” 作者的核心判断是对的:给 IDE 装 Copilot、让 PM 用 ChatGPT 写文档,只会带来局部提效,不会改掉组织的吞吐上限;真正的 AI 优先,必须默认“AI 是主要构建者”,然后反过来重设计流程和角色分工。
  • 瓶颈已经从写代码转向“验证与闭环” 文中最站得住脚的部分不是“99% 代码 AI 写”,而是“快生成必须配快验证”;如果开发被压到小时级,测试、发布、分诊、回滚还停留在天级或周级,组织只是在更快地制造垃圾和技术债。
  • harness engineering 是比 prompt engineering 更硬的能力 作者强调的不是 prompt 技巧,而是让 agent 可见上下文、可执行任务、可被约束、可被回滚;这比“vibe coding”成熟得多,也更接近生产环境真正需要的工程纪律。
  • 组织角色会变,但作者的分法过于激进 “架构师 + 操作员”的二分法有启发,但“PM 退出构建周期”“只需要一两个架构师”“资深工程师反而更难适应”这些判断证据不足,更像 founder 立场,不是普适结论。
  • 这是一篇有明显营销色彩的实战宣言 CREAO 本身就是 agent 平台,文中大量使用“我们用 agents 重建自己”的叙事,这不是中性观察,而是在为一条技术路线和一家产品公司抢占话语权;可借鉴其方法论,但不能照单全收其结论。

跟我们的关联

  • 对 ATou 意味着什么、下一步怎么用 ATou 如果还把 AI 主要当“写得更快的助手”,判断上已经落后半步;下一步应该先盘点自己链路里最慢的人工环节,优先改测试、发布、监控和回滚,而不是继续追更强模型。
  • 对 Neta 意味着什么、下一步怎么用 Neta 可以把这篇文章当成“AI 原生运营”的组织样板,但不能被其工程话术带跑;下一步应用在内容、增长、运营时,要优先建立实验—验证—淘汰闭环,而不是只追求日产出数量。
  • 对 Uota 意味着什么、下一步怎么用 Uota 若关心 agent 产品和工作流设计,这篇文章的价值在于指出“可观测性、结构化输入、自动分诊、确定性闸门”才是 agent 真正落地的底座;下一步可以用它来反查现有 agent 方案是不是停留在 demo 阶段。
  • 对投资判断意味着什么、下一步怎么用 这类公司真正的壁垒不一定是模型接入,而是能否把生成、验证、部署、反馈整成闭环系统;下一步看项目时要问清楚:它是否真的缩短了“学习周期”,还是只把人工劳动换了个地方。

讨论引子

1. AI 优先转型里,最该先砍掉的到底是 PM/QA 流程,还是先补测试与观测基础设施? 2. “AI 写代码 + AI 审代码 + AI 测代码”会不会形成同源盲点,这种闭环到底是更安全还是更脆弱? 3. 对强监管、低容错行业来说,这套方法是未来方向,还是只适合互联网中小团队的局部打法?

我们线上生产环境里 99% 的代码都是 AI 写的。上周二,上午 10 点我们上线了一个新功能,中午前就做完了 A/B 测试,下午 3 点因为数据不行把它砍掉了。下午 5 点,我们又上线了一个更好的版本。三个月前,这样一个周期要花六周。

我们不是靠在 IDE 里装个 Copilot 才走到今天的。我们把整个工程流程拆了,又围着 AI 重建了一遍。规划、开发、测试、部署、团队组织,全都改了。公司里每个人的角色,也都变了。

CREAO 是一个 agent 平台。公司一共 25 个人,10 个工程师。我们从 2025 年 11 月开始做 agents,两个月前,我又从底层把整套产品架构和工程工作流重构了一遍。

OpenAI 在 2026 年 2 月发布了一个概念,正好说中了我们一直在做的事。他们把它叫作 harness engineering。意思是,工程团队最主要的工作,不再是写代码,而是让 agents 能稳定地完成有用的工作。只要哪里失败了,修复方式都不该是再努力一点。真正该问的是,缺了什么能力,以及怎么把这种能力变得对 agent 来说可见、可执行、可约束。

这个结论是我们自己走到的。只是当时还没给它起名字。

AI 优先,不等于会用 AI

大多数公司只是把 AI 硬接到现有流程上。工程师打开 Cursor。PM 用 ChatGPT 写规格。QA 试着用 AI 生成测试。工作流还是老样子。效率提升 10% 到 20%。结构上什么都没变。

这叫 AI 辅助。

AI 优先,意味着你围绕 AI 是主要构建者这个前提,重设计你的流程、架构和组织。你不再问 AI 怎么帮助工程师,而是开始问我们要怎么把一切重新组织起来,让 AI 负责建设,而工程师负责方向和判断。

两者的差别是成倍放大的。

我见过不少团队说自己是 AI 优先,但他们跑的还是同样的 sprint 周期、同样的 Jira 看板、同样的每周站会、同样的 QA 签字。他们只是把 AI 塞进了这个环里,没有重做这个环本身。

这种情况里一个常见版本,就是大家说的 vibe coding。打开 Cursor,一路 prompt 到能跑为止,提交,再重复。这能做出原型。但线上生产系统要稳定、可靠、安全。既然代码是 AI 写的,那就得有一套系统去保证这些属性。系统要你来建。prompt 是一次性的。

为什么我们非改不可

去年,我观察团队怎么做事时,看到了三个会把我们拖死的瓶颈。

产品管理瓶颈

我们的 PM 要花几周时间做调研、设计、写规格。产品管理几十年来一直这么干。但 agent 两小时就能把一个功能做出来。开发时间从几个月塌缩成几小时以后,长达数周的规划周期就成了约束。

一件事想几个月,再用两小时做出来,这不合理。

PM 要么进化成具备产品思维、能跟上迭代速度的架构型角色,要么就退出构建周期。设计应该通过快速的 原型-上线-测试-迭代 循环完成,而不是靠委员会式评审的规格文档。

QA 瓶颈

逻辑是一样的。agent 把功能做完后,我们的 QA 团队要花几天去测边角场景。开发时间两小时。测试时间三天。

后来我们用 AI 搭了测试平台,让 AI 去测 AI 写的代码。验证必须跟实现保持同样的速度。不然的话,你只是把老瓶颈往下游挪了十英尺,又造了一个新的。

人头瓶颈

我们的竞争对手,做同量级工作的团队,人数往往是我们的 100 倍甚至更多。我们只有 25 个人。靠招人补平差距不现实。只能靠重设计做到。

有三个系统必须被 AI 打通。产品怎么设计,产品怎么实现,产品怎么测试。只要其中任何一个还靠手工,整条流水线就会被它卡住。

关键决定,把架构统一起来

先得把代码库修掉。

我们旧的架构分散在多个彼此独立的系统里。一个改动,可能要碰三四个仓库。对人类工程师来说,这还能管理。对 AI agent 来说,就是黑箱。agent 看不到全局,没法推理跨服务影响,也没法在本地跑集成测试。

所以我必须把所有代码统一进一个 monorepo。原因之一很直接,就是让 AI 能看到全貌。

这就是 harness engineering 的一个实际原则。你把越多系统拉进一种 agent 能检查、验证、修改的形式里,杠杆就越大。碎片化代码库对 agents 来说是不可见的。统一后的代码库才是可读的。

我先花了一周设计新系统,规划阶段、实现阶段、测试阶段、集成测试阶段。然后又花了一周,用 agents 把整个代码库重构了一遍。

CREAO 本身就是一个 agent 平台。我们用自己的 agents 重建了这个运行 agents 的平台。如果产品能自己把自己建出来,说明它真的行。

这套栈

下面是我们的技术栈,以及每一块在做什么。

基础设施,AWS

我们跑在 AWS 上,用自动扩缩容的容器服务和 circuit-breaker 回滚。只要部署后指标变差,系统会自己回退。

CloudWatch 是整套系统的中枢神经。所有服务都做结构化日志,设了 25 个以上告警,自定义指标每天由自动化工作流查询。每一块基础设施都会暴露结构化、可查询的信号。AI 如果读不懂日志,就诊断不了问题。

CI/CD,GitHub Actions

每一次代码变更都要经过六阶段流水线。

Verify CI → Build and Deploy Dev → Test Dev → Deploy Prod → Test Prod → Release

每个 pull request 的 CI 闸门都会强制做 typecheck、lint、单元测试、集成测试、Docker 构建、通过 Playwright 的端到端测试,以及环境一致性检查。没有哪个阶段是可选的。没有人工强行跳过。流水线是确定性的,所以 agents 能预测结果,也能推理失败原因。

AI 代码审查,Claude

每个 pull request 都会触发三路并行的 AI 审查,使用 Claude Opus 4.6。

Pass 1:代码质量。逻辑错误、性能问题、可维护性。

Pass 2:安全。漏洞扫描、认证边界检查、注入风险。

Pass 3:依赖扫描。供应链风险、版本冲突、许可证问题。

这些不是建议,而是审查闸门。它们和人工审查并行运行,能在高吞吐下抓到人会漏掉的东西。一天发版八次时,没有哪个人工 reviewer 能对每个 PR 都保持足够注意力。

工程师也会在任何 GitHub issue 或 PR 里标记 @claude,让它来出实现方案、陪调试、做代码分析。agent 能看到整个 monorepo。上下文还能跨对话延续。

自愈反馈闭环

这是整套系统的核心。

每天早上 9 点 UTC,会跑一个自动健康检查工作流。Claude Sonnet 4.6 去查 CloudWatch,分析所有服务里的报错模式,然后生成一份管理层级别的健康摘要,通过 Microsoft Teams 发给团队。没有人需要先开口问。

一小时后,分诊引擎开始运行。它把来自 CloudWatch 和 Sentry 的线上错误聚类,在九个严重度维度上给每一类打分,并自动在 Linear 里生成调查工单。每张工单都会带上样本日志、受影响用户、受影响接口,以及建议调查路径。

系统会去重。如果已有未关闭 issue 覆盖了同类错误模式,它就更新那张 issue。如果之前关闭过的问题又复发,它会识别回归并重新打开。

工程师推了修复以后,还是走同一条流水线。三轮 Claude 审查先评估 PR。CI 再验证。六阶段部署流水线把它从 dev 推到 prod,每个阶段都测试。部署完成后,分诊引擎再去查一遍 CloudWatch。如果原始错误已经解决,Linear 工单会自动关闭。

每个工具只负责一个阶段。没有工具试图包打天下。这个每日循环形成了一个自愈闭环,错误会被发现、分诊、修复、验证,人工介入降到最低。

我跟 Business Insider 的一位记者说过一句话,AI 会生成 PR,人类只需要审查这里面有没有风险。

功能开关和配套栈

Statsig 负责 feature flag。每个功能都在开关后面上线。发布模式是先给团队开,再按比例逐步放量,最后全量发布或者直接砍掉。kill switch 能瞬间关掉功能,不需要重新部署。只要一个功能让指标变差,我们几小时内就会把它拉掉。坏功能上线当天就死。A/B 测试也跑在同一套系统里。

Graphite 负责 PR 分支管理。merge queue 会先 rebase 到 main,再重跑 CI,只有全绿才会合并。stacked PR 让高吞吐下的增量审查成为可能。

Sentry 汇报所有服务里的结构化异常,再由分诊引擎和 CloudWatch 的信息合并,形成跨工具上下文。Linear 是面向人的那一层。自动创建的工单里会带严重度分数、样本日志和建议调查路径。去重机制防止噪声。后续验证会自动关闭已解决的问题。

一个功能如何从想法走到生产

新功能路径

  1. 架构师把任务定义成结构化 prompt,里面包含代码库上下文、目标和约束。

  2. 一个 agent 把任务拆解、规划实现、写代码,并给自己生成测试。

  3. 系统打开一个 PR。三轮 Claude 审查对它做评估。人工 reviewer 只看战略性风险,不逐行核对正确性。

  4. CI 验证。typecheck、lint、单元测试、集成测试、端到端测试。

  5. Graphite 的 merge queue 负责 rebase、重跑 CI,结果全绿就合并。

  6. 六阶段部署流水线把它从 dev 推到 prod,每个阶段都测试。

  7. 先把 feature gate 给团队打开,再按比例逐步放量,同时监控指标。

  8. 如果任何地方变差,随时可以触发 kill switch。严重问题则由 circuit-breaker 自动回滚。

修 bug 路径

  1. CloudWatch 和 Sentry 发现错误。

  2. Claude 分诊引擎给严重度打分,在 Linear 里创建带完整调查上下文的 issue。

  3. 工程师开始调查。AI 已经先做了诊断。工程师负责验证,并推送修复。

  4. 之后还是同样的审查、CI、部署和监控流水线。

  5. 分诊引擎再次验证。解决了,工单就自动关闭。

两条路径走的是同一条流水线。同一套系统。同一个标准。

结果

在连续 14 天里,我们平均每天向生产环境部署三到八次。按旧模式,这整整两周的时间,连一次正式上线都未必能做出来。

坏功能上线当天就会被撤掉。新功能构思当天就能上线。A/B 测试实时验证影响。

很多人以为我们是在拿质量换速度。实际上,用户参与度提高了。付费转化也提高了。结果比以前更好,因为反馈闭环变紧了。每天发版,比每月发版能学到更多东西。

新的工程组织

未来会有两类工程师。

架构师

一两个就够。他们负责设计标准作业流程,教 AI 怎么工作。他们搭测试基础设施、集成系统、分诊系统。他们决定架构和系统边界。他们定义 agents 眼里的好是什么。

这个角色需要很强的批判性思维。你要批评 AI,而不是跟着它走。agent 提一个方案,架构师要去找洞。它漏了哪些失败模式。它越过了哪些安全边界。它在积累什么技术债。

我有物理学博士学位。这个博士对我最有用的一点,不是具体知识,而是学会了怎么质疑假设、压测论证、寻找缺失项。将来,批判 AI 的能力,会比产出代码的能力更值钱。

这也是最难招的角色。

操作员

其他人都在这里。工作依然重要,只是结构不一样了。

AI 给人分配任务。分诊系统发现 bug,创建工单,给出诊断,再把它分配给合适的人。人负责调查、验证、批准修复。AI 生成 PR。人类审查这里面有没有风险。

这些任务包括 bug 调查、UI 打磨、CSS 优化、PR 审查、验证。它们需要技能和专注力。但不需要旧模式里那种架构级推理能力。

谁适应得最快

我注意到了一个原本没预料到的模式。初级工程师适应得比资深工程师更快。

传统训练较少的初级工程师,反而更容易被激发出来。他们拿到了能放大影响力的工具。也没有十年习惯要先卸掉。

传统训练很强的资深工程师,适应起来最难。他们过去两个月的工作,AI 一小时就能做完。一个人花多年练成的稀缺技能,突然变成这样,这很难接受。

这不是价值判断。只是我观察到的事实。在这次转变里,适应能力比积累出来的技能更重要。

人的这一面

管理塌缩了

两个月前,我 60% 的时间都花在管人上。对齐优先级。开会。给反馈。带工程师。

今天,这个比例不到 10%。

传统 CTO 模型会说,你要赋能团队做架构,训练他们,做授权。但如果系统只需要一两个架构师,那第一步就得我自己来做。我从管理转回了建设。大多数日子里,我从早上 9 点写到凌晨 3 点。我设计系统的 SOP 和架构。我维护这套 harness。

压力更大了。但现在做的是建设,不是对齐。

争论变少了,关系更好了

我和联合创始人、工程师之间的关系,比以前更好了。

转型之前,我和团队的大部分互动都发生在对齐会议里。讨论权衡。争优先级。为技术决策争论。这些对话在传统模式下是必要的,但也很耗人。

现在我还是会和团队交流,但聊的是别的事。聊工作之外的话题。闲聊。线下出行。大家反而相处得更好了,因为那些本来会围着工作争的事,现在系统就能很容易完成。

不确定感是真实存在的

我不打算假装每个人都很开心。

当我不再每天和大家说话时,有些团队成员会不安。CTO 不跟我说话意味着什么。我在这个新世界里的价值是什么。这些担心都很合理。

有些人花更多时间在争论 AI 能不能做他们的工作,而不是去做工作本身。转型期会带来焦虑。这一点我没有一个干净利落的答案。

但我有一个原则。我们不会因为某个工程师引入了生产事故就把他开掉。我们会改进审查流程。加强测试。加护栏。AI 也一样。如果 AI 犯错,我们就把验证做得更好,把约束写得更清楚,把可观测性做得更强。

不只是工程

我看到别的公司采用了 AI 优先工程,但其他部门还是手工模式。

如果工程团队几小时就能把功能发出去,而市场团队要一周才能发公告,那市场就是瓶颈。如果产品团队还在跑按月规划周期,那规划就是瓶颈。

在 CREAO,我们把 AI 原生运营推进到了每个职能里。

  • 产品发布说明,由 AI 根据 changelog 和功能描述自动生成。

  • 功能介绍视频,由 AI 自动生成动态图形。

  • 社交媒体每日内容,由 AI 编排并自动发布。

  • 健康报告和分析摘要,由 AI 从 CloudWatch 和生产数据库中自动生成。

工程、产品、市场、增长,跑在同一套 AI 原生工作流里。只要一个职能按 agent 的速度运作,另一个职能还按人的速度运作,那个按人速运作的职能就会卡住一切。

这意味着什么

对工程师来说

你的价值,正在从代码产出转向决策质量。写代码快这件事,每个月都在变得更不值钱。评估、批判、引导 AI 的能力,则越来越值钱。

产品感觉和审美会变得重要。你能不能在用户开口前,一眼看出某个生成出来的 UI 不对。你能不能在一个架构方案里,看出 agent 漏掉的失败模式。

我对我们那些 19 岁的实习生说,去训练批判性思维。学会评估论证,找到缺口,质疑假设。学会识别什么是好的设计。这些能力会持续复利。

对 CTO 和创始人来说

如果你的 PM 流程比你的开发时间还长,就从那里开刀。

在扩展 agents 之前,先把测试 harness 搭好。快 AI 没有快验证,最后只会变成高速累积的技术债。

先从一个架构师开始。先让一个人把系统搭起来,证明它能跑,再把其他人接进操作员角色。

把 AI 原生推进到每个职能里。

要预期会有阻力。有些人一定会反对。

对整个行业来说

OpenAI、Anthropic,以及多个独立团队,最后都收敛到了同一组原则。结构化上下文、专业化 agents、持久记忆、执行闭环。harness engineering 正在变成一个标准。

推动这一切的时钟,是模型能力。我把 CREAO 的整次转变,都归因于最近这两个月。Opus 4.5 做不到 Opus 4.6 能做到的事。下一代模型还会把这个过程继续加速。

我相信,一人公司会变得常见。如果一个架构师加上一群 agents,就能做完 100 个人的工作,那很多公司根本不需要第二个员工。

我们还在很早期

我聊过的大多数创始人和工程师,还是按传统方式在运作。有些人在考虑转过来。真正做了的人,非常少。

一个做记者的朋友跟我说,她为这个话题大概采访了五个人。她说我们走得比所有人都远,我不觉得有人像你们这样,真的把整个工作流彻底重建了一遍。

任何团队都已经有工具去这么做了。我们这套栈里没有什么是专有的。

真正的竞争优势,在于你是否决定围绕这些工具重设计一切,以及你愿不愿意吞下这件事的成本。成本是真实存在的。员工会不安。CTO 要连续工作 18 个小时。资深工程师会怀疑自己的价值。还有那两周,旧系统已经没了,新系统还没被证明。

这些成本我们吞下来了。两个月后,数字已经说明一切。

我们做的是 agent 平台。我们用 agents 把它做了出来。

99% of our production code is written by AI. Last Tuesday, we shipped a new feature at 10 AM, A/B tested it by noon, and killed it by 3 PM because the data said no. We shipped a better version at 5 PM. Three months ago, a cycle like that would have taken six weeks.

我们线上生产环境里 99% 的代码都是 AI 写的。上周二,上午 10 点我们上线了一个新功能,中午前就做完了 A/B 测试,下午 3 点因为数据不行把它砍掉了。下午 5 点,我们又上线了一个更好的版本。三个月前,这样一个周期要花六周。

We didn't get here by adding Copilot to our IDE. We dismantled our engineering process and rebuilt it around AI. We changed how we plan, build, test, deploy, and organize the team. We changed the role of everyone in the company.

我们不是靠在 IDE 里装个 Copilot 才走到今天的。我们把整个工程流程拆了,又围着 AI 重建了一遍。规划、开发、测试、部署、团队组织,全都改了。公司里每个人的角色,也都变了。

CREAO is an agent platform. Twenty-five employees, 10 engineers. We started building agents in November 2025, and two months ago I restructured the entire product architecture and engineering workflow from the ground up.

CREAO 是一个 agent 平台。公司一共 25 个人,10 个工程师。我们从 2025 年 11 月开始做 agents,两个月前,我又从底层把整套产品架构和工程工作流重构了一遍。

OpenAI published a concept in February 2026 that captured what we'd been doing. They called it harness engineering: the primary job of an engineering team is no longer writing code. It is enabling agents to do useful work. When something fails, the fix is never "try harder." The fix is: what capability is missing, and how do we make it legible and enforceable for the agent?

OpenAI 在 2026 年 2 月发布了一个概念,正好说中了我们一直在做的事。他们把它叫作 harness engineering。意思是,工程团队最主要的工作,不再是写代码,而是让 agents 能稳定地完成有用的工作。只要哪里失败了,修复方式都不该是再努力一点。真正该问的是,缺了什么能力,以及怎么把这种能力变得对 agent 来说可见、可执行、可约束。

We arrived at that conclusion on our own. We didn't have a name for it.

这个结论是我们自己走到的。只是当时还没给它起名字。

AI-First Is Not the Same as Using AI

AI 优先,不等于会用 AI

Most companies bolt AI onto their existing process. An engineer opens Cursor. A PM drafts specs with ChatGPT. QA experiments with AI test generation. The workflow stays the same. Efficiency goes up 10 to 20 percent. Nothing structurally changes.

大多数公司只是把 AI 硬接到现有流程上。工程师打开 Cursor。PM 用 ChatGPT 写规格。QA 试着用 AI 生成测试。工作流还是老样子。效率提升 10% 到 20%。结构上什么都没变。

That is AI-assisted.

这叫 AI 辅助。

AI-first means you redesign your process, your architecture, and your organization around the assumption that AI is the primary builder. You stop asking "how can AI help our engineers?" and start asking "how do we restructure everything so AI does the building, and engineers provide direction and judgment?"

AI 优先,意味着你围绕 AI 是主要构建者这个前提,重设计你的流程、架构和组织。你不再问 AI 怎么帮助工程师,而是开始问我们要怎么把一切重新组织起来,让 AI 负责建设,而工程师负责方向和判断。

The difference is multiplicative.

两者的差别是成倍放大的。

I see teams claim AI-first while running the same sprint cycles, the same Jira boards, the same weekly standups, the same QA sign-offs. They added AI to the loop. They didn't redesign the loop.

我见过不少团队说自己是 AI 优先,但他们跑的还是同样的 sprint 周期、同样的 Jira 看板、同样的每周站会、同样的 QA 签字。他们只是把 AI 塞进了这个环里,没有重做这个环本身。

A common version of this is what people call vibe coding. Open Cursor, prompt until something works, commit, repeat. That produces prototypes. A production system needs to be stable, reliable, and secure. You need a system that can guarantee those properties when AI writes the code. You build the system. The prompts are disposable.

这种情况里一个常见版本,就是大家说的 vibe coding。打开 Cursor,一路 prompt 到能跑为止,提交,再重复。这能做出原型。但线上生产系统要稳定、可靠、安全。既然代码是 AI 写的,那就得有一套系统去保证这些属性。系统要你来建。prompt 是一次性的。

Why We Had to Change

为什么我们非改不可

Last year, I watched how our team worked and saw three bottlenecks that would kill us.

去年,我观察团队怎么做事时,看到了三个会把我们拖死的瓶颈。

The Product Management Bottleneck

产品管理瓶颈

Our PMs spent weeks researching, designing, specifying features. Product management has worked this way for decades. But agents can implement a feature in two hours. When build time collapses from months to hours, a weeks-long planning cycle becomes the constraint.

我们的 PM 要花几周时间做调研、设计、写规格。产品管理几十年来一直这么干。但 agent 两小时就能把一个功能做出来。开发时间从几个月塌缩成几小时以后,长达数周的规划周期就成了约束。

It doesn't make sense to think about something for months and then build it in two hours.

一件事想几个月,再用两小时做出来,这不合理。

PMs needed to evolve into product-minded architects who work at the speed of iteration, or step out of the build cycle. Design needed to happen through rapid prototype-ship-test-iterate loops, not specification documents reviewed in committee.

PM 要么进化成具备产品思维、能跟上迭代速度的架构型角色,要么就退出构建周期。设计应该通过快速的 原型-上线-测试-迭代 循环完成,而不是靠委员会式评审的规格文档。

The QA Bottleneck

QA 瓶颈

Same dynamic. After an agent shipped a feature, our QA team spent days testing corner cases. Build time: two hours. Test time: three days.

逻辑是一样的。agent 把功能做完后,我们的 QA 团队要花几天去测边角场景。开发时间两小时。测试时间三天。

We replaced manual QA with AI-built testing platforms that test AI-written code. Validation has to move at the same speed as implementation. Otherwise you've built a new bottleneck ten feet downstream from the old one.

后来我们用 AI 搭了测试平台,让 AI 去测 AI 写的代码。验证必须跟实现保持同样的速度。不然的话,你只是把老瓶颈往下游挪了十英尺,又造了一个新的。

The Headcount Bottleneck

人头瓶颈

Our competitors had 100x or more people doing comparable work. We have 25. We couldn't hire our way to parity. We had to redesign our way there.

我们的竞争对手,做同量级工作的团队,人数往往是我们的 100 倍甚至更多。我们只有 25 个人。靠招人补平差距不现实。只能靠重设计做到。

Three systems needed AI running through them: how we design product, how we implement product, and how we test product. If any single one stays manual, it constrains the whole pipeline.

有三个系统必须被 AI 打通。产品怎么设计,产品怎么实现,产品怎么测试。只要其中任何一个还靠手工,整条流水线就会被它卡住。

The Bold Decision: Unifying the Architecture

关键决定,把架构统一起来

I had to fix the codebase first.

先得把代码库修掉。

Our old architecture was scattered across multiple independent systems. A single change might require touching three or four repositories. From a human engineer's perspective, it is manageable. From an AI agent's perspective, opaque. The agent can't see the full picture. It can't reason about cross-service implications. It can't run integration tests locally.

我们旧的架构分散在多个彼此独立的系统里。一个改动,可能要碰三四个仓库。对人类工程师来说,这还能管理。对 AI agent 来说,就是黑箱。agent 看不到全局,没法推理跨服务影响,也没法在本地跑集成测试。

I had to unify all the code into a single monorepo. One reason: so AI could see everything.

所以我必须把所有代码统一进一个 monorepo。原因之一很直接,就是让 AI 能看到全貌。

This is a harness engineering principle in practice. The more of your system you pull into a form the agent can inspect, validate, and modify, the more leverage you get. A fragmented codebase is invisible to agents. A unified one is legible.

这就是 harness engineering 的一个实际原则。你把越多系统拉进一种 agent 能检查、验证、修改的形式里,杠杆就越大。碎片化代码库对 agents 来说是不可见的。统一后的代码库才是可读的。

I spent one week designing the new system: planning stage, implementation stage, testing stage, integration testing stage. Then another week re-architecting the entire codebase using agents.

我先花了一周设计新系统,规划阶段、实现阶段、测试阶段、集成测试阶段。然后又花了一周,用 agents 把整个代码库重构了一遍。

CREAO is an agent platform. We used our own agents to rebuild the platform that runs agents. If the product can build itself, it works.

CREAO 本身就是一个 agent 平台。我们用自己的 agents 重建了这个运行 agents 的平台。如果产品能自己把自己建出来,说明它真的行。

The Stack

这套栈

Here is our stack and what each piece does.

下面是我们的技术栈,以及每一块在做什么。

Infrastructure: AWS

基础设施,AWS

We run on AWS with auto-scaling container services and circuit-breaker rollback. If metrics degrade after a deployment, the system reverts on its own.

我们跑在 AWS 上,用自动扩缩容的容器服务和 circuit-breaker 回滚。只要部署后指标变差,系统会自己回退。

CloudWatch is the central nervous system. Structured logging across all services, over 25 alarms, custom metrics queried daily by automated workflows. Every piece of infrastructure exposes structured, queryable signals. If AI can't read the logs, it can't diagnose the problem.

CloudWatch 是整套系统的中枢神经。所有服务都做结构化日志,设了 25 个以上告警,自定义指标每天由自动化工作流查询。每一块基础设施都会暴露结构化、可查询的信号。AI 如果读不懂日志,就诊断不了问题。

CI/CD: GitHub Actions

CI/CD,GitHub Actions

Every code change passes through a six-phase pipeline:

每一次代码变更都要经过六阶段流水线。

Verify CI → Build and Deploy Dev → Test Dev → Deploy Prod → Test Prod → Release

Verify CI → Build and Deploy Dev → Test Dev → Deploy Prod → Test Prod → Release

The CI gate on every pull request enforces typechecking, linting, unit and integration tests, Docker builds, end-to-end tests via Playwright, and environment parity checks. No phase is optional. No manual overrides. The pipeline is deterministic, so agents can predict outcomes and reason about failures.

每个 pull request 的 CI 闸门都会强制做 typecheck、lint、单元测试、集成测试、Docker 构建、通过 Playwright 的端到端测试,以及环境一致性检查。没有哪个阶段是可选的。没有人工强行跳过。流水线是确定性的,所以 agents 能预测结果,也能推理失败原因。

AI Code Review: Claude

AI 代码审查,Claude

Every pull request triggers three parallel AI review passes using Claude Opus 4.6:

每个 pull request 都会触发三路并行的 AI 审查,使用 Claude Opus 4.6。

Pass 1: Code quality. Logic errors, performance issues, maintainability.

Pass 1:代码质量。逻辑错误、性能问题、可维护性。

Pass 2: Security. Vulnerability scanning, authentication boundary checks, injection risks.

Pass 2:安全。漏洞扫描、认证边界检查、注入风险。

Pass 3: Dependency scan. Supply chain risks, version conflicts, license issues.

Pass 3:依赖扫描。供应链风险、版本冲突、许可证问题。

These are review gates, not suggestions. They run alongside human review, catching what humans miss at volume. When you deploy eight times a day, no human reviewer can sustain attention across every PR.

这些不是建议,而是审查闸门。它们和人工审查并行运行,能在高吞吐下抓到人会漏掉的东西。一天发版八次时,没有哪个人工 reviewer 能对每个 PR 都保持足够注意力。

Engineers also tag @claude in any GitHub issue or PR for implementation plans, debugging sessions, or code analysis. The agent sees the whole monorepo. Context carries across conversations.

工程师也会在任何 GitHub issue 或 PR 里标记 @claude,让它来出实现方案、陪调试、做代码分析。agent 能看到整个 monorepo。上下文还能跨对话延续。

The Self-Healing Feedback Loop

自愈反馈闭环

This is the centerpiece.

这是整套系统的核心。

Every morning at 9:00 AM UTC, an automated health workflow runs. Claude Sonnet 4.6 queries CloudWatch, analyzes error patterns across all services, and generates an executive health summary delivered to the team via Microsoft Teams. Nobody had to ask for it.

每天早上 9 点 UTC,会跑一个自动健康检查工作流。Claude Sonnet 4.6 去查 CloudWatch,分析所有服务里的报错模式,然后生成一份管理层级别的健康摘要,通过 Microsoft Teams 发给团队。没有人需要先开口问。

One hour later, the triage engine runs. It clusters production errors from CloudWatch and Sentry, scores each cluster across nine severity dimensions, and auto-generates investigation tickets in Linear. Each ticket includes sample logs, affected users, affected endpoints, and suggested investigation paths.

一小时后,分诊引擎开始运行。它把来自 CloudWatch 和 Sentry 的线上错误聚类,在九个严重度维度上给每一类打分,并自动在 Linear 里生成调查工单。每张工单都会带上样本日志、受影响用户、受影响接口,以及建议调查路径。

The system deduplicates. If an open issue covers the same error pattern, it updates that issue. If a previously closed issue recurs, it detects the regression and reopens.

系统会去重。如果已有未关闭 issue 覆盖了同类错误模式,它就更新那张 issue。如果之前关闭过的问题又复发,它会识别回归并重新打开。

When an engineer pushes a fix, the same pipeline handles it. Three Claude review passes evaluate the PR. CI validates. The six-phase deploy pipeline promotes through dev and prod with testing at each stage. After deployment, the triage engine re-checks CloudWatch. If the original errors are resolved, the Linear ticket auto-closes.

工程师推了修复以后,还是走同一条流水线。三轮 Claude 审查先评估 PR。CI 再验证。六阶段部署流水线把它从 dev 推到 prod,每个阶段都测试。部署完成后,分诊引擎再去查一遍 CloudWatch。如果原始错误已经解决,Linear 工单会自动关闭。

Each tool handles one phase. No tool tries to do everything. The daily cycle creates a self-healing loop where errors are detected, triaged, fixed, and verified with minimal manual intervention.

每个工具只负责一个阶段。没有工具试图包打天下。这个每日循环形成了一个自愈闭环,错误会被发现、分诊、修复、验证,人工介入降到最低。

I told a reporter from Business Insider: "AI will make the PR and the human just needs to review whether there's any risk."

我跟 Business Insider 的一位记者说过一句话,AI 会生成 PR,人类只需要审查这里面有没有风险。

Feature Flags and the Supporting Stack

功能开关和配套栈

Statsig handles feature flags. Every feature ships behind a gate. The rollout pattern: enable for the team, then gradual percentage rollout, then full release or kill. The kill switch toggles a feature off instantly, no deploy needed. If a feature degrades metrics, we pull it within hours. Bad features die the same day they ship. A/B testing runs through the same system.

Statsig 负责 feature flag。每个功能都在开关后面上线。发布模式是先给团队开,再按比例逐步放量,最后全量发布或者直接砍掉。kill switch 能瞬间关掉功能,不需要重新部署。只要一个功能让指标变差,我们几小时内就会把它拉掉。坏功能上线当天就死。A/B 测试也跑在同一套系统里。

Graphite manages PR branching: merge queues rebase onto main, re-run CI, merge only if green. Stacked PRs allow incremental review at high throughput.

Graphite 负责 PR 分支管理。merge queue 会先 rebase 到 main,再重跑 CI,只有全绿才会合并。stacked PR 让高吞吐下的增量审查成为可能。

Sentry reports structured exceptions across all services, merged with CloudWatch by the triage engine for cross-tool context. Linear is the human-facing layer: auto-created tickets with severity scores, sample logs, and suggested investigation. Deduplication prevents noise. Follow-up verification auto-closes resolved issues.

Sentry 汇报所有服务里的结构化异常,再由分诊引擎和 CloudWatch 的信息合并,形成跨工具上下文。Linear 是面向人的那一层。自动创建的工单里会带严重度分数、样本日志和建议调查路径。去重机制防止噪声。后续验证会自动关闭已解决的问题。

How a Feature Moves from Idea to Production

一个功能如何从想法走到生产

New Feature Path

新功能路径

  1. The architect defines the task as a structured prompt with codebase context, goals, and constraints.
  1. 架构师把任务定义成结构化 prompt,里面包含代码库上下文、目标和约束。
  1. An agent decomposes the task, plans implementation, writes code, and generates its own tests.
  1. 一个 agent 把任务拆解、规划实现、写代码,并给自己生成测试。
  1. A PR opens. Three Claude review passes evaluate it. A human reviewer checks for strategic risk, not line-by-line correctness.
  1. 系统打开一个 PR。三轮 Claude 审查对它做评估。人工 reviewer 只看战略性风险,不逐行核对正确性。
  1. CI validates: typecheck, lint, unit tests, integration tests, end-to-end tests.
  1. CI 验证。typecheck、lint、单元测试、集成测试、端到端测试。
  1. Graphite's merge queue rebases, re-runs CI, merges if green.
  1. Graphite 的 merge queue 负责 rebase、重跑 CI,结果全绿就合并。
  1. Six-phase deploy pipeline promotes through dev and prod with testing at each stage.
  1. 六阶段部署流水线把它从 dev 推到 prod,每个阶段都测试。
  1. Feature gate turns on for the team. Gradual percentage rollout. Metrics monitored.
  1. 先把 feature gate 给团队打开,再按比例逐步放量,同时监控指标。
  1. Kill switch available if anything degrades. Circuit-breaker auto-rollback for severe issues.
  1. 如果任何地方变差,随时可以触发 kill switch。严重问题则由 circuit-breaker 自动回滚。

Bug Fix Path

修 bug 路径

  1. CloudWatch and Sentry detect errors.
  1. CloudWatch 和 Sentry 发现错误。
  1. Claude triage engine scores severity, creates a Linear issue with full investigation context.
  1. Claude 分诊引擎给严重度打分,在 Linear 里创建带完整调查上下文的 issue。
  1. An engineer investigates. AI has already done the diagnosis. The engineer validates and pushes a fix.
  1. 工程师开始调查。AI 已经先做了诊断。工程师负责验证,并推送修复。
  1. Same review, CI, deploy, and monitoring pipeline.
  1. 之后还是同样的审查、CI、部署和监控流水线。
  1. Triage engine re-verifies. If resolved, ticket auto-closes.
  1. 分诊引擎再次验证。解决了,工单就自动关闭。

Both paths use the same pipeline. One system. One standard.

两条路径走的是同一条流水线。同一套系统。同一个标准。

The Results

结果

Over 14 days, we averaged three to eight production deployments per day. Under our old model, that entire two-week period would have produced not even a single release to production.

在连续 14 天里,我们平均每天向生产环境部署三到八次。按旧模式,这整整两周的时间,连一次正式上线都未必能做出来。

Bad features get pulled the same day they ship. New features go live the same day they're conceived. A/B tests validate impact in real time.

坏功能上线当天就会被撤掉。新功能构思当天就能上线。A/B 测试实时验证影响。

People assume we're trading quality for speed. User engagement went up. Payment conversion went up. We produce better results than before, because the feedback loops are tighter. You learn more when you ship daily than when you ship monthly.

很多人以为我们是在拿质量换速度。实际上,用户参与度提高了。付费转化也提高了。结果比以前更好,因为反馈闭环变紧了。每天发版,比每月发版能学到更多东西。

The New Engineering Org

新的工程组织

Two types of engineers will exist.

未来会有两类工程师。

The Architect

架构师

One or two people. They design the standard operating procedures that teach AI how to work. They build the testing infrastructure, the integration systems, the triage systems. They decide architecture and system boundaries. They define what "good" looks like for the agents.

一两个就够。他们负责设计标准作业流程,教 AI 怎么工作。他们搭测试基础设施、集成系统、分诊系统。他们决定架构和系统边界。他们定义 agents 眼里的好是什么。

This role requires deep critical thinking. You criticize AI. You don't follow it. When the agent proposes a plan, the architect finds the holes. What failure modes did it miss? What security boundaries did it cross? What technical debt is it accumulating?

这个角色需要很强的批判性思维。你要批评 AI,而不是跟着它走。agent 提一个方案,架构师要去找洞。它漏了哪些失败模式。它越过了哪些安全边界。它在积累什么技术债。

I have a PhD in physics. The most useful thing my PhD taught me was how to question assumptions, stress-test arguments, and look for what's missing. The ability to criticise AI will be more valuable than the ability to produce code.

我有物理学博士学位。这个博士对我最有用的一点,不是具体知识,而是学会了怎么质疑假设、压测论证、寻找缺失项。将来,批判 AI 的能力,会比产出代码的能力更值钱。

This is also the hardest role to fill.

这也是最难招的角色。

The Operator

操作员

Everyone else. The work matters. The structure is different.

其他人都在这里。工作依然重要,只是结构不一样了。

AI assigns tasks to humans. The triage system finds a bug, creates a ticket, surfaces the diagnosis, and assigns it to the right person. The person investigates, validates, and approves the fix. AI makes the PR. The human reviews whether there's risk.

AI 给人分配任务。分诊系统发现 bug,创建工单,给出诊断,再把它分配给合适的人。人负责调查、验证、批准修复。AI 生成 PR。人类审查这里面有没有风险。

The tasks are bug investigation, UI refinement, CSS improvements, PR review, verification. They require skill and attention. They don't require the architectural reasoning the old model demanded.

这些任务包括 bug 调查、UI 打磨、CSS 优化、PR 审查、验证。它们需要技能和专注力。但不需要旧模式里那种架构级推理能力。

Who Adapts Fastest

谁适应得最快

I noticed a pattern I didn't expect. Junior engineers adapted faster than senior engineers.

我注意到了一个原本没预料到的模式。初级工程师适应得比资深工程师更快。

Junior engineers with less traditional practice felt empowered. They had access to tools that amplified their impact. They didn't carry a decade of habits to unlearn.

传统训练较少的初级工程师,反而更容易被激发出来。他们拿到了能放大影响力的工具。也没有十年习惯要先卸掉。

Senior engineers with strong traditional practice had the hardest time. Two months of their work could be completed in one hour by AI. That is a hard thing to accept after years of building a rare skill set.

传统训练很强的资深工程师,适应起来最难。他们过去两个月的工作,AI 一小时就能做完。一个人花多年练成的稀缺技能,突然变成这样,这很难接受。

I'm not making a judgment. I'm describing what I observed. In this transition, adaptability matters more than accumulated skill.

这不是价值判断。只是我观察到的事实。在这次转变里,适应能力比积累出来的技能更重要。

The Human Side

人的这一面

Management Collapsed

管理塌缩了

Two months ago, I spent 60% of my time managing people. Aligning priorities. Running meetings. Giving feedback. Coaching engineers.

两个月前,我 60% 的时间都花在管人上。对齐优先级。开会。给反馈。带工程师。

Today: below 10%.

今天,这个比例不到 10%。

The traditional CTO model says to empower your team to do architecture work, train them, delegate. But if the system only needs one or two architects, I need to do it myself first. I went from managing to building. I code from 9 AM to 3 AM most days. I design the SOPs and architecture of the system. I maintain the harness.

传统 CTO 模型会说,你要赋能团队做架构,训练他们,做授权。但如果系统只需要一两个架构师,那第一步就得我自己来做。我从管理转回了建设。大多数日子里,我从早上 9 点写到凌晨 3 点。我设计系统的 SOP 和架构。我维护这套 harness。

More stressful. But I'm enjoying building, not aligning.

压力更大了。但现在做的是建设,不是对齐。

Less Arguing, Better Relationships

争论变少了,关系更好了

My relationships with co-founders and engineers are better than before.

我和联合创始人、工程师之间的关系,比以前更好了。

Before the transition, most of my interaction with the team was alignment meetings. Discussing trade-offs. Debating priorities. Disagreeing about technical decisions. Those conversations are necessary in a traditional model. They're also draining.

转型之前,我和团队的大部分互动都发生在对齐会议里。讨论权衡。争优先级。为技术决策争论。这些对话在传统模式下是必要的,但也很耗人。

Now I still talk to my team. We talk about other things. Non-work topics. Casual conversations. Offsite trips. We get along better because we stopped arguing about work that can be easily done by our system.

现在我还是会和团队交流,但聊的是别的事。聊工作之外的话题。闲聊。线下出行。大家反而相处得更好了,因为那些本来会围着工作争的事,现在系统就能很容易完成。

Uncertainty Is Real

不确定感是真实存在的

I won't pretend everyone is happy.

我不打算假装每个人都很开心。

When I stopped talking to people every day, some team members felt uncertain. What does the CTO not talking to me mean? What is my value in this new world? Reasonable concerns.

当我不再每天和大家说话时,有些团队成员会不安。CTO 不跟我说话意味着什么。我在这个新世界里的价值是什么。这些担心都很合理。

Some people spend more time debating whether AI can do their work than doing the work. The transition period creates anxiety. I don't have a clean answer for it.

有些人花更多时间在争论 AI 能不能做他们的工作,而不是去做工作本身。转型期会带来焦虑。这一点我没有一个干净利落的答案。

I do have a principle: we don't fire an engineer because they introduced a production bug. We improve the review process. We strengthen testing. We add guardrails. The same applies to AI. If AI makes a mistake, we build better validation, clearer constraints, stronger observability.

但我有一个原则。我们不会因为某个工程师引入了生产事故就把他开掉。我们会改进审查流程。加强测试。加护栏。AI 也一样。如果 AI 犯错,我们就把验证做得更好,把约束写得更清楚,把可观测性做得更强。

Beyond Engineering

不只是工程

I see other companies adopt AI-first engineering and leave everything else manual.

我看到别的公司采用了 AI 优先工程,但其他部门还是手工模式。

If engineering ships features in hours but marketing takes a week to announce them, marketing is the bottleneck. If the product team still runs a monthly planning cycle, planning is the bottleneck.

如果工程团队几小时就能把功能发出去,而市场团队要一周才能发公告,那市场就是瓶颈。如果产品团队还在跑按月规划周期,那规划就是瓶颈。

At CREAO, we pushed AI-native operations into every function:

在 CREAO,我们把 AI 原生运营推进到了每个职能里。

  • Product release notes: AI-generated from changelogs and feature descriptions.
  • 产品发布说明,由 AI 根据 changelog 和功能描述自动生成。
  • Feature intro videos: AI-generated motion graphics.
  • 功能介绍视频,由 AI 自动生成动态图形。
  • Daily posts on socials: AI-orchestrated and auto-published.
  • 社交媒体每日内容,由 AI 编排并自动发布。
  • Health reports and analytics summaries: AI-generated from CloudWatch and production databases.
  • 健康报告和分析摘要,由 AI 从 CloudWatch 和生产数据库中自动生成。

Engineering, product, marketing, and growth run in one AI-native workflow. If one function operates at agent speed and another at human speed, the human-speed function constrains everything.

工程、产品、市场、增长,跑在同一套 AI 原生工作流里。只要一个职能按 agent 的速度运作,另一个职能还按人的速度运作,那个按人速运作的职能就会卡住一切。

What This Means

这意味着什么

For Engineers

对工程师来说

Your value is moving from code output to decision quality. The ability to write code fast is worth less every month. The ability to evaluate, criticize, and direct AI is worth more.

你的价值,正在从代码产出转向决策质量。写代码快这件事,每个月都在变得更不值钱。评估、批判、引导 AI 的能力,则越来越值钱。

Product sense or taste matters. Can you look at a generated UI and know it's wrong before the user tells you? Can you look at an architecture proposal and see the failure mode the agent missed?

产品感觉和审美会变得重要。你能不能在用户开口前,一眼看出某个生成出来的 UI 不对。你能不能在一个架构方案里,看出 agent 漏掉的失败模式。

I tell our 19-year-old interns: train critical thinking. Learn to evaluate arguments, find gaps, question assumptions. Learn what good design looks like. Those skills compound.

我对我们那些 19 岁的实习生说,去训练批判性思维。学会评估论证,找到缺口,质疑假设。学会识别什么是好的设计。这些能力会持续复利。

For CTOs and Founders

对 CTO 和创始人来说

If your PM process takes longer than your build time, start there.

如果你的 PM 流程比你的开发时间还长,就从那里开刀。

Build the testing harness before you scale agents. Fast AI without fast validation is fast-moving technical debt.

在扩展 agents 之前,先把测试 harness 搭好。快 AI 没有快验证,最后只会变成高速累积的技术债。

Start with one architect. One person who builds the system and proves it works. Onboard others into operator roles after the system runs.

先从一个架构师开始。先让一个人把系统搭起来,证明它能跑,再把其他人接进操作员角色。

Push AI-native into every function.

把 AI 原生推进到每个职能里。

Expect resistance. Some people will push back.

要预期会有阻力。有些人一定会反对。

For the Industry

对整个行业来说

OpenAI, Anthropic, and multiple independent teams converged on the same principles: structured context, specialized agents, persistent memory, and execution loops. Harness engineering is becoming a standard.

OpenAI、Anthropic,以及多个独立团队,最后都收敛到了同一组原则。结构化上下文、专业化 agents、持久记忆、执行闭环。harness engineering 正在变成一个标准。

Model capability is the clock driving this. I attribute the entire shift at CREAO to the last two months. Opus 4.5 couldn't do what Opus 4.6 does. Next-gen models will accelerate it further.

推动这一切的时钟,是模型能力。我把 CREAO 的整次转变,都归因于最近这两个月。Opus 4.5 做不到 Opus 4.6 能做到的事。下一代模型还会把这个过程继续加速。

I believe one-person companies will become common. If one architect with agents can do the work of 100 people, many companies won't need a second employee.

我相信,一人公司会变得常见。如果一个架构师加上一群 agents,就能做完 100 个人的工作,那很多公司根本不需要第二个员工。

We're Early

我们还在很早期

Most founders and engineers I talk to still operate the traditional way. Some think about making the shift. Very few have done it.

我聊过的大多数创始人和工程师,还是按传统方式在运作。有些人在考虑转过来。真正做了的人,非常少。

A reporter friend told me she'd talked to about five people on this topic. She said we were further along than anyone: "I don't think anyone's just totally rebuilt their entire workflow the way you have."

一个做记者的朋友跟我说,她为这个话题大概采访了五个人。她说我们走得比所有人都远,我不觉得有人像你们这样,真的把整个工作流彻底重建了一遍。

The tools exist for any team to do this. Nothing in our stack is proprietary.

任何团队都已经有工具去这么做了。我们这套栈里没有什么是专有的。

The competitive advantage is the decision to redesign everything around these tools, and the willingness to absorb the cost. The cost is real: uncertainty among employees, the CTO working 18-hour days, senior engineers questioning their value, a two-week period where the old system is gone and the new one isn't proven.

真正的竞争优势,在于你是否决定围绕这些工具重设计一切,以及你愿不愿意吞下这件事的成本。成本是真实存在的。员工会不安。CTO 要连续工作 18 个小时。资深工程师会怀疑自己的价值。还有那两周,旧系统已经没了,新系统还没被证明。

We absorbed that cost. Two months later, the numbers speak.

这些成本我们吞下来了。两个月后,数字已经说明一切。

We build an agent platform. We built it with agents.

我们做的是 agent 平台。我们用 agents 把它做了出来。

99% of our production code is written by AI. Last Tuesday, we shipped a new feature at 10 AM, A/B tested it by noon, and killed it by 3 PM because the data said no. We shipped a better version at 5 PM. Three months ago, a cycle like that would have taken six weeks.

We didn't get here by adding Copilot to our IDE. We dismantled our engineering process and rebuilt it around AI. We changed how we plan, build, test, deploy, and organize the team. We changed the role of everyone in the company.

CREAO is an agent platform. Twenty-five employees, 10 engineers. We started building agents in November 2025, and two months ago I restructured the entire product architecture and engineering workflow from the ground up.

OpenAI published a concept in February 2026 that captured what we'd been doing. They called it harness engineering: the primary job of an engineering team is no longer writing code. It is enabling agents to do useful work. When something fails, the fix is never "try harder." The fix is: what capability is missing, and how do we make it legible and enforceable for the agent?

We arrived at that conclusion on our own. We didn't have a name for it.

AI-First Is Not the Same as Using AI

Most companies bolt AI onto their existing process. An engineer opens Cursor. A PM drafts specs with ChatGPT. QA experiments with AI test generation. The workflow stays the same. Efficiency goes up 10 to 20 percent. Nothing structurally changes.

That is AI-assisted.

AI-first means you redesign your process, your architecture, and your organization around the assumption that AI is the primary builder. You stop asking "how can AI help our engineers?" and start asking "how do we restructure everything so AI does the building, and engineers provide direction and judgment?"

The difference is multiplicative.

I see teams claim AI-first while running the same sprint cycles, the same Jira boards, the same weekly standups, the same QA sign-offs. They added AI to the loop. They didn't redesign the loop.

A common version of this is what people call vibe coding. Open Cursor, prompt until something works, commit, repeat. That produces prototypes. A production system needs to be stable, reliable, and secure. You need a system that can guarantee those properties when AI writes the code. You build the system. The prompts are disposable.

Why We Had to Change

Last year, I watched how our team worked and saw three bottlenecks that would kill us.

The Product Management Bottleneck

Our PMs spent weeks researching, designing, specifying features. Product management has worked this way for decades. But agents can implement a feature in two hours. When build time collapses from months to hours, a weeks-long planning cycle becomes the constraint.

It doesn't make sense to think about something for months and then build it in two hours.

PMs needed to evolve into product-minded architects who work at the speed of iteration, or step out of the build cycle. Design needed to happen through rapid prototype-ship-test-iterate loops, not specification documents reviewed in committee.

The QA Bottleneck

Same dynamic. After an agent shipped a feature, our QA team spent days testing corner cases. Build time: two hours. Test time: three days.

We replaced manual QA with AI-built testing platforms that test AI-written code. Validation has to move at the same speed as implementation. Otherwise you've built a new bottleneck ten feet downstream from the old one.

The Headcount Bottleneck

Our competitors had 100x or more people doing comparable work. We have 25. We couldn't hire our way to parity. We had to redesign our way there.

Three systems needed AI running through them: how we design product, how we implement product, and how we test product. If any single one stays manual, it constrains the whole pipeline.

The Bold Decision: Unifying the Architecture

I had to fix the codebase first.

Our old architecture was scattered across multiple independent systems. A single change might require touching three or four repositories. From a human engineer's perspective, it is manageable. From an AI agent's perspective, opaque. The agent can't see the full picture. It can't reason about cross-service implications. It can't run integration tests locally.

I had to unify all the code into a single monorepo. One reason: so AI could see everything.

This is a harness engineering principle in practice. The more of your system you pull into a form the agent can inspect, validate, and modify, the more leverage you get. A fragmented codebase is invisible to agents. A unified one is legible.

I spent one week designing the new system: planning stage, implementation stage, testing stage, integration testing stage. Then another week re-architecting the entire codebase using agents.

CREAO is an agent platform. We used our own agents to rebuild the platform that runs agents. If the product can build itself, it works.

The Stack

Here is our stack and what each piece does.

Infrastructure: AWS

We run on AWS with auto-scaling container services and circuit-breaker rollback. If metrics degrade after a deployment, the system reverts on its own.

CloudWatch is the central nervous system. Structured logging across all services, over 25 alarms, custom metrics queried daily by automated workflows. Every piece of infrastructure exposes structured, queryable signals. If AI can't read the logs, it can't diagnose the problem.

CI/CD: GitHub Actions

Every code change passes through a six-phase pipeline:

Verify CI → Build and Deploy Dev → Test Dev → Deploy Prod → Test Prod → Release

The CI gate on every pull request enforces typechecking, linting, unit and integration tests, Docker builds, end-to-end tests via Playwright, and environment parity checks. No phase is optional. No manual overrides. The pipeline is deterministic, so agents can predict outcomes and reason about failures.

AI Code Review: Claude

Every pull request triggers three parallel AI review passes using Claude Opus 4.6:

Pass 1: Code quality. Logic errors, performance issues, maintainability.

Pass 2: Security. Vulnerability scanning, authentication boundary checks, injection risks.

Pass 3: Dependency scan. Supply chain risks, version conflicts, license issues.

These are review gates, not suggestions. They run alongside human review, catching what humans miss at volume. When you deploy eight times a day, no human reviewer can sustain attention across every PR.

Engineers also tag @claude in any GitHub issue or PR for implementation plans, debugging sessions, or code analysis. The agent sees the whole monorepo. Context carries across conversations.

The Self-Healing Feedback Loop

This is the centerpiece.

Every morning at 9:00 AM UTC, an automated health workflow runs. Claude Sonnet 4.6 queries CloudWatch, analyzes error patterns across all services, and generates an executive health summary delivered to the team via Microsoft Teams. Nobody had to ask for it.

One hour later, the triage engine runs. It clusters production errors from CloudWatch and Sentry, scores each cluster across nine severity dimensions, and auto-generates investigation tickets in Linear. Each ticket includes sample logs, affected users, affected endpoints, and suggested investigation paths.

The system deduplicates. If an open issue covers the same error pattern, it updates that issue. If a previously closed issue recurs, it detects the regression and reopens.

When an engineer pushes a fix, the same pipeline handles it. Three Claude review passes evaluate the PR. CI validates. The six-phase deploy pipeline promotes through dev and prod with testing at each stage. After deployment, the triage engine re-checks CloudWatch. If the original errors are resolved, the Linear ticket auto-closes.

Each tool handles one phase. No tool tries to do everything. The daily cycle creates a self-healing loop where errors are detected, triaged, fixed, and verified with minimal manual intervention.

I told a reporter from Business Insider: "AI will make the PR and the human just needs to review whether there's any risk."

Feature Flags and the Supporting Stack

Statsig handles feature flags. Every feature ships behind a gate. The rollout pattern: enable for the team, then gradual percentage rollout, then full release or kill. The kill switch toggles a feature off instantly, no deploy needed. If a feature degrades metrics, we pull it within hours. Bad features die the same day they ship. A/B testing runs through the same system.

Graphite manages PR branching: merge queues rebase onto main, re-run CI, merge only if green. Stacked PRs allow incremental review at high throughput.

Sentry reports structured exceptions across all services, merged with CloudWatch by the triage engine for cross-tool context. Linear is the human-facing layer: auto-created tickets with severity scores, sample logs, and suggested investigation. Deduplication prevents noise. Follow-up verification auto-closes resolved issues.

How a Feature Moves from Idea to Production

New Feature Path

  1. The architect defines the task as a structured prompt with codebase context, goals, and constraints.

  2. An agent decomposes the task, plans implementation, writes code, and generates its own tests.

  3. A PR opens. Three Claude review passes evaluate it. A human reviewer checks for strategic risk, not line-by-line correctness.

  4. CI validates: typecheck, lint, unit tests, integration tests, end-to-end tests.

  5. Graphite's merge queue rebases, re-runs CI, merges if green.

  6. Six-phase deploy pipeline promotes through dev and prod with testing at each stage.

  7. Feature gate turns on for the team. Gradual percentage rollout. Metrics monitored.

  8. Kill switch available if anything degrades. Circuit-breaker auto-rollback for severe issues.

Bug Fix Path

  1. CloudWatch and Sentry detect errors.

  2. Claude triage engine scores severity, creates a Linear issue with full investigation context.

  3. An engineer investigates. AI has already done the diagnosis. The engineer validates and pushes a fix.

  4. Same review, CI, deploy, and monitoring pipeline.

  5. Triage engine re-verifies. If resolved, ticket auto-closes.

Both paths use the same pipeline. One system. One standard.

The Results

Over 14 days, we averaged three to eight production deployments per day. Under our old model, that entire two-week period would have produced not even a single release to production.

Bad features get pulled the same day they ship. New features go live the same day they're conceived. A/B tests validate impact in real time.

People assume we're trading quality for speed. User engagement went up. Payment conversion went up. We produce better results than before, because the feedback loops are tighter. You learn more when you ship daily than when you ship monthly.

The New Engineering Org

Two types of engineers will exist.

The Architect

One or two people. They design the standard operating procedures that teach AI how to work. They build the testing infrastructure, the integration systems, the triage systems. They decide architecture and system boundaries. They define what "good" looks like for the agents.

This role requires deep critical thinking. You criticize AI. You don't follow it. When the agent proposes a plan, the architect finds the holes. What failure modes did it miss? What security boundaries did it cross? What technical debt is it accumulating?

I have a PhD in physics. The most useful thing my PhD taught me was how to question assumptions, stress-test arguments, and look for what's missing. The ability to criticise AI will be more valuable than the ability to produce code.

This is also the hardest role to fill.

The Operator

Everyone else. The work matters. The structure is different.

AI assigns tasks to humans. The triage system finds a bug, creates a ticket, surfaces the diagnosis, and assigns it to the right person. The person investigates, validates, and approves the fix. AI makes the PR. The human reviews whether there's risk.

The tasks are bug investigation, UI refinement, CSS improvements, PR review, verification. They require skill and attention. They don't require the architectural reasoning the old model demanded.

Who Adapts Fastest

I noticed a pattern I didn't expect. Junior engineers adapted faster than senior engineers.

Junior engineers with less traditional practice felt empowered. They had access to tools that amplified their impact. They didn't carry a decade of habits to unlearn.

Senior engineers with strong traditional practice had the hardest time. Two months of their work could be completed in one hour by AI. That is a hard thing to accept after years of building a rare skill set.

I'm not making a judgment. I'm describing what I observed. In this transition, adaptability matters more than accumulated skill.

The Human Side

Management Collapsed

Two months ago, I spent 60% of my time managing people. Aligning priorities. Running meetings. Giving feedback. Coaching engineers.

Today: below 10%.

The traditional CTO model says to empower your team to do architecture work, train them, delegate. But if the system only needs one or two architects, I need to do it myself first. I went from managing to building. I code from 9 AM to 3 AM most days. I design the SOPs and architecture of the system. I maintain the harness.

More stressful. But I'm enjoying building, not aligning.

Less Arguing, Better Relationships

My relationships with co-founders and engineers are better than before.

Before the transition, most of my interaction with the team was alignment meetings. Discussing trade-offs. Debating priorities. Disagreeing about technical decisions. Those conversations are necessary in a traditional model. They're also draining.

Now I still talk to my team. We talk about other things. Non-work topics. Casual conversations. Offsite trips. We get along better because we stopped arguing about work that can be easily done by our system.

Uncertainty Is Real

I won't pretend everyone is happy.

When I stopped talking to people every day, some team members felt uncertain. What does the CTO not talking to me mean? What is my value in this new world? Reasonable concerns.

Some people spend more time debating whether AI can do their work than doing the work. The transition period creates anxiety. I don't have a clean answer for it.

I do have a principle: we don't fire an engineer because they introduced a production bug. We improve the review process. We strengthen testing. We add guardrails. The same applies to AI. If AI makes a mistake, we build better validation, clearer constraints, stronger observability.

Beyond Engineering

I see other companies adopt AI-first engineering and leave everything else manual.

If engineering ships features in hours but marketing takes a week to announce them, marketing is the bottleneck. If the product team still runs a monthly planning cycle, planning is the bottleneck.

At CREAO, we pushed AI-native operations into every function:

  • Product release notes: AI-generated from changelogs and feature descriptions.

  • Feature intro videos: AI-generated motion graphics.

  • Daily posts on socials: AI-orchestrated and auto-published.

  • Health reports and analytics summaries: AI-generated from CloudWatch and production databases.

Engineering, product, marketing, and growth run in one AI-native workflow. If one function operates at agent speed and another at human speed, the human-speed function constrains everything.

What This Means

For Engineers

Your value is moving from code output to decision quality. The ability to write code fast is worth less every month. The ability to evaluate, criticize, and direct AI is worth more.

Product sense or taste matters. Can you look at a generated UI and know it's wrong before the user tells you? Can you look at an architecture proposal and see the failure mode the agent missed?

I tell our 19-year-old interns: train critical thinking. Learn to evaluate arguments, find gaps, question assumptions. Learn what good design looks like. Those skills compound.

For CTOs and Founders

If your PM process takes longer than your build time, start there.

Build the testing harness before you scale agents. Fast AI without fast validation is fast-moving technical debt.

Start with one architect. One person who builds the system and proves it works. Onboard others into operator roles after the system runs.

Push AI-native into every function.

Expect resistance. Some people will push back.

For the Industry

OpenAI, Anthropic, and multiple independent teams converged on the same principles: structured context, specialized agents, persistent memory, and execution loops. Harness engineering is becoming a standard.

Model capability is the clock driving this. I attribute the entire shift at CREAO to the last two months. Opus 4.5 couldn't do what Opus 4.6 does. Next-gen models will accelerate it further.

I believe one-person companies will become common. If one architect with agents can do the work of 100 people, many companies won't need a second employee.

We're Early

Most founders and engineers I talk to still operate the traditional way. Some think about making the shift. Very few have done it.

A reporter friend told me she'd talked to about five people on this topic. She said we were further along than anyone: "I don't think anyone's just totally rebuilt their entire workflow the way you have."

The tools exist for any team to do this. Nothing in our stack is proprietary.

The competitive advantage is the decision to redesign everything around these tools, and the willingness to absorb the cost. The cost is real: uncertainty among employees, the CTO working 18-hour days, senior engineers questioning their value, a two-week period where the old system is gone and the new one isn't proven.

We absorbed that cost. Two months later, the numbers speak.

We build an agent platform. We built it with agents.

📋 讨论归档

讨论进行中…