返回列表
🧠 阿头学 · 🪞 Uota学 · 💬 讨论题

慢下来,别把代码库交给编程智能体毁掉

这篇文章的判断很明确:当前把编程智能体大规模放进生产开发,通常不是提效,而是在用更快的速度制造技术债和失控复杂度。
打开原文 ↗

2026-03-26 原文链接 ↗
阅读简报
双语对照
完整翻译
原文
讨论归档

核心观点

  • 速度已经压过理解 作者最有力的判断是,AI 生成代码的速度已经超过人类审查、理解和修复的速度,这不是普通提效,而是把技术债积累机制从“可感知”变成了“失控复利”。
  • 智能体不是不会犯错,而是会高频重复犯错 人类会在痛苦和反馈里逐渐收敛,智能体默认没有这种学习闭环,所以同类错误会被工业化复制,这个判断站得住,而且比“模型偶尔写错代码”更严重。
  • 局部最优会堆成全局灾难 作者指出智能体缺少全局视野,只能基于局部上下文做决策,结果就是重复造轮子、伪抽象增殖、架构漂移,这个机制判断是现实的,不是情绪化吐槽。
  • 代码搜索召回率是隐藏瓶颈 文章最值得认真对待的技术判断是:大代码库里,智能体未必是“不会改”,而是“找不全该改的地方”,所以它天生容易漏依赖、漏已有实现、漏历史约束,这会直接制造不一致。
  • 正确用法不是全自动,而是限边界的人机协作 作者主张把智能体用于范围收敛、可评估、低风险任务,同时把架构、API、关键路径留给人类,这个结论比“全面拒绝 AI”稳健得多,也更可执行。

跟我们的关联

  • 对 ATou 意味着什么、下一步怎么用 ATou 如果在做产品和工程协同,最该吸收的不是“慢下来”的姿态,而是“可审代码预算”这个约束:每天 AI 产出必须小于团队真实能审完、能理解、能追责的量,下一步可以把它写进开发流程。
  • 对 Neta 意味着什么、下一步怎么用 Neta 如果关注系统方法论,这篇文章给了一个强模型:产出速度一旦超过反馈速度,系统就会腐化,下一步可以把这个模型迁移到内容生产、投放、销售自动化,而不只限于代码。
  • 对 Uota 意味着什么、下一步怎么用 Uota 如果在看 agent 产品机会,作者其实指出了真正有价值的方向不是“替你全自动写完”,而是“提高召回率、显化决策、约束边界、支持人工收口”,下一步应优先验证这些能力,而不是继续堆 autonomous 叙事。
  • 对三者共同意味着什么、下一步怎么用 这篇文章本质上在反对“产出崇拜”,提醒大家重新把判断力当成稀缺资源,下一步最实用的动作是给所有 agent 任务加三问:范围是否收敛、评估是否闭环、后果是否可承受。

讨论引子

1. 如果“生成速度超过审查速度”才是根因,那么问题到底是 AI 工具本身,还是团队管理正在借 AI 合法化偷懒? 2. “代码搜索召回率低”是当前智能体的暂时工程问题,还是决定其无法独立负责大型系统的结构性限制? 3. 在 AI 让功能开发成本趋近于零之后,团队最该优化的是交付速度,还是“说不”的能力?

开发者 • 教练 • 演讲者](https://mariozechner.at/)

关于他妈的慢下来的一些想法

2026-03-25

海龟的表情就是我看着我们行业时的样子

目录

能真正帮你把完整项目搭出来的编程智能体登场,大概也就一年。之前也有 Aider、早期 Cursor 这种前辈,但更像助手,不太像智能体。新一代很诱人,很多人把大量业余时间都花在了把那些一直想做、却没时间做的项目一个个做出来。

这没什么问题。用业余时间做东西本来就很爽,而且大多数时候也不用太在意代码质量和可维护性。想顺便学一套新技术栈,也正好有个抓手。

圣诞假期里,Anthropic 和 OpenAI 都发了点免费额度,把人往他们上瘾的老虎机前面拽。很多人也是第一次体验到智能体式编程的魔法。队伍在变大。

如今,编程智能体也被塞进了生产代码库。十二个月过去,那些所谓的“进步”开始显出后果了。下面是目前的看法。

一切都坏了

虽然这些都是轶事,但感觉软件正在变成一团脆得要命的烂摊子。98% 的在线率成了常态而不是例外,大型服务也一样。用户界面里还冒出各种他妈离谱的 bug,按理 QA 团队早该抓住。承认一点,这种事在智能体出现之前就有了。但现在像是越跑越快。

外部看不到公司的内部情况。但隔三差五总会有点东西漏到记者那儿。比如这篇据称AI 导致 AWS 宕机。AWS 马上就去“更正”。但紧接着内部又来了一个90 天重置

微软 CEO Satya Nadella 一直在讲,如今微软有多少代码是 AI 写的。虽然没有直接证据,但确实有种感觉:Windows 正在往下水道里滑。微软自己似乎也认同这一点,从这篇不错的博客文章就能看出来。

那些宣称自家产品 100% 代码都由 AI 写出来的公司,稳定输出你能想象到的最烂垃圾。不是要点名,但动不动就是几 GB 的内存泄漏、界面抖动、烂到不行的功能、崩溃:这完全不是他们以为的质量徽章。更别提,这对让智能体替你干完所有活这场发烧梦,绝对不是什么好广告。

小道消息里,越来越多人在说,不管公司大还是小,他们已经用智能体把自己写进了死胡同:没有代码审查,把设计决策交给智能体,堆出一大堆没人要的功能。这样当然会翻车。

我们不该怎样用智能体,以及为什么

我们几乎把自律和主导权全丢了,换来一种上瘾感:最高目标变成了用最短时间产出最多代码。后果随他去。

你在搭一层编排系统,指挥一支自主智能体大军。你装了 Beads,完全没意识到那玩意儿基本就是卸不干净的恶意软件。网上说要装。你就得这么干,不然你就 ngmi。你在 loop 里吐得昏天黑地。看吧,Anthropic 用一群智能体做了个 C 编译器。它有点坏,但下一代 LLM 肯定能修好。天哪,Cursor 也用一营智能体做了个浏览器。对,当然,它其实不太能用,而且隔三差五还得人类去转一下方向盘。但下一代 LLM 肯定能修好。拉钩保证!分布式、分而治之、自主、黑灯工厂,六个月内软件问题就解决了。SaaS 已死,我奶奶刚让她的 Claw 给自己搭了个 Shopify!

当然,这套东西也许能用在你那个几乎没人用的业余项目上,甚至你自己都不太用。也许真有人能把它用在一个不是一坨冒着热气的垃圾、而且会被真实用户带着火气也要用的产品上。

如果你就是那个人,那你确实强。但至少在身边这圈同行里,还没看到这种鬼东西真的能跑通的证据。也许我们都太菜了。

零学习、无瓶颈、延迟痛感:小乌龙如何连环叠加

智能体的问题在于,它们会犯错。这没什么,人也会犯错。也许只是正确性错误,很容易发现、修掉,再补个回归测试就更好了。也可能是你 linter 抓不住的代码异味:这儿多一个没用的方法,那儿来个说不通的类型,另一边又复制了一份代码。单看这些,都不算大事。人也会犯这种小乌龙。

但这些铁皮机器人不是人。人会把同一个错误犯几次,最后总会学会别再犯。要么有人开始冲他吼,要么他真在学习。

智能体没有这种学习能力,至少不是开箱即用就有。它会一遍又一遍犯同样的错。看训练数据长什么样,它还可能把不同错误拼出一些看似很精彩的新变体。

当然也能试着教它:在 AGENTS.md 里告诉它别再犯那种乌龙;再搞一套复杂的记忆系统,让它去查历史错误和最佳实践。这对某一类错误确实可能有效。但前提是,得先亲眼看到智能体犯了那个错。

铁皮和人之间还有个更重要的区别:人是瓶颈。人不可能在几小时里喷出两万行代码。就算人犯小乌龙的频率很高,一天之内也只能往代码库里塞进有限数量的乌龙。乌龙的叠加速度很慢。一般来说,如果乌龙带来的痛苦太大,讨厌痛苦的人就会花点时间把乌龙修掉。或者这个人被炒了,换别人来把乌龙修掉。总之,痛苦会消下去。

但当你编排着一支智能体大军时,就没有瓶颈,也没有人类的痛感。那些看起来无害的小乌龙,会突然以一种不可持续的速度叠加。你把自己从回路里拿掉了,甚至不知道这些无辜的乌龙已经堆出一只怪兽般的代码库。等痛感传来时,往往已经太晚。

然后有一天,你回过头想加个新功能。但此时的架构基本就是乌龙堆出来的,它不允许你的智能体大军以一个能跑的方式把改动做完。或者用户冲你吼,因为最新版本里某个东西坏了,还删掉了部分用户数据。

这时才意识到,代码库已经不可信了。更糟的是,那些让铁皮写出来的成千上万的单元测试、快照测试、端到端测试,也一样不可信。唯一还算可靠的 "这能用吗" 指标,只剩手工测试产品。恭喜,你把自己(也把公司)坑惨了。

学来的复杂度贩子

你完全不知道发生了什么,因为主导权都交给了智能体。你让它们撒开跑,而它们是复杂度贩子。它们在训练数据里、在 RL 训练里,见过太多糟糕的架构决策。你又让它们来给你的应用做架构。猜猜结果是什么?

结果就是一大坨复杂度:一锅糟糕的、货物崇拜式的所谓“行业最佳实践”乱炖,而你在来不及之前没有把它收住。但更糟的还在后面。

你的智能体彼此看不到对方的运行过程,看不到你的整个代码库,也看不到在它改动之前你或其他智能体已经做出的所有决策。所以智能体的决策永远是局部的,于是就会出现前面说的那些乌龙:海量重复代码,为抽象而抽象。

这些东西叠加起来,会变成一团几乎救不回来的复杂度烂摊子。你在人工打造的企业代码库里也能见到同款烂摊子。它们之所以会走到那一步,是因为痛苦被分摊到一大群人身上,单个人的痛感达不到“我得把这玩意儿修了”的阈值。这个人甚至可能根本没有条件去修。组织的痛阈又特别高。但人工的企业代码库要花很多年才会变成那样。组织会在一种扭曲的共生里,随着复杂度慢慢演化,然后学会怎么和它相处。

而有了智能体,再加上两个人的小团队,你几周就能把复杂度堆到同一个水平。

智能体搜索的召回率很低

这时候你指望智能体来收拾烂摊子:重构、清理、把它弄得干干净净。但智能体也处理不了了,因为代码库和复杂度太大,而它们永远只能看到这堆烂摊子的局部。

这还不只是上下文窗口大小,或者长上下文注意力机制一看到一百万行代码怪兽就当场失灵这种显眼的技术限制。问题更阴险。

在智能体帮你修之前,它得先找到所有需要改的代码,以及所有可以复用的现有代码。这叫智能体搜索。它怎么做,取决于你给它什么工具。可以给它一个 Bash 工具,让它用 ripgrep 在代码库里搜来搜去。也可以给它一个可查询的代码库索引、一个 LSP 服务器、一个向量数据库。到头来差别不大。代码库越大,召回率越低。召回率低,就意味着智能体确实找不全它要找的代码,也就做不好。

这也是为什么那些代码异味式的乌龙一开始就会出现:智能体漏掉了已有代码,于是重复造轮子,引入不一致。然后它们就会开成一朵漂亮的复杂度狗屎花。

那怎么避免这一切?

我们应该怎样用智能体(至少目前,我是这么想的)

编程智能体像海妖,用它们写代码的速度和时灵时不灵的聪明把人勾进去,常常能以惊人的速度把一个简单任务做得很像样。真正开始崩的时候,是你心里冒出这句:"天哪,这玩意儿太棒了。电脑,替我干活!"。

把任务交给智能体当然没错。适合智能体的任务通常有几条特点:范围能收敛到不需要它理解整个系统;循环能闭合,也就是智能体有办法评估自己的产出;输出不是关键任务,只是某个临时工具或内部软件,没人命和收入押在上面。又或者你只是需要一只橡皮鸭来对着聊一聊,本质上就是把想法拿去碰一碰互联网的压缩智慧和合成训练数据。只要符合其中任何一条,这就是智能体的好活,前提是人类得当最后的质量关。

Karpathy 的 auto-research 用来加速你应用的启动时间?很棒!前提是你得明白,它吐出来的代码完全不适合直接上生产。auto-research 之所以能跑,是因为你给了它一个评估函数,让智能体能把自己的产出对齐到某个指标,比如启动时间或 loss。但这个评估函数只覆盖很窄的一条指标。评估函数没覆盖到的指标,智能体会很开心地忽略掉,比如代码质量、复杂度,甚至正确性,尤其当你的评估函数本身就是 foobar 的时候。

重点是:让智能体去做那些无聊的活,那些不会教你新东西的活,或者去试各种你本来没时间试的方案。然后你来评估它搞出来的结果,把真正合理、正确的点子拿走,最后把实现收口。是的,最后一步也可以用智能体。

更想提一个建议:他妈的慢下来,才是正路。给自己一点时间想清楚到底在做什么、为什么要做。也给自己一个机会说:去他妈的,不需要这个。给自己设个上限,每天最多让铁皮生成多少代码,要和你真正能审得过来的量匹配。

凡是决定系统整体气质的东西,比如架构、API 之类,自己动手写。想怀旧的话,用用 tab completion 也行。或者和智能体结对编程。人要在代码里。因为亲手写出来,或者看着它一步步搭起来,这个过程会带来摩擦,反而能让你更清楚自己想做什么,也更能感受到系统的“手感”。经验和品味就是在这里起作用的,而当前的 SOTA 模型还替代不了。慢下来,忍受一点摩擦,才会学到东西、才会成长。

最后得到的,会是依然可维护的系统和代码库,至少能做到和智能体出现之前的老系统差不多的可维护性。对,那些老系统也不完美。但用户会感谢你,因为产品不再是一碗泔水,而是能让人心里一亮的东西。功能会做得更少,但会做对的那几个。学会说不,本身就是一个功能。

你也能睡得踏实,因为自己还大概知道到底他妈发生了什么,也还握着主导权。你的理解能补上智能体搜索的召回问题,让铁皮的产出更靠谱,少一点修修补补。真要出事,你也能进去把它修好。或者一开始的设计不够好,你知道为什么不够好,也知道怎么把它重构成更好的样子。有智能体也好,没有也好,去他妈的。

这一切都需要自律和主导权。

这一切都需要人。

本页面尊重你的隐私:不使用 Cookie 或类似技术,也不会收集任何可识别个人身份的信息。

developer • coach • speaker](https://mariozechner.at/)

开发者 • 教练 • 演讲者](https://mariozechner.at/)

Thoughts on slowing the fuck down

The turtle's face is me looking at our industry

关于他妈的慢下来的一些想法

2026-03-25

海龟的表情就是我看着我们行业时的样子

Table of contents

It's been about a year since coding agents appeared on the scene that could actually build you full projects. There were precursors like Aider and early Cursor, but they were more assistant than agent. The new generation is enticing, and a lot of us have spent a lot of free time building all the projects we always wanted to build but never had time to.

And I think that's fine. Spending your free time building things is super enjoyable, and most of the time you don't really have to care about code quality and maintainability. It also gives you a way to learn a new tech stack if you so want.

During the Christmas break, both Anthropic and OpenAI handed out some freebies to hook people to their addictive slot machines. For many, it was the first time they experienced the magic of agentic coding. The fold's getting bigger.

Coding agents are now also introduced to production codebases. After 12 months, we are now beginning to see the effects of all that "progress". Here's my current view.

目录

能真正帮你把完整项目搭出来的编程智能体登场,大概也就一年。之前也有 Aider、早期 Cursor 这种前辈,但更像助手,不太像智能体。新一代很诱人,很多人把大量业余时间都花在了把那些一直想做、却没时间做的项目一个个做出来。

这没什么问题。用业余时间做东西本来就很爽,而且大多数时候也不用太在意代码质量和可维护性。想顺便学一套新技术栈,也正好有个抓手。

圣诞假期里,Anthropic 和 OpenAI 都发了点免费额度,把人往他们上瘾的老虎机前面拽。很多人也是第一次体验到智能体式编程的魔法。队伍在变大。

如今,编程智能体也被塞进了生产代码库。十二个月过去,那些所谓的“进步”开始显出后果了。下面是目前的看法。

Everything is broken

While all of this is anecdotal, it sure feels like software has become a brittle mess, with 98% uptime becoming the norm instead of the exception, including for big services. And user interfaces have the weirdest fucking bugs that you'd think a QA team would catch. I give you that that's been the case for longer than agents exist. But we seem to be accelerating.

We don't have access to the internals of companies. But every now and then something slips through to some news reporter. Like this supposed AI caused outage at AWS. Which AWS immediately "corrected". Only to then follow up internally with a 90-day reset.

Satya Nadella, the CEO of Microsoft, has been going on about how much code is now being written by AI at Microsoft. While we don't have direct evidence, there sure is a feeling that Windows is going down the shitter. Microsoft itself seems to agree, based on this fine blog post.

Companies claiming 100% of their product's code is now written by AI consistently put out the worst garbage you can imagine. Not pointing fingers, but memory leaks in the gigabytes, UI glitches, broken-ass features, crashes: that is not the seal of quality they think it is. And it's definitely not good advertising for the fever dream of having your agents do all the work for you.

Through the grapevine you hear more and more people, from software companies small and large, saying they have agentically coded themselves into a corner. No code review, design decisions delegated to the agent, a gazillion features nobody asked for. That'll do it.

一切都坏了

虽然这些都是轶事,但感觉软件正在变成一团脆得要命的烂摊子。98% 的在线率成了常态而不是例外,大型服务也一样。用户界面里还冒出各种他妈离谱的 bug,按理 QA 团队早该抓住。承认一点,这种事在智能体出现之前就有了。但现在像是越跑越快。

外部看不到公司的内部情况。但隔三差五总会有点东西漏到记者那儿。比如这篇据称AI 导致 AWS 宕机。AWS 马上就去“更正”。但紧接着内部又来了一个90 天重置

微软 CEO Satya Nadella 一直在讲,如今微软有多少代码是 AI 写的。虽然没有直接证据,但确实有种感觉:Windows 正在往下水道里滑。微软自己似乎也认同这一点,从这篇不错的博客文章就能看出来。

那些宣称自家产品 100% 代码都由 AI 写出来的公司,稳定输出你能想象到的最烂垃圾。不是要点名,但动不动就是几 GB 的内存泄漏、界面抖动、烂到不行的功能、崩溃:这完全不是他们以为的质量徽章。更别提,这对让智能体替你干完所有活这场发烧梦,绝对不是什么好广告。

小道消息里,越来越多人在说,不管公司大还是小,他们已经用智能体把自己写进了死胡同:没有代码审查,把设计决策交给智能体,堆出一大堆没人要的功能。这样当然会翻车。

How we should not work with agents and why

We have basically given up all discipline and agency for a sort of addiction, where your highest goal is to produce the largest amount of code in the shortest amount of time. Consequences be damned.

You're building an orchestration layer to command an army of autonomous agents. You installed Beads, completely oblivious to the fact that it's basically uninstallable malware. The internet told you to. That's how you should work or you're ngmi. You're ralphing the loop. Look, Anthropic built a C compiler with an agent swarm. It's kind of broken, but surely the next generation of LLMs can fix it. Oh my god, Cursor built a browser with a battalion of agents. Yes, of course, it's not really working and it needed a human to spin the wheel a little bit every now and then. But surely the next generation of LLMs will fix it. Pinky promise! Distribute, divide and conquer, autonomy, dark factories, software is solved in the next 6 months. SaaS is dead, my grandma just had her Claw build her own Shopify!

Now again, this can work for your side project barely anyone is using, including yourself. And hey, maybe there's somebody out there who can actually make this work for a software product that's not a steaming pile of garbage and is used by actual humans in anger.

If that's you, more power to you. But at least among my circle of peers I have yet to find evidence that this kind of shit works. Maybe we all have skill issues.

Compounding booboos with zero learning, no bottlenecks, and delayed pain

The problem with agents is that they make errors. Which is fine, humans also make errors. Maybe they are just correctness errors. Easy to identify and fix. Add a regression test on top for bonus points. Or maybe it's a code smell your linter doesn't catch. A useless method here, a type that doesn't make sense, duplicated code over there. On their own, these are harmless. A human will also do such booboos.

But clankers aren't humans. A human makes the same error a few times. Eventually they learn not to make it again. Either because someone starts screaming at them or because they're on a genuine learning path.

An agent has no such learning ability. At least not out of the box. It will continue making the same errors over and over again. Depending on the training data it might also come up with glorious new interpolations of different errors.

Now you can try to teach your agent. Tell it to not make that booboo again in your AGENTS.md. Concoct the most complex memory system and have it look up previous errors and best practices. And that can be effective for a specific category of errors. But it also requires you to actually observe the agent making that error.

There's a much more important difference between clanker and human. A human is a bottleneck. A human cannot shit out 20,000 lines of code in a few hours. Even if the human creates such booboos at high frequency, there's only so many booboos the human can introduce in a codebase per day. The booboos will compound at a very slow rate. Usually, if the booboo pain gets too big, the human, who hates pain, will spend some time fixing up the booboos. Or the human gets fired and someone else fixes up the booboos. So the pain goes away.

With an orchestrated army of agents, there is no bottleneck, no human pain. These tiny little harmless booboos suddenly compound at a rate that's unsustainable. You have removed yourself from the loop, so you don't even know that all the innocent booboos have formed a monster of a codebase. You only feel the pain when it's too late.

Then one day you turn around and want to add a new feature. But the architecture, which is largely booboos at this point, doesn't allow your army of agents to make the change in a functioning way. Or your users are screaming at you because something in the latest release broke and deleted some user data.

You realize you can no longer trust the codebase. Worse, you realize that the gazillions of unit, snapshot, and e2e tests you had your clankers write are equally untrustworthy. The only thing that's still a reliable measure of "does this work" is manually testing the product. Congrats, you fucked yourself (and your company).

Merchants of learned complexity

You have zero fucking idea what's going on because you delegated all your agency to your agents. You let them run free, and they are merchants of complexity. They have seen many bad architectural decisions in their training data and throughout their RL training. You have told them to architect your application. Guess what the result is?

An immense amount of complexity, an amalgam of terrible cargo cult "industry best practices", that you didn't rein in before it was too late. But it's worse than that.

Your agents never see each other's runs, never get to see all of your codebase, never get to see all the decisions that were made by you or other agents before they make a change. As such, an agent's decisions are always local, which leads to the exact booboos described above. Immense amounts of code duplication, abstractions for abstractions' sake.

All of this compounds into an unrecoverable mess of complexity. The exact same mess you find in human-made enterprise codebases. Those arrive at that state because the pain is distributed over a massive amount of people. The individual suffering doesn't pass the threshold of "I need to fix this". The individual might not even have the means to fix things. And organizations have super high pain tolerance. But human-made enterprise codebases take years to get there. The organization slowly evolves along with the complexity in a demented kind of synergy and learns how to deal with it.

With agents and a team of 2 humans, you can get to that complexity within weeks.

Agentic search has low recall

So now you hope your agents can fix the mess, refactor it, make it pristine. But your agents can also no longer deal with it. Because the codebase and complexity are too big, and they only ever have a local view of the mess.

And I'm not just talking about context window size or long context attention mechanisms failing at the sight of a 1 million lines of code monster. Those are obvious technical limitations. It's more devious than that.

Before your agent can try and help fix the mess, it needs to find all the code that needs changing and all existing code it can reuse. We call that agentic search. How the agent does that depends on the tools it has. You can give it a Bash tool so it can ripgrep its way through the codebase. You can give it some queryable codebase index, an LSP server, a vector database. In the end it doesn't matter much. The bigger the codebase, the lower the recall. Low recall means that your agent will, in fact, not find all the code it needs to do a good job.

This is also why those code smell booboos happen in the first place. The agent misses existing code, duplicates things, introduces inconsistencies. And then they blossom into a beautiful shit flower of complexity.

How do we avoid all of this?

我们不该怎样用智能体,以及为什么

我们几乎把自律和主导权全丢了,换来一种上瘾感:最高目标变成了用最短时间产出最多代码。后果随他去。

你在搭一层编排系统,指挥一支自主智能体大军。你装了 Beads,完全没意识到那玩意儿基本就是卸不干净的恶意软件。网上说要装。你就得这么干,不然你就 ngmi。你在 loop 里吐得昏天黑地。看吧,Anthropic 用一群智能体做了个 C 编译器。它有点坏,但下一代 LLM 肯定能修好。天哪,Cursor 也用一营智能体做了个浏览器。对,当然,它其实不太能用,而且隔三差五还得人类去转一下方向盘。但下一代 LLM 肯定能修好。拉钩保证!分布式、分而治之、自主、黑灯工厂,六个月内软件问题就解决了。SaaS 已死,我奶奶刚让她的 Claw 给自己搭了个 Shopify!

当然,这套东西也许能用在你那个几乎没人用的业余项目上,甚至你自己都不太用。也许真有人能把它用在一个不是一坨冒着热气的垃圾、而且会被真实用户带着火气也要用的产品上。

如果你就是那个人,那你确实强。但至少在身边这圈同行里,还没看到这种鬼东西真的能跑通的证据。也许我们都太菜了。

零学习、无瓶颈、延迟痛感:小乌龙如何连环叠加

智能体的问题在于,它们会犯错。这没什么,人也会犯错。也许只是正确性错误,很容易发现、修掉,再补个回归测试就更好了。也可能是你 linter 抓不住的代码异味:这儿多一个没用的方法,那儿来个说不通的类型,另一边又复制了一份代码。单看这些,都不算大事。人也会犯这种小乌龙。

但这些铁皮机器人不是人。人会把同一个错误犯几次,最后总会学会别再犯。要么有人开始冲他吼,要么他真在学习。

智能体没有这种学习能力,至少不是开箱即用就有。它会一遍又一遍犯同样的错。看训练数据长什么样,它还可能把不同错误拼出一些看似很精彩的新变体。

当然也能试着教它:在 AGENTS.md 里告诉它别再犯那种乌龙;再搞一套复杂的记忆系统,让它去查历史错误和最佳实践。这对某一类错误确实可能有效。但前提是,得先亲眼看到智能体犯了那个错。

铁皮和人之间还有个更重要的区别:人是瓶颈。人不可能在几小时里喷出两万行代码。就算人犯小乌龙的频率很高,一天之内也只能往代码库里塞进有限数量的乌龙。乌龙的叠加速度很慢。一般来说,如果乌龙带来的痛苦太大,讨厌痛苦的人就会花点时间把乌龙修掉。或者这个人被炒了,换别人来把乌龙修掉。总之,痛苦会消下去。

但当你编排着一支智能体大军时,就没有瓶颈,也没有人类的痛感。那些看起来无害的小乌龙,会突然以一种不可持续的速度叠加。你把自己从回路里拿掉了,甚至不知道这些无辜的乌龙已经堆出一只怪兽般的代码库。等痛感传来时,往往已经太晚。

然后有一天,你回过头想加个新功能。但此时的架构基本就是乌龙堆出来的,它不允许你的智能体大军以一个能跑的方式把改动做完。或者用户冲你吼,因为最新版本里某个东西坏了,还删掉了部分用户数据。

这时才意识到,代码库已经不可信了。更糟的是,那些让铁皮写出来的成千上万的单元测试、快照测试、端到端测试,也一样不可信。唯一还算可靠的 "这能用吗" 指标,只剩手工测试产品。恭喜,你把自己(也把公司)坑惨了。

学来的复杂度贩子

你完全不知道发生了什么,因为主导权都交给了智能体。你让它们撒开跑,而它们是复杂度贩子。它们在训练数据里、在 RL 训练里,见过太多糟糕的架构决策。你又让它们来给你的应用做架构。猜猜结果是什么?

结果就是一大坨复杂度:一锅糟糕的、货物崇拜式的所谓“行业最佳实践”乱炖,而你在来不及之前没有把它收住。但更糟的还在后面。

你的智能体彼此看不到对方的运行过程,看不到你的整个代码库,也看不到在它改动之前你或其他智能体已经做出的所有决策。所以智能体的决策永远是局部的,于是就会出现前面说的那些乌龙:海量重复代码,为抽象而抽象。

这些东西叠加起来,会变成一团几乎救不回来的复杂度烂摊子。你在人工打造的企业代码库里也能见到同款烂摊子。它们之所以会走到那一步,是因为痛苦被分摊到一大群人身上,单个人的痛感达不到“我得把这玩意儿修了”的阈值。这个人甚至可能根本没有条件去修。组织的痛阈又特别高。但人工的企业代码库要花很多年才会变成那样。组织会在一种扭曲的共生里,随着复杂度慢慢演化,然后学会怎么和它相处。

而有了智能体,再加上两个人的小团队,你几周就能把复杂度堆到同一个水平。

智能体搜索的召回率很低

这时候你指望智能体来收拾烂摊子:重构、清理、把它弄得干干净净。但智能体也处理不了了,因为代码库和复杂度太大,而它们永远只能看到这堆烂摊子的局部。

这还不只是上下文窗口大小,或者长上下文注意力机制一看到一百万行代码怪兽就当场失灵这种显眼的技术限制。问题更阴险。

在智能体帮你修之前,它得先找到所有需要改的代码,以及所有可以复用的现有代码。这叫智能体搜索。它怎么做,取决于你给它什么工具。可以给它一个 Bash 工具,让它用 ripgrep 在代码库里搜来搜去。也可以给它一个可查询的代码库索引、一个 LSP 服务器、一个向量数据库。到头来差别不大。代码库越大,召回率越低。召回率低,就意味着智能体确实找不全它要找的代码,也就做不好。

这也是为什么那些代码异味式的乌龙一开始就会出现:智能体漏掉了已有代码,于是重复造轮子,引入不一致。然后它们就会开成一朵漂亮的复杂度狗屎花。

那怎么避免这一切?

How we should work with agents (for now, I think)

Coding agents are sirens, luring you in with their speed of code generation and jagged intelligence, often completing a simple task with high quality at breakneck velocity. Things start falling apart when you think: "Oh golly, this thing is great. Computer, do my work!".

There's nothing wrong with delegating tasks to agents, obviously. Good agent tasks share a few properties: they can be scoped so the agent doesn't need to understand the full system. The loop can be closed, that is, the agent has a way to evaluate its own work. The output isn't mission critical, just some ad hoc tool or internal piece of software nobody's life or revenue depends on. Or you just need a rubber duck to bounce ideas against, which basically means bouncing your idea against the compressed wisdom of the internet and synthetic training data. If any of that applies, you found the perfect task for the agent, provided that you as the human are the final quality gate.

Karpathy's auto-research applied to speeding up startup time of your app? Great! As long as you understand that the code it spits out is not production-ready at all. Auto-research works because you give it an evaluation function that lets the agent measure its work against some metric, like startup time or loss. But that evaluation function only captures a very narrow metric. The agent will happily ignore any metrics not captured by the evaluation function, such as code quality, complexity, or even correctness, if your evaluation function is foobar.

The point is: let the agent do the boring stuff, the stuff that won't teach you anything new, or try out different things you'd otherwise not have time for. Then you evaluate what it came up with, take the ideas that are actually reasonable and correct, and finalize the implementation. Yes, sure, you can also use an agent for that final step.

And I would like to suggest that slowing the fuck down is the way to go. Give yourself time to think about what you're actually building and why. Give yourself an opportunity to say, fuck no, we don't need this. Set yourself limits on how much code you let the clanker generate per day, in line with your ability to actually review the code.

Anything that defines the gestalt of your system, that is architecture, API, and so on, write it by hand. Maybe use tab completion for some nostalgic feels. Or do some pair programming with your agent. Be in the code. Because the simple act of having to write the thing or seeing it being built up step by step introduces friction that allows you to better understand what you want to build and how the system "feels". This is where your experience and taste come in, something the current SOTA models simply cannot yet replace. And slowing the fuck down and suffering some friction is what allows you to learn and grow.

The end result will be systems and codebases that continue to be maintainable, at least as maintainable as our old systems before agents. Yes, those were not perfect either. Your users will thank you, as your product now sparks joy instead of slop. You'll build fewer features, but the right ones. Learning to say no is a feature in itself.

You can sleep well knowing that you still have an idea what the fuck is going on, and that you have agency. Your understanding allows you to fix the recall problem of agentic search, leading to better clanker outputs that need less massaging. And if shit hits the fan, you are able to go in and fix it. Or if your initial design has been suboptimal, you understand why it's suboptimal, and how to refactor it into something better. With or without an agent, don't fucking care.

All of this requires discipline and agency.

All of this requires humans.

This page respects your privacy by not using cookies or similar technologies and by not collecting any personally identifiable information.

我们应该怎样用智能体(至少目前,我是这么想的)

编程智能体像海妖,用它们写代码的速度和时灵时不灵的聪明把人勾进去,常常能以惊人的速度把一个简单任务做得很像样。真正开始崩的时候,是你心里冒出这句:"天哪,这玩意儿太棒了。电脑,替我干活!"。

把任务交给智能体当然没错。适合智能体的任务通常有几条特点:范围能收敛到不需要它理解整个系统;循环能闭合,也就是智能体有办法评估自己的产出;输出不是关键任务,只是某个临时工具或内部软件,没人命和收入押在上面。又或者你只是需要一只橡皮鸭来对着聊一聊,本质上就是把想法拿去碰一碰互联网的压缩智慧和合成训练数据。只要符合其中任何一条,这就是智能体的好活,前提是人类得当最后的质量关。

Karpathy 的 auto-research 用来加速你应用的启动时间?很棒!前提是你得明白,它吐出来的代码完全不适合直接上生产。auto-research 之所以能跑,是因为你给了它一个评估函数,让智能体能把自己的产出对齐到某个指标,比如启动时间或 loss。但这个评估函数只覆盖很窄的一条指标。评估函数没覆盖到的指标,智能体会很开心地忽略掉,比如代码质量、复杂度,甚至正确性,尤其当你的评估函数本身就是 foobar 的时候。

重点是:让智能体去做那些无聊的活,那些不会教你新东西的活,或者去试各种你本来没时间试的方案。然后你来评估它搞出来的结果,把真正合理、正确的点子拿走,最后把实现收口。是的,最后一步也可以用智能体。

更想提一个建议:他妈的慢下来,才是正路。给自己一点时间想清楚到底在做什么、为什么要做。也给自己一个机会说:去他妈的,不需要这个。给自己设个上限,每天最多让铁皮生成多少代码,要和你真正能审得过来的量匹配。

凡是决定系统整体气质的东西,比如架构、API 之类,自己动手写。想怀旧的话,用用 tab completion 也行。或者和智能体结对编程。人要在代码里。因为亲手写出来,或者看着它一步步搭起来,这个过程会带来摩擦,反而能让你更清楚自己想做什么,也更能感受到系统的“手感”。经验和品味就是在这里起作用的,而当前的 SOTA 模型还替代不了。慢下来,忍受一点摩擦,才会学到东西、才会成长。

最后得到的,会是依然可维护的系统和代码库,至少能做到和智能体出现之前的老系统差不多的可维护性。对,那些老系统也不完美。但用户会感谢你,因为产品不再是一碗泔水,而是能让人心里一亮的东西。功能会做得更少,但会做对的那几个。学会说不,本身就是一个功能。

你也能睡得踏实,因为自己还大概知道到底他妈发生了什么,也还握着主导权。你的理解能补上智能体搜索的召回问题,让铁皮的产出更靠谱,少一点修修补补。真要出事,你也能进去把它修好。或者一开始的设计不够好,你知道为什么不够好,也知道怎么把它重构成更好的样子。有智能体也好,没有也好,去他妈的。

这一切都需要自律和主导权。

这一切都需要人。

本页面尊重你的隐私:不使用 Cookie 或类似技术,也不会收集任何可识别个人身份的信息。

developer • coach • speaker](https://mariozechner.at/)

Thoughts on slowing the fuck down

The turtle's face is me looking at our industry

Table of contents

It's been about a year since coding agents appeared on the scene that could actually build you full projects. There were precursors like Aider and early Cursor, but they were more assistant than agent. The new generation is enticing, and a lot of us have spent a lot of free time building all the projects we always wanted to build but never had time to.

And I think that's fine. Spending your free time building things is super enjoyable, and most of the time you don't really have to care about code quality and maintainability. It also gives you a way to learn a new tech stack if you so want.

During the Christmas break, both Anthropic and OpenAI handed out some freebies to hook people to their addictive slot machines. For many, it was the first time they experienced the magic of agentic coding. The fold's getting bigger.

Coding agents are now also introduced to production codebases. After 12 months, we are now beginning to see the effects of all that "progress". Here's my current view.

Everything is broken

While all of this is anecdotal, it sure feels like software has become a brittle mess, with 98% uptime becoming the norm instead of the exception, including for big services. And user interfaces have the weirdest fucking bugs that you'd think a QA team would catch. I give you that that's been the case for longer than agents exist. But we seem to be accelerating.

We don't have access to the internals of companies. But every now and then something slips through to some news reporter. Like this supposed AI caused outage at AWS. Which AWS immediately "corrected". Only to then follow up internally with a 90-day reset.

Satya Nadella, the CEO of Microsoft, has been going on about how much code is now being written by AI at Microsoft. While we don't have direct evidence, there sure is a feeling that Windows is going down the shitter. Microsoft itself seems to agree, based on this fine blog post.

Companies claiming 100% of their product's code is now written by AI consistently put out the worst garbage you can imagine. Not pointing fingers, but memory leaks in the gigabytes, UI glitches, broken-ass features, crashes: that is not the seal of quality they think it is. And it's definitely not good advertising for the fever dream of having your agents do all the work for you.

Through the grapevine you hear more and more people, from software companies small and large, saying they have agentically coded themselves into a corner. No code review, design decisions delegated to the agent, a gazillion features nobody asked for. That'll do it.

How we should not work with agents and why

We have basically given up all discipline and agency for a sort of addiction, where your highest goal is to produce the largest amount of code in the shortest amount of time. Consequences be damned.

You're building an orchestration layer to command an army of autonomous agents. You installed Beads, completely oblivious to the fact that it's basically uninstallable malware. The internet told you to. That's how you should work or you're ngmi. You're ralphing the loop. Look, Anthropic built a C compiler with an agent swarm. It's kind of broken, but surely the next generation of LLMs can fix it. Oh my god, Cursor built a browser with a battalion of agents. Yes, of course, it's not really working and it needed a human to spin the wheel a little bit every now and then. But surely the next generation of LLMs will fix it. Pinky promise! Distribute, divide and conquer, autonomy, dark factories, software is solved in the next 6 months. SaaS is dead, my grandma just had her Claw build her own Shopify!

Now again, this can work for your side project barely anyone is using, including yourself. And hey, maybe there's somebody out there who can actually make this work for a software product that's not a steaming pile of garbage and is used by actual humans in anger.

If that's you, more power to you. But at least among my circle of peers I have yet to find evidence that this kind of shit works. Maybe we all have skill issues.

Compounding booboos with zero learning, no bottlenecks, and delayed pain

The problem with agents is that they make errors. Which is fine, humans also make errors. Maybe they are just correctness errors. Easy to identify and fix. Add a regression test on top for bonus points. Or maybe it's a code smell your linter doesn't catch. A useless method here, a type that doesn't make sense, duplicated code over there. On their own, these are harmless. A human will also do such booboos.

But clankers aren't humans. A human makes the same error a few times. Eventually they learn not to make it again. Either because someone starts screaming at them or because they're on a genuine learning path.

An agent has no such learning ability. At least not out of the box. It will continue making the same errors over and over again. Depending on the training data it might also come up with glorious new interpolations of different errors.

Now you can try to teach your agent. Tell it to not make that booboo again in your AGENTS.md. Concoct the most complex memory system and have it look up previous errors and best practices. And that can be effective for a specific category of errors. But it also requires you to actually observe the agent making that error.

There's a much more important difference between clanker and human. A human is a bottleneck. A human cannot shit out 20,000 lines of code in a few hours. Even if the human creates such booboos at high frequency, there's only so many booboos the human can introduce in a codebase per day. The booboos will compound at a very slow rate. Usually, if the booboo pain gets too big, the human, who hates pain, will spend some time fixing up the booboos. Or the human gets fired and someone else fixes up the booboos. So the pain goes away.

With an orchestrated army of agents, there is no bottleneck, no human pain. These tiny little harmless booboos suddenly compound at a rate that's unsustainable. You have removed yourself from the loop, so you don't even know that all the innocent booboos have formed a monster of a codebase. You only feel the pain when it's too late.

Then one day you turn around and want to add a new feature. But the architecture, which is largely booboos at this point, doesn't allow your army of agents to make the change in a functioning way. Or your users are screaming at you because something in the latest release broke and deleted some user data.

You realize you can no longer trust the codebase. Worse, you realize that the gazillions of unit, snapshot, and e2e tests you had your clankers write are equally untrustworthy. The only thing that's still a reliable measure of "does this work" is manually testing the product. Congrats, you fucked yourself (and your company).

Merchants of learned complexity

You have zero fucking idea what's going on because you delegated all your agency to your agents. You let them run free, and they are merchants of complexity. They have seen many bad architectural decisions in their training data and throughout their RL training. You have told them to architect your application. Guess what the result is?

An immense amount of complexity, an amalgam of terrible cargo cult "industry best practices", that you didn't rein in before it was too late. But it's worse than that.

Your agents never see each other's runs, never get to see all of your codebase, never get to see all the decisions that were made by you or other agents before they make a change. As such, an agent's decisions are always local, which leads to the exact booboos described above. Immense amounts of code duplication, abstractions for abstractions' sake.

All of this compounds into an unrecoverable mess of complexity. The exact same mess you find in human-made enterprise codebases. Those arrive at that state because the pain is distributed over a massive amount of people. The individual suffering doesn't pass the threshold of "I need to fix this". The individual might not even have the means to fix things. And organizations have super high pain tolerance. But human-made enterprise codebases take years to get there. The organization slowly evolves along with the complexity in a demented kind of synergy and learns how to deal with it.

With agents and a team of 2 humans, you can get to that complexity within weeks.

Agentic search has low recall

So now you hope your agents can fix the mess, refactor it, make it pristine. But your agents can also no longer deal with it. Because the codebase and complexity are too big, and they only ever have a local view of the mess.

And I'm not just talking about context window size or long context attention mechanisms failing at the sight of a 1 million lines of code monster. Those are obvious technical limitations. It's more devious than that.

Before your agent can try and help fix the mess, it needs to find all the code that needs changing and all existing code it can reuse. We call that agentic search. How the agent does that depends on the tools it has. You can give it a Bash tool so it can ripgrep its way through the codebase. You can give it some queryable codebase index, an LSP server, a vector database. In the end it doesn't matter much. The bigger the codebase, the lower the recall. Low recall means that your agent will, in fact, not find all the code it needs to do a good job.

This is also why those code smell booboos happen in the first place. The agent misses existing code, duplicates things, introduces inconsistencies. And then they blossom into a beautiful shit flower of complexity.

How do we avoid all of this?

How we should work with agents (for now, I think)

Coding agents are sirens, luring you in with their speed of code generation and jagged intelligence, often completing a simple task with high quality at breakneck velocity. Things start falling apart when you think: "Oh golly, this thing is great. Computer, do my work!".

There's nothing wrong with delegating tasks to agents, obviously. Good agent tasks share a few properties: they can be scoped so the agent doesn't need to understand the full system. The loop can be closed, that is, the agent has a way to evaluate its own work. The output isn't mission critical, just some ad hoc tool or internal piece of software nobody's life or revenue depends on. Or you just need a rubber duck to bounce ideas against, which basically means bouncing your idea against the compressed wisdom of the internet and synthetic training data. If any of that applies, you found the perfect task for the agent, provided that you as the human are the final quality gate.

Karpathy's auto-research applied to speeding up startup time of your app? Great! As long as you understand that the code it spits out is not production-ready at all. Auto-research works because you give it an evaluation function that lets the agent measure its work against some metric, like startup time or loss. But that evaluation function only captures a very narrow metric. The agent will happily ignore any metrics not captured by the evaluation function, such as code quality, complexity, or even correctness, if your evaluation function is foobar.

The point is: let the agent do the boring stuff, the stuff that won't teach you anything new, or try out different things you'd otherwise not have time for. Then you evaluate what it came up with, take the ideas that are actually reasonable and correct, and finalize the implementation. Yes, sure, you can also use an agent for that final step.

And I would like to suggest that slowing the fuck down is the way to go. Give yourself time to think about what you're actually building and why. Give yourself an opportunity to say, fuck no, we don't need this. Set yourself limits on how much code you let the clanker generate per day, in line with your ability to actually review the code.

Anything that defines the gestalt of your system, that is architecture, API, and so on, write it by hand. Maybe use tab completion for some nostalgic feels. Or do some pair programming with your agent. Be in the code. Because the simple act of having to write the thing or seeing it being built up step by step introduces friction that allows you to better understand what you want to build and how the system "feels". This is where your experience and taste come in, something the current SOTA models simply cannot yet replace. And slowing the fuck down and suffering some friction is what allows you to learn and grow.

The end result will be systems and codebases that continue to be maintainable, at least as maintainable as our old systems before agents. Yes, those were not perfect either. Your users will thank you, as your product now sparks joy instead of slop. You'll build fewer features, but the right ones. Learning to say no is a feature in itself.

You can sleep well knowing that you still have an idea what the fuck is going on, and that you have agency. Your understanding allows you to fix the recall problem of agentic search, leading to better clanker outputs that need less massaging. And if shit hits the fan, you are able to go in and fix it. Or if your initial design has been suboptimal, you understand why it's suboptimal, and how to refactor it into something better. With or without an agent, don't fucking care.

All of this requires discipline and agency.

All of this requires humans.

This page respects your privacy by not using cookies or similar technologies and by not collecting any personally identifiable information.

📋 讨论归档

讨论进行中…