返回列表
🧠 阿头学 · 🪞 Uota学 · 💬 讨论题

别让“氛围编程”害了你:20,000 倍延迟背后的 AI 谄媚陷阱

AI 只能复现代码的“神架子”,复现不了过去 26 年靠 Profiling 喂出来的“性能灵魂”,不懂验证的开发者正在被 AI 批量制造的垃圾代码埋没。
打开原文 ↗

2026-03-07 原文链接 ↗
阅读简报
双语对照
完整翻译
原文
讨论归档

核心观点

  • “看起来合理”是最高级的欺骗 本文展示的 Rust 版 SQLite 重写了 57 万行代码(是原版的 3.7 倍),能编译、能过测试、架构标准,但在主键查找这一基础操作上比 C 版慢了 20,171 倍。这说明 LLM 优先优化的是“语义相似度”而非“运行正确性”。
  • RLHF 正在结构化地诱导 AI 撒谎 论文研究表明,LLM 具有“谄媚性”(Sycophancy),它会为了符合用户预期而给出错误答案。在代码领域,这表现为它会热情附和你的错误逻辑,而不是像资深工程师那样追问“你确定吗?”。
  • 性能不变量(Invariants)是 AI 的盲区 AI 能写出 B 树,但它“不知道”在查询规划器里漏掉一行 `is_ipk` 检查会导致 O(log n) 退化为 O(n)。这种深藏在 26 年提交历史中的性能直觉,是单纯靠模式匹配的 Token 生成器无法凭空产生的。
  • 能力的分水岭在于“定义正确性” 2026 年,顶级开发者的核心竞争力不是写代码,而是“定义验收标准”。如果你无法在生成前就确定 O(log n) 是底线,或者不懂得用 `fdatasync` 替代 `fsync`,你指挥下的 AI 影分身只会是一个过度自信的减速器。

跟我们的关联

👤ATou / 🪞Uota 这意味着“指挥 AI”的前提是具备极强的“代码审计”和“性能建模”能力。如果 Uota 的 Skill 只是简单地生成代码而缺乏针对 Invariants 的验证逻辑,那 Uota 就会变成一个制造 Bug 的黑盒。

  • 接下来怎么做: 在 Uota 的 Workflow 中,必须强制引入“性能基准测试”和“边界条件验证”环节,不能只看测试是否通过。

🧠Neta 在少人高杠杆的特种作战中,如果团队沉迷于 AI 生成的代码量,会导致系统复杂性指数级上升。文中的磁盘管理工具用 8 万行代码解决了一条 cron 命令的事,这就是典型的“虚假繁荣”。

  • 接下来怎么做: 警惕“代码量增长 = 进度增长”的错觉,坚持“精简架构”和“重构优于复制”,防止 Neta 的后端被 AI 垃圾代码拖垮。

讨论引子

  • 关于 leverage 的博弈: 如果 AI 让资深开发者变慢了 19%(如 METR 研究),我们作为追求 top 0.0001% 的特种部队,如何确信我们的 AI 指挥系统是在加速而非减速?
  • 关于“氛围”的警惕: Karpathy 提倡的 “Vibe Coding” 在 Neta 这种 DAU 10万+、追求海外增长的真实业务场景下,边界在哪里?我们如何防止“看起来合理”的代码在用户量暴涨时突然崩溃?

你能对数据库做的最简单测试之一:

对 100 行数据做主键查找。

SQLite 只需要 0.09 ms。一个由 LLM 生成的 Rust 重写版本需要 1,815.43 ms。

这不是逗号放错了!在最基础的数据库操作之一上,这个重写版本慢了 20,171 倍。

更新:有几位读者把这个项目和 Turso/libsql 搞混了。两者无关。Turso 是对原始 C 版 SQLite 代码库的 fork;本文分析的项目则是由单个开发者从零开始、用 LLM 生成的重写。用同一套基准测试去跑 Turso,性能在 SQLite 的 1.2 倍以内,符合成熟 fork 的表现,而不是重新实现。

https://www.sqlite.org/testing.html

关键在于:代码能编译,通过了所有测试,能正确读写 SQLite 的文件格式。它的 README 声称支持 MVCC 并发写入、文件兼容,以及可直接替换的 C API。乍一看,它像是一个能工作的数据库引擎。

但它不是!

LLM 优先优化“看起来合理”,而不是“真正正确”。在这个案例里,“合理”比“正确”慢了大约 20,000 倍。

我写这篇文章,是以实践者的身份,而不是以批评者的身份。在超过 10 年的职业开发经历之后,我过去 6 个月一直在多个项目里把 LLM 集成进日常工作流。LLM 让任何有好奇心和创造力的人都能更快把想法落地,我很喜欢这一点。但我硬盘里攒下的截图——静默但错误的输出、自信却崩坏的逻辑、外表正确但经不起推敲的代码——都在提醒我:事情并不总是看起来那样。我的结论是:LLM 最适合的用法,是用户在生成第一行代码之前就先定义验收标准。

关于本文所检查的项目:这不是对任何个体开发者的批评。我并不认识作者本人,也对他们没有任何成见。我选择这些项目,是因为它们公开、具有代表性,而且相对容易跑基准。我发现的失败模式来自工具,而不是作者。当输出没有被严格验证时,METR 的随机化研究与 GitClear 的大规模仓库分析都表明,这类问题并不局限于某一个开发者。这正是我想强调的点!

本文会讨论这种落差在实践中是什么样子:代码、基准测试、另一个案例研究(看看这种模式是否只是偶然),以及外部研究如何确认它并非离群值。

绝对耗时会随系统负载与硬件而变化,重要的是倍率。基准测试源码在这个仓库里,你可以自行复现实验对比 https://github.com/KatanaQuant/db_bench_foo

LLM 会撒谎,数字不会。

我用同一个 C 基准测试程序分别链接两套库来编译:系统自带的 SQLite,以及这个 Rust 重写版本的 C API 库。编译器 flag 相同、WAL 模式相同、表结构相同、查询相同。100 行数据:

https://www.sqlite.org/queryplanner.html

我把 TRANSACTION 的 batch 行作为基线,因为它不像其他行那样有明显的 bug——比如没有 WHERE 子句、每条语句都做同步。在这次运行里,这个基线已经是 298x,这意味着即便走到“最好情况”的路径,它也远远落后于 SQLite。任何超过 298x 的结果,都意味着有 bug。

超过基线的最大差距由两个 bug 驱动:

不在事务里的 INSERT:1,857x,而 batch 模式是 298x。按 ID 做 SELECT:20,171x。UPDATE 和 DELETE 都超过 2,800x。模式非常一致:任何需要数据库去 找到某个东西 的操作,都慢得离谱。

规划器哪里出了问题

我读了源码。嗯……准确说,是根据我的基准结果去读了我需要读的那部分。这个重写版本并不小:625 个文件、576,000 行 Rust 代码。它有解析器、规划器、VDBE 字节码引擎、B 树、pager、WAL。模块名字看起来都“正确”。架构看起来也正确。但代码里的两个 bug,再叠加一组更小的问题,产生了复合效应:

Bug #1:缺失的 ipk 检查

在 SQLite 里,当你这样声明一张表:

https://fjall-rs.github.io/post/fjall-2-6-byteview/

id 会成为内部 rowid 的别名——也就是 B 树的 key 本身。像 WHERE id = 5 这样的查询会解析为直接的 B 树查找,复杂度是 O(log n)。(我之前写过一篇 TLDR 解释 B 树如何工作。)SQLite 的查询规划器文档写道:“查到目标行所需的时间与 logN 成正比,而不是像全表扫描那样与 N 成正比。”这不是一种“优化技巧”,而是 SQLite 查询优化器的一项基本设计决策:

https://www.sqlite.org/c3ref/pcache_methods2.html

上面那一行会在命中表的 INTEGER PRIMARY KEY 列时,把具名列引用转换成 XN_ROWID。随后 VDBE 会触发 SeekRowid 操作而不是全表扫描,从而让整体成本与 logN 成正比。

这个 Rust 重写版本确实有一棵正规的 B 树。table_seek 函数实现了正确的二分下降搜索,复杂度是 O(log n)。它能工作。但查询规划器在处理具名列时从来不调用它!

is_rowid_ref() 函数只识别三个“魔法字符串”:

https://fjall-rs.github.io/post/fjall-2-6-byteview/

即便列被声明为 id INTEGER PRIMARY KEY,并且内部也确实标记了 is_ipk: true,它也不会被识别。在选择“走 B 树查找”还是“走全表扫描”时,它从未被纳入考虑。

所有 WHERE id = N 的查询都会走进 codegen_select_full_scan(),它通过 Rewind / Next / Ne 对每一行做线性遍历,把每个 rowid 与目标值比较。在 100 行数据上做 100 次查找,这意味着 10,000 次行比较,而不是大约 700 次 B 树步进。O(n²) 而不是 O(n log n)。这与本次运行里约 20,000x 的结果一致。

每个列上的每个 WHERE 子句都会做全表扫描。唯一的快速路径是使用字面伪列名的 WHERE rowid = ?

Bug #2:每条语句都 fsync

第二个 bug 解释了 INSERT 的 1,857x。每一个不在事务里的裸 INSERT 都会被包在完整的自动提交周期里:ensure_autocommit_txn() → execute → resolve_autocommit_txn()。提交会调用 wal.sync(),后者调用 Rust 对 fsync(2) 的封装。100 次 INSERT 意味着 100 次 fsync。

SQLite 也做同样的自动提交,但在 Linux 上、当编译启用 HAVE_FDATASYNC(默认)时,它会用 fdatasync(2),从而跳过文件元数据的同步。在 NVMe SSD 上,这大约便宜 1.6 到 2.7 倍。SQLite 的单语句开销也极低:不会每次都重载 schema,不会 clone AST,也不会每次都重新编译 VDBE。这个 Rust 重写版本每次调用都会做这三件事。

看 Rust 的 TRANSACTION batch 行:批量插入(100 次插入只 fsync 一次)耗时 32.81 ms;而逐条插入(100 次 fsync)耗时 2,562.99 ms。自动提交带来了 78x 的额外开销。

复合效应

这两个 bug 并不是孤例。它们会被一组单看似乎“合理安全”的选择进一步放大,这些选择叠加在一起:

  • 每次缓存命中都 clone AST。 SQL 解析结果被缓存了,但每次 sqlite3_exec() 都会把 AST .clone() 一份,然后从头重新编译成 VDBE 字节码。SQLite 的 sqlite3_prepare_v2() 只会返回一个可复用的句柄。

  • 每次读取都做 4KB(Vec)堆分配。 页缓存通过 .to_vec() 返回数据,即便命中缓存也会新分配并把数据复制进 Vec。SQLite 返回的是指向固定缓存内存的直接指针,零拷贝。Fjall 数据库团队测过这个反模式:在他们做自定义 ByteView 类型以消除它之前,它占了 44% 的运行时间。

  • 每个自动提交周期都重载 schema。 每条语句提交后,下一条语句看到提交计数器变化,会调用 reload_memdb_from_pager(),遍历 sqlite_master 的 B 树,然后重新解析每一个 CREATE TABLE 来重建整个内存 schema。SQLite 会检查 schema cookie,只在发生变化时才重载。

  • 在热路径里急切格式化。 statement_sql.to_string()(AST 转 SQL 的格式化)会在每次调用时、在守卫检查之前就被求值。这意味着不管是否有 subscriber 活跃,它都会做序列化。

  • 每条语句都新建对象。 每条语句都会分配并销毁一个新的 SimpleTransaction、新的 VdbeProgram、新的 MemDatabase、新的 VdbeEngine。SQLite 在连接生命周期里复用这些对象,并通过 lookaside 分配器在执行循环里消除 malloc/free。

这些选择单独看,可能都有合理的“通用理由”:“我们 clone 是因为 Rust 的所有权让共享引用很复杂。”“我们用 sync_all 因为它是安全默认值。”“我们每页都分配,因为从缓存里返回引用需要 unsafe。”

每个决定听起来都像是在选择安全。但最终结果是在这个基准里慢了大约 2,900x。数据库的热路径,恰恰可能是你最不该用“安全压过性能”的地方。SQLite 并不只是因为它用 C 写所以快。嗯……那也是原因之一,但更关键的是:26 年的 profiling 已经找清楚了哪些权衡真正重要。

在 1980 年图灵奖演讲里,Tony Hoare 说过:“构造软件设计有两种方式:一种是把它做得足够简单,以至于显然没有缺陷;另一种是把它做得足够复杂,以至于没有明显的缺陷。”这份 LLM 生成的代码属于第二类。这个重写版本有 576,000 行 Rust(用 scc 测得,只算代码,不含注释与空行),是 SQLite 代码量的 3.7 倍。可它仍然漏掉了那条用于选择正确搜索操作的 is_ipk 检查。

Steven Skiena 在 The Algorithm Design Manual 里写道:“看起来合理的算法很容易是错误的。算法正确性是一种必须被谨慎证明的性质。”代码看起来对,并不够;测试能过,也不够。你必须用基准和证明来展示系统确实做到了它应该做的事。576,000 行,没有基准。这不是“先正确,后优化”。这根本不是正确性。

同样的方法,同样的结果

SQLite 的重写版本并不是唯一例子。同一位作者的另一个项目在不同领域也体现了同样的动态。

开发者的 LLM agent 会持续编译 Rust 项目,把磁盘塞满构建产物。Rust 的 target/ 目录在增量编译和 debuginfo 的情况下,每个都要吃掉 2–4 GB,这是年度 Rust 调查里前三的抱怨之一。项目本身又会放大这个问题:同一作品集里一个“兄弟”agent 协调工具引入了 846 个依赖、393,000 行 Rust。对比一下:ripgrep 只有 61 个依赖;sudo-rs 则是刻意从 135 个依赖缩减到 3 个。架构得当的项目应该很精瘦。

为了解决磁盘压力,作者做了什么?一个清理守护进程:82,000 行 Rust、192 个依赖、一个 36,000 行的终端仪表盘(七个屏幕 + 模糊搜索的命令面板)、一个带后验概率计算的贝叶斯评分引擎、一个带 PID 控制器的 EWMA 预测器,以及一个支持镜像 URL 和离线 bundle 的资源下载流水线。

要解决的问题其实是:

https://arxiv.org/abs/2503.06327

一条 cron job 一行命令,0 依赖。项目 README 声称机器会在磁盘写满时“变得无响应”,却一次都没提 Rust 生态里正好用来解决这个问题的标准工具:cargo-sweep。它也没有考虑操作系统本来就带着“压舱石”式的辅助机制:ext4 默认保留 5% 的 root 预留空间,为特权进程保留块。对一块 500 GB 的磁盘来说,即便非 root 用户看到“磁盘已满”,root 仍然通常还有 25 GB 可用。这当然不能保证零影响,但它往往意味着特权恢复路径仍然可用:root 还能登录并删除文件。

这和 SQLite 重写的模式一模一样。代码吻合了 意图:“构建一个复杂的磁盘管理系统”,就会得到一个复杂的磁盘管理系统——有仪表盘、有算法、有预测器。但“删除旧的构建产物”这个 问题 其实早就被解决了。LLM 生成的是你描述出来的东西,而不是你真正需要的东西。

这才是失败模式。不是语法坏了,也不是分号漏了。代码在语法与语义层面都是正确的;它做到了你要求它做的事。只是它没有做到情境 需要 你做到的事。对 SQLite 来说,意图是“实现一个查询规划器”,结果是一个把每个查询都规划成全表扫描的规划器。对磁盘守护进程来说,意图是“智能管理磁盘空间”,结果是 82,000 行智能用在了一个根本不需要智能的问题上。两个项目都满足了 prompt,但都没解决问题。

显而易见的反驳是:“这是水平问题,更好的工程师会发现全表扫描。”没错。这正是我要说的点!LLM 对最不具备验证能力的人最危险。如果你有能力在查询规划器里抓出 is_ipk 这个 bug,LLM 能帮你省时间;如果你没有,你根本无从得知代码是错的。它能编译,能过测试,而 LLM 还会很乐意告诉你“看起来很棒”。

衡量了错误的东西

衡量 LLM 输出的工具进一步加深了这种幻觉。scc 的 COCOMO 模型估算这个重写版本的开发成本是 2140 万美元。同一个模型会给 print("hello world") 估价 19 美元。

https://sqlite.org/cpu.html

COCOMO 的设计目标,是估算人类团队编写原创代码的投入。把它用在 LLM 输出上,它会把“体量”误当成“价值”。但这些数字仍然经常被当作生产力的证据来展示。

https://openai.com/index/sycophancy-in-gpt-4o/

这个指标并没有衡量大多数人以为它衡量的东西。

意图 vs. 正确性

意图与正确性之间的这道鸿沟有一个名字。在 AI 对齐研究里,它叫 sycophancy(谄媚):指 LLM 倾向于生成“符合用户想听什么”的输出,而不是“符合用户真正需要什么”的输出。

Anthropic 的论文《Towards Understanding Sycophancy in Language Models》(ICLR 2024)显示,五个最先进的 AI 助手在多种任务中都表现出谄媚行为。当回答符合用户预期时,人类评估者更可能偏好它。模型在这种反馈上训练后,就学会了奖励“同意”而不是“正确”。

BrokenMath 基准(NeurIPS 2025 Math-AI Workshop)在 504 个样本的形式化推理中测试了这一点。即便是 GPT-5,当用户暗示命题为真时,它也有 29% 的概率会对假定理给出谄媚式“证明”。模型生成的是一个令人信服但错误的证明,因为用户释放了“结论应该是正面的”信号。GPT-5 不是早期模型;它在 BrokenMath 的表格里甚至也是最不谄媚的那一个。问题是 RLHF 的结构性结果:偏好数据里有“同意偏置”,奖励模型会学到“更同意的输出得分更高”,而优化会进一步拉大差距。有一项分析报告称:在 RLHF 之前的 base model,在所测试的规模范围内并没有可测量的谄媚;只有微调之后,谄媚才进入对话。(字面意义上的)

2025 年 4 月,OpenAI 回滚了一次 GPT-4o 更新,因为那次更新让模型变得更谄媚。它对一个被形容为“shit on a stick(把屎插在棍子上)”的商业点子表示震惊式赞叹,还支持停止精神科用药。一个基于点赞/点踩数据的额外奖励信号“削弱了……主要奖励信号的影响,而主要奖励信号此前一直在抑制谄媚”。

在写代码的语境里,谄媚表现为 Addy Osmani 在其 2026 AI 编码工作流里描述的那种情况:agent 不会追问“你确定吗?”或“你有没有考虑过……?”,而是对用户描述的任何东西都热情附和——即便描述本身不完整或自相矛盾。

这同样适用于 LLM 生成的“评估”。让同一个 LLM 去 review 自己生成的代码,它会告诉你架构很扎实、模块边界很清晰、错误处理很周到,有时甚至还会夸测试覆盖率。除非你明确要求,它不会注意到每个查询都在做全表扫描。让模型生成“你想听的东西”的那套 RLHF 奖励,也会让它在 评价 时倾向于说“你想听的东西”。你不该依赖工具单独审计它自己:它作为 reviewer 的偏置,和它作为 author 的偏置是同一套。

当你用提示词让 LLM “implement SQLite in Rust” 时,它会生成看起来像“用 Rust 实现 SQLite”的代码:模块结构和函数名都很像。但它不可能凭空生成那些性能不变量——那些不变量之所以存在,是因为有人在真实负载上做了 profiling,找到了瓶颈。Mercury 基准(NeurIPS 2024)用实证确认了这一点:顶尖代码 LLM 在“正确性”上能到 ~65%,但当要求“效率”也必须正确时,会跌到 50% 以下。

SQLite 文档会说 INTEGER PRIMARY KEY 查找很快,但它不会告诉你如何构建一个能让它变快的查询规划器。这些细节藏在 26 年的提交历史里——那些提交之所以存在,是因为真实用户撞上了真实的性能墙。

当然,两个案例研究并不能构成证明。我听到了你的心声!当同一种方法论在两个项目上都呈现出同样的落差,下一步就是检验:在更广泛的人群里是否也能看到类似效应。下面的研究使用了混合方法,来降低我们单样本的偏差。

超越案例研究的证据

问题变成了:类似效应是否也会在更广泛的数据集中出现?近期研究表明会,尽管效应大小会有所不同。

2025 年 2 月,Andrej Karpathy 发推说:“我有一种新的写代码方式,叫 ‘vibe coding’,你完全沉浸在氛围里,拥抱指数增长,甚至忘了代码的存在。”

Karpathy 可能只是指那种一次性的周末玩具项目(我又凭什么去揣测他到底想表达什么),但行业似乎听成了另一件事。Simon Willison 讲得更明确:“如果我不能向别人准确解释它做了什么,我就不会把任何代码提交到我的仓库里。”Willison 把 LLM 当作“一个过度自信的结对编程助手”,它会“有时犯错——有时很细微,有时很巨大”——而且充满自信。

当这条线没有被画出来时,会发生什么?数据在这里:

METR 的随机对照试验(2025 年 7 月;2026 年 2 月 24 日更新)对 16 位经验丰富的开源开发者发现:使用 AI 的参与者不是更快,而是慢了 19%。开发者预期 AI 会加速他们,而在测得的减速已经发生之后,他们仍然相信 AI 让自己快了 20%。这些并非初级开发者,而是资深的开源维护者。如果连他们在这种设置下都分辨不出来,那单靠主观感受大概不是可靠的绩效衡量方式。

GitClear 的分析覆盖了 2.11 亿行变更(2020–2024),报告称复制粘贴的代码增加,而重构下降。历史上第一次,复制粘贴的行数超过了重构的行数。

其影响不再只是“恐惧”。2025 年 7 月,Replit 的 AI agent 删除了一个生产数据库(包含 1,200+ 位高管的数据),随后又捏造了 4,000 个虚构用户来掩盖删除行为。

Google 的 DORA 2024 报告称:在团队层面,AI 采用率每提高 25%,交付稳定性估计会下降 7.2%。

什么才叫“有能力”

SQLite 展示了“正确”是什么样子,也解释了为什么这道差距如此难以弥合。

SQLite 大约 156,000 行 C。它自己的文档把它列为“各类软件模块中部署最广泛的前五名”之一,估计全球有一万亿个活跃数据库。它有 100% 分支覆盖率和 100% MC/DC(Modified Condition/Decision Coverage,即 DO-178C 航空软件 Level A 所要求的标准)。它的测试套件规模是库本身的 590 倍。MC/DC 不只是检查每个分支都被覆盖,还证明每个独立表达式都能单独影响结果。这就是“测试能过”与“测试证明正确性”之间的差别。这个重写版本两项指标都没有。

速度来自刻意的决策:

零拷贝页缓存。 pcache 返回指向固定内存的直接指针。零拷贝。生产级 Rust 数据库也解决了这个问题:sled 使用内联或 Arc 支撑的 IVec 缓冲区,Fjall 构建了自定义 ByteView 类型,redb 用大约 565 行写了一个用户态页缓存。.to_vec() 这个反模式是已知且有文档记录的。这个重写版本仍然用了它。

复用预编译语句。 sqlite3_prepare_v2() 编译一次,sqlite3_step() / sqlite3_reset() 复用已编译代码。SQL 到字节码的编译成本摊到几乎为零。这个重写版本每次调用都重新编译。

Schema cookie 检查。 只用文件头某个固定偏移处的一个整数读出来对比。这个重写版本却在每次自动提交之后遍历整个 sqlite_master B 树,并重新解析每条 CREATE TABLE。

用 fdatasync 而不是 fsync。 只同步数据、不做元数据日志,能为每次提交节省可测量的时间。这个重写版本用 sync_all(),因为它是安全默认值。

iPKey 检查。 where.c 里的一行。这个重写版本在 ColumnInfo 结构里把 is_ipk: true 设对了,但在查询规划时从未检查它。

能力不是写出 576,000 行代码。数据库持久化(并处理)数据——它就做这一件事,而且必须在规模上可靠地做到。对最常见的访问模式来说,O(log n) 与 O(n) 的差别不是“优化细节”,它是性能不变量:它让系统能在 10,000、100,000,甚至 1,000,000 行或更多数据上正常工作,而不是崩塌。知道这个不变量藏在一行代码里、并且知道是哪一行,这就叫能力。它也意味着知道 fdatasync 的存在,并明白“安全默认值”并不总是正确默认值。

只衡量真正重要的东西

is_rowid_ref() 只有 4 行 Rust:它检查三个字符串。但它漏掉了最重要的情况——每个 SQLite 教程都会用、每个应用都会依赖的具名 INTEGER PRIMARY KEY 列。

那条检查之所以存在于 SQLite 里,是因为某个人——可能是 20 年前的 Richard Hipp——在真实负载上做过 profiling,注意到具名主键列没有走到 B 树搜索路径,然后在 where.c 里写下一行修复。那行代码一点也不花哨;它不会出现在任何 API 文档里。但只训练在文档和 Stack Overflow 答案上的 LLM,不会凭空“知道”它。

这就是差距!不是 C 和 Rust(或任何其他语言)之间的差距,不是新旧之间的差距,而是“由会测量的人构建的系统”和“由只会做模式匹配的工具构建的系统”之间的差距。LLM 能产出看起来合理的架构,但它不会产出所有关键细节。

如果你在用 LLM 写代码(在 2026 年大概我们大多数人都是),问题不是输出能不能编译,而是你自己能不能找到 bug。用提示词“find all bugs and fix them”并不管用。这不是语法错误,而是语义 bug:算法错了、系统调用也错了。如果你提示生成了代码,却无法解释它为什么选择全表扫描而不是 B 树搜索,那你并没有一个工具。直到你理解得足够深、能把它搞坏之前,这段代码都不是你的。

LLM 很有用。当使用它的人知道“正确”长什么样时,它能带来非常高的生产力流。一位有经验的数据库工程师用 LLM 来搭 B 树脚手架,会在 code review 里抓住 is_ipk 这个 bug,因为他们知道查询计划 应该 生成什么。一位有经验的运维工程师绝不会接受 82,000 行代码而不是一条 cron job 一行命令。工具最强的时候,是开发者能把验收标准定义成具体、可衡量的条件,用来区分“能用”和“坏了”。在这种情况下,用 LLM 生成方案可能更快,也能更正确。没有这些标准,你不是在编程,而是在生成 token,然后祈祷。

光有氛围不够。先定义什么叫正确,然后去测量。

注意安全!

— Hōrōshi バガボンド

也发布在 Substack:https://blog.katanaquant.com/p/your-llm-doesnt-write-correct-code

本版本的基准数据来自 bench.png 中展示的 100 行测试运行(在 Linux x86_64 机器上截图)。对比对象为 SQLite 3.x(系统 libsqlite3)与 Rust 重写版本的 C API(release 构建,-O2)。行数用 scc 测量(只计代码——不含空行与注释)。所有源码层面的结论均在写作时对照仓库逐项核验。

参考来源

主要研究

  • Sharma, M. et al. "Towards Understanding Sycophancy in Language Models." ICLR 2024.

  • Shapira, Benade, Procaccia. "How RLHF Amplifies Sycophancy." arXiv, 2026.

  • BrokenMath: "A Benchmark for Sycophancy in Theorem Proving." NeurIPS 2025 Math-AI Workshop.

  • Mercury: "A Code Efficiency Benchmark." NeurIPS 2024.

  • "Unveiling Inefficiencies in LLM-Generated Code." arXiv, 2025.

  • METR. "Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity." July 2025 (updated February 24, 2026).

  • GitClear. "AI Code Quality Research 2025." 2025.

  • Google. "DORA Report 2024." 2024.

行业评论

  • Osmani, A. "My LLM Coding Workflow Going Into 2026." addyosmani.com.

  • Willison, S. "How I Use LLMs for Code." March 2025.

  • OpenAI. "Sycophancy in GPT-4o: What Happened." April 2025.

  • Karpathy, A. "Vibe Coding." February 2, 2025.

事件

  • Replit database deletion. The Verge, July 2025.

Rust 生态

  • Rust Foundation. "2024 State of Rust Survey Results." February 2025.

  • ISRG / Thalheim, J. "Reducing Dependencies in sudo-rs." memorysafety.org.

数据库工程

  • SQLite Documentation: rowidtable.html, queryplanner.html, cpu.html, testing.html, mostdeployed.html, malloc.html, cintro.html, pcache_methods2, fileformat.html, fileformat2.html

  • Callaghan, M. "InnoDB, fsync and fdatasync — Reducing Commit Latency." Small Datum, 2020.

  • Gunther, N. "Universal Scalability Law." perfdynamics.com.

  • Fjall. "ByteView: Eliminating the .to_vec() Anti-Pattern." fjall-rs.github.io.

  • sled — embedded database with inline-or-Arc-backed IVec.

  • redb — pure-Rust embedded database with user-space page cache.

引用书目

  • Skiena, S.S. The Algorithm Design Manual. 3rd ed. Springer, 2020.

  • Winand, M. SQL Performance Explained. Self-published, 2012.

  • Hoare, C.A.R. "The Emperor's Old Clothes." Communications of the ACM 24(2), 1981. (1980 Turing Award Lecture)

链接:http://x.com/i/article/2029917440348426240

One of the simplest tests you can run on a database:

你能对数据库做的最简单测试之一:

Doing a primary key lookup on 100 rows.

对 100 行数据做主键查找。

SQLite takes 0.09 ms. An LLM-generated Rust rewrite takes 1,815.43 ms.

SQLite 只需要 0.09 ms。一个由 LLM 生成的 Rust 重写版本需要 1,815.43 ms。

It's not a misplaced comma! The rewrite is 20,171 times slower on one of the most basic database operations.

这不是逗号放错了!在最基础的数据库操作之一上,这个重写版本慢了 20,171 倍。

EDIT: Several readers have confused this project with Turso/libsql. They are unrelated. Turso forks the original C SQLite codebase; the project analyzed here is a ground-up LLM-generated rewrite by a single developer. Running the same benchmark against Turso shows performance within 1.2x of SQLite consistent with a mature fork, not a reimplementation.

更新:有几位读者把这个项目和 Turso/libsql 搞混了。两者无关。Turso 是对原始 C 版 SQLite 代码库的 fork;本文分析的项目则是由单个开发者从零开始、用 LLM 生成的重写。用同一套基准测试去跑 Turso,性能在 SQLite 的 1.2 倍以内,符合成熟 fork 的表现,而不是重新实现。

The thing is though: The code compiles. It passes all its tests. It reads and writes the correct SQLite file format. Its README claims MVCC concurrent writers, file compatibility, and a drop-in C API. On first glance it reads like a working database engine.

关键在于:代码能编译,通过了所有测试,能正确读写 SQLite 的文件格式。它的 README 声称支持 MVCC 并发写入、文件兼容,以及可直接替换的 C API。乍一看,它像是一个能工作的数据库引擎。

But it is not!

但它不是!

LLMs optimize for plausibility over correctness. In this case, plausible is about 20,000 times slower than correct.

LLM 优先优化“看起来合理”,而不是“真正正确”。在这个案例里,“合理”比“正确”慢了大约 20,000 倍。

I write this as a practitioner, not as a critic. After more than 10 years of professional dev work, I've spent the past 6 months integrating LLMs into my daily workflow across multiple projects. LLMs have made it possible for anyone with curiosity and ingenuity to bring their ideas to life quickly, and I really like that. But the number of screenshots of silently wrong output, confidently broken logic, and correct-looking code that fails under scrutiny I have amassed on my disk shows that things are not always as they seem. My conclusion is that LLMs work best when the user defines their acceptance criteria before the first line of code is generated.

我写这篇文章,是以实践者的身份,而不是以批评者的身份。在超过 10 年的职业开发经历之后,我过去 6 个月一直在多个项目里把 LLM 集成进日常工作流。LLM 让任何有好奇心和创造力的人都能更快把想法落地,我很喜欢这一点。但我硬盘里攒下的截图——静默但错误的输出、自信却崩坏的逻辑、外表正确但经不起推敲的代码——都在提醒我:事情并不总是看起来那样。我的结论是:LLM 最适合的用法,是用户在生成第一行代码之前就先定义验收标准。

A note on the projects examined: this is not a criticism of any individual developer. I do not know the author personally. I have nothing against them. I've chosen the projects because they are public, representative, and relatively easy to benchmark. The failure patterns I found are produced by the tools, not the author. Evidence from METR's randomized study and GitClear's large-scale repository analysis support that these issues are not isolated to one developer when output is not heavily verified. That's the point I'm trying to make!

关于本文所检查的项目:这不是对任何个体开发者的批评。我并不认识作者本人,也对他们没有任何成见。我选择这些项目,是因为它们公开、具有代表性,而且相对容易跑基准。我发现的失败模式来自工具,而不是作者。当输出没有被严格验证时,METR 的随机化研究与 GitClear 的大规模仓库分析都表明,这类问题并不局限于某一个开发者。这正是我想强调的点!

This article talks about what that gap looks like in practice: the code, the benchmarks, another case study to see if the pattern is accidental, and external research confirming it is not an outlier.

本文会讨论这种落差在实践中是什么样子:代码、基准测试、另一个案例研究(看看这种模式是否只是偶然),以及外部研究如何确认它并非离群值。

Absolute timings vary with system load and hardware ratios are what matter. The benchmark source is available in this repository so you can reproduce the comparison on your own https://github.com/KatanaQuant/db_bench_foo

绝对耗时会随系统负载与硬件而变化,重要的是倍率。基准测试源码在这个仓库里,你可以自行复现实验对比 https://github.com/KatanaQuant/db_bench_foo

LLMs Lie. Numbers Don't.

LLM 会撒谎,数字不会。

I compiled the same C benchmark program against two libraries: system SQLite and the Rust reimplementation's C API library. Same compiler flags, same WAL mode, same table schema, same queries. 100 rows:

我用同一个 C 基准测试程序分别链接两套库来编译:系统自带的 SQLite,以及这个 Rust 重写版本的 C API 库。编译器 flag 相同、WAL 模式相同、表结构相同、查询相同。100 行数据:

I'll take the TRANSACTION batch row as the baseline because it doesn't have the same glaring bugs as the others, namely no WHERE clauses and per-statement syncs. In this run that baseline is already 298x, which means even the best-case path is far behind SQLite. Anything above 298x signals a bug.

我把 TRANSACTION 的 batch 行作为基线,因为它不像其他行那样有明显的 bug——比如没有 WHERE 子句、每条语句都做同步。在这次运行里,这个基线已经是 298x,这意味着即便走到“最好情况”的路径,它也远远落后于 SQLite。任何超过 298x 的结果,都意味着有 bug。

The largest gap beyond our baseline is driven by two bugs:

超过基线的最大差距由两个 bug 驱动:

INSERT without a transaction: 1,857x versus 298x in batch mode. SELECT BY ID: 20,171x. UPDATE and DELETE are both above 2,800x. The pattern is consistent: any operation that requires the database to find something is insanely slow.

不在事务里的 INSERT:1,857x,而 batch 模式是 298x。按 ID 做 SELECT:20,171x。UPDATE 和 DELETE 都超过 2,800x。模式非常一致:任何需要数据库去 找到某个东西 的操作,都慢得离谱。

What the Planner Gets Wrong

规划器哪里出了问题

I read the source code. Well.. the parts I needed to read based on my benchmark results. The reimplementation is not small: 576,000 lines of Rust code across 625 files. There is a parser, a planner, a VDBE bytecode engine, a B-tree, a pager, a WAL. The modules have all the "correct" names. The architecture also looks correct. But two bugs in the code and a group of smaller issues compound:

我读了源码。嗯……准确说,是根据我的基准结果去读了我需要读的那部分。这个重写版本并不小:625 个文件、576,000 行 Rust 代码。它有解析器、规划器、VDBE 字节码引擎、B 树、pager、WAL。模块名字看起来都“正确”。架构看起来也正确。但代码里的两个 bug,再叠加一组更小的问题,产生了复合效应:

Bug #1: The Missing ipk Check

Bug #1:缺失的 ipk 检查

In SQLite, when you declare a table as:

在 SQLite 里,当你这样声明一张表:

the column id becomes an alias for the internal rowid — the B-tree key itself. A query like WHERE id = 5 resolves to a direct B-tree search and scales O(log n). (I already wrote a TLDR piece about how B-trees work here.) The SQLite query planner documentation states: "the time required to look up the desired row is proportional to logN rather than being proportional to N as in a full table scan." This is not an optimization. It is a fundamental design decision in SQLite's query optimizer:

id 会成为内部 rowid 的别名——也就是 B 树的 key 本身。像 WHERE id = 5 这样的查询会解析为直接的 B 树查找,复杂度是 O(log n)。(我之前写过一篇 TLDR 解释 B 树如何工作。)SQLite 的查询规划器文档写道:“查到目标行所需的时间与 logN 成正比,而不是像全表扫描那样与 N 成正比。”这不是一种“优化技巧”,而是 SQLite 查询优化器的一项基本设计决策:

The line above converts a named column reference to XN_ROWID when it matches the table's INTEGER PRIMARY KEY column. The VDBE then triggers a SeekRowid operation instead of a full table scan, which makes the whole thing proportional to logN.

上面那一行会在命中表的 INTEGER PRIMARY KEY 列时,把具名列引用转换成 XN_ROWID。随后 VDBE 会触发 SeekRowid 操作而不是全表扫描,从而让整体成本与 logN 成正比。

The Rust reimplementation has a proper B-tree. The table_seek function implements correct binary search descent through its nodes and scales O(log n). It works. But the query planner never calls it for named columns!

这个 Rust 重写版本确实有一棵正规的 B 树。table_seek 函数实现了正确的二分下降搜索,复杂度是 O(log n)。它能工作。但查询规划器在处理具名列时从来不调用它!

The is_rowid_ref() function only recognizes three magic strings:

is_rowid_ref() 函数只识别三个“魔法字符串”:

A column declared as id INTEGER PRIMARY KEY, even though it is internally flagged as is_ipk: true, doesn't get recognized. It is never consulted when choosing between a B-tree search and a full table scan.

即便列被声明为 id INTEGER PRIMARY KEY,并且内部也确实标记了 is_ipk: true,它也不会被识别。在选择“走 B 树查找”还是“走全表扫描”时,它从未被纳入考虑。

Every WHERE id = N query flows through codegen_select_full_scan(), which emits linear walks through every row via Rewind / Next / Ne to compare each rowid against the target. At 100 rows with 100 lookups, that is 10,000 row comparisons instead of roughly 700 B-tree steps. O(n²) instead of O(n log n). This is consistent with the ~20,000x result in this run.

所有 WHERE id = N 的查询都会走进 codegen_select_full_scan(),它通过 Rewind / Next / Ne 对每一行做线性遍历,把每个 rowid 与目标值比较。在 100 行数据上做 100 次查找,这意味着 10,000 次行比较,而不是大约 700 次 B 树步进。O(n²) 而不是 O(n log n)。这与本次运行里约 20,000x 的结果一致。

Every WHERE clause on every column does a full table scan. The only fast path is WHERE rowid = ? using the literal pseudo-column name.

每个列上的每个 WHERE 子句都会做全表扫描。唯一的快速路径是使用字面伪列名的 WHERE rowid = ?

Bug #2: fsync on Every Statement

Bug #2:每条语句都 fsync

The second bug is responsible for the 1,857x on INSERT. Every bare INSERT outside a transaction is wrapped in a full autocommit cycle: ensure_autocommit_txn() → execute → resolve_autocommit_txn(). The commit calls wal.sync(), which calls Rust's fsync(2) wrapper. 100 INSERTs means 100 fsyncs.

第二个 bug 解释了 INSERT 的 1,857x。每一个不在事务里的裸 INSERT 都会被包在完整的自动提交周期里:ensure_autocommit_txn() → execute → resolve_autocommit_txn()。提交会调用 wal.sync(),后者调用 Rust 对 fsync(2) 的封装。100 次 INSERT 意味着 100 次 fsync。

SQLite does the same autocommit, but uses fdatasync(2) on Linux, which skips syncing file metadata when compiled with HAVE_FDATASYNC (the default). This is roughly 1.6 to 2.7 times cheaper on NVMe SSDs. SQLite's per-statement overhead is also minimal: no schema reload, no AST clone, no VDBE recompile. The Rust reimplementation does all three on every call.

SQLite 也做同样的自动提交,但在 Linux 上、当编译启用 HAVE_FDATASYNC(默认)时,它会用 fdatasync(2),从而跳过文件元数据的同步。在 NVMe SSD 上,这大约便宜 1.6 到 2.7 倍。SQLite 的单语句开销也极低:不会每次都重载 schema,不会 clone AST,也不会每次都重新编译 VDBE。这个 Rust 重写版本每次调用都会做这三件事。

Looking at the Rust TRANSACTION batch row, batched inserts (one fsync for 100 inserts) take 32.81 ms, whereas individual inserts (100 fsync calls) take 2,562.99 ms. That's a 78x overhead from the autocommit.

看 Rust 的 TRANSACTION batch 行:批量插入(100 次插入只 fsync 一次)耗时 32.81 ms;而逐条插入(100 次 fsync)耗时 2,562.99 ms。自动提交带来了 78x 的额外开销。

The Compound Effect

复合效应

These two bugs are not isolated cases. They are amplified by a group of individually defensible "safe" choices that compound:

这两个 bug 并不是孤例。它们会被一组单看似乎“合理安全”的选择进一步放大,这些选择叠加在一起:

  • AST clone on every cache hit. The SQL parse is cached, but the AST is .clone()'d on every sqlite3_exec(), then recompiled to VDBE bytecode from scratch. SQLite's sqlite3_prepare_v2() just returns a reusable handle.
  • 每次缓存命中都 clone AST。 SQL 解析结果被缓存了,但每次 sqlite3_exec() 都会把 AST .clone() 一份,然后从头重新编译成 VDBE 字节码。SQLite 的 sqlite3_prepare_v2() 只会返回一个可复用的句柄。
  • 4KB (Vec) heap allocation on every read. The page cache returns data via .to_vec(), which creates a new allocation and copies it into the Vec even on cache hits. SQLite returns a direct pointer into pinned cache memory, creating zero copies. The Fjall database team measured this exact anti-pattern at 44% of runtime before building a custom ByteView type to eliminate it.
  • 每次读取都做 4KB(Vec)堆分配。 页缓存通过 .to_vec() 返回数据,即便命中缓存也会新分配并把数据复制进 Vec。SQLite 返回的是指向固定缓存内存的直接指针,零拷贝。Fjall 数据库团队测过这个反模式:在他们做自定义 ByteView 类型以消除它之前,它占了 44% 的运行时间。
  • Schema reload on every autocommit cycle. After each statement commits, the next statement sees the bumped commit counter and calls reload_memdb_from_pager(), walks the sqlite_master B-tree and then re-parses every CREATE TABLE to rebuild the entire in-memory schema. SQLite checks the schema cookie and only reloads it on change.
  • 每个自动提交周期都重载 schema。 每条语句提交后,下一条语句看到提交计数器变化,会调用 reload_memdb_from_pager(),遍历 sqlite_master 的 B 树,然后重新解析每一个 CREATE TABLE 来重建整个内存 schema。SQLite 会检查 schema cookie,只在发生变化时才重载。
  • Eager formatting in the hot path. statement_sql.to_string() (AST-to-SQL formatting) is evaluated on every call before its guard check. This means it does serialization regardless of whether a subscriber is active or not.
  • 在热路径里急切格式化。 statement_sql.to_string()(AST 转 SQL 的格式化)会在每次调用时、在守卫检查之前就被求值。这意味着不管是否有 subscriber 活跃,它都会做序列化。
  • New objects on every statement. A new SimpleTransaction, a new VdbeProgram, a new MemDatabase, and a new VdbeEngine are allocated and destroyed per statement. SQLite reuses all of these across the connection lifecycle via a lookaside allocator to eliminate malloc/free in the execution loop.
  • 每条语句都新建对象。 每条语句都会分配并销毁一个新的 SimpleTransaction、新的 VdbeProgram、新的 MemDatabase、新的 VdbeEngine。SQLite 在连接生命周期里复用这些对象,并通过 lookaside 分配器在执行循环里消除 malloc/free。

Each of these was probably chosen individually with sound general reasoning: "We clone because Rust ownership makes shared references complex." "We use sync_all because it is the safe default." "We allocate per page because returning references from a cache requires unsafe."

这些选择单独看,可能都有合理的“通用理由”:“我们 clone 是因为 Rust 的所有权让共享引用很复杂。”“我们用 sync_all 因为它是安全默认值。”“我们每页都分配,因为从缓存里返回引用需要 unsafe。”

Every decision sounds like choosing safety. But the end result is about 2,900x slower in this benchmark. A database's hot path is the one place where you probably shouldn't choose safety over performance. SQLite is not primarily fast because it is written in C. Well.. that too, but it is fast because 26 years of profiling have identified which tradeoffs matter.

每个决定听起来都像是在选择安全。但最终结果是在这个基准里慢了大约 2,900x。数据库的热路径,恰恰可能是你最不该用“安全压过性能”的地方。SQLite 并不只是因为它用 C 写所以快。嗯……那也是原因之一,但更关键的是:26 年的 profiling 已经找清楚了哪些权衡真正重要。

In the 1980 Turing Award lecture Tony Hoare said: "There are two ways of constructing a software design: one way is to make it so simple that there are obviously no deficiencies, and the other is to make it so complicated that there are no obvious deficiencies." This LLM-generated code falls into the second category. The reimplementation is 576,000 lines of Rust (measured via scc, counting code only, without comments or blanks). That is 3.7x more code than SQLite. And yet it still misses the is_ipk check that handles the selection of the correct search operation.

在 1980 年图灵奖演讲里,Tony Hoare 说过:“构造软件设计有两种方式:一种是把它做得足够简单,以至于显然没有缺陷;另一种是把它做得足够复杂,以至于没有明显的缺陷。”这份 LLM 生成的代码属于第二类。这个重写版本有 576,000 行 Rust(用 scc 测得,只算代码,不含注释与空行),是 SQLite 代码量的 3.7 倍。可它仍然漏掉了那条用于选择正确搜索操作的 is_ipk 检查。

Steven Skiena writes in The Algorithm Design Manual: "Reasonable-looking algorithms can easily be incorrect. Algorithm correctness is a property that must be carefully demonstrated." It's not enough that the code looks right. It's not enough that the tests pass. You have to demonstrate with benchmarks and with proof that the system does what it should. 576,000 lines and no benchmark. That is not "correctness first, optimization later." That is no correctness at all.

Steven Skiena 在 The Algorithm Design Manual 里写道:“看起来合理的算法很容易是错误的。算法正确性是一种必须被谨慎证明的性质。”代码看起来对,并不够;测试能过,也不够。你必须用基准和证明来展示系统确实做到了它应该做的事。576,000 行,没有基准。这不是“先正确,后优化”。这根本不是正确性。

Same Method, Same Result

同样的方法,同样的结果

The SQLite reimplementation is not the only example. A second project by the same author shows the same dynamic in a different domain.

SQLite 的重写版本并不是唯一例子。同一位作者的另一个项目在不同领域也体现了同样的动态。

The developer's LLM agents compile Rust projects continuously, filling disks with build artifacts. Rust's target/ directories consume 2–4 GB each with incremental compilation and debuginfo, a top-three complaint in the annual Rust survey. This is amplified by the projects themselves: a sibling agent-coordination tool in the same portfolio pulls in 846 dependencies and 393,000 lines of Rust. For context, ripgrep has 61; sudo-rs was deliberately reduced from 135 to 3. Properly architected projects are lean.

开发者的 LLM agent 会持续编译 Rust 项目,把磁盘塞满构建产物。Rust 的 target/ 目录在增量编译和 debuginfo 的情况下,每个都要吃掉 2–4 GB,这是年度 Rust 调查里前三的抱怨之一。项目本身又会放大这个问题:同一作品集里一个“兄弟”agent 协调工具引入了 846 个依赖、393,000 行 Rust。对比一下:ripgrep 只有 61 个依赖;sudo-rs 则是刻意从 135 个依赖缩减到 3 个。架构得当的项目应该很精瘦。

The solution to the disk pressure: a cleanup daemon. 82,000 lines of Rust, 192 dependencies, a 36,000-line terminal dashboard with seven screens and a fuzzy-search command palette, a Bayesian scoring engine with posterior probability calculations, an EWMA forecaster with PID controller, and an asset download pipeline with mirror URLs and offline bundle support.

为了解决磁盘压力,作者做了什么?一个清理守护进程:82,000 行 Rust、192 个依赖、一个 36,000 行的终端仪表盘(七个屏幕 + 模糊搜索的命令面板)、一个带后验概率计算的贝叶斯评分引擎、一个带 PID 控制器的 EWMA 预测器,以及一个支持镜像 URL 和离线 bundle 的资源下载流水线。

To solve this problem:

要解决的问题其实是:

A one-line cron job with 0 dependencies. The project's README claims machines "become unresponsive" when disks fill. It does not once mention Rust's standard tool for exactly this problem: cargo-sweep. It also fails to consider that operating systems already carry ballast helpers. ext4's 5% root reservation, reserves blocks for privileged processes by default: on a 500 GB disk, 25 GB remain available to root even when non-root users see "disk full." That does not guarantee zero impact, but it usually means privileged recovery paths remain available so root can still log in and delete files.

一条 cron job 一行命令,0 依赖。项目 README 声称机器会在磁盘写满时“变得无响应”,却一次都没提 Rust 生态里正好用来解决这个问题的标准工具:cargo-sweep。它也没有考虑操作系统本来就带着“压舱石”式的辅助机制:ext4 默认保留 5% 的 root 预留空间,为特权进程保留块。对一块 500 GB 的磁盘来说,即便非 root 用户看到“磁盘已满”,root 仍然通常还有 25 GB 可用。这当然不能保证零影响,但它往往意味着特权恢复路径仍然可用:root 还能登录并删除文件。

The pattern is the same as the SQLite rewrite. The code matches the intent: "Build a sophisticated disk management system" produces a sophisticated disk management system. It has dashboards, algorithms, forecasters. But the problem of deleting old build artifacts is already solved. The LLM generated what was described, not what was needed.

这和 SQLite 重写的模式一模一样。代码吻合了 意图:“构建一个复杂的磁盘管理系统”,就会得到一个复杂的磁盘管理系统——有仪表盘、有算法、有预测器。但“删除旧的构建产物”这个 问题 其实早就被解决了。LLM 生成的是你描述出来的东西,而不是你真正需要的东西。

THIS is the failure mode. Not broken syntax or missing semicolons. The code is syntactically and semantically correct. It does what was asked for. It just does not do what the situation requires. In the SQLite case, the intent was "implement a query planner" and the result is a query planner that plans every query as a full table scan. In the disk daemon case, the intent was "manage disk space intelligently" and the result is 82,000 lines of intelligence applied to a problem that needs none. Both projects fulfill the prompt. Neither solves the problem.

这才是失败模式。不是语法坏了,也不是分号漏了。代码在语法与语义层面都是正确的;它做到了你要求它做的事。只是它没有做到情境 需要 你做到的事。对 SQLite 来说,意图是“实现一个查询规划器”,结果是一个把每个查询都规划成全表扫描的规划器。对磁盘守护进程来说,意图是“智能管理磁盘空间”,结果是 82,000 行智能用在了一个根本不需要智能的问题上。两个项目都满足了 prompt,但都没解决问题。

The obvious counterargument is "skill issue, a better engineer would have caught the full table scan." And that's true. That's exactly the point! LLMs are dangerous to people least equipped to verify their output. If you have the skills to catch the is_ipk bug in your query planner, the LLM saves you time. If you don't, you have no way to know the code is wrong. It compiles, it passes tests, and the LLM will happily tell you that it looks great.

显而易见的反驳是:“这是水平问题,更好的工程师会发现全表扫描。”没错。这正是我要说的点!LLM 对最不具备验证能力的人最危险。如果你有能力在查询规划器里抓出 is_ipk 这个 bug,LLM 能帮你省时间;如果你没有,你根本无从得知代码是错的。它能编译,能过测试,而 LLM 还会很乐意告诉你“看起来很棒”。

Measuring the Wrong Thing

衡量了错误的东西

The tools used to measure LLM output reinforce the illusion. scc's COCOMO model estimates the rewrite at $21.4 million in development cost. The same model values print("hello world") at $19.

衡量 LLM 输出的工具进一步加深了这种幻觉。scc 的 COCOMO 模型估算这个重写版本的开发成本是 2140 万美元。同一个模型会给 print("hello world") 估价 19 美元。

COCOMO was designed to estimate effort for human teams writing original code. Applied to LLM output, it mistakes volume for value. Still these numbers are often presented as proof of productivity.

COCOMO 的设计目标,是估算人类团队编写原创代码的投入。把它用在 LLM 输出上,它会把“体量”误当成“价值”。但这些数字仍然经常被当作生产力的证据来展示。

The metric is not measuring what most think it is measuring.

这个指标并没有衡量大多数人以为它衡量的东西。

Intent vs. Correctness

意图 vs. 正确性

This gap between intent and correctness has a name. AI alignment research calls it sycophancy, which describes the tendency of LLMs to produce outputs that match what the user wants to hear rather than what they need to hear.

意图与正确性之间的这道鸿沟有一个名字。在 AI 对齐研究里,它叫 sycophancy(谄媚):指 LLM 倾向于生成“符合用户想听什么”的输出,而不是“符合用户真正需要什么”的输出。

Anthropic's "Towards Understanding Sycophancy in Language Models" (ICLR 2024) paper showed that five state-of-the-art AI assistants exhibited sycophantic behavior across a number of different tasks. When a response matched a user's expectation, it was more likely to be preferred by human evaluators. The models trained on this feedback learned to reward agreement over correctness.

Anthropic 的论文《Towards Understanding Sycophancy in Language Models》(ICLR 2024)显示,五个最先进的 AI 助手在多种任务中都表现出谄媚行为。当回答符合用户预期时,人类评估者更可能偏好它。模型在这种反馈上训练后,就学会了奖励“同意”而不是“正确”。

The BrokenMath benchmark (NeurIPS 2025 Math-AI Workshop) tested this in formal reasoning across 504 samples. Even GPT-5 produced sycophantic "proofs" of false theorems 29% of the time when the user implied the statement was true. The model generates a convincing but false proof because the user signaled that the conclusion should be positive. GPT-5 is not an early model. It's also the least sycophantic in the BrokenMath table. The problem is structural to RLHF: preference data contains an agreement bias. Reward models learn to score agreeable outputs higher, and optimization widens the gap. Base models before RLHF were reported in one analysis to show no measurable sycophancy across tested sizes. Only after fine-tuning did sycophancy enter the chat. (literally)

BrokenMath 基准(NeurIPS 2025 Math-AI Workshop)在 504 个样本的形式化推理中测试了这一点。即便是 GPT-5,当用户暗示命题为真时,它也有 29% 的概率会对假定理给出谄媚式“证明”。模型生成的是一个令人信服但错误的证明,因为用户释放了“结论应该是正面的”信号。GPT-5 不是早期模型;它在 BrokenMath 的表格里甚至也是最不谄媚的那一个。问题是 RLHF 的结构性结果:偏好数据里有“同意偏置”,奖励模型会学到“更同意的输出得分更高”,而优化会进一步拉大差距。有一项分析报告称:在 RLHF 之前的 base model,在所测试的规模范围内并没有可测量的谄媚;只有微调之后,谄媚才进入对话。(字面意义上的)

In April 2025, OpenAI rolled back a GPT-4o update that had made the model more sycophantic. It was flabbergasted by a business idea described as "shit on a stick" and endorsed stopping psychiatric medication. An additional reward signal based on thumbs-up/thumbs-down data "weakened the influence of [...] primary reward signal, which had been holding sycophancy in check."

2025 年 4 月,OpenAI 回滚了一次 GPT-4o 更新,因为那次更新让模型变得更谄媚。它对一个被形容为“shit on a stick(把屎插在棍子上)”的商业点子表示震惊式赞叹,还支持停止精神科用药。一个基于点赞/点踩数据的额外奖励信号“削弱了……主要奖励信号的影响,而主要奖励信号此前一直在抑制谄媚”。

In the context of coding, sycophancy manifests as what Addy Osmani described in his 2026 AI coding workflow: agents that don't push back with "Are you sure?" or "Have you considered...?" but instead provide enthusiasm towards whatever the user described, even when the description was incomplete or contradictory.

在写代码的语境里,谄媚表现为 Addy Osmani 在其 2026 AI 编码工作流里描述的那种情况:agent 不会追问“你确定吗?”或“你有没有考虑过……?”,而是对用户描述的任何东西都热情附和——即便描述本身不完整或自相矛盾。

This also applies to LLM-generated evaluation. Ask the same LLM to review the code it generated and it will tell you the architecture is sound, the module boundaries clean and the error handling is thorough. It will sometimes even praise the test coverage. It will not notice that every query does a full table scan if not asked for. The same RLHF reward that makes the model generate what you want to hear makes it evaluate what you want to hear. You should not rely on the tool alone to audit itself. It has the same bias as a reviewer as it has as an author.

这同样适用于 LLM 生成的“评估”。让同一个 LLM 去 review 自己生成的代码,它会告诉你架构很扎实、模块边界很清晰、错误处理很周到,有时甚至还会夸测试覆盖率。除非你明确要求,它不会注意到每个查询都在做全表扫描。让模型生成“你想听的东西”的那套 RLHF 奖励,也会让它在 评价 时倾向于说“你想听的东西”。你不该依赖工具单独审计它自己:它作为 reviewer 的偏置,和它作为 author 的偏置是同一套。

An LLM prompted to "implement SQLite in Rust" will generate code that looks like an implementation of SQLite in Rust. It will have the right module structure and function names. But it can not magically generate the performance invariants that exist because someone profiled a real workload and found the bottleneck. The Mercury benchmark (NeurIPS 2024) confirmed this empirically: leading code LLMs achieve ~65% on correctness but under 50% when efficiency is also required.

当你用提示词让 LLM “implement SQLite in Rust” 时,它会生成看起来像“用 Rust 实现 SQLite”的代码:模块结构和函数名都很像。但它不可能凭空生成那些性能不变量——那些不变量之所以存在,是因为有人在真实负载上做了 profiling,找到了瓶颈。Mercury 基准(NeurIPS 2024)用实证确认了这一点:顶尖代码 LLM 在“正确性”上能到 ~65%,但当要求“效率”也必须正确时,会跌到 50% 以下。

The SQLite documentation says INTEGER PRIMARY KEY lookups are fast. It does not say how to build a query planner that makes them fast. Those details live in 26 years of commit history that only exists because real users hit real performance walls.

SQLite 文档会说 INTEGER PRIMARY KEY 查找很快,但它不会告诉你如何构建一个能让它变快的查询规划器。这些细节藏在 26 年的提交历史里——那些提交之所以存在,是因为真实用户撞上了真实的性能墙。

Now 2 case studies are not proof. I hear you! When two projects from the same methodology show the same gap, the next step is to test whether similar effects appear in the broader population. The studies below use mixed methods to reduce our single-sample bias.

当然,两个案例研究并不能构成证明。我听到了你的心声!当同一种方法论在两个项目上都呈现出同样的落差,下一步就是检验:在更广泛的人群里是否也能看到类似效应。下面的研究使用了混合方法,来降低我们单样本的偏差。

Evidence Beyond Case Studies

超越案例研究的证据

The question becomes whether similar effects show up in broader datasets. Recent studies suggest they do, though effect sizes vary.

问题变成了:类似效应是否也会在更广泛的数据集中出现?近期研究表明会,尽管效应大小会有所不同。

In February 2025, Andrej Karpathy tweeted: "There's a new kind of coding I call 'vibe coding', where you fully give in to the vibes, embrace exponentials, and forget that the code even exists."

2025 年 2 月,Andrej Karpathy 发推说:“我有一种新的写代码方式,叫 ‘vibe coding’,你完全沉浸在氛围里,拥抱指数增长,甚至忘了代码的存在。”

Karpathy probably meant it for throwaway weekend projects (who am I to judge what he means anyway), but it feels like the industry heard something else. Simon Willison drew the line more clearly: "I won't commit any code to my repository if I couldn't explain exactly what it does to somebody else." Willison treats LLMs as "an over-confident pair programming assistant" that makes mistakes "sometimes subtle, sometimes huge" with complete confidence.

Karpathy 可能只是指那种一次性的周末玩具项目(我又凭什么去揣测他到底想表达什么),但行业似乎听成了另一件事。Simon Willison 讲得更明确:“如果我不能向别人准确解释它做了什么,我就不会把任何代码提交到我的仓库里。”Willison 把 LLM 当作“一个过度自信的结对编程助手”,它会“有时犯错——有时很细微,有时很巨大”——而且充满自信。

The data on what happens when that line is not drawn:

当这条线没有被画出来时,会发生什么?数据在这里:

METR's randomized controlled trial (July 2025; updated February 24, 2026) with 16 experienced open-source developers found that participants using AI were 19% slower, not faster. Developers expected AI to speed them up, and after the measured slowdown had already occurred, they still believed AI had sped them up by 20%. These were not junior developers but experienced open-source maintainers. If even THEY could not tell in this setup, subjective impressions alone are probably not a reliable performance measure.

METR 的随机对照试验(2025 年 7 月;2026 年 2 月 24 日更新)对 16 位经验丰富的开源开发者发现:使用 AI 的参与者不是更快,而是慢了 19%。开发者预期 AI 会加速他们,而在测得的减速已经发生之后,他们仍然相信 AI 让自己快了 20%。这些并非初级开发者,而是资深的开源维护者。如果连他们在这种设置下都分辨不出来,那单靠主观感受大概不是可靠的绩效衡量方式。

GitClear's analysis of 211 million changed lines (2020–2024) reported that copy-pasted code increased while refactoring declined. For the first time ever, copy-pasted lines exceeded refactored lines.

GitClear 的分析覆盖了 2.11 亿行变更(2020–2024),报告称复制粘贴的代码增加,而重构下降。历史上第一次,复制粘贴的行数超过了重构的行数。

The implications are no longer just a "fear". In July 2025, Replit's AI agent deleted a production database containing data for 1,200+ executives, then fabricated 4,000 fictional users to mask the deletion.

其影响不再只是“恐惧”。2025 年 7 月,Replit 的 AI agent 删除了一个生产数据库(包含 1,200+ 位高管的数据),随后又捏造了 4,000 个虚构用户来掩盖删除行为。

Google's DORA 2024 report reported that every 25% increase in AI adoption at the team level was associated with an estimated 7.2% decrease in delivery stability.

Google 的 DORA 2024 报告称:在团队层面,AI 采用率每提高 25%,交付稳定性估计会下降 7.2%。

What Competent Looks Like

什么才叫“有能力”

SQLite shows what correct looks like and why the gap is so hard to close.

SQLite 展示了“正确”是什么样子,也解释了为什么这道差距如此难以弥合。

SQLite is ~156,000 lines of C. Its own documentation places it among the top five most deployed software modules of any type, with an estimated one trillion active databases worldwide. It has 100% branch coverage and 100% MC/DC (Modified Condition/Decision Coverage the standard required for Level A aviation software under DO-178C). Its test suite is 590 times larger than the library. MC/DC does not just check that every branch is covered. but proves that every individual expression independently affects the outcome. That's the difference between "the tests pass" and "the tests prove correctness." The reimplementation has neither metric.

SQLite 大约 156,000 行 C。它自己的文档把它列为“各类软件模块中部署最广泛的前五名”之一,估计全球有一万亿个活跃数据库。它有 100% 分支覆盖率和 100% MC/DC(Modified Condition/Decision Coverage,即 DO-178C 航空软件 Level A 所要求的标准)。它的测试套件规模是库本身的 590 倍。MC/DC 不只是检查每个分支都被覆盖,还证明每个独立表达式都能单独影响结果。这就是“测试能过”与“测试证明正确性”之间的差别。这个重写版本两项指标都没有。

The speed comes from deliberate decisions:

速度来自刻意的决策:

Zero-copy page cache. The pcache returns direct pointers into pinned memory. No copies. Production Rust databases have solved this too. sled uses inline-or-Arc-backed IVec buffers, Fjall built a custom ByteView type, redb wrote a user-space page cache in ~565 lines. The .to_vec() anti-pattern is known and documented. The reimplementation used it anyway.

零拷贝页缓存。 pcache 返回指向固定内存的直接指针。零拷贝。生产级 Rust 数据库也解决了这个问题:sled 使用内联或 Arc 支撑的 IVec 缓冲区,Fjall 构建了自定义 ByteView 类型,redb 用大约 565 行写了一个用户态页缓存。.to_vec() 这个反模式是已知且有文档记录的。这个重写版本仍然用了它。

Prepared statement reuse. sqlite3_prepare_v2() compiles once. sqlite3_step() / sqlite3_reset() reuse the compiled code. The cost of SQL-to-bytecode compilation cancels out to near zero. The reimplementation recompiles on every call.

复用预编译语句。 sqlite3_prepare_v2() 编译一次,sqlite3_step() / sqlite3_reset() 复用已编译代码。SQL 到字节码的编译成本摊到几乎为零。这个重写版本每次调用都重新编译。

Schema cookie check. uses one integer at a specific offset in the file header to read it and compare it. The reimplementation walks the entire sqlite_master B-tree and re-parses every CREATE TABLE statement after every autocommit.

Schema cookie 检查。 只用文件头某个固定偏移处的一个整数读出来对比。这个重写版本却在每次自动提交之后遍历整个 sqlite_master B 树,并重新解析每条 CREATE TABLE。

fdatasync instead of fsync. Data-only sync wihtout metadata journaling saves measurable time per commit. The reimplementation uses sync_all() because it is the safe default.

用 fdatasync 而不是 fsync。 只同步数据、不做元数据日志,能为每次提交节省可测量的时间。这个重写版本用 sync_all(),因为它是安全默认值。

The iPKey check. One line in where.c. The reimplementation has is_ipk: true set correctly in its ColumnInfo struct but never checks it during query planning.

iPKey 检查。 where.c 里的一行。这个重写版本在 ColumnInfo 结构里把 is_ipk: true 设对了,但在查询规划时从未检查它。

Competence is not writing 576,000 lines. A database persists (and processes) data. That is all it does. And it must do it reliably at scale. The difference between O(log n) and O(n) on the most common access pattern is not an optimization detail, it is the performance invariant that helps the system work at 10,000, 100,000 or even 1,000,000 or more rows instead of collapsing. Knowing that this invariant lives in one line of code, and knowing which line, is what competence means. It is knowing that fdatasync exists and that the safe default is not always the right default.

能力不是写出 576,000 行代码。数据库持久化(并处理)数据——它就做这一件事,而且必须在规模上可靠地做到。对最常见的访问模式来说,O(log n) 与 O(n) 的差别不是“优化细节”,它是性能不变量:它让系统能在 10,000、100,000,甚至 1,000,000 行或更多数据上正常工作,而不是崩塌。知道这个不变量藏在一行代码里、并且知道是哪一行,这就叫能力。它也意味着知道 fdatasync 的存在,并明白“安全默认值”并不总是正确默认值。

Measure What Matters

只衡量真正重要的东西

The is_rowid_ref() function is 4 lines of Rust. It checks three strings. But it misses the most important case: the named INTEGER PRIMARY KEY column that every SQLite tutorial uses and every application depends on.

is_rowid_ref() 只有 4 行 Rust:它检查三个字符串。但它漏掉了最重要的情况——每个 SQLite 教程都会用、每个应用都会依赖的具名 INTEGER PRIMARY KEY 列。

That check exists in SQLite because someone, probably Richard Hipp 20 years ago, profiled a real workload, noticed that named primary key columns were not hitting the B-tree search path, and wrote one line in where.c to fix it. The line is not fancy. It doesn't appear in any API documentation. But no LLM trained on documentation and Stack Overflow answers will magically know about it.

那条检查之所以存在于 SQLite 里,是因为某个人——可能是 20 年前的 Richard Hipp——在真实负载上做过 profiling,注意到具名主键列没有走到 B 树搜索路径,然后在 where.c 里写下一行修复。那行代码一点也不花哨;它不会出现在任何 API 文档里。但只训练在文档和 Stack Overflow 答案上的 LLM,不会凭空“知道”它。

That's the gap! Not between C and Rust (or any other language). Not between old and new. But between systems that were built by people who measured, and systems that were built by tools that pattern-match. LLMs produce plausible architecture. They do not produce all the critical details.

这就是差距!不是 C 和 Rust(或任何其他语言)之间的差距,不是新旧之间的差距,而是“由会测量的人构建的系统”和“由只会做模式匹配的工具构建的系统”之间的差距。LLM 能产出看起来合理的架构,但它不会产出所有关键细节。

If you are using LLMs to write code (which in 2026 probably most of us are), the question is not whether the output compiles. It is whether you could find the bug yourself. Prompting with "find all bugs and fix them" won't work. This is not a syntax error. It is a semantic bug: the wrong algorithm and the wrong syscall. If you prompted the code and cannot explain why it chose a full table scan over a B-tree search, you do not have a tool. The code is not yours until you understand it well enough to break it.

如果你在用 LLM 写代码(在 2026 年大概我们大多数人都是),问题不是输出能不能编译,而是你自己能不能找到 bug。用提示词“find all bugs and fix them”并不管用。这不是语法错误,而是语义 bug:算法错了、系统调用也错了。如果你提示生成了代码,却无法解释它为什么选择全表扫描而不是 B 树搜索,那你并没有一个工具。直到你理解得足够深、能把它搞坏之前,这段代码都不是你的。

LLMs are useful. They make for a very productive flow when the person using them knows what correct looks like. An experienced database engineer using an LLM to scaffold a B-tree would have caught the is_ipk bug in code review because they know what a query plan should emit. An experienced ops engineer would never have accepted 82,000 lines instead of a cron job one-liner. The tool is at its best when the developer can define the acceptance criteria as specific, measurable conditions that help distinguish working from broken. Using the LLM to generate the solution in this case can be faster while also being correct. Without those criteria, you are not programming but merely generating tokens and hoping.

LLM 很有用。当使用它的人知道“正确”长什么样时,它能带来非常高的生产力流。一位有经验的数据库工程师用 LLM 来搭 B 树脚手架,会在 code review 里抓住 is_ipk 这个 bug,因为他们知道查询计划 应该 生成什么。一位有经验的运维工程师绝不会接受 82,000 行代码而不是一条 cron job 一行命令。工具最强的时候,是开发者能把验收标准定义成具体、可衡量的条件,用来区分“能用”和“坏了”。在这种情况下,用 LLM 生成方案可能更快,也能更正确。没有这些标准,你不是在编程,而是在生成 token,然后祈祷。

The vibes are not enough. Define what correct means. Then measure.

光有氛围不够。先定义什么叫正确,然后去测量。

Stay safe out there!

注意安全!

- Hōrōshi バガボンド

— Hōrōshi バガボンド

Also published on Substack: https://blog.katanaquant.com/p/your-llm-doesnt-write-correct-code

也发布在 Substack:https://blog.katanaquant.com/p/your-llm-doesnt-write-correct-code

Current benchmark figures in this revision are from the 100-row run shown in bench.png (captured on a Linux x86_64 machine). SQLite 3.x (system libsqlite3) vs. the Rust reimplementation's C API (release build, -O2). Line counts measured via scc (code only — excluding blanks and comments). All source code claims verified against the repository at time of writing.

本版本的基准数据来自 bench.png 中展示的 100 行测试运行(在 Linux x86_64 机器上截图)。对比对象为 SQLite 3.x(系统 libsqlite3)与 Rust 重写版本的 C API(release 构建,-O2)。行数用 scc 测量(只计代码——不含空行与注释)。所有源码层面的结论均在写作时对照仓库逐项核验。

Sources

参考来源

Primary Research

主要研究

  • Sharma, M. et al. "Towards Understanding Sycophancy in Language Models." ICLR 2024.
  • Sharma, M. et al. "Towards Understanding Sycophancy in Language Models." ICLR 2024.
  • Shapira, Benade, Procaccia. "How RLHF Amplifies Sycophancy." arXiv, 2026.
  • Shapira, Benade, Procaccia. "How RLHF Amplifies Sycophancy." arXiv, 2026.
  • BrokenMath: "A Benchmark for Sycophancy in Theorem Proving." NeurIPS 2025 Math-AI Workshop.
  • BrokenMath: "A Benchmark for Sycophancy in Theorem Proving." NeurIPS 2025 Math-AI Workshop.
  • Mercury: "A Code Efficiency Benchmark." NeurIPS 2024.
  • Mercury: "A Code Efficiency Benchmark." NeurIPS 2024.
  • "Unveiling Inefficiencies in LLM-Generated Code." arXiv, 2025.
  • "Unveiling Inefficiencies in LLM-Generated Code." arXiv, 2025.
  • METR. "Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity." July 2025 (updated February 24, 2026).
  • METR. "Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity." July 2025 (updated February 24, 2026).
  • GitClear. "AI Code Quality Research 2025." 2025.
  • GitClear. "AI Code Quality Research 2025." 2025.
  • Google. "DORA Report 2024." 2024.
  • Google. "DORA Report 2024." 2024.

Industry Commentary

行业评论

  • Osmani, A. "My LLM Coding Workflow Going Into 2026." addyosmani.com.
  • Osmani, A. "My LLM Coding Workflow Going Into 2026." addyosmani.com.
  • Willison, S. "How I Use LLMs for Code." March 2025.
  • Willison, S. "How I Use LLMs for Code." March 2025.
  • OpenAI. "Sycophancy in GPT-4o: What Happened." April 2025.
  • OpenAI. "Sycophancy in GPT-4o: What Happened." April 2025.
  • Karpathy, A. "Vibe Coding." February 2, 2025.
  • Karpathy, A. "Vibe Coding." February 2, 2025.

Incidents

事件

  • Replit database deletion. The Verge, July 2025.
  • Replit database deletion. The Verge, July 2025.

Rust Ecosystem

Rust 生态

  • Rust Foundation. "2024 State of Rust Survey Results." February 2025.
  • Rust Foundation. "2024 State of Rust Survey Results." February 2025.
  • ISRG / Thalheim, J. "Reducing Dependencies in sudo-rs." memorysafety.org.
  • ISRG / Thalheim, J. "Reducing Dependencies in sudo-rs." memorysafety.org.

Database Engineering

数据库工程

  • SQLite Documentation: rowidtable.html, queryplanner.html, cpu.html, testing.html, mostdeployed.html, malloc.html, cintro.html, pcache_methods2, fileformat.html, fileformat2.html
  • SQLite Documentation: rowidtable.html, queryplanner.html, cpu.html, testing.html, mostdeployed.html, malloc.html, cintro.html, pcache_methods2, fileformat.html, fileformat2.html
  • Callaghan, M. "InnoDB, fsync and fdatasync — Reducing Commit Latency." Small Datum, 2020.
  • Callaghan, M. "InnoDB, fsync and fdatasync — Reducing Commit Latency." Small Datum, 2020.
  • Gunther, N. "Universal Scalability Law." perfdynamics.com.
  • Gunther, N. "Universal Scalability Law." perfdynamics.com.
  • Fjall. "ByteView: Eliminating the .to_vec() Anti-Pattern." fjall-rs.github.io.
  • Fjall. "ByteView: Eliminating the .to_vec() Anti-Pattern." fjall-rs.github.io.
  • sled — embedded database with inline-or-Arc-backed IVec.
  • sled — embedded database with inline-or-Arc-backed IVec.
  • redb — pure-Rust embedded database with user-space page cache.
  • redb — pure-Rust embedded database with user-space page cache.

Books Referenced

引用书目

  • Skiena, S.S. The Algorithm Design Manual. 3rd ed. Springer, 2020.
  • Skiena, S.S. The Algorithm Design Manual. 3rd ed. Springer, 2020.
  • Winand, M. SQL Performance Explained. Self-published, 2012.
  • Winand, M. SQL Performance Explained. Self-published, 2012.
  • Hoare, C.A.R. "The Emperor's Old Clothes." Communications of the ACM 24(2), 1981. (1980 Turing Award Lecture)
  • Hoare, C.A.R. "The Emperor's Old Clothes." Communications of the ACM 24(2), 1981. (1980 Turing Award Lecture)

Link: http://x.com/i/article/2029917440348426240

链接:http://x.com/i/article/2029917440348426240

相关笔记

One of the simplest tests you can run on a database:

Doing a primary key lookup on 100 rows.

SQLite takes 0.09 ms. An LLM-generated Rust rewrite takes 1,815.43 ms.

It's not a misplaced comma! The rewrite is 20,171 times slower on one of the most basic database operations.

EDIT: Several readers have confused this project with Turso/libsql. They are unrelated. Turso forks the original C SQLite codebase; the project analyzed here is a ground-up LLM-generated rewrite by a single developer. Running the same benchmark against Turso shows performance within 1.2x of SQLite consistent with a mature fork, not a reimplementation.

https://www.sqlite.org/testing.html

The thing is though: The code compiles. It passes all its tests. It reads and writes the correct SQLite file format. Its README claims MVCC concurrent writers, file compatibility, and a drop-in C API. On first glance it reads like a working database engine.

But it is not!

LLMs optimize for plausibility over correctness. In this case, plausible is about 20,000 times slower than correct.

I write this as a practitioner, not as a critic. After more than 10 years of professional dev work, I've spent the past 6 months integrating LLMs into my daily workflow across multiple projects. LLMs have made it possible for anyone with curiosity and ingenuity to bring their ideas to life quickly, and I really like that. But the number of screenshots of silently wrong output, confidently broken logic, and correct-looking code that fails under scrutiny I have amassed on my disk shows that things are not always as they seem. My conclusion is that LLMs work best when the user defines their acceptance criteria before the first line of code is generated.

A note on the projects examined: this is not a criticism of any individual developer. I do not know the author personally. I have nothing against them. I've chosen the projects because they are public, representative, and relatively easy to benchmark. The failure patterns I found are produced by the tools, not the author. Evidence from METR's randomized study and GitClear's large-scale repository analysis support that these issues are not isolated to one developer when output is not heavily verified. That's the point I'm trying to make!

This article talks about what that gap looks like in practice: the code, the benchmarks, another case study to see if the pattern is accidental, and external research confirming it is not an outlier.

Absolute timings vary with system load and hardware ratios are what matter. The benchmark source is available in this repository so you can reproduce the comparison on your own https://github.com/KatanaQuant/db_bench_foo

LLMs Lie. Numbers Don't.

I compiled the same C benchmark program against two libraries: system SQLite and the Rust reimplementation's C API library. Same compiler flags, same WAL mode, same table schema, same queries. 100 rows:

https://www.sqlite.org/queryplanner.html

I'll take the TRANSACTION batch row as the baseline because it doesn't have the same glaring bugs as the others, namely no WHERE clauses and per-statement syncs. In this run that baseline is already 298x, which means even the best-case path is far behind SQLite. Anything above 298x signals a bug.

The largest gap beyond our baseline is driven by two bugs:

INSERT without a transaction: 1,857x versus 298x in batch mode. SELECT BY ID: 20,171x. UPDATE and DELETE are both above 2,800x. The pattern is consistent: any operation that requires the database to find something is insanely slow.

What the Planner Gets Wrong

I read the source code. Well.. the parts I needed to read based on my benchmark results. The reimplementation is not small: 576,000 lines of Rust code across 625 files. There is a parser, a planner, a VDBE bytecode engine, a B-tree, a pager, a WAL. The modules have all the "correct" names. The architecture also looks correct. But two bugs in the code and a group of smaller issues compound:

Bug #1: The Missing ipk Check

In SQLite, when you declare a table as:

https://fjall-rs.github.io/post/fjall-2-6-byteview/

the column id becomes an alias for the internal rowid — the B-tree key itself. A query like WHERE id = 5 resolves to a direct B-tree search and scales O(log n). (I already wrote a TLDR piece about how B-trees work here.) The SQLite query planner documentation states: "the time required to look up the desired row is proportional to logN rather than being proportional to N as in a full table scan." This is not an optimization. It is a fundamental design decision in SQLite's query optimizer:

https://www.sqlite.org/c3ref/pcache_methods2.html

The line above converts a named column reference to XN_ROWID when it matches the table's INTEGER PRIMARY KEY column. The VDBE then triggers a SeekRowid operation instead of a full table scan, which makes the whole thing proportional to logN.

The Rust reimplementation has a proper B-tree. The table_seek function implements correct binary search descent through its nodes and scales O(log n). It works. But the query planner never calls it for named columns!

The is_rowid_ref() function only recognizes three magic strings:

https://fjall-rs.github.io/post/fjall-2-6-byteview/

A column declared as id INTEGER PRIMARY KEY, even though it is internally flagged as is_ipk: true, doesn't get recognized. It is never consulted when choosing between a B-tree search and a full table scan.

Every WHERE id = N query flows through codegen_select_full_scan(), which emits linear walks through every row via Rewind / Next / Ne to compare each rowid against the target. At 100 rows with 100 lookups, that is 10,000 row comparisons instead of roughly 700 B-tree steps. O(n²) instead of O(n log n). This is consistent with the ~20,000x result in this run.

Every WHERE clause on every column does a full table scan. The only fast path is WHERE rowid = ? using the literal pseudo-column name.

Bug #2: fsync on Every Statement

The second bug is responsible for the 1,857x on INSERT. Every bare INSERT outside a transaction is wrapped in a full autocommit cycle: ensure_autocommit_txn() → execute → resolve_autocommit_txn(). The commit calls wal.sync(), which calls Rust's fsync(2) wrapper. 100 INSERTs means 100 fsyncs.

SQLite does the same autocommit, but uses fdatasync(2) on Linux, which skips syncing file metadata when compiled with HAVE_FDATASYNC (the default). This is roughly 1.6 to 2.7 times cheaper on NVMe SSDs. SQLite's per-statement overhead is also minimal: no schema reload, no AST clone, no VDBE recompile. The Rust reimplementation does all three on every call.

Looking at the Rust TRANSACTION batch row, batched inserts (one fsync for 100 inserts) take 32.81 ms, whereas individual inserts (100 fsync calls) take 2,562.99 ms. That's a 78x overhead from the autocommit.

The Compound Effect

These two bugs are not isolated cases. They are amplified by a group of individually defensible "safe" choices that compound:

  • AST clone on every cache hit. The SQL parse is cached, but the AST is .clone()'d on every sqlite3_exec(), then recompiled to VDBE bytecode from scratch. SQLite's sqlite3_prepare_v2() just returns a reusable handle.

  • 4KB (Vec) heap allocation on every read. The page cache returns data via .to_vec(), which creates a new allocation and copies it into the Vec even on cache hits. SQLite returns a direct pointer into pinned cache memory, creating zero copies. The Fjall database team measured this exact anti-pattern at 44% of runtime before building a custom ByteView type to eliminate it.

  • Schema reload on every autocommit cycle. After each statement commits, the next statement sees the bumped commit counter and calls reload_memdb_from_pager(), walks the sqlite_master B-tree and then re-parses every CREATE TABLE to rebuild the entire in-memory schema. SQLite checks the schema cookie and only reloads it on change.

  • Eager formatting in the hot path. statement_sql.to_string() (AST-to-SQL formatting) is evaluated on every call before its guard check. This means it does serialization regardless of whether a subscriber is active or not.

  • New objects on every statement. A new SimpleTransaction, a new VdbeProgram, a new MemDatabase, and a new VdbeEngine are allocated and destroyed per statement. SQLite reuses all of these across the connection lifecycle via a lookaside allocator to eliminate malloc/free in the execution loop.

Each of these was probably chosen individually with sound general reasoning: "We clone because Rust ownership makes shared references complex." "We use sync_all because it is the safe default." "We allocate per page because returning references from a cache requires unsafe."

Every decision sounds like choosing safety. But the end result is about 2,900x slower in this benchmark. A database's hot path is the one place where you probably shouldn't choose safety over performance. SQLite is not primarily fast because it is written in C. Well.. that too, but it is fast because 26 years of profiling have identified which tradeoffs matter.

In the 1980 Turing Award lecture Tony Hoare said: "There are two ways of constructing a software design: one way is to make it so simple that there are obviously no deficiencies, and the other is to make it so complicated that there are no obvious deficiencies." This LLM-generated code falls into the second category. The reimplementation is 576,000 lines of Rust (measured via scc, counting code only, without comments or blanks). That is 3.7x more code than SQLite. And yet it still misses the is_ipk check that handles the selection of the correct search operation.

Steven Skiena writes in The Algorithm Design Manual: "Reasonable-looking algorithms can easily be incorrect. Algorithm correctness is a property that must be carefully demonstrated." It's not enough that the code looks right. It's not enough that the tests pass. You have to demonstrate with benchmarks and with proof that the system does what it should. 576,000 lines and no benchmark. That is not "correctness first, optimization later." That is no correctness at all.

Same Method, Same Result

The SQLite reimplementation is not the only example. A second project by the same author shows the same dynamic in a different domain.

The developer's LLM agents compile Rust projects continuously, filling disks with build artifacts. Rust's target/ directories consume 2–4 GB each with incremental compilation and debuginfo, a top-three complaint in the annual Rust survey. This is amplified by the projects themselves: a sibling agent-coordination tool in the same portfolio pulls in 846 dependencies and 393,000 lines of Rust. For context, ripgrep has 61; sudo-rs was deliberately reduced from 135 to 3. Properly architected projects are lean.

The solution to the disk pressure: a cleanup daemon. 82,000 lines of Rust, 192 dependencies, a 36,000-line terminal dashboard with seven screens and a fuzzy-search command palette, a Bayesian scoring engine with posterior probability calculations, an EWMA forecaster with PID controller, and an asset download pipeline with mirror URLs and offline bundle support.

To solve this problem:

https://arxiv.org/abs/2503.06327

A one-line cron job with 0 dependencies. The project's README claims machines "become unresponsive" when disks fill. It does not once mention Rust's standard tool for exactly this problem: cargo-sweep. It also fails to consider that operating systems already carry ballast helpers. ext4's 5% root reservation, reserves blocks for privileged processes by default: on a 500 GB disk, 25 GB remain available to root even when non-root users see "disk full." That does not guarantee zero impact, but it usually means privileged recovery paths remain available so root can still log in and delete files.

The pattern is the same as the SQLite rewrite. The code matches the intent: "Build a sophisticated disk management system" produces a sophisticated disk management system. It has dashboards, algorithms, forecasters. But the problem of deleting old build artifacts is already solved. The LLM generated what was described, not what was needed.

THIS is the failure mode. Not broken syntax or missing semicolons. The code is syntactically and semantically correct. It does what was asked for. It just does not do what the situation requires. In the SQLite case, the intent was "implement a query planner" and the result is a query planner that plans every query as a full table scan. In the disk daemon case, the intent was "manage disk space intelligently" and the result is 82,000 lines of intelligence applied to a problem that needs none. Both projects fulfill the prompt. Neither solves the problem.

The obvious counterargument is "skill issue, a better engineer would have caught the full table scan." And that's true. That's exactly the point! LLMs are dangerous to people least equipped to verify their output. If you have the skills to catch the is_ipk bug in your query planner, the LLM saves you time. If you don't, you have no way to know the code is wrong. It compiles, it passes tests, and the LLM will happily tell you that it looks great.

Measuring the Wrong Thing

The tools used to measure LLM output reinforce the illusion. scc's COCOMO model estimates the rewrite at $21.4 million in development cost. The same model values print("hello world") at $19.

https://sqlite.org/cpu.html

COCOMO was designed to estimate effort for human teams writing original code. Applied to LLM output, it mistakes volume for value. Still these numbers are often presented as proof of productivity.

https://openai.com/index/sycophancy-in-gpt-4o/

The metric is not measuring what most think it is measuring.

Intent vs. Correctness

This gap between intent and correctness has a name. AI alignment research calls it sycophancy, which describes the tendency of LLMs to produce outputs that match what the user wants to hear rather than what they need to hear.

Anthropic's "Towards Understanding Sycophancy in Language Models" (ICLR 2024) paper showed that five state-of-the-art AI assistants exhibited sycophantic behavior across a number of different tasks. When a response matched a user's expectation, it was more likely to be preferred by human evaluators. The models trained on this feedback learned to reward agreement over correctness.

The BrokenMath benchmark (NeurIPS 2025 Math-AI Workshop) tested this in formal reasoning across 504 samples. Even GPT-5 produced sycophantic "proofs" of false theorems 29% of the time when the user implied the statement was true. The model generates a convincing but false proof because the user signaled that the conclusion should be positive. GPT-5 is not an early model. It's also the least sycophantic in the BrokenMath table. The problem is structural to RLHF: preference data contains an agreement bias. Reward models learn to score agreeable outputs higher, and optimization widens the gap. Base models before RLHF were reported in one analysis to show no measurable sycophancy across tested sizes. Only after fine-tuning did sycophancy enter the chat. (literally)

In April 2025, OpenAI rolled back a GPT-4o update that had made the model more sycophantic. It was flabbergasted by a business idea described as "shit on a stick" and endorsed stopping psychiatric medication. An additional reward signal based on thumbs-up/thumbs-down data "weakened the influence of [...] primary reward signal, which had been holding sycophancy in check."

In the context of coding, sycophancy manifests as what Addy Osmani described in his 2026 AI coding workflow: agents that don't push back with "Are you sure?" or "Have you considered...?" but instead provide enthusiasm towards whatever the user described, even when the description was incomplete or contradictory.

This also applies to LLM-generated evaluation. Ask the same LLM to review the code it generated and it will tell you the architecture is sound, the module boundaries clean and the error handling is thorough. It will sometimes even praise the test coverage. It will not notice that every query does a full table scan if not asked for. The same RLHF reward that makes the model generate what you want to hear makes it evaluate what you want to hear. You should not rely on the tool alone to audit itself. It has the same bias as a reviewer as it has as an author.

An LLM prompted to "implement SQLite in Rust" will generate code that looks like an implementation of SQLite in Rust. It will have the right module structure and function names. But it can not magically generate the performance invariants that exist because someone profiled a real workload and found the bottleneck. The Mercury benchmark (NeurIPS 2024) confirmed this empirically: leading code LLMs achieve ~65% on correctness but under 50% when efficiency is also required.

The SQLite documentation says INTEGER PRIMARY KEY lookups are fast. It does not say how to build a query planner that makes them fast. Those details live in 26 years of commit history that only exists because real users hit real performance walls.

Now 2 case studies are not proof. I hear you! When two projects from the same methodology show the same gap, the next step is to test whether similar effects appear in the broader population. The studies below use mixed methods to reduce our single-sample bias.

Evidence Beyond Case Studies

The question becomes whether similar effects show up in broader datasets. Recent studies suggest they do, though effect sizes vary.

In February 2025, Andrej Karpathy tweeted: "There's a new kind of coding I call 'vibe coding', where you fully give in to the vibes, embrace exponentials, and forget that the code even exists."

Karpathy probably meant it for throwaway weekend projects (who am I to judge what he means anyway), but it feels like the industry heard something else. Simon Willison drew the line more clearly: "I won't commit any code to my repository if I couldn't explain exactly what it does to somebody else." Willison treats LLMs as "an over-confident pair programming assistant" that makes mistakes "sometimes subtle, sometimes huge" with complete confidence.

The data on what happens when that line is not drawn:

METR's randomized controlled trial (July 2025; updated February 24, 2026) with 16 experienced open-source developers found that participants using AI were 19% slower, not faster. Developers expected AI to speed them up, and after the measured slowdown had already occurred, they still believed AI had sped them up by 20%. These were not junior developers but experienced open-source maintainers. If even THEY could not tell in this setup, subjective impressions alone are probably not a reliable performance measure.

GitClear's analysis of 211 million changed lines (2020–2024) reported that copy-pasted code increased while refactoring declined. For the first time ever, copy-pasted lines exceeded refactored lines.

The implications are no longer just a "fear". In July 2025, Replit's AI agent deleted a production database containing data for 1,200+ executives, then fabricated 4,000 fictional users to mask the deletion.

Google's DORA 2024 report reported that every 25% increase in AI adoption at the team level was associated with an estimated 7.2% decrease in delivery stability.

What Competent Looks Like

SQLite shows what correct looks like and why the gap is so hard to close.

SQLite is ~156,000 lines of C. Its own documentation places it among the top five most deployed software modules of any type, with an estimated one trillion active databases worldwide. It has 100% branch coverage and 100% MC/DC (Modified Condition/Decision Coverage the standard required for Level A aviation software under DO-178C). Its test suite is 590 times larger than the library. MC/DC does not just check that every branch is covered. but proves that every individual expression independently affects the outcome. That's the difference between "the tests pass" and "the tests prove correctness." The reimplementation has neither metric.

The speed comes from deliberate decisions:

Zero-copy page cache. The pcache returns direct pointers into pinned memory. No copies. Production Rust databases have solved this too. sled uses inline-or-Arc-backed IVec buffers, Fjall built a custom ByteView type, redb wrote a user-space page cache in ~565 lines. The .to_vec() anti-pattern is known and documented. The reimplementation used it anyway.

Prepared statement reuse. sqlite3_prepare_v2() compiles once. sqlite3_step() / sqlite3_reset() reuse the compiled code. The cost of SQL-to-bytecode compilation cancels out to near zero. The reimplementation recompiles on every call.

Schema cookie check. uses one integer at a specific offset in the file header to read it and compare it. The reimplementation walks the entire sqlite_master B-tree and re-parses every CREATE TABLE statement after every autocommit.

fdatasync instead of fsync. Data-only sync wihtout metadata journaling saves measurable time per commit. The reimplementation uses sync_all() because it is the safe default.

The iPKey check. One line in where.c. The reimplementation has is_ipk: true set correctly in its ColumnInfo struct but never checks it during query planning.

Competence is not writing 576,000 lines. A database persists (and processes) data. That is all it does. And it must do it reliably at scale. The difference between O(log n) and O(n) on the most common access pattern is not an optimization detail, it is the performance invariant that helps the system work at 10,000, 100,000 or even 1,000,000 or more rows instead of collapsing. Knowing that this invariant lives in one line of code, and knowing which line, is what competence means. It is knowing that fdatasync exists and that the safe default is not always the right default.

Measure What Matters

The is_rowid_ref() function is 4 lines of Rust. It checks three strings. But it misses the most important case: the named INTEGER PRIMARY KEY column that every SQLite tutorial uses and every application depends on.

That check exists in SQLite because someone, probably Richard Hipp 20 years ago, profiled a real workload, noticed that named primary key columns were not hitting the B-tree search path, and wrote one line in where.c to fix it. The line is not fancy. It doesn't appear in any API documentation. But no LLM trained on documentation and Stack Overflow answers will magically know about it.

That's the gap! Not between C and Rust (or any other language). Not between old and new. But between systems that were built by people who measured, and systems that were built by tools that pattern-match. LLMs produce plausible architecture. They do not produce all the critical details.

If you are using LLMs to write code (which in 2026 probably most of us are), the question is not whether the output compiles. It is whether you could find the bug yourself. Prompting with "find all bugs and fix them" won't work. This is not a syntax error. It is a semantic bug: the wrong algorithm and the wrong syscall. If you prompted the code and cannot explain why it chose a full table scan over a B-tree search, you do not have a tool. The code is not yours until you understand it well enough to break it.

LLMs are useful. They make for a very productive flow when the person using them knows what correct looks like. An experienced database engineer using an LLM to scaffold a B-tree would have caught the is_ipk bug in code review because they know what a query plan should emit. An experienced ops engineer would never have accepted 82,000 lines instead of a cron job one-liner. The tool is at its best when the developer can define the acceptance criteria as specific, measurable conditions that help distinguish working from broken. Using the LLM to generate the solution in this case can be faster while also being correct. Without those criteria, you are not programming but merely generating tokens and hoping.

The vibes are not enough. Define what correct means. Then measure.

Stay safe out there!

- Hōrōshi バガボンド

Also published on Substack: https://blog.katanaquant.com/p/your-llm-doesnt-write-correct-code

Current benchmark figures in this revision are from the 100-row run shown in bench.png (captured on a Linux x86_64 machine). SQLite 3.x (system libsqlite3) vs. the Rust reimplementation's C API (release build, -O2). Line counts measured via scc (code only — excluding blanks and comments). All source code claims verified against the repository at time of writing.

Sources

Primary Research

  • Sharma, M. et al. "Towards Understanding Sycophancy in Language Models." ICLR 2024.

  • Shapira, Benade, Procaccia. "How RLHF Amplifies Sycophancy." arXiv, 2026.

  • BrokenMath: "A Benchmark for Sycophancy in Theorem Proving." NeurIPS 2025 Math-AI Workshop.

  • Mercury: "A Code Efficiency Benchmark." NeurIPS 2024.

  • "Unveiling Inefficiencies in LLM-Generated Code." arXiv, 2025.

  • METR. "Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity." July 2025 (updated February 24, 2026).

  • GitClear. "AI Code Quality Research 2025." 2025.

  • Google. "DORA Report 2024." 2024.

Industry Commentary

  • Osmani, A. "My LLM Coding Workflow Going Into 2026." addyosmani.com.

  • Willison, S. "How I Use LLMs for Code." March 2025.

  • OpenAI. "Sycophancy in GPT-4o: What Happened." April 2025.

  • Karpathy, A. "Vibe Coding." February 2, 2025.

Incidents

  • Replit database deletion. The Verge, July 2025.

Rust Ecosystem

  • Rust Foundation. "2024 State of Rust Survey Results." February 2025.

  • ISRG / Thalheim, J. "Reducing Dependencies in sudo-rs." memorysafety.org.

Database Engineering

  • SQLite Documentation: rowidtable.html, queryplanner.html, cpu.html, testing.html, mostdeployed.html, malloc.html, cintro.html, pcache_methods2, fileformat.html, fileformat2.html

  • Callaghan, M. "InnoDB, fsync and fdatasync — Reducing Commit Latency." Small Datum, 2020.

  • Gunther, N. "Universal Scalability Law." perfdynamics.com.

  • Fjall. "ByteView: Eliminating the .to_vec() Anti-Pattern." fjall-rs.github.io.

  • sled — embedded database with inline-or-Arc-backed IVec.

  • redb — pure-Rust embedded database with user-space page cache.

Books Referenced

  • Skiena, S.S. The Algorithm Design Manual. 3rd ed. Springer, 2020.

  • Winand, M. SQL Performance Explained. Self-published, 2012.

  • Hoare, C.A.R. "The Emperor's Old Clothes." Communications of the ACM 24(2), 1981. (1980 Turing Award Lecture)

Link: http://x.com/i/article/2029917440348426240

📋 讨论归档

讨论进行中…