🧠 阿头学 · 💬 讨论题

DeepSeek 反超 Opus，关键不在模型而在 Harness

这篇文章最有价值的判断是：开源模型工具调用差，很多时候确实不是“智力不够”，而是 harness 契约设计太死，但作者把“工程补偿有效”进一步说成“模型能力没问题”，这个结论明显说过头了。
打开原文 ↗

2026-05-04 原文链接 ↗

阅读简报

双语对照

完整翻译

原文

讨论归档

核心观点

四种错误模式覆盖90%开源模型工具调用失败 作者通过对数十亿 token 的观察，将 DeepSeek、GLM、Qwen 等模型的工具调用失败收敛为四种可组合的形状错误（传 null、JSON 字符串化数组、单参数误包对象、裸字符串未装箱），每种修复仅 30–100 行，可直接提升可用性。
RLHF 泄漏是可被 schema 层物理隔离的"对齐税" 模型将路径输出为 Markdown 自动链接并非幻觉，而是聊天分布的后训练奖励泄漏到工具边界；通过在 schema 层以 `pathString()` 替代 `z.string()`，可在不改动模型的情况下堵住此类退化。
"先校验后修复"优于"先预处理"的防御性架构 先让校验器报错定位 issue path，再针对性应用修复，能完全避免合法输入被静默篡改（如 writeFile 内容误被改写），并天然附带按 model × tool 的修复率遥测。
关系不变量应通过语义扩展与透明反馈处理 对 offset/limit 等字段依赖错误，强行修输入不可行；补全默认值并将系统决策（"limit 缺省为 2000"）作为 Note 回传模型，可利用其 ICL 能力自我纠偏，优于红色 Error 阻断。
严格 Schema 的摩擦成本会错误地否定开源模型 闭源巨头靠预训练"偷偷吃掉"契约理解成本，开源模型因 Harness 缺乏容错而被判定为不会工具调用；在 Harness 层提供宽容设计，是释放开源模型 Agent 潜力的必要条件而非充分条件。

跟我们的关联

对 Neta 而言，Agent 的可靠性瓶颈通常不在 LLM 推理层而在 Harness 边界层；下一步应在工具调用链路中强制部署"先校验后修复"中间件，并建立按 model × tool × error type 的精细化监控，在模型退化前主动预警。
对 Uota 而言，"透明胜过静默魔法"可直接迁移至人机交互设计；当系统替用户做默认决定时，必须将决策逻辑显式回传（如"已默认 offset=0"），避免 Error 阻断对话流，同时保留用户纠偏权。
对 ATou 而言，在评估开源与闭源模型的工具调用能力时，必须要求评测方披露 Harness 层的宽容度与错误修复策略，否则"带辅助轮的 DeepSeek 对比裸跑的 Opus"将严重扭曲采购与选型决策。

讨论引子

1. 如果为 Opus 4.7 同等配置这套 Harness 修复层，其胜率是否也会大幅提升，从而彻底消解"开源反超"的结论？ 2. 当 Harness 层的静默容错成为标配，组织应建立何种机制来区分"工程脚手架在掩盖模型真实缺陷"与"模型能力已足够"？ 3. 在高危操作（如批量删除、资金转账）的工具调用中，"先校验后修复"的宽容策略

我们是怎么让 deepseek 反超 opus 4.7 的？

这两天我一直在想，为什么所谓的开放模型不擅长工具调用，几乎总是 harness 的问题，不是模型本身的问题。

背景是，我花了两天时间，用 deepseek 在 @CommandCodeAI 里看了数十亿 token。@CommandCodeAI 是一个 TB 级别的开源 AI CLI。最后我写了一层工具输入修复层。起因是看 deepseek-flash 在最简单的 /review 运行里都失败了。每一次 shellCommand 和 readFile 调用，都会原样弹回一坨 zod issues blob。模型根本没法恢复，因为错误不是它能读懂的形式。到最后，deepseek v4 pro 在我们的内部评测里，10 次里有 6 次能打赢 opus 4.7。

有几点体会，我觉得挺通用的。

1/ 失败模式并不是随机的，而是一小组有限、可组合的问题。

在 deepseek-flash、deepseek v4 pro、glm、qwen 上，反复出现的几乎就是同样四种错误： - 可选字段本该省略，却传了 null - 把 ["a","b"] 输出成 JSON 字符串，而不是真正的数组 - schema 期待的是数组，却把单个参数包进了 {} 里，变成一种空占位符 - 期待数组的地方传了裸字符串，写成 "foo"，而不是 ["foo"]

四种修复，每种大概 30 到 100 行，而且顺序要小心安排。比如 json-array-parse 必须先于 bare-string-wrap 运行，不然 '["a","b"]' 就会变成 ['["a","b"]']。全部目录就这些。现在一听到有人说这个开源模型不会做工具调用，我默认就是这四种里的一种。到目前为止，差不多 90% 的时候都猜对了。

2/ 最好笑的失败模式，其实也最能说明问题。

deepseek-flash 在被要求编辑或写入文件时，有时会把路径输出成 Markdown 自动链接：

filePath: "/Users/x/proj/[notes.md](http://notes. md)"

我们的 writeFile 工具很老实，差点真的去创建名叫 [notes.md](http://notes .md) 的文件，直到我们发现为止。这不是幻觉。是后训练阶段的聊天分布漏到了工具边界上。模型在对话输出里因为自动加链接而得到过奖励，于是把这个先验也带到了一个完全不该这么做的场景里。修复只用了两行正则，只解开一种退化情况，也就是链接文本等于去掉协议的 url。真正的 markdown，比如 [click](https://x .com)，会原样通过，不受影响。

这也和它们在 RL 里对自家工具做过的条件化有关，而那些工具和我们写的其他工具都不一样，当然也不可能提前预测到。

比起能力缺口，用工具混淆来理解这个问题更有用。模型知道怎么格式化路径。只是没人足够明确地告诉它，这个路径接下来是要进 fopen 的，不是要出现在聊天气泡里的。所以我们把这个提示编码到 schema 层，用 pathString() 代替 z.string()，这样所有路径字段的泄漏一次就堵住了。

3/ 真正关键的设计选择，是把先预处理再校验，反过来变成先校验再修复。

我第一版的尝试很直觉。先做一遍预处理，把输入标准化，比如去掉 null、解析字符串化数组之类，然后再交给 zod。结果马上出问题了。writeFile 的内容只要碰巧长得像 json，在写进磁盘前就会被改写。静默损坏，而且在 smoke test 里很容易漏掉。

后来我把它改得更克制。 - 按原样解析输入。能成功就直接发出去。合法输入永远不碰。 - 如果失败，就遍历校验器自己的 issue 列表。对每一个 issue path，按顺序尝试那四种修复，直到有一种适用。 - 再解析一次。成功就记录 tool_input_repaired:${toolName}。失败就记录 tool_input_invalid:${toolName}，然后返回一条模型能看懂的重试消息。

这里的结构性洞见是这样的。你一旦做预处理，其实就是先假定哪里坏了。可如果先让校验器自己报错，那 schema 才是那个先验，而你只会把修复预算花在 schema 真实不同意的那些路径上。校验器已经替你把 bug 定位好了。它和别处那种先便宜后仔细的思路是同一种形状。先走快路径，再基于证据回退。

这还白送你每个工具的遥测能力。你可以按 model 和 tool 去看修复率，在用户发现之前，就看出某个模型是不是在某个特定契约上退化了。

4/ 形状不变量和关系不变量，需要不同的修法。

上面那四种修复，处理的都是形状问题。类型错了，key 缺了，容器类型不对。但 read_file 有个关系不变量。如果你传了 offset，那你也必须传 limit。反过来也一样。deepseek 一直在调用 readFile({ absolutePath, limit: 30 })，然后拿到一个 ERROR:。这类问题没法靠输入修复解决，因为每个字段单独看都合法，错的是它们之间的关系。

所以我没有再修输入，而是把函数往模型的意图上教了一步。只有 limit，那就补成 offset = 0。只有 offset，那就补成 limit = 2000，和常见读取工具操作的默认值一致。然后把这个决定原样回给模型，放在结果里：

Note: limit was not provided; defaulted to 2000 lines. To read more or fewer lines, retry with both offset and limit.

没有 Error: 前缀，所以 TUI 不会把它刷成红色。模型能看到我们替它做了什么决定。如果我们猜错了，它下一轮自己就能修正。透明胜过静默魔法，而且优势很大。

能修就修。不能修就扩展语义。无论哪种，都把你的选择明白告诉它。

拉远一点看。

很多看起来像模型能力的问题，其实是契约设计的问题。严格 schema 是一种选择，而且有成本。它确实能滤掉噪音，但它也会把那些本来可以恢复的噪音一起滤掉，尤其是面对那些没有死记住你那套精确 JSON 契约的模型时。最大的商业模型会把这笔成本偷偷吃掉。它们在工具调用上显得宽容，是因为预训练里见过足够多的各种契约。开放模型则会把这笔成本大声付出来，然后就因此被否定。

真正调和不同分布的地方，是 harness。四个小修复，我敢说后面还会继续加，因为今天还有三个合并在路上。再加上两行处理自动链接的正则，一个关系默认值，一个前缀调整。模型没变。只是契约在真正需要宽容的地方，变得更宽容了。

deepseek v4 pro 现在在我们的内部评测里，10 次有 6 次能打赢 opus 4.7。

在我看来，所谓 skill issue，更常见地出在 harness，不在模型。

how did we make deepseek outperform opus 4.7? i've been thinking about why "open model bad at tool calling" is almost always a harness problem, not a model problem.

context: spent the two days looking at billions of tokens in @CommandCodeAI (tb open source ai cli) using deepseek. I ended up writing a tool-input repair layer. the trigger was watching deepseek-flash fail on the simplest /review run, every shellCommand and readFile call bouncing back with a raw zod issues blob, the model unable to recover because the error wasn't in a form it could read. by the end deepseek v4 pro was beating opus 4.7 6/10 times on our internal evals.

a few things i learned that feel general:

1/ the failure modes aren't random they're a small finite compositional set.

across deepseek-flash, deepseek v4 pro, glm, qwen, the same four mistakes repeat almost exactly: - sending null for an optional field instead of omitting it - emitting ["a","b"] as a json string instead of an actual array - wrapping a single arg in {} where the schema expected an array (an "empty placeholder") - passing a bare string where an array was expected ("foo" instead of ["foo"])

four repairs, ~30-100 lines each, ordered carefully (json-array-parse must run before bare-string-wrap or '["a","b"]' becomes ['["a","b"]']). that is the whole catalogue. when i hear "this open source model can't do tool calls" i now assume one of those four, and so far that's been right ~90% of the time.

2/ the funniest failure mode is also the most revealing.

deepseek-flash, when asked to edit or write a file, sometimes emits the path as a markdown auto-link:

filePath: "/Users/x/proj/notes.md"

our writeFile tool obediently trued creating files literally named [notes.md](http://notes .md) until we caught it. this is not a hallucination. it's the post-training chat distribution leaking through the tool boundary the model has been rewarded for auto-linking in conversational output, and is applying that prior in a context where it makes no sense. the fix is two regex lines that unwrap only the degenerate case where link text equals url-without-protocol real markdown like [click](https://x .com) passes through untouched.

this is also conditioning of their own tools during RL which were different from all other tools we write and ofc can't predict.

"tool confusion" is a more useful frame than "capability gap." the model knows how to format a path. it just hasn't been told clearly enough that this path is going to fopen, not into a chat bubble. so we encode that hint at the schema level pathString() instead of z.string() and the leak is plugged for every path field at once.

3/ the design choice that mattered was inverting preprocess-then-validate to validate-then-repair.

my first attempt was the obvious one: a preprocessing pass that normalized inputs (strip nulls, parse stringified arrays, etc.) before zod ever saw them. it broke immediately, writeFile content that happened to be json-shaped got rewritten before it hit disk. silent corruption, easy to miss in a smoke test.

then i made it less greedy - parse the input as-is. if it succeeds, ship it. valid inputs are never touched. - on failure, walk the validator's own issue list. for each issue path, try the four repairs in order until one applies. - parse again. on success, log tool_input_repaired:${toolName}. on failure, log tool_input_invalid:${toolName} and return a model-readable retry message.

the structural insight here is: when you preprocess, you encode a prior about what's broken. when you let the validator complain first, the schema is the prior, and you only spend repair budget at the exact paths the schema actually disagreed at. the validator is doing the work of localizing the bug for you. it's the same shape as cheap-then-careful everywhere else try the fast path, fall back on evidence.

(this also gives you per-tool telemetry for free. you can watch repair rates per (model, tool) and notice when a model regresses on a specific contract before users do.)

4/ shape invariants and relational invariants need different fixes.

the four repairs above all handle shape problems wrong type, missing key, wrong container. but read_file had a relational invariant: "if you provide offset, you must also provide limit, and vice versa." deepseek kept calling readFile({ absolutePath, limit: 30 }) and getting an ERROR: back. you can't fix this with input repair, because each field is independently valid the bug is in the relationship between them.

so i taught the function the model's intent instead. limit alone → offset = 0. offset alone → limit = 2000 (matches common read tool ops default). then surfaced the decision back to the model in the result:

"Note: limit was not provided; defaulted to 2000 lines. To read more or fewer lines, retry with both offset and limit."

no Error: prefix, so the tui doesn't paint it red. the model sees what we picked and can self-correct on the next turn if our guess was wrong. transparency over silent magic wins big.

repair where you can. extend semantics where you can't. surface the choice either way.

zoom out:

a lot of what looks like model capability is actually contract design. a strict schema is a choice with a cost it filters out noise, but it also filters out recoverable noise from any model that hasn't memorized the exact json contract you happened to pick. the largest commercial models eat that cost invisibly and are linient on tool calling because they've seen enough of every contract during pretraining; open models pay it loudly and get dismissed for it.

the harness is where you mediate between distributions. four small repairs (i'm sure more to follow as we have three more merging today), two regex lines for auto-links, one relational default, one prefix change. the model didn't change. the contract got more forgiving in exactly the places it needed to be.

deepseek v4 pro now beats opus 4.7 6/10 times on our internal evals.

imo "skill issue" applies to the harness more often than the model.