返回列表
🧠 阿头学 · 💬 讨论题

DeepSeek 反超 Opus,关键不在模型而在 Harness

这篇文章最有价值的判断是:开源模型工具调用差,很多时候确实不是“智力不够”,而是 harness 契约设计太死,但作者把“工程补偿有效”进一步说成“模型能力没问题”,这个结论明显说过头了。
打开原文 ↗

2026-05-04 原文链接 ↗
阅读简报
双语对照
完整翻译
原文
讨论归档

核心观点

  • 失败多数是契约错配,不是随机失控 作者最站得住的点是,工具调用失败高度集中在少数几类形状错误:`null` 乱传、数组被字符串化、对象和数组容器错位、单值没包成数组。这个判断很强,因为它把“模型黑箱问题”降成了“接口容错问题”,意味着工程手段能显著提分,而不是只能靠换更贵模型。
  • “先校验再修复”比“先预处理再校验”更成熟 这是全文最硬的工程洞见。先全局预处理会误伤合法输入,甚至造成静默损坏;先让 schema 报错,再沿着 issue path 定点修复,才是更安全、更可观测的做法。这个方法不仅降低误修率,还天然提供 telemetry,能看出某个模型在哪个工具契约上退化。
  • 模型会把聊天分布泄漏到工具边界 路径被输出成 Markdown 自动链接这个例子很荒诞,但判断是对的:模型在聊天场景里被奖励过的格式习惯,会污染执行场景。这个问题不能靠骂模型“笨”解决,必须在 schema 和工具层明确区分“展示字符串”和“执行参数”,否则错误会反复出现。
  • 形状修复和语义补全不是一回事 作者区分 shape invariant 与 relational invariant,这个框架非常实用。类型、容器、nullable 这类形状错误适合自动修;字段间依赖关系则不该硬修,而该做透明默认值补全并把决策回传模型。这个判断是对的,但风险也真实存在:一旦默认值猜错,系统就是在替模型做错误决定。
  • “DeepSeek 反超 Opus”是强叙事,不是强证据 作者拿“内部评测 10 次里 6 次打赢”做 headline,这在传播上很有效,但在论证上明显不够。样本太小、基准不透明、对比是否公平也没交代,尤其修复层是按 DeepSeek 常见错误定向优化的,这更像“特定 harness 下的局部逆转”,不够资格上升为普遍的能力排名结论。

跟我们的关联

  • 对 ATou 意味着什么、下一步怎么用 这说明做 AI 产品时,别把主要精力都砸在 prompt 和模型 PK 上,真正能拉开体验差距的往往是 harness。下一步可以把“先校验后修复”当成默认设计原则,用在所有工具调用、表单解析、结构化输出链路里。
  • 对 Neta 意味着什么、下一步怎么用 这篇材料直接支持一个更成熟的 agent 评估框架:评模型时不能只看裸输出,还要区分“模型本体能力”和“系统吸收摩擦的能力”。下一步可以把失败分成 shape / relation / intent 三层统计,否则团队会系统性高估模型差异、低估系统设计价值。
  • 对 Uota 意味着什么、下一步怎么用 这篇文章提醒我们,很多“用户/模型总做错”的问题,本质是边界没设计好。下一步讨论时可以把“聊天习惯泄漏到执行环境”类比成用户误操作,反推哪些产品交互其实也该做容错和透明反馈,而不是只会报红错。
  • 对投资判断意味着什么、下一步怎么用 这类能力说明 agent 基础设施层仍然有真实价值,尤其是 schema、容错、观测、恢复机制这些“脏活”。下一步看项目时,不该只问“用的哪个模型”,而该重点看它是否有可迁移的 harness know-how;但也要警惕把局部工程优化包装成“模型代际逆袭”的营销话术。

讨论引子

1. 如果一个系统靠大量 harness 修复才能表现好,我们该把优势记在“模型”头上,还是“产品工程”头上? 2. 自动修复工具输入到底是提高鲁棒性,还是在制造更隐蔽的语义错误? 3. 未来 agent 的护城河会更多来自更强模型,还是来自更厚的容错与契约层?

我们是怎么让 deepseek 反超 opus 4.7 的?

这两天我一直在想,为什么所谓的开放模型不擅长工具调用,几乎总是 harness 的问题,不是模型本身的问题。

背景是,我花了两天时间,用 deepseek 在 @CommandCodeAI 里看了数十亿 token。@CommandCodeAI 是一个 TB 级别的开源 AI CLI。最后我写了一层工具输入修复层。起因是看 deepseek-flash 在最简单的 /review 运行里都失败了。每一次 shellCommandreadFile 调用,都会原样弹回一坨 zod issues blob。模型根本没法恢复,因为错误不是它能读懂的形式。到最后,deepseek v4 pro 在我们的内部评测里,10 次里有 6 次能打赢 opus 4.7。

有几点体会,我觉得挺通用的。

1/ 失败模式并不是随机的,而是一小组有限、可组合的问题。

在 deepseek-flash、deepseek v4 pro、glm、qwen 上,反复出现的几乎就是同样四种错误: - 可选字段本该省略,却传了 null - 把 ["a","b"] 输出成 JSON 字符串,而不是真正的数组 - schema 期待的是数组,却把单个参数包进了 {} 里,变成一种空占位符 - 期待数组的地方传了裸字符串,写成 "foo",而不是 ["foo"]

四种修复,每种大概 30 到 100 行,而且顺序要小心安排。比如 json-array-parse 必须先于 bare-string-wrap 运行,不然 '["a","b"]' 就会变成 ['["a","b"]']。全部目录就这些。现在一听到有人说这个开源模型不会做工具调用,我默认就是这四种里的一种。到目前为止,差不多 90% 的时候都猜对了。

2/ 最好笑的失败模式,其实也最能说明问题。

deepseek-flash 在被要求编辑或写入文件时,有时会把路径输出成 Markdown 自动链接

filePath: "/Users/x/proj/[notes.md](http://notes. md)"

我们的 writeFile 工具很老实,差点真的去创建名叫 [notes.md](http://notes .md) 的文件,直到我们发现为止。这不是幻觉。是后训练阶段的聊天分布漏到了工具边界上。模型在对话输出里因为自动加链接而得到过奖励,于是把这个先验也带到了一个完全不该这么做的场景里。修复只用了两行正则,只解开一种退化情况,也就是链接文本等于去掉协议的 url。真正的 markdown,比如 [click](https://x .com),会原样通过,不受影响。

这也和它们在 RL 里对自家工具做过的条件化有关,而那些工具和我们写的其他工具都不一样,当然也不可能提前预测到。

比起能力缺口,用工具混淆来理解这个问题更有用。模型知道怎么格式化路径。只是没人足够明确地告诉它,这个路径接下来是要进 fopen 的,不是要出现在聊天气泡里的。所以我们把这个提示编码到 schema 层,用 pathString() 代替 z.string(),这样所有路径字段的泄漏一次就堵住了。

3/ 真正关键的设计选择,是把先预处理再校验,反过来变成先校验再修复。

我第一版的尝试很直觉。先做一遍预处理,把输入标准化,比如去掉 null、解析字符串化数组之类,然后再交给 zod。结果马上出问题了。writeFile 的内容只要碰巧长得像 json,在写进磁盘前就会被改写。静默损坏,而且在 smoke test 里很容易漏掉。

后来我把它改得更克制。 - 按原样解析输入。能成功就直接发出去。合法输入永远不碰。 - 如果失败,就遍历校验器自己的 issue 列表。对每一个 issue path,按顺序尝试那四种修复,直到有一种适用。 - 再解析一次。成功就记录 tool_input_repaired:${toolName}。失败就记录 tool_input_invalid:${toolName},然后返回一条模型能看懂的重试消息。

这里的结构性洞见是这样的。你一旦做预处理,其实就是先假定哪里坏了。可如果先让校验器自己报错,那 schema 才是那个先验,而你只会把修复预算花在 schema 真实不同意的那些路径上。校验器已经替你把 bug 定位好了。它和别处那种先便宜后仔细的思路是同一种形状。先走快路径,再基于证据回退。

这还白送你每个工具的遥测能力。你可以按 model 和 tool 去看修复率,在用户发现之前,就看出某个模型是不是在某个特定契约上退化了。

4/ 形状不变量和关系不变量,需要不同的修法。

上面那四种修复,处理的都是形状问题。类型错了,key 缺了,容器类型不对。但 read_file 有个关系不变量。如果你传了 offset,那你也必须传 limit。反过来也一样。deepseek 一直在调用 readFile({ absolutePath, limit: 30 }),然后拿到一个 ERROR:。这类问题没法靠输入修复解决,因为每个字段单独看都合法,错的是它们之间的关系。

所以我没有再修输入,而是把函数往模型的意图上教了一步。只有 limit,那就补成 offset = 0。只有 offset,那就补成 limit = 2000,和常见读取工具操作的默认值一致。然后把这个决定原样回给模型,放在结果里:

Note: limit was not provided; defaulted to 2000 lines. To read more or fewer lines, retry with both offset and limit.

没有 Error: 前缀,所以 TUI 不会把它刷成红色。模型能看到我们替它做了什么决定。如果我们猜错了,它下一轮自己就能修正。透明胜过静默魔法,而且优势很大。

能修就修。不能修就扩展语义。无论哪种,都把你的选择明白告诉它。

拉远一点看。

很多看起来像模型能力的问题,其实是契约设计的问题。严格 schema 是一种选择,而且有成本。它确实能滤掉噪音,但它也会把那些本来可以恢复的噪音一起滤掉,尤其是面对那些没有死记住你那套精确 JSON 契约的模型时。最大的商业模型会把这笔成本偷偷吃掉。它们在工具调用上显得宽容,是因为预训练里见过足够多的各种契约。开放模型则会把这笔成本大声付出来,然后就因此被否定。

真正调和不同分布的地方,是 harness。四个小修复,我敢说后面还会继续加,因为今天还有三个合并在路上。再加上两行处理自动链接的正则,一个关系默认值,一个前缀调整。模型没变。只是契约在真正需要宽容的地方,变得更宽容了。

deepseek v4 pro 现在在我们的内部评测里,10 次有 6 次能打赢 opus 4.7。

在我看来,所谓 skill issue,更常见地出在 harness,不在模型。

how did we make deepseek outperform opus 4.7? i've been thinking about why "open model bad at tool calling" is almost always a harness problem, not a model problem.

context: spent the two days looking at billions of tokens in @CommandCodeAI (tb open source ai cli) using deepseek. I ended up writing a tool-input repair layer. the trigger was watching deepseek-flash fail on the simplest /review run, every shellCommand and readFile call bouncing back with a raw zod issues blob, the model unable to recover because the error wasn't in a form it could read. by the end deepseek v4 pro was beating opus 4.7 6/10 times on our internal evals.

a few things i learned that feel general:

1/ the failure modes aren't random they're a small finite compositional set.

across deepseek-flash, deepseek v4 pro, glm, qwen, the same four mistakes repeat almost exactly: - sending null for an optional field instead of omitting it - emitting ["a","b"] as a json string instead of an actual array - wrapping a single arg in {} where the schema expected an array (an "empty placeholder") - passing a bare string where an array was expected ("foo" instead of ["foo"])

four repairs, ~30-100 lines each, ordered carefully (json-array-parse must run before bare-string-wrap or '["a","b"]' becomes ['["a","b"]']). that is the whole catalogue. when i hear "this open source model can't do tool calls" i now assume one of those four, and so far that's been right ~90% of the time.

2/ the funniest failure mode is also the most revealing.

deepseek-flash, when asked to edit or write a file, sometimes emits the path as a markdown auto-link:

filePath: "/Users/x/proj/notes.md"

our writeFile tool obediently trued creating files literally named [notes.md](http://notes .md) until we caught it. this is not a hallucination. it's the post-training chat distribution leaking through the tool boundary the model has been rewarded for auto-linking in conversational output, and is applying that prior in a context where it makes no sense. the fix is two regex lines that unwrap only the degenerate case where link text equals url-without-protocol real markdown like [click](https://x .com) passes through untouched.

this is also conditioning of their own tools during RL which were different from all other tools we write and ofc can't predict.

"tool confusion" is a more useful frame than "capability gap." the model knows how to format a path. it just hasn't been told clearly enough that this path is going to fopen, not into a chat bubble. so we encode that hint at the schema level pathString() instead of z.string() and the leak is plugged for every path field at once.

3/ the design choice that mattered was inverting preprocess-then-validate to validate-then-repair.

my first attempt was the obvious one: a preprocessing pass that normalized inputs (strip nulls, parse stringified arrays, etc.) before zod ever saw them. it broke immediately, writeFile content that happened to be json-shaped got rewritten before it hit disk. silent corruption, easy to miss in a smoke test.

then i made it less greedy - parse the input as-is. if it succeeds, ship it. valid inputs are never touched. - on failure, walk the validator's own issue list. for each issue path, try the four repairs in order until one applies. - parse again. on success, log tool_input_repaired:${toolName}. on failure, log tool_input_invalid:${toolName} and return a model-readable retry message.

the structural insight here is: when you preprocess, you encode a prior about what's broken. when you let the validator complain first, the schema is the prior, and you only spend repair budget at the exact paths the schema actually disagreed at. the validator is doing the work of localizing the bug for you. it's the same shape as cheap-then-careful everywhere else try the fast path, fall back on evidence.

(this also gives you per-tool telemetry for free. you can watch repair rates per (model, tool) and notice when a model regresses on a specific contract before users do.)

4/ shape invariants and relational invariants need different fixes.

the four repairs above all handle shape problems wrong type, missing key, wrong container. but read_file had a relational invariant: "if you provide offset, you must also provide limit, and vice versa." deepseek kept calling readFile({ absolutePath, limit: 30 }) and getting an ERROR: back. you can't fix this with input repair, because each field is independently valid the bug is in the relationship between them.

so i taught the function the model's intent instead. limit alone → offset = 0. offset alone → limit = 2000 (matches common read tool ops default). then surfaced the decision back to the model in the result:

"Note: limit was not provided; defaulted to 2000 lines. To read more or fewer lines, retry with both offset and limit."

no Error: prefix, so the tui doesn't paint it red. the model sees what we picked and can self-correct on the next turn if our guess was wrong. transparency over silent magic wins big.

repair where you can. extend semantics where you can't. surface the choice either way.

zoom out:

a lot of what looks like model capability is actually contract design. a strict schema is a choice with a cost it filters out noise, but it also filters out recoverable noise from any model that hasn't memorized the exact json contract you happened to pick. the largest commercial models eat that cost invisibly and are linient on tool calling because they've seen enough of every contract during pretraining; open models pay it loudly and get dismissed for it.

the harness is where you mediate between distributions. four small repairs (i'm sure more to follow as we have three more merging today), two regex lines for auto-links, one relational default, one prefix change. the model didn't change. the contract got more forgiving in exactly the places it needed to be.

deepseek v4 pro now beats opus 4.7 6/10 times on our internal evals.

imo "skill issue" applies to the harness more often than the model.

我们是怎么让 deepseek 反超 opus 4.7 的?

这两天我一直在想,为什么所谓的开放模型不擅长工具调用,几乎总是 harness 的问题,不是模型本身的问题。

背景是,我花了两天时间,用 deepseek 在 @CommandCodeAI 里看了数十亿 token。@CommandCodeAI 是一个 TB 级别的开源 AI CLI。最后我写了一层工具输入修复层。起因是看 deepseek-flash 在最简单的 /review 运行里都失败了。每一次 shellCommandreadFile 调用,都会原样弹回一坨 zod issues blob。模型根本没法恢复,因为错误不是它能读懂的形式。到最后,deepseek v4 pro 在我们的内部评测里,10 次里有 6 次能打赢 opus 4.7。

有几点体会,我觉得挺通用的。

1/ 失败模式并不是随机的,而是一小组有限、可组合的问题。

在 deepseek-flash、deepseek v4 pro、glm、qwen 上,反复出现的几乎就是同样四种错误: - 可选字段本该省略,却传了 null - 把 ["a","b"] 输出成 JSON 字符串,而不是真正的数组 - schema 期待的是数组,却把单个参数包进了 {} 里,变成一种空占位符 - 期待数组的地方传了裸字符串,写成 "foo",而不是 ["foo"]

四种修复,每种大概 30 到 100 行,而且顺序要小心安排。比如 json-array-parse 必须先于 bare-string-wrap 运行,不然 '["a","b"]' 就会变成 ['["a","b"]']。全部目录就这些。现在一听到有人说这个开源模型不会做工具调用,我默认就是这四种里的一种。到目前为止,差不多 90% 的时候都猜对了。

2/ 最好笑的失败模式,其实也最能说明问题。

deepseek-flash 在被要求编辑或写入文件时,有时会把路径输出成 Markdown 自动链接

filePath: "/Users/x/proj/[notes.md](http://notes. md)"

我们的 writeFile 工具很老实,差点真的去创建名叫 [notes.md](http://notes .md) 的文件,直到我们发现为止。这不是幻觉。是后训练阶段的聊天分布漏到了工具边界上。模型在对话输出里因为自动加链接而得到过奖励,于是把这个先验也带到了一个完全不该这么做的场景里。修复只用了两行正则,只解开一种退化情况,也就是链接文本等于去掉协议的 url。真正的 markdown,比如 [click](https://x .com),会原样通过,不受影响。

这也和它们在 RL 里对自家工具做过的条件化有关,而那些工具和我们写的其他工具都不一样,当然也不可能提前预测到。

比起能力缺口,用工具混淆来理解这个问题更有用。模型知道怎么格式化路径。只是没人足够明确地告诉它,这个路径接下来是要进 fopen 的,不是要出现在聊天气泡里的。所以我们把这个提示编码到 schema 层,用 pathString() 代替 z.string(),这样所有路径字段的泄漏一次就堵住了。

3/ 真正关键的设计选择,是把先预处理再校验,反过来变成先校验再修复。

我第一版的尝试很直觉。先做一遍预处理,把输入标准化,比如去掉 null、解析字符串化数组之类,然后再交给 zod。结果马上出问题了。writeFile 的内容只要碰巧长得像 json,在写进磁盘前就会被改写。静默损坏,而且在 smoke test 里很容易漏掉。

后来我把它改得更克制。 - 按原样解析输入。能成功就直接发出去。合法输入永远不碰。 - 如果失败,就遍历校验器自己的 issue 列表。对每一个 issue path,按顺序尝试那四种修复,直到有一种适用。 - 再解析一次。成功就记录 tool_input_repaired:${toolName}。失败就记录 tool_input_invalid:${toolName},然后返回一条模型能看懂的重试消息。

这里的结构性洞见是这样的。你一旦做预处理,其实就是先假定哪里坏了。可如果先让校验器自己报错,那 schema 才是那个先验,而你只会把修复预算花在 schema 真实不同意的那些路径上。校验器已经替你把 bug 定位好了。它和别处那种先便宜后仔细的思路是同一种形状。先走快路径,再基于证据回退。

这还白送你每个工具的遥测能力。你可以按 model 和 tool 去看修复率,在用户发现之前,就看出某个模型是不是在某个特定契约上退化了。

4/ 形状不变量和关系不变量,需要不同的修法。

上面那四种修复,处理的都是形状问题。类型错了,key 缺了,容器类型不对。但 read_file 有个关系不变量。如果你传了 offset,那你也必须传 limit。反过来也一样。deepseek 一直在调用 readFile({ absolutePath, limit: 30 }),然后拿到一个 ERROR:。这类问题没法靠输入修复解决,因为每个字段单独看都合法,错的是它们之间的关系。

所以我没有再修输入,而是把函数往模型的意图上教了一步。只有 limit,那就补成 offset = 0。只有 offset,那就补成 limit = 2000,和常见读取工具操作的默认值一致。然后把这个决定原样回给模型,放在结果里:

Note: limit was not provided; defaulted to 2000 lines. To read more or fewer lines, retry with both offset and limit.

没有 Error: 前缀,所以 TUI 不会把它刷成红色。模型能看到我们替它做了什么决定。如果我们猜错了,它下一轮自己就能修正。透明胜过静默魔法,而且优势很大。

能修就修。不能修就扩展语义。无论哪种,都把你的选择明白告诉它。

拉远一点看。

很多看起来像模型能力的问题,其实是契约设计的问题。严格 schema 是一种选择,而且有成本。它确实能滤掉噪音,但它也会把那些本来可以恢复的噪音一起滤掉,尤其是面对那些没有死记住你那套精确 JSON 契约的模型时。最大的商业模型会把这笔成本偷偷吃掉。它们在工具调用上显得宽容,是因为预训练里见过足够多的各种契约。开放模型则会把这笔成本大声付出来,然后就因此被否定。

真正调和不同分布的地方,是 harness。四个小修复,我敢说后面还会继续加,因为今天还有三个合并在路上。再加上两行处理自动链接的正则,一个关系默认值,一个前缀调整。模型没变。只是契约在真正需要宽容的地方,变得更宽容了。

deepseek v4 pro 现在在我们的内部评测里,10 次有 6 次能打赢 opus 4.7。

在我看来,所谓 skill issue,更常见地出在 harness,不在模型。

how did we make deepseek outperform opus 4.7? i've been thinking about why "open model bad at tool calling" is almost always a harness problem, not a model problem.

context: spent the two days looking at billions of tokens in @CommandCodeAI (tb open source ai cli) using deepseek. I ended up writing a tool-input repair layer. the trigger was watching deepseek-flash fail on the simplest /review run, every shellCommand and readFile call bouncing back with a raw zod issues blob, the model unable to recover because the error wasn't in a form it could read. by the end deepseek v4 pro was beating opus 4.7 6/10 times on our internal evals.

a few things i learned that feel general:

1/ the failure modes aren't random they're a small finite compositional set.

across deepseek-flash, deepseek v4 pro, glm, qwen, the same four mistakes repeat almost exactly: - sending null for an optional field instead of omitting it - emitting ["a","b"] as a json string instead of an actual array - wrapping a single arg in {} where the schema expected an array (an "empty placeholder") - passing a bare string where an array was expected ("foo" instead of ["foo"])

four repairs, ~30-100 lines each, ordered carefully (json-array-parse must run before bare-string-wrap or '["a","b"]' becomes ['["a","b"]']). that is the whole catalogue. when i hear "this open source model can't do tool calls" i now assume one of those four, and so far that's been right ~90% of the time.

2/ the funniest failure mode is also the most revealing.

deepseek-flash, when asked to edit or write a file, sometimes emits the path as a markdown auto-link:

filePath: "/Users/x/proj/notes.md"

our writeFile tool obediently trued creating files literally named [notes.md](http://notes .md) until we caught it. this is not a hallucination. it's the post-training chat distribution leaking through the tool boundary the model has been rewarded for auto-linking in conversational output, and is applying that prior in a context where it makes no sense. the fix is two regex lines that unwrap only the degenerate case where link text equals url-without-protocol real markdown like [click](https://x .com) passes through untouched.

this is also conditioning of their own tools during RL which were different from all other tools we write and ofc can't predict.

"tool confusion" is a more useful frame than "capability gap." the model knows how to format a path. it just hasn't been told clearly enough that this path is going to fopen, not into a chat bubble. so we encode that hint at the schema level pathString() instead of z.string() and the leak is plugged for every path field at once.

3/ the design choice that mattered was inverting preprocess-then-validate to validate-then-repair.

my first attempt was the obvious one: a preprocessing pass that normalized inputs (strip nulls, parse stringified arrays, etc.) before zod ever saw them. it broke immediately, writeFile content that happened to be json-shaped got rewritten before it hit disk. silent corruption, easy to miss in a smoke test.

then i made it less greedy - parse the input as-is. if it succeeds, ship it. valid inputs are never touched. - on failure, walk the validator's own issue list. for each issue path, try the four repairs in order until one applies. - parse again. on success, log tool_input_repaired:${toolName}. on failure, log tool_input_invalid:${toolName} and return a model-readable retry message.

the structural insight here is: when you preprocess, you encode a prior about what's broken. when you let the validator complain first, the schema is the prior, and you only spend repair budget at the exact paths the schema actually disagreed at. the validator is doing the work of localizing the bug for you. it's the same shape as cheap-then-careful everywhere else try the fast path, fall back on evidence.

(this also gives you per-tool telemetry for free. you can watch repair rates per (model, tool) and notice when a model regresses on a specific contract before users do.)

4/ shape invariants and relational invariants need different fixes.

the four repairs above all handle shape problems wrong type, missing key, wrong container. but read_file had a relational invariant: "if you provide offset, you must also provide limit, and vice versa." deepseek kept calling readFile({ absolutePath, limit: 30 }) and getting an ERROR: back. you can't fix this with input repair, because each field is independently valid the bug is in the relationship between them.

so i taught the function the model's intent instead. limit alone → offset = 0. offset alone → limit = 2000 (matches common read tool ops default). then surfaced the decision back to the model in the result:

"Note: limit was not provided; defaulted to 2000 lines. To read more or fewer lines, retry with both offset and limit."

no Error: prefix, so the tui doesn't paint it red. the model sees what we picked and can self-correct on the next turn if our guess was wrong. transparency over silent magic wins big.

repair where you can. extend semantics where you can't. surface the choice either way.

zoom out:

a lot of what looks like model capability is actually contract design. a strict schema is a choice with a cost it filters out noise, but it also filters out recoverable noise from any model that hasn't memorized the exact json contract you happened to pick. the largest commercial models eat that cost invisibly and are linient on tool calling because they've seen enough of every contract during pretraining; open models pay it loudly and get dismissed for it.

the harness is where you mediate between distributions. four small repairs (i'm sure more to follow as we have three more merging today), two regex lines for auto-links, one relational default, one prefix change. the model didn't change. the contract got more forgiving in exactly the places it needed to be.

deepseek v4 pro now beats opus 4.7 6/10 times on our internal evals.

imo "skill issue" applies to the harness more often than the model.

📋 讨论归档

讨论进行中…