🧠 阿头学 · 💬 讨论题

评估 Agentic Harness 的正确姿势，不是看日志而是做实验

这篇文章最有价值的判断是“Agent 评估本质上是可重复实验设计而不是线上指标观测”，但作者把“大模型输出一致性”偷换成“真实正确性”的做法有明显方法论硬伤。
打开原文 ↗

2026-05-10 原文链接 ↗

阅读简报

双语对照

完整翻译

原文

讨论归档

核心观点

先拆模块，别一口气评整条链路 作者判断端到端评估往往不可行动，因为变量太多、归因太差；优先评估最靠前、假设最重、最影响目标的子模块，才能把“发现问题”变成“能下手优化”。
评估的目标必须先定权重 作者把目标压成四类：准确性、延迟、成本、技术债，这个框架是务实的；但真正重要的判断是别同时最大化四者，否则实验只会产出漂亮图表，不会产出决策。
把 Agent 变成黑盒函数，再暴露少量旋钮 作者强调先隔离模块，再只保留 2-3 个自变量，这个建议很工程化，也确实能降低实验复杂度；不过它更适合单模块优化，不适合研究多变量交互强的真实 Agent 系统。
确定性指标优先，LLM 裁判只是补充 这是全文最站得住脚的部分，因为 regex、JSON 校验、precision/recall、IOU 这类指标更便宜、更稳定、更可复现；行业里动不动就上 LLM-as-a-judge，很多时候是在用高成本掩盖评估设计偷懒。
作者案例能证明流程有用，不能证明结论普适 用真实日志、公开数据集、成本/时延/错误率联动评估，说明这套流程对他自己的 retrieval subagent 有实际价值；但它最多证明“这个案例里换模型有效”，还远远证明不了“这是 Agent 评估的通用标准答案”。

跟我们的关联

对 ATou 意味着什么 ATou 如果在搭复杂 workflow，不该先追求“全链路智能感”，而该先把最前置、最脆弱的模块抽出来做 harness；下一步可以直接用“模块/目标/旋钮/数据/指标/图表”六步法给现有流程做第一次体检。
对 Neta 意味着什么 Neta 如果要判断一个 AI 产品到底有没有工程壁垒，不能只看 demo，而要追问它有没有稳定的 eval 基建；下一步可以把“是否有真实日志、是否有回归测试集、是否同时看成本和错误率”当成项目尽调清单。
对 Uota 意味着什么 Uota 如果关注“系统如何形成可持续自我优化能力”，这篇文章最值得吸收的是 eval harness 其实就是半个 RL 环境；下一步可以讨论如何把 reward 写清楚，让 prompt 优化、策略搜索甚至轻量 RL 接上现有评估框架。
对三者共同意味着什么 这篇文章共同提醒了一件事：没有可重复评估，就没有可信优化；下一步最该做的不是再换新模型，而是先确认你现在的“变好”到底是不是被可靠测出来的。

讨论引子

1. 如果一个小模型和大模型输出不一致，但用户结果更好，我们应该相信参考答案还是相信用户价值？ 2. Agent 系统为了做干净实验而关闭缓存和记忆，这到底是在提升因果判断，还是在阉割 Agent 的核心能力？ 3. 对真实产品来说，成本下降 50% 但准确率轻微下降，什么情况下是正确决策，什么情况下是错误决策？

我们的前辈早就为改进任何系统写好了脚本。第一步，构建。第二步，评估。第三步，优化。现在有了 coding agents，第一步已经不难了。等你做完第二步，第三步通常也会变得简单。

所以这篇文章讲的是这个流程里最开放的一部分，也就是评估。但它讲的评估，可能和你听到这个词时第一反应想到的并不完全一样。

在我看来，大家普遍低估了评估的范围，也误判了它的边界。

评估不只是收集当前系统的日志和成功指标。

它也包括通过频繁实验来验证你的假设，并把你的方案和其他可替代方法做对比。

过去一周，我一直在跑多组实验，想弄清楚我已经部署出去的那些 agentic 系统到底有多强。整篇文章里，我会给几个真实世界的例子，帮你打开思路，也方便你把这些方法迁移到自己的场景里。

该怎么设计实验如何选择评估指标又该怎么解读结果

废话不多说，开始。

第 1 步

决定你要评估什么

大多数 agentic pipeline 都由多个不同的 agent 协作组成，最终给用户提供某种价值。我平时有个简单经验，就是把每个 agent 都看成一个独立的 harness。对我大多数使用场景来说，这个做法一直都挺有效。

具体到你的场景，你可能会想搭一个系统级的评估 harness，也可能会做模块级或单元级的 harness。

你的第一个任务，是先做出一个决定。你到底在评估什么？

比如，我有一个用 AI 学习论文的网站。里面有很多不同的 agentic 模块。一个负责处理论文，一个负责生成图表，一个负责根据用户问题从论文里找相关段落，一个负责生成回答，等等。

通常你没法一次性评估整条 pipeline，还指望得到能直接行动的结论。因为里面可变因素实在太多了。

那怎么选对模块。问问自己。你多半早就知道自己项目里哪个模块最该立刻评估了。读到这里的人，我相信 99% 都已经心里有数。如果你刚好属于那 1%，可以先问自己这几个问题。

我在这条 pipeline 的哪个地方做了最离谱的假设。实验的核心就是验证假设，所以从这里开始最合理。 *
我的 pipeline 里，哪个部分出现在最前面。越靠前的环节，影响通常越大，因为错误会一路往下传。所以优先看这些。 *
我这次的目标是什么。我想优化的是哪个方向。这个问题我后面会专门展开讲。

对 Paper Breakdown 来说，我决定先做 retrieval subagent。

问。我的 retrieval subagent 是做什么的。 给定一个用户问题，这个 subagent 会在一篇或多篇论文里搜索，并找出和用户问题相关的段落。之后所有类型的回答生成，不管是文本、图表还是代码，都会以这些检索到的段落作为上下文。

第 2 步

决定你的最终目标

做实验时，脑子里一定得有个目标。你要验证一个假设，或者打消一个怀疑。

成功长什么样。拿到结果之后，你打算怎么用这些信息。最好一开始就有清晰的假设，也有明确的行动阈值。

好在这里有个几乎通用的目标，所有人都能朝它靠近。

我们都在优化一件事，就是给用户提供更好的价值……

越快越好……

成本越低越好……

技术债越少越好。

作为项目负责人，我们通常追求的目标，基本就是这四类。

提高准确性：能不能提升目标模块输出结果的整体质量
降低延迟：能不能让目标模块跑得更快
降低成本：能不能让目标模块运行起来更便宜
提升代码质量：能不能减少依赖，让代码库更简单、更轻

不同业务对这四件事的权重会不一样。理想情况下，先挑一个重点优化，同时确保其他几个维度还在可接受范围内。

对 Paper Breakdown 来说，我的目标很简单。因为 PBD 完全是我自己出钱在养，所以我想尽可能把成本降下来，这样就能支持更多免费用户，放宽限制，也让更多人有机会先免费试用平台。我的动机就是从这里开始的。

**第 3 步

隔离黑盒和你的旋钮**

你在第 1 步里已经选好了要测的模块。现在该把它从系统其他部分里隔离出来了。你需要一个干净的函数。把输入塞进去，它就吐出输出，不用操心内部管线怎么接。

但先别急着说，哦，这不是早就做完了吗，我本来就有这些函数。因为在真正开始跑实验前，你还缺一样东西。

自变量。

它本质上就是实验里的旋钮，也就是机器学习语境里的超参数。这些是你从外部控制的具体配置参数，用来观察它们如何影响黑盒的表现。

把这些自变量从程序其他部分里隔离出来。我建议自变量数量先控制在 2 到 3 个以内，方便你在继续加变量之前，先看清它们各自带来的影响。

例子如下。

假设你只是想测试不同 LLM 的质量。把模块做成一个函数，并让它接收一个参数，也就是模型名。
假设你想测试不同的 system prompt。那就把 prompt 当成输入传进去。
如果你想测试新工具，就把这些工具藏在 feature flag 后面，让你可以随时开关。

关于持久化的说明

大多数真实世界应用都会有缓存层。比如 API 后面挂一层 Redis，把过去的响应缓存起来，供后续计算复用。在线上这当然很好，但做实验时，你得重新判断它是不是必需的。

如果你测试的是多轮功能，或者模块读写缓存的能力，那完全可以保留。但在大多数情况下，你会希望把这些功能关掉。不写数据库，不写缓存。

你希望实验是可重复的，更重要的是，绝不能互相干扰。第 10 个测试用例不该比第 4 个天然更有优势，也不该更吃亏。每个测试用例都应该是事务性的。你的黑盒应该是一个零持久化的临时函数。

问。我的实验是什么样的。

我做过一个实验，目标是测试新的小模型。小模型更快，也更便宜。所以如果能把当前 subagent 使用的模型换成更小的模型，我就能省下一些钱。

等我把 subagent 的代码搭好，运行方式也几乎和线上一致，就可以进入下一步了，也就是为实验准备测试用例。

第 4 步

设计你的测试用例

评估质量的上限，取决于你的数据集质量。拿垃圾数据去测，最后你优化出来的也只会是垃圾。

每个测试用例都有两个组成部分，输入和期望输出。最简单的实验，甚至只需要一个 CSV 文件，里面真的就两列。

那这些输入和期望输出从哪来。可以参考下面几种来源。

如果你有线上日志，这就是最好的起点。理想情况下，你应该把服务里的所有交互都存进数据库，或者写成结构化日志。这样你就能找到目标模块被调用时的准确输入。
如果你没有线上日志，我建议先别做实验，先把分析系统搭起来。去看看 posthog 之类的工具。加一层 middleware，让用户先用几周产品。之后再回来做实验。
如果上面两种都不适用，那最靠谱的办法就是自己写测试用例，或者用 LLM 生成合成样例。具体走哪条路，还是两种结合，要看你的场景。

高质量测试用例数据长什么样。

为了得到更可靠的结果，测试用例最好具备下面这些特征。

去重且多样：因为最后要汇总这些测试用例上的整体表现，所以测试用例的多样性非常重要。如果某个子领域被你过度采样，整个实验都会被带偏。
最好有 ground truth 响应：如果你在线上日志里记录了生产系统的输入，多半也记录了输出。拿到这些输出会很有用。至少它能告诉你，新方案相对当前方案预计会偏移多少。

问。我是怎么做的。

我的基础输入输出数据集公开放在这里。

https://huggingface.co/datasets/paperbd/paper-cited-chunks-v1

我从数据库里拉了几百条真实线上消息，这些消息对应的是 assistant 在回答里引用了论文 PDF 某段内容的场景。
我把主 agent 发给 subagent 的查询反向还原出来了。因为这些信息我本来就存进数据库了，所以并不难。
我确保每篇论文的样本数都有上限。
在每篇论文内部，我也尽量让各个测试用例之间的重叠最小化。这样能保证每篇论文内部的覆盖面。

最后，我得到的是一个很简单的数据集，里面包含主 agent 发出的查询，以及它最终回答里引用到的 chunk。

第 5 步

设计一个或多个评估指标

下一步，你需要一种方法，能自动给系统输出打分。

评估指标大体分成确定性指标和概率性指标。确定性指标是精确判断。输出里有没有某个特定字符串。输出是不是合法 JSON。长度是不是低于 500 个字符。

概率性指标通常会用到 LLM-as-a-judge，也就是让一个更强的模型根据提示词，对输出在有用性、语气或者准确性上打 1 到 5 分。

只要能用确定性指标，就尽量用。因为它更便宜、更快，而且是 100% 可靠的。

评估指标示例：

如果你评估的是 retrieval agent，可以用 precision、recall 或 IOU 分数。
如果你评估的是完整回答 agent，可以用 LLM-as-judge。
如果你评估的是 task agent，可以评估 tool call 统计。
大多数 task agent 也都工作在可验证环境里，也就是说，任务成没成功通常有明确判断方式。尽量利用这种设计。
对 citation agent 来说，你往往会在 system prompt 里要求它按特定格式回复。这时可以用 regex 找出格式不合法的结果。对 coding agent 或结构化输出生成器来说也是一样。
不管你做什么，至少额外记录 3 件事，总 walltime、completion token 使用量，以及总成本。

问。我是怎么做的。

因为我的目标是在保持延迟和准确性的前提下降低成本，所以我上面这些方法几乎都用了。

我拿各个模型和当前输出做对比。主要用 IOU 分数，但也记录了更多指标，比如每个模型返回的 chunk 总数。返回太多的话，主 agent 会被淹没。另外还记录了 precision 和 recall。
我也拿各个模型和 Sonnet-4.6 的输出做对比。这是我测试里用过的最大模型，所以对我的 harness 来说，它已经很接近 LLM-as-a-judge 了。
我记录了成本、completion tokens、tool call 次数、cache write attempt 次数，也就是 LLM 可以调用的那个写缓存工具，当然还有跑完整组实验的总 walltime。
还有错误率，这个肯定也要记。模型一旦失败就会被淘汰。

当黑盒准备好了，也就是 retrieval subagent，旋钮准备好了，也就是模型名，输入样例准备好了，也就是线上真实查询，期望输出准备好了，也就是生产环境里模型引用的段落加上 Sonnet-4.6 的输出，再加上指标体系也全设好了，实验就可以开跑了。拿到结果之后，我就开始画图来解读它们。

关于 RL 环境和评估 harness 的说明

评估 harness 和 RL 环境其实非常像。你可以程序化生成 prompt，也就是 observation，也可以直接使用已有的输入测试用例。reward function 你也已经有了。对我来说，就是 sonnet overlap 加 latency 加 cost。理论上，你已经可以在这些 harness 上跑 prompt optimization 方法，比如 GEPA，或者直接做端到端 RL 训练，来训练小型 agent。这个话题我会另写一篇。

第 6 步

画图和可视化

可视化是通往实验灵魂的一扇窗。大概就是这个意思。

https://x.com/@neural_avb

我下一步就是画图。用柱状图看响应时间。用散点图看 Cost 对 IOU，或者 Walltime 对 IOU。我分别画了和 baseline 比较，以及和 Sonnet-4.6 比较时的 IOU 和 Recall 分数。我还做了箱线图，用来观察响应时间、返回 chunk 长度等指标的分布。我想要的是那种分布窄、均值高、表现稳定的模型。

这一步其实可以直接交给 coding agent。告诉它你需要哪些图，基本就行了。

收获

总之，这套方法帮了很大忙。我最终把之前的系统，也就是 gpt-5-mini，换成了一个更快也更便宜的替代方案，也就是 gemini-3-flash-lite。它更快，更便宜，而且还会调用我一个工具，把信息共享给后续 subagent，这件事 gpt-5-mini 以前并不会稳定去做。

https://paperbreakdown.com

这件事另外一个特别好的地方在于，以后每次有新模型出来，我都能很容易复用这套实验。我只需要拿现有测试用例，在新模型上重新跑一遍，再把它的表现和既有 benchmark 做对比就行。

希望这篇文章多少能让你脑子里的齿轮转起来。实验设计真的既有趣又有价值，建议你亲手试试。也欢迎分享你的问题和经验，大家可以一起互相学习。

如果你喜欢看链接，这里可以找到我。

关注我获取更多内容：@neural_avb

我的 YouTube 频道：https://www.youtube.com/@avb_fj

Patreon：https://www.patreon.com/NeuralBreakdownwithAVB

Paper Breakdown：https://paperbreakdown.com/

一点善意，能走很远。转发或留言支持一下吧 💙

Our forefathers and foremothers laid down the script for improving any system: (1) Build, (2) Evaluate, (3) Refine. Step 1 is trivial now thanks to coding agents. Step 3 is also simple too once you finish Step 2.

So this article is about the most open-ended part of the process: Evaluation. But it may not be about exactly what you think when you hear that word.

Imo, people underestimate and miscalculate the breadth of evaluation.

Evaluation is not only about collecting logs and success metrics of the current system you have in place.

It is also about validating your hypothesis and comparing your approach against alternative methods through frequent experiments.

For the past week, I have been running multiple experiments to find out how good all my deployed agentic systems actually are. Throughout this article, I will give a couple of real world examples to get your brain juices flowing and help you transfer some of this stuff into your domain.

How can you design experiments? How do you pick eval metrics? How do you interpret your results?

Enough talk, let's jazz.

Step 1:

Decide what you want to evaluate

Most agentic pipelines work with multiple different agents collaborating together towards giving your user some value. A good rule of thumb I follow is to treat each agent as a separate harness. For most of my use-cases, following this rule of thumb has served me well.

Depending on your use-case, you may want to set up a system-level evaluation harness, or a module/unit-level harness.

Your first task is to make a decision - what are you evaluating?

For example, I have a website to study research papers with AI. It has multiple different agentic modules - one to process papers, another to generate diagrams, another to find relevant sections from the paper given an user query, another to respond to answers, etc etc.

Generally you can't evaluate an entire pipeline at once and expect actionable insights. There may be way too many moving variables!

So how to pick the correct module? Ask yourself, you probably already know which module in your project needs instant evaluation. I am sure 99% of you reading already know this! If you in the 1% I will suggest you to ask yourself these questions:

Where in the pipeline have I made the most egregious assumptions? *( the whole point of an experiment is to validate hypothesis/assumptions, so may as well start here) *
What part of my pipeline happens EARLIEST in the the chain? *( the earlier things are in the pipeline, the more impactful they tend to be coz errors propagate downstream! So prioritize these first! ) *
What is my goal here - what vector am I trying to optimize? ( I have an entire section for this one coming up)

For Paper Breakdown, I decided to work on my retrieval subagent.

Q. What does my retrieval subagent do? Given a user query, my subagent searches one or more research papers and fetches passages that are relevant to the user's question. All types of response generation (through text, diagrams, code, etc) is contextualized by these retrieved passages.

Step 2:

Decide your end goal

When we run experiments, we always have a goal in mind. A hypothesis we want to test, or a suspicion we want to rest.

What does success look like? What will you do with the information once you have it? It is best to have a clear hypothesis and a threshold for action.

Thankfully, there is a universally applicable goal we can all strive for

We are all optimizing to provide better value to our users... *

as fast* as possible...

with the least cost...

and minimal tech debt.

The three goals that are generally what we strive for as project owners are:

Improve accuracy: Can we improve the overall quality of the outputs of our target module?
Improve latency: Can we make our target module run FASTER?
Improve cost: Can we make our target module cheap to run?
Improve code quality: Can we kill dependencies, make our codebase simpler and leaner?

Depending on your business, you may weigh these four things differently. Ideally, pick ONE of these while ensuring the other ones stay acceptable!

For Paper Breakdown, I had a simple goal. Since PBD is completely self-funded by me, I want to cut my cost as much as I can so I can support more free users. Extend limits, and give users a chance to try out the platform for free. This is where the motivation began.

所以这篇文章讲的是这个流程里最开放的一部分，也就是评估。但它讲的评估，可能和你听到这个词时第一反应想到的并不完全一样。

在我看来，大家普遍低估了评估的范围，也误判了它的边界。

评估不只是收集当前系统的日志和成功指标。

它也包括通过频繁实验来验证你的假设，并把你的方案和其他可替代方法做对比。

该怎么设计实验如何选择评估指标又该怎么解读结果

废话不多说，开始。

第 1 步

决定你要评估什么

具体到你的场景，你可能会想搭一个系统级的评估 harness，也可能会做模块级或单元级的 harness。

你的第一个任务，是先做出一个决定。你到底在评估什么？

通常你没法一次性评估整条 pipeline，还指望得到能直接行动的结论。因为里面可变因素实在太多了。

我在这条 pipeline 的哪个地方做了最离谱的假设。实验的核心就是验证假设，所以从这里开始最合理。 *
我的 pipeline 里，哪个部分出现在最前面。越靠前的环节，影响通常越大，因为错误会一路往下传。所以优先看这些。 *
我这次的目标是什么。我想优化的是哪个方向。这个问题我后面会专门展开讲。

对 Paper Breakdown 来说，我决定先做 retrieval subagent。

第 2 步

决定你的最终目标

做实验时，脑子里一定得有个目标。你要验证一个假设，或者打消一个怀疑。

成功长什么样。拿到结果之后，你打算怎么用这些信息。最好一开始就有清晰的假设，也有明确的行动阈值。

好在这里有个几乎通用的目标，所有人都能朝它靠近。

我们都在优化一件事，就是给用户提供更好的价值……

越快越好……

成本越低越好……

技术债越少越好。

作为项目负责人，我们通常追求的目标，基本就是这四类。

提高准确性：能不能提升目标模块输出结果的整体质量
降低延迟：能不能让目标模块跑得更快
降低成本：能不能让目标模块运行起来更便宜
提升代码质量：能不能减少依赖，让代码库更简单、更轻

不同业务对这四件事的权重会不一样。理想情况下，先挑一个重点优化，同时确保其他几个维度还在可接受范围内。

**Step 3:

Isolate the black box and your knobs**

You picked a module in Step 1 to test. Now it's time to isolate it from the rest of the system. You need a clean function where you shove your inputs in, and it spits the outputs out, without worrying about any internal plumbing.

Before you become happy saying "oh this is already done, I have my functions already duh", there is one more thing you need before you start running experiments.

Independent variables.

This is basically the "knobs" of your experiment. The hyperparameters (in a machine learning sense). These are the specific configuration parameters you are controlling from the outside to see how they affect the performance of the black box.

Isolate your independent variables from the rest of your program. I will suggest to keep the number of independent variables to like maximum of 2 or 3 so you can study their effects afterwards before adding more variables!

Examples:

Suppose you just want to test quality of different LLMs. Put your module as a function and make it accept a parameter (the model name) as input.
Suppose you want to test different system prompts. Pass the prompts as input.
You want to test new tools, hide these tools behind a feature flag that you can toggle on or off.

Note on persistence

Most real world applications use a caching layer. Maybe a Redis layer that caches past responses inside the API so future computation can use it. This is awesome for production, but for an experiment, you need to reconsider if it's required.

If you are testing a multi-turn feature and the module's ability to read/write input from cache: by all means DO IT. In most cases, you would want to turn these features off. No db writes, no cache writes.

You want experiments to be repeatable and above all, never interfere with any test case. Test case 10 should have zero advantage (or disadvantage) over Test case 4. Every test case needs to be transactional, your black box should be a ephemeral functions with zero persistence.

Q. What did my experiments look like?

In one of my experiments, I wanted to test new (smaller) models. Smaller models run fast, and cost less. So if I can replace my current subagent model with a smaller model, I could save a few dollars.

Having set up my subagent code (almost exactly like how it runs in prod), I was ready for the next stage. Prepare the test-cases for my experiment.

**第 3 步

隔离黑盒和你的旋钮**

但先别急着说，哦，这不是早就做完了吗，我本来就有这些函数。因为在真正开始跑实验前，你还缺一样东西。

自变量。

它本质上就是实验里的旋钮，也就是机器学习语境里的超参数。这些是你从外部控制的具体配置参数，用来观察它们如何影响黑盒的表现。

把这些自变量从程序其他部分里隔离出来。我建议自变量数量先控制在 2 到 3 个以内，方便你在继续加变量之前，先看清它们各自带来的影响。

例子如下。

假设你只是想测试不同 LLM 的质量。把模块做成一个函数，并让它接收一个参数，也就是模型名。
假设你想测试不同的 system prompt。那就把 prompt 当成输入传进去。
如果你想测试新工具，就把这些工具藏在 feature flag 后面，让你可以随时开关。

关于持久化的说明

如果你测试的是多轮功能，或者模块读写缓存的能力，那完全可以保留。但在大多数情况下，你会希望把这些功能关掉。不写数据库，不写缓存。

问。我的实验是什么样的。

我做过一个实验，目标是测试新的小模型。小模型更快，也更便宜。所以如果能把当前 subagent 使用的模型换成更小的模型，我就能省下一些钱。

等我把 subagent 的代码搭好，运行方式也几乎和线上一致，就可以进入下一步了，也就是为实验准备测试用例。

Step 4:

Design your test-cases

Your evaluation is only as good as your dataset. If you test on garbage, you'll optimize for garbage.

Every test-case has 2 components: the input and the expected output. The most simple experiments can just be some kind of a CSV file containing literally those two columns as inputs.

Soooo where do you get these inputs and expected outputs from? Here are some examples:

If you have production logs, that is the best place to begin. Ideally, you should be saving all interactions through your service in a database, or structured logs. You should be able to find the exact inputs where your target module was invoked.
If you don't have production logs, I encourage you to quit experimenting and set up your analytics first. Look into posthog or something. Add a middleware and let your users use your product for a couple of weeks. Come back to the experiment later.
If none of the above apply to you, your best shot is to either write test-cases yourself, or generate synthetic cases with an LLM. I encourage you to decide which route (or a combination) is best for your use-case.

How does quality test-case data look like?

For best results, it is important that your test-case have the following properties:

**Deduplicated and diverse: ** Since we will be aggregating the performance across these test cases, it is super important for your test cases to be diverse. If you oversample a single subdomain, you stand the risk of biasing your entire experiment.
Ground truth responses are preferred: If you are logging the inputs of your production system in the logs, you are probably also logging outputs as well. Having access to this is handy coz in the minimum, it tells you how much your new solution is expected to drift from your current solution.

Q. What did I do?

My base input-output dataset is publicly available here:

https://huggingface.co/datasets/paperbd/paper-cited-chunks-v1

I pulled a couple hundred real production messages from our database where the assistant referred to a passage in the paper PDF.
I reverse-engineer the query that was sent by the main agent to the subagent (I save this info in the database, so this wasn't difficult)
Made sure to cap the samples per paper
Within each paper I made sure each test case have minimal overlap... this ensures coverage of each paper.

In the end, I had a simple dataset of the query that was asked by the main agent, and the chunks that were referred in it's final response.

Step 5:

Design one or more evaluation metrics

Next step: you need a way to score the system's output automatically.

There are deterministic metrics and probabilistic metrics. Deterministic metrics are exact: did the output contain a specific string? Was the output a valid JSON? Was the length under 500 characters?

Probabilistic metrics usually involve "LLM-as-a-judge", where you prompt a smarter model to grade the output on a scale of 1-5 for helpfulness, tone, or accuracy.

You should use deterministic metrics wherever possible because they are cheaper, faster, and 100% reliable.

**Examples of eval metrics: **

If you are evaluating a retrieval agent, use precision, recall, or IOU scores.
If you are evaluating a full response agent, use LLM-as-judge
If you are evaluating a task agent, evaluate tool call statistics
Most task agent also work in verifiable environments - meaning there is a clear way to distinguish if the work succeeded or not. Use this design!
For citation agents, you would often have system prompts that ask them to respond in a specific format. Use regex to find malformed formatting. This is also true for coding agents, or structured output generators.
Whatever you do, always record atleast 3 additional things: total walltime, completion token usage, and total cost.

Q. What did I do?

Since my goal was to reduce cost while maintaining latency and accuracy, I used almost all the above:

I compared models with the current outputs. I mainly used IOU scores, but I also recorded more metrics like total chunks returned per model (too many will mean main agent will get overwhelmed), precision, and recall.
I also compared models with Sonnet-4.6 output. This was the largest model I tried with, so it is as close to LLM-as-a-judge for my harness.
Recorded cost, completion tokens, number of tool calls, number of cache write attempts (this is a tool that LLMs have access to), and ofcourse, total walltime to run all experiments.
Error rates too, of course. Models get eliminated if they fail.

With my black box ready (retrieval subagent), my knobs (model names) ready, my input cases (actual queries from prod), my expected outputs (passages cited by models in prod + Sonnet-4.6 outputs), and my metrics all setup, it was time to run experiments! Once I got the results, I just began drawing plots to interpret them.

Note on RL Environments vs Evaluation Harnesses

Evaluation harnesses and RL environments are actually very similar. You can procedurally generate prompts (observations), or use your existing input test cases. Youalready got your reward function (for me: sonnet overlap + latency + cost). In theory, you can run prompt optimization methods (GEPA), or end-to-end RL training already, to train small agents on your harnesses. More on that in a separate article.

Step 6:

Draw graphs and plots

Visualizations are a window to your experiment's soul. Or something like that.

https://x.com/@neural_avb

My next step was to plot graphs. Bar graphs to measure response times. Scatter plots to design Cost vs IOU or Walltime vs IOU. I plotted IOU/Recall scores for both comparing with baseline and Sonnet-4.6. I also made box-plots to find the distribution of response times, returned chunk lengths, etc. I want consistent models that have a narrow distribution and a high mean.

You can just ask your coding agent to do this step for you. Tell them what plots you need and you should be good to go!

Takeaways

So yeah, that helped a lot. I was able to replace my previous system (gpt-5-mini) with a much faster and cheaper alternative (gemini-3-flash-lite). It is faster, it is cheaper, and it also calls one of my tools that allows it to share information for latter subagents (something gpt-5-mini didn't do regularly).

https://paperbreakdown.com

The other best part about all this is that I can easily replicate this experiment on a new model whenever something new drops. I will just re-run the existing test-cases only on the new model, and compare it's performance against the established benchmark.

Hope I managed to get the cogs in your head turning. Experiment design is super fun and valuable - I encourage you to try it. Share your questions or experiences, so we can learn from each other.

If you are into links, here is where to find me:

Follow me for more: @neural_avb

My YouTube channel: https://www.youtube.com/@avb_fj

Patreon: https://www.patreon.com/NeuralBreakdownwithAVB

Paper Breakdown: https://paperbreakdown.com/

A bit of kindness goes a long way, RT or comment to support! 💙

第 4 步

设计你的测试用例

评估质量的上限，取决于你的数据集质量。拿垃圾数据去测，最后你优化出来的也只会是垃圾。

每个测试用例都有两个组成部分，输入和期望输出。最简单的实验，甚至只需要一个 CSV 文件，里面真的就两列。

那这些输入和期望输出从哪来。可以参考下面几种来源。

如果你有线上日志，这就是最好的起点。理想情况下，你应该把服务里的所有交互都存进数据库，或者写成结构化日志。这样你就能找到目标模块被调用时的准确输入。
如果你没有线上日志，我建议先别做实验，先把分析系统搭起来。去看看 posthog 之类的工具。加一层 middleware，让用户先用几周产品。之后再回来做实验。
如果上面两种都不适用，那最靠谱的办法就是自己写测试用例，或者用 LLM 生成合成样例。具体走哪条路，还是两种结合，要看你的场景。

高质量测试用例数据长什么样。

为了得到更可靠的结果，测试用例最好具备下面这些特征。

去重且多样：因为最后要汇总这些测试用例上的整体表现，所以测试用例的多样性非常重要。如果某个子领域被你过度采样，整个实验都会被带偏。
最好有 ground truth 响应：如果你在线上日志里记录了生产系统的输入，多半也记录了输出。拿到这些输出会很有用。至少它能告诉你，新方案相对当前方案预计会偏移多少。

问。我是怎么做的。

我的基础输入输出数据集公开放在这里。

https://huggingface.co/datasets/paperbd/paper-cited-chunks-v1

我从数据库里拉了几百条真实线上消息，这些消息对应的是 assistant 在回答里引用了论文 PDF 某段内容的场景。
我把主 agent 发给 subagent 的查询反向还原出来了。因为这些信息我本来就存进数据库了，所以并不难。
我确保每篇论文的样本数都有上限。
在每篇论文内部，我也尽量让各个测试用例之间的重叠最小化。这样能保证每篇论文内部的覆盖面。

最后，我得到的是一个很简单的数据集，里面包含主 agent 发出的查询，以及它最终回答里引用到的 chunk。

第 5 步

设计一个或多个评估指标

下一步，你需要一种方法，能自动给系统输出打分。

概率性指标通常会用到 LLM-as-a-judge，也就是让一个更强的模型根据提示词，对输出在有用性、语气或者准确性上打 1 到 5 分。

只要能用确定性指标，就尽量用。因为它更便宜、更快，而且是 100% 可靠的。

评估指标示例：

如果你评估的是 retrieval agent，可以用 precision、recall 或 IOU 分数。
如果你评估的是完整回答 agent，可以用 LLM-as-judge。
如果你评估的是 task agent，可以评估 tool call 统计。
大多数 task agent 也都工作在可验证环境里，也就是说，任务成没成功通常有明确判断方式。尽量利用这种设计。
对 citation agent 来说，你往往会在 system prompt 里要求它按特定格式回复。这时可以用 regex 找出格式不合法的结果。对 coding agent 或结构化输出生成器来说也是一样。
不管你做什么，至少额外记录 3 件事，总 walltime、completion token 使用量，以及总成本。

问。我是怎么做的。

因为我的目标是在保持延迟和准确性的前提下降低成本，所以我上面这些方法几乎都用了。

我拿各个模型和当前输出做对比。主要用 IOU 分数，但也记录了更多指标，比如每个模型返回的 chunk 总数。返回太多的话，主 agent 会被淹没。另外还记录了 precision 和 recall。
我也拿各个模型和 Sonnet-4.6 的输出做对比。这是我测试里用过的最大模型，所以对我的 harness 来说，它已经很接近 LLM-as-a-judge 了。
我记录了成本、completion tokens、tool call 次数、cache write attempt 次数，也就是 LLM 可以调用的那个写缓存工具，当然还有跑完整组实验的总 walltime。
还有错误率，这个肯定也要记。模型一旦失败就会被淘汰。

关于 RL 环境和评估 harness 的说明

第 6 步

画图和可视化

可视化是通往实验灵魂的一扇窗。大概就是这个意思。

https://x.com/@neural_avb

这一步其实可以直接交给 coding agent。告诉它你需要哪些图，基本就行了。

收获

https://paperbreakdown.com

如果你喜欢看链接，这里可以找到我。

关注我获取更多内容：@neural_avb

我的 YouTube 频道：https://www.youtube.com/@avb_fj

Patreon：https://www.patreon.com/NeuralBreakdownwithAVB

Paper Breakdown：https://paperbreakdown.com/

一点善意，能走很远。转发或留言支持一下吧 💙

So this article is about the most open-ended part of the process: Evaluation. But it may not be about exactly what you think when you hear that word.

Imo, people underestimate and miscalculate the breadth of evaluation.

Evaluation is not only about collecting logs and success metrics of the current system you have in place.

It is also about validating your hypothesis and comparing your approach against alternative methods through frequent experiments.

How can you design experiments? How do you pick eval metrics? How do you interpret your results?

Enough talk, let's jazz.

Step 1:

Decide what you want to evaluate

Depending on your use-case, you may want to set up a system-level evaluation harness, or a module/unit-level harness.

Your first task is to make a decision - what are you evaluating?

Generally you can't evaluate an entire pipeline at once and expect actionable insights. There may be way too many moving variables!

Where in the pipeline have I made the most egregious assumptions? *( the whole point of an experiment is to validate hypothesis/assumptions, so may as well start here) *
What part of my pipeline happens EARLIEST in the the chain? *( the earlier things are in the pipeline, the more impactful they tend to be coz errors propagate downstream! So prioritize these first! ) *
What is my goal here - what vector am I trying to optimize? ( I have an entire section for this one coming up)

For Paper Breakdown, I decided to work on my retrieval subagent.

Step 2:

Decide your end goal

When we run experiments, we always have a goal in mind. A hypothesis we want to test, or a suspicion we want to rest.

What does success look like? What will you do with the information once you have it? It is best to have a clear hypothesis and a threshold for action.

Thankfully, there is a universally applicable goal we can all strive for

We are all optimizing to provide better value to our users... *

as fast* as possible...

with the least cost...

and minimal tech debt.

The three goals that are generally what we strive for as project owners are:

Improve accuracy: Can we improve the overall quality of the outputs of our target module?
Improve latency: Can we make our target module run FASTER?
Improve cost: Can we make our target module cheap to run?
Improve code quality: Can we kill dependencies, make our codebase simpler and leaner?

Depending on your business, you may weigh these four things differently. Ideally, pick ONE of these while ensuring the other ones stay acceptable!

**Step 3:

Isolate the black box and your knobs**

Before you become happy saying "oh this is already done, I have my functions already duh", there is one more thing you need before you start running experiments.

Independent variables.

Examples:

Suppose you just want to test quality of different LLMs. Put your module as a function and make it accept a parameter (the model name) as input.
Suppose you want to test different system prompts. Pass the prompts as input.
You want to test new tools, hide these tools behind a feature flag that you can toggle on or off.

Note on persistence

Q. What did my experiments look like?

Having set up my subagent code (almost exactly like how it runs in prod), I was ready for the next stage. Prepare the test-cases for my experiment.

Step 4:

Design your test-cases

Your evaluation is only as good as your dataset. If you test on garbage, you'll optimize for garbage.

Every test-case has 2 components: the input and the expected output. The most simple experiments can just be some kind of a CSV file containing literally those two columns as inputs.

Soooo where do you get these inputs and expected outputs from? Here are some examples:

If you have production logs, that is the best place to begin. Ideally, you should be saving all interactions through your service in a database, or structured logs. You should be able to find the exact inputs where your target module was invoked.
If you don't have production logs, I encourage you to quit experimenting and set up your analytics first. Look into posthog or something. Add a middleware and let your users use your product for a couple of weeks. Come back to the experiment later.
If none of the above apply to you, your best shot is to either write test-cases yourself, or generate synthetic cases with an LLM. I encourage you to decide which route (or a combination) is best for your use-case.

How does quality test-case data look like?

For best results, it is important that your test-case have the following properties:

**Deduplicated and diverse: ** Since we will be aggregating the performance across these test cases, it is super important for your test cases to be diverse. If you oversample a single subdomain, you stand the risk of biasing your entire experiment.
Ground truth responses are preferred: If you are logging the inputs of your production system in the logs, you are probably also logging outputs as well. Having access to this is handy coz in the minimum, it tells you how much your new solution is expected to drift from your current solution.

Q. What did I do?

My base input-output dataset is publicly available here:

https://huggingface.co/datasets/paperbd/paper-cited-chunks-v1

I pulled a couple hundred real production messages from our database where the assistant referred to a passage in the paper PDF.
I reverse-engineer the query that was sent by the main agent to the subagent (I save this info in the database, so this wasn't difficult)
Made sure to cap the samples per paper
Within each paper I made sure each test case have minimal overlap... this ensures coverage of each paper.

In the end, I had a simple dataset of the query that was asked by the main agent, and the chunks that were referred in it's final response.

Step 5:

Design one or more evaluation metrics

Next step: you need a way to score the system's output automatically.

There are deterministic metrics and probabilistic metrics. Deterministic metrics are exact: did the output contain a specific string? Was the output a valid JSON? Was the length under 500 characters?

Probabilistic metrics usually involve "LLM-as-a-judge", where you prompt a smarter model to grade the output on a scale of 1-5 for helpfulness, tone, or accuracy.

You should use deterministic metrics wherever possible because they are cheaper, faster, and 100% reliable.

**Examples of eval metrics: **

If you are evaluating a retrieval agent, use precision, recall, or IOU scores.
If you are evaluating a full response agent, use LLM-as-judge
If you are evaluating a task agent, evaluate tool call statistics
Most task agent also work in verifiable environments - meaning there is a clear way to distinguish if the work succeeded or not. Use this design!
For citation agents, you would often have system prompts that ask them to respond in a specific format. Use regex to find malformed formatting. This is also true for coding agents, or structured output generators.
Whatever you do, always record atleast 3 additional things: total walltime, completion token usage, and total cost.

Q. What did I do?

Since my goal was to reduce cost while maintaining latency and accuracy, I used almost all the above:

I compared models with the current outputs. I mainly used IOU scores, but I also recorded more metrics like total chunks returned per model (too many will mean main agent will get overwhelmed), precision, and recall.
I also compared models with Sonnet-4.6 output. This was the largest model I tried with, so it is as close to LLM-as-a-judge for my harness.
Recorded cost, completion tokens, number of tool calls, number of cache write attempts (this is a tool that LLMs have access to), and ofcourse, total walltime to run all experiments.
Error rates too, of course. Models get eliminated if they fail.

Note on RL Environments vs Evaluation Harnesses

Step 6:

Draw graphs and plots

Visualizations are a window to your experiment's soul. Or something like that.

https://x.com/@neural_avb

You can just ask your coding agent to do this step for you. Tell them what plots you need and you should be good to go!

Takeaways

https://paperbreakdown.com

Hope I managed to get the cogs in your head turning. Experiment design is super fun and valuable - I encourage you to try it. Share your questions or experiences, so we can learn from each other.

If you are into links, here is where to find me:

Follow me for more: @neural_avb

My YouTube channel: https://www.youtube.com/@avb_fj

Patreon: https://www.patreon.com/NeuralBreakdownwithAVB

Paper Breakdown: https://paperbreakdown.com/

A bit of kindness goes a long way, RT or comment to support! 💙

📋 讨论归档

讨论进行中…