返回列表
🧠 阿头学 · 💬 讨论题

面向长时应用开发的 Harness 设计

Anthropic 用"规划者-生成者-评估者"三智能体架构让 Claude 跑出了远超单智能体基线的长时编码成果,但 20 倍成本换来的仍是"勉强可用的 Demo",核心价值不在架构本身,而在"脚手架必须随模型进化做减法"这一工程判断。
打开原文 ↗

2026-03-25 原文链接 ↗
阅读简报
双语对照
完整翻译
原文
讨论归档

核心观点

  • 生成与评估必须分离 作者最站得住的结论是,LLM 自评天然偏宽松,尤其在设计和复杂开发里会“自我感觉良好”;把 Generator 和 Evaluator 分开,哪怕评估器仍是 LLM,也比让模型自己挑自己毛病更可靠。
  • 长时任务的真问题是“跑偏”不是“不会” 文中把失败归因到两类:一是长上下文导致连贯性衰减甚至提前收尾,二是自评放水导致错误混过去;这个拆法是有效的,因为它把优化重点从“堆提示词”转向“重构执行流程”。
  • 评分标准不只是验收工具,而是生成方向盘 作者反复证明,rubric 一旦同时进入 generator 和 evaluator,模型在收到反馈前就会被牵引;这说明“评价函数塑造输出”是硬规律,不是后处理细节。
  • Harness 不是越复杂越好,而是要随模型升级做减法 这篇文章最成熟的工程观,不是鼓吹多智能体,而是承认很多脚手架只是对模型缺陷的补丁;当 Opus 4.6 变强后,原先必要的 sprint 拆解就可能变成冗余成本。
  • 效果提升是真实的,但经济性并未被证明 从 $9/20 分钟到 $200/6 小时,再到 $124/近 4 小时,文章证明了“更强输出可以买到”,却没有证明“这对真实团队是划算的”,这一点不能被 demo 的惊艳感掩盖。

跟我们的关联

  • 对 ATou 意味着什么 这篇文章提醒 ATou,不要把 agent 当全能个人,而要当成需要分工的组织;下一步可以把自己的复杂工作流拆成“规划-执行-验收”三层,先在内容生产、研究整理或产品原型上做小型实验。
  • 对 Neta 意味着什么 Neta 如果在做长链条任务,最该学的不是 Anthropic 的品牌话术,而是“评分标准先行”;下一步可以先写 3-5 条明确 rubric,再让模型按 rubric 生成和复审,验证是否比单轮直出稳定。
  • 对 Uota 意味着什么 Uota 如果关心 UI、品牌或审美输出,这篇文章最可迁移的是“把主观偏好编码成评估标准”;下一步可以把“设计质量/原创性/工艺/功能性”改写成自己的版本,用来约束设计生成。
  • 对投资判断意味着什么 文章显示 agent 基础设施的价值正在从“模型调用”转向“流程编排+验证工具+上下文管理”;下一步看项目时应重点问:它的 moat 是模型红利,还是能随着模型升级持续删改 harness 的工程能力。

讨论引子

1. 如果 full harness 比 solo 贵 20 倍以上,它在什么业务场景下才真正成立,而不是 demo 工程? 2. 评估器经过大量人工调教后才变“严格”,这到底算自主 agent,还是把人的偏好隐藏进了 prompt? 3. 随着底座模型变强,未来最有价值的是更复杂的多 agent 系统,还是更少但更精准的编排组件?

面向长时应用开发的 Harness 设计

在智能体编程的前沿,harness 设计是性能的关键。下面分享我们如何在前端设计与长时自主软件工程上,把 Claude 推得更远。

本文由 Prithvi Rajasekaran 撰写,他是我们 Labs 团队的成员。

过去几个月里,我一直在解决两个相互牵连的问题:让 Claude 产出高质量的前端设计,以及让它在无需人工干预的情况下构建完整应用。这项工作源自我们早期在 frontend design skilllong-running coding agent harness 上的探索,当时我和同事通过提示词工程与 harness 设计,把 Claude 的表现提升到了远高于基线的水平,但两条路线最终都撞上了天花板。

为了突破,我开始寻找能同时适用于两个差异极大的领域的新型 AI 工程方法:一个由主观审美定义,另一个由可验证的正确性与可用性定义。受到 生成对抗网络(GAN)启发,我设计了一个由生成智能体与评估智能体构成的多智能体结构。要让评估器稳定、可靠,而且有品味地打分,首先需要制定一套标准,把像 这个设计好吗 这类主观判断,转化为具体、可评分的条目。

随后,我把这些方法应用到长时自主编码上,并把我们之前 harness 工作中的两条经验带了过来:把构建过程拆解成可处理的小块,以及用结构化工件在不同会话之间交接上下文。最终得到的是一个由三类智能体组成的架构:规划者、生成者和评估者,它能在持续数小时的自主编码会话里产出内容丰富的全栈应用。

为什么朴素实现不够用

我们之前已经展示过,harness 设计会显著影响长时智能体编程的效果。在更早的一次 实验 中,我们用一个初始化智能体把产品规格拆成任务清单,再由编码智能体按功能逐个实现任务,并在会话之间交接工件来携带上下文。更广泛的开发者社区也收敛到了类似的洞见,比如 Ralph Wiggum 方法会用 hooks 或脚本让智能体保持连续的迭代循环。

但有些问题一直存在。面对更复杂的任务,智能体仍然会随着时间推移逐渐跑偏。我们拆解这个问题时,观察到这类任务中常见的两种失败模式。

第一种是:随着上下文窗口被不断填满,模型在长任务上往往会逐渐失去连贯性(见我们关于 上下文工程 的文章)。一些模型还会出现上下文焦虑:当它们接近自认为的上下文上限时,会提前开始收尾。上下文重置——把上下文窗口完全清空并启动一个新的智能体,同时用结构化交接把上一个智能体的状态与下一步计划传递过去——可以同时解决这两个问题。

这与压缩(compaction)不同。压缩是在原地把更早的对话内容总结掉,让同一个智能体能在更短的历史上继续推进。压缩能保留连续性,但它不会给智能体一张干净的白纸,因此上下文焦虑仍可能持续。重置能给出一张干净的白纸,但代价是交接工件必须包含足够的状态,才能让下一个智能体无缝接手。在我们早期测试中,我们发现 Claude Sonnet 4.5 的上下文焦虑表现得足够强烈,以至于仅靠压缩不足以支撑强的长任务表现,于是上下文重置成了 harness 设计的必需品。这解决了核心问题,但也为每次 harness 运行增加了编排复杂度、token 开销与延迟。

第二个问题(我们此前还没系统解决过)是自我评估。当被要求评估自己产出的工作时,智能体往往会自信地夸赞——即使在人类看来质量明显一般。这个问题在设计这类主观任务上尤其突出,因为它没有类似可验证软件测试那样的二元检查。布局是精致还是泛化,本来就是判断题,而智能体在给自己的作品打分时会稳定地偏向正面。

不过,即便在有可验证结果的任务里,智能体也仍会时不时表现出糟糕的判断,从而拖慢任务完成。把做事的智能体与判定的智能体分开,是解决这个问题的一根强杠杆。这种分离不会立刻消除宽松倾向,因为评估器仍然是一个倾向于对 LLM 生成内容更慷慨的 LLM。但把一个独立评估器调到更怀疑、更挑剔,往往比让生成器对自己的工作变得苛刻要可控得多;而一旦有了外部反馈,生成器也就有了具体可迭代的目标。

前端设计:让主观质量可评分

我从前端设计开始做实验,因为自我评估的问题在这里最明显。在没有任何干预时,Claude 通常会倾向于安全、可预期的布局:技术上能用,但视觉上不出彩。

我为前端设计构建的 harness 主要由两点洞见塑形。第一,审美不可能被完全还原成一个分数,个人口味也永远不同,但如果用能编码设计原则与偏好的评分标准,审美是可以被提升的。这个设计美吗 很难稳定回答,但这个是否遵循我们的好设计原则 就能让 Claude 有具体的参照。第二,把前端生成与前端评分分离,可以建立一个反馈回路,推动生成器朝更强的输出前进。

基于此,我写了四条评分标准,并把它们同时放进生成器与评估器的提示里:

  • 设计质量:整体是否像一个连贯的整体,而不是零散部件的拼凑?这里做得好意味着颜色、字体排印、布局、图像等细节能共同塑造出独特的情绪与身份。

  • 原创性:能看到定制化决策吗,还是只是模板布局、库默认值和 AI 生成的套路?人类设计师应当能识别出刻意的创意选择。未改动的素材组件——或白底卡片上叠紫色渐变这类明显的 AI 生成痕迹——都会在这里失败。

  • 工艺:技术实现:字阶层次、间距一致性、色彩和谐、对比度比例。它检查的是基本功而非创意。大多数合理实现默认能过关;不及格则意味着基础被破坏。

  • 功能性:与审美无关的可用性。用户能否理解界面在做什么,找到主要操作,并在不靠猜的情况下完成任务?

我把设计质量与原创性的权重放在工艺与功能性之上。Claude 在工艺与功能性上默认就能拿到不错分数,因为所需的技术熟练度往往会自然地从模型里冒出来。但在设计与原创性上,Claude 常常只能做到不难看而已。评分标准会明确惩罚高度泛化的 AI 糊弄套路,并通过更重的设计与原创性权重,推动模型在审美上更敢冒险。

我用带有细致分数拆解的少样本示例来校准评估器。这能让评估器的判断与我的偏好对齐,也能降低跨迭代的分数漂移。

我用 Claude Agent SDK 搭了这个循环,让编排保持简单。生成智能体先基于用户提示生成一个 HTML/CSS/JS 前端。我给评估器配了 Playwright MCP,让它在给每条标准打分并写出细致点评前,能直接与线上页面交互。实际运行中,评估器会自行在页面里导航、截图,并仔细研究实现后再给出评估。反馈再回流到生成器,作为下一轮迭代的输入。每次生成我会跑 5 到 15 轮迭代,每一轮通常都会在回应评估器批评时,把生成器推向更鲜明的方向。因为评估器是在真实页面里导航,而不是对静态截图打分,每一轮都会消耗真实时间;完整跑下来最长会到四小时。我还要求生成器在每次评估后做一个策略决策:如果分数趋势不错就继续精修当前方向;如果路线不对,就直接切换到完全不同的审美方向。

跨多次运行,评估器的评分会随着迭代提升,随后进入平台期,仍留有提升空间。有些生成是渐进式的精修,有些则会在迭代间发生明显的审美急转。

评分标准的措辞会以一些我没完全预料到的方式牵引生成器。比如加入 最好的设计是博物馆级 这样的语句,会把设计推向某种特定的视觉收敛,这暗示评分标准相关的提示本身,直接塑造了输出的性格。

分数总体会随迭代上升,但这种模式并不总是干净的线性。后期实现往往整体更好,但我也经常更喜欢某个中间迭代,而不是最后一版。实现复杂度也会在轮次里上升,生成器会为了回应评估器反馈而尝试更有野心的方案。甚至在第一轮,输出就明显好于完全不加提示的基线,这说明评分标准与其关联语言本身,就能在还没有评估器反馈之前,把模型从泛化默认值里拉出来,并为后续精修铺路。

一个很典型的例子是,我让模型为一家荷兰艺术博物馆做网站。到第九轮时,它已经做出一个干净、深色主题的虚构博物馆落地页,页面精致,基本也符合我的预期。但在第十轮,它把思路完全推翻,把网站重构为一种空间体验:用 CSS 透视渲染出棋盘格地面、把作品以自由布局挂在墙上,并用门洞在不同展厅之间导航,而不是滚动或点击列表。这种创意跳跃,是我以前没在单次生成里见过的。

扩展到全栈开发

有了这些发现,我把这套 GAN 启发的模式应用到全栈开发上。生成器-评估器循环与软件开发生命周期天然对应:代码审查与 QA 在结构上承担的角色,与设计评估器是一致的。

架构

在我们更早的 long-running harness 中,我们用初始化智能体、按功能逐个推进的编码智能体,以及会话之间的上下文重置,解决了跨多会话编码的连贯性。上下文重置是关键解法:那套 harness 用的是 Sonnet 4.5,而它正如前文所述,会表现出 上下文焦虑。要让模型保持在任务上,就必须构建一个能很好跨重置运行的 harness。Opus 4.5 在很大程度上自行消除了这种行为,因此我在这套 harness 里完全去掉了上下文重置。所有智能体在整个构建期间以一次连续会话运行,并由 Claude Agent SDK 的自动压缩在过程中处理上下文增长。

这次工作是在原始 harness 的基础上,构建了一个三智能体系统,每个智能体对应我在此前运行里观察到的一个具体缺口。系统包含以下人格设定:

Planner:之前的 long-running harness 需要用户一开始就提供很详细的规格。我想把这步自动化,于是做了一个规划智能体:给它 1 到 4 句的简单提示,它会扩写成完整的产品规格。我提示它在范围上要有野心,同时聚焦产品语境与高层技术设计,而不是具体的技术实现细节。这样强调,是因为担心规划者如果一开始就写得过细、还写错了,规格里的错误会在后续实现里层层放大。更聪明的做法似乎是约束智能体最终要交付的东西,让它们在工作过程中自己找到路径。我还要求规划者主动寻找机会,把 AI 功能织进产品规格里。(示例见底部附录。)

Generator:早期 harness 的按功能逐个推进方法,对范围管理很有效。我在这里沿用了类似思路,让生成器以 sprint 形式工作,每次从规格里拾取一个功能实现。每个 sprint 都用 React、Vite、FastAPI 和 SQLite(后期换成 PostgreSQL)的技术栈来实现应用,并要求生成器在每个 sprint 结束时先做自评,再交给 QA。它也可以使用 git 做版本控制。

Evaluator:早期 harness 生成的应用往往看起来很惊艳,但你真用起来仍会遇到实打实的 bug。为捕捉这些问题,评估器用 Playwright MCP 像真实用户一样点击运行中的应用,测试 UI 功能、API 端点和数据库状态。随后它会把每个 sprint 的结果,按它发现的 bug 以及一套从前端实验迁移并改写的标准来打分,这套标准覆盖产品深度、功能性、视觉设计与代码质量。每条标准都有硬阈值,只要有一条低于阈值,这个 sprint 就算失败,生成器会收到对问题原因的细致反馈。

每个 sprint 开始前,生成器与评估器会协商一个 sprint 合同:在写任何代码之前先对这一块工作做到什么算完成达成一致。之所以需要这一步,是因为产品规格刻意保持高层,我希望用一个步骤把用户故事与可测试的实现连接起来。生成器会提出它要构建什么、如何验证成功;评估器会审阅提案,确保生成器在做对的事。双方会迭代直到达成一致。

沟通通过文件完成:一个智能体写文件,另一个智能体读文件,并在同一个文件里回应,或另写一个新文件让前者再读取。生成器会以达成一致的合同为约束来开发,然后把工作交给 QA。这让实现能忠实对齐规格,又避免过早把实现细节写死。

运行 harness

这套 harness 的第一版,我使用 Claude Opus 4.5,并把用户提示分别送进完整 harness 与单智能体系统做对比。我之所以选 Opus 4.5,是因为开始实验时它是我们最强的编码模型。

我用下面的提示生成一个复古视频游戏制作器:

创建一个 2D 复古游戏制作器,功能包括关卡编辑器、Sprite 编辑器、实体行为,以及可游玩的测试模式。

下表展示了 harness 类型、运行时长与总成本。

Harness类型时长成本 Solo 20 分钟 $9 Full harness 6 小时 $200

完整 harness 的成本超过 20 倍,但输出质量的差异立刻就能看出来。

我期待的是一个界面:可以搭建关卡及其组成部分(sprites、entities、图块布局),然后点 play 直接玩这个关卡。我先打开了 solo 运行的结果,最初这个应用看起来符合预期。

但在我一路点下去时,问题开始出现。布局浪费空间,固定高度的面板让大部分视口空着。工作流很僵硬。尝试往关卡里填东西时,它会提示我先创建 sprites 和 entities,但 UI 里没有任何引导告诉我该按这个顺序走。更关键的是,实际游戏坏了:实体能出现在屏幕上,但输入没有任何响应。深入代码后发现,实体定义与游戏运行时之间的连线断了,而且界面上完全看不出哪里出了问题。

打开页面 Sprite 编辑器 游戏运行

打开 solo harness 生成的应用时的初始界面。

在 solo harness 生成的 Sprite 编辑器里创建一个 sprite

尝试游玩我创建的关卡,但没有成功

评估完 solo 运行后,我把注意力转向完整 harness 运行。它同样从一句话提示开始,但规划步骤把提示扩写成一个包含 16 个功能、分布在 10 个 sprint 里的规格,远超 solo 运行所尝试的范围。除了核心编辑器与游玩模式,规格还要求 sprite 动画系统、行为模板、音效与音乐、AI 辅助的 sprite 生成器与关卡设计器,以及可分享链接的游戏导出。我让规划者可以访问我们的 frontend design skill,它会阅读并把其中内容用于为应用制定一套视觉设计语言,作为规格的一部分。每个 sprint 中,生成器与评估器都会协商一份合同,定义该 sprint 的具体实现细节,以及用于验证完成度的可测试行为。

这个应用立刻比 solo 运行显得更精致、更顺滑。画布占满了整个视口,面板尺寸合理,界面也有一致的视觉识别,并跟随规格里的设计方向。solo 运行里那些笨拙感仍有一部分存在——工作流依然不清楚提示你应该先做 sprites 和 entities 再去填关卡,我只能靠自己到处试出来。这更像是底座模型的产品直觉缺口,而不是这套 harness 专门要解决的东西,不过它也提示了一个方向:在 harness 内做更有针对性的迭代,可能还能继续提升输出质量。

沿着编辑器一路用下去,新运行相对 solo 的优势更明显了。Sprite 编辑器更丰富、更完整:工具栏更清爽、取色器更好、缩放控制也更顺手。

因为我要求规划者把 AI 功能织进规格里,应用还内置了 Claude 集成,让我可以通过提示生成游戏的不同部分,工作流因此快了很多。

打开页面 Sprite 编辑器 AI 游戏设计 AI 游戏设计 游戏运行

初始界面:在完整 harness 构建的应用里创建一个新游戏

Sprite 编辑器更干净,也更好用

使用内置 AI 功能生成关卡

使用内置 AI 功能生成关卡

游玩我生成的游戏

最大的差异出现在游玩模式里。我真的能控制实体移动并玩起来。物理系统还有些粗糙——角色跳到平台上后会和平台重叠在一起,这在直觉上不对——但核心功能能用,这是 solo 运行没做到的。我走动了一会儿后,也碰到 AI 在关卡构建上的限制:有一堵大墙怎么都跳不过去,直接卡死。这说明 harness 可能还能处理一些常识性改进与边界情况,让应用更完善。

读日志时可以清楚看到,评估器把实现严格拉回规格。每个 sprint,它都会按 sprint 合同里的测试标准逐项走一遍,并通过 Playwright 操作运行中的应用;凡是和预期行为不一致的地方都会提 bug。合同非常细——仅 Sprint 3 就有 27 条标准覆盖关卡编辑器——而评估器的发现也足够具体,无需额外排查就能直接动手修。下表是评估器识别问题的几个例子:

合同条款评估发现矩形填充工具允许点击拖拽,用所选图块填充矩形区域失败 — 工具只在拖拽起点/终点放置图块,没有填满区域。fillRectangle 函数存在,但在 mouseUp 时没有正确触发。用户可以选择并删除已放置的实体生成点失败 — LevelEditor.tsx:892 的 Delete 键处理要求同时设置 selection 和 selectedEntityId,但点击实体只会设置 selectedEntityId。条件应为 selection || (selectedEntityId activeLayer === entity)。用户可通过 API 重新排序动画帧失败 — PUT /frames/reorder 路由定义在 /{frame_id} 路由之后。FastAPI 会把 reorder 当作 frame_id 整数匹配,并返回 422:无法将字符串解析为整数。

要把评估器拉到这种表现,需要下不少功夫。开箱状态下,Claude 做 QA 并不强。早期运行里,我看着它识别出真实问题,然后又自己把自己说服,觉得问题不大,照样放行。它也倾向于浅尝辄止,而不是去探边界情况,所以更隐蔽的 bug 常会漏掉。调参循环是读评估器日志,找出它的判断与我分歧的例子,再更新 QA 的提示来修正这些问题。跑了几轮这样的开发循环后,评估器才开始以我认为合理的方式打分。即便如此,harness 输出仍暴露了模型 QA 能力的边界:一些小的布局问题、某些交互在直觉上不够顺,以及更深层功能里仍有未被评估器充分覆盖的隐藏 bug。显然还有更多验证空间可以通过进一步调优来挖掘。但与 solo 运行相比(应用的核心功能干脆不能用),提升非常明显。

迭代优化 harness

第一批 harness 结果令人鼓舞,但也臃肿、缓慢且昂贵。下一步合乎逻辑的动作,是在不削弱效果的前提下简化 harness。这既是常识,也来自一个更一般的原则:harness 里的每个组件都编码了一个假设,假设模型自己做不到某件事;这些假设值得做压力测试,一方面因为它们可能本来就是错的,另一方面因为随着模型变强,它们也会很快过时。我们的博文 Building Effective Agents 把核心思想概括为:尽量找最简单的解法,只有在需要时才增加复杂度;这也是任何维护 agent harness 的人都会反复遇到的一种模式。

我第一次尝试简化时,大幅裁剪了 harness,并试了几个新的创意点子,但没能复现原先的表现。与此同时,也越来越难判断 harness 设计里哪些部分真正是承重的,以及它们具体通过什么方式承重。基于这段经历,我改用更方法化的方式:一次移除一个组件,然后回看它对最终结果的影响。

在这些迭代周期里,我们还发布了 Opus 4.6,这进一步推动我降低 harness 复杂度。很有理由相信 4.6 相比 4.5 需要更少的脚手架。我们的 发布博客 中写到:[Opus 4.6] 规划更谨慎,能把智能体任务维持得更久,在大型代码库中运行更可靠,并且具备更强的代码审查与调试能力以捕捉自身错误;它在长上下文检索上也有显著提升。这些正是 harness 原本用来补齐的能力。

移除 sprint 结构

我先把 sprint 结构整体移除了。sprint 的好处是把工作拆成块,让模型能更连贯地推进。考虑到 Opus 4.6 的提升,很有理由相信模型可以不靠这种拆解就原生完成任务。

我保留了规划者与评估者,因为它们各自仍然带来明显价值。没有规划者时,生成器会低估范围:面对原始提示,它会直接开建而不是先做规格,最后做出来的应用功能会比规划者扩写后的规格少得多。

移除 sprint 结构后,我把评估器从按 sprint 打分,改为在整次运行结束后做一次统一评估。由于模型更强,评估器在不同任务中的承重程度发生了变化,它是否有价值取决于任务相对模型单独可靠能力的边界位置。在 4.5 上,这条边界很近:我们的构建处在生成器独立完成能力的边缘,评估器能在全程捕捉到有意义的问题。在 4.6 上,模型原生能力提升,边界向外移动。过去需要评估器把关才能连贯实现的任务,现在常常已经落在生成器自己就能做好的范围内;对于落在边界内的任务,评估器就成了不必要的额外开销。但对于仍处在生成器能力边缘的部分,评估器依然能带来真实增益。

现实含义是:评估器不是一个固定的要么开要么关的选择。当任务超出当前模型单独可靠完成范围时,它就值得付出成本。

在结构简化的同时,我也增加了提示来改进 harness 把 AI 功能嵌进每个应用的方式,尤其是让生成器构建一个真正的 agent,并能通过工具驱动应用自身的功能。这需要实打实的迭代,因为相关知识足够新,Claude 的训练数据覆盖并不充分。但经过足够的调优后,生成器确实能把 agents 正确地搭起来。

更新版 harness 的结果

为了检验更新后的 harness,我用下面的提示生成一个数字音频工作站(DAW),也就是用于作曲、录音和混音的音乐制作软件:

用 Web Audio API 在浏览器中构建一个功能完备的 DAW。

这次运行仍然漫长且昂贵,大约 4 小时,token 成本约 $124。

大部分时间花在构建阶段:在不需要 Opus 4.5 那种 sprint 拆解的前提下,构建器仍能连贯地跑两个多小时。

Agent阶段时长成本 Planner 4.7 分钟 $0.46 Build (Round 1) 2 小时 7 分钟 $71.08 QA (Round 1) 8.8 分钟 $3.24 Build (Round 2) 1 小时 2 分钟 $36.89 QA (Round 2) 6.8 分钟 $3.09 Build (Round 3) 10.9 分钟 $5.88 QA (Round 3) 9.6 分钟 $4.06 Total V2 Harness 3 小时 50 分钟 $124.70

和之前的 harness 一样,规划者把一行提示扩写成完整规格。从日志里能看到,生成器模型在规划应用与 agent 设计、把 agent 接上线、并在交给 QA 前进行测试等方面都做得不错。

尽管如此,QA 智能体仍然抓到了真实缺口。在第一轮反馈里,它指出:

这是一个很强的应用,设计还原度很高,AI agent 扎实,后端也不错。主要失败点在于功能完整性——虽然应用看起来很惊艳,AI 集成也跑得很好,但多个 DAW 核心功能只是展示,没有交互深度:音频片段无法在时间线上拖拽/移动,没有乐器 UI 面板(合成器旋钮、鼓垫),也没有可视化效果器编辑器(EQ 曲线、压缩器电平表)。这些不是边角问题——它们是让 DAW 真正可用的核心交互,而规格里明确要求了它们。

在第二轮反馈里,它又抓到了几处功能缺口:

仍存在的缺口: - 音频录制仍是占位实现(按钮会切换,但没有麦克风采集) - 未实现通过拖拽边缘缩放片段,以及片段切割 - 效果器可视化是数字滑杆而不是图形(没有 EQ 曲线)

当生成器独立工作时,仍然容易漏掉细节或把功能做成占位,而 QA 依然能在最后一公里抓到这些问题,让生成器去补齐。

基于这个提示,我期待的是一个程序:能写旋律、和声与鼓点,把它们编排成一首歌,并且一路有一个集成的 agent 提供帮助。下面的视频展示了结果。

这个应用离专业音乐制作软件还差得很远,agent 的作曲能力显然也还有很大提升空间。此外,Claude 实际上听不见,这让 QA 反馈回路在音乐审美方面的效果打了折扣。

但最终应用具备了一个可用音乐制作程序的所有核心部件:一个能在浏览器里工作的编排视图、调音台与传输控制。更进一步,我还能完全通过提示拼出一段短小的音乐片段:agent 设定速度与调式,铺出旋律,做出鼓轨,调整调音台电平,并加上混响。作曲所需的核心原语都在,agent 也能用工具自主驱动它们,从头到尾做出一个简单的制作成品。也许还谈不上分毫不差,但已经越来越接近了。

接下来会怎样

随着模型持续进步,我们大致可以预期它们能工作更久,也能应对更复杂的任务。有些情况下,这意味着包在模型外的脚手架会随着时间变得不那么重要,开发者只需等下一代模型,某些问题就会自行消失。另一方面,模型越强,可开发的空间越大:我们可以构建 harness,让它完成远超模型基线能力的复杂任务。

基于此,这项工作有几条值得带走的经验。对目标模型做实验、在真实问题上读它的 traces、并把它调到你想要的输出,是一直都值得的实践。面对更复杂的任务,有时把任务拆解,并为不同方面配置专门的智能体,仍有可挖的提升空间。当新模型上线时,通常也值得重新审视 harness:剥离那些不再承重的部分,同时加入新部件,以达成此前可能做不到的更强能力。

我从这项工作里得到的确信是:随着模型变强,有趣的 harness 组合空间不会缩小。它会移动,而 AI 工程师的有趣工作,就是不断找到下一种新的组合。

致谢

特别感谢 Mike Krieger、Michael Agaby、Justin Young、Jeremy Hadfield、David Hershey、Julius Tarng、Xiaoyi Zhang、Barry Zhang、Orowa Sidker、Michael Tingley、Ibrahim Madha、Martina Long 和 Canyon Robbins 对这项工作的贡献。

也感谢 Jake Eaton、Alyssa Leonard 和 Stef Sequeira 在塑造本文过程中提供的帮助。

附录

规划智能体生成的示例计划。

RetroForge - 2D Retro Game Maker

Overview
RetroForge is a web-based creative studio for designing and building 2D retro-style video games. It combines the nostalgic charm of classic 8-bit and 16-bit game aesthetics with modern, intuitive editing tools—enabling anyone from hobbyist creators to indie developers to bring their game ideas to life without writing traditional code.

The platform provides four integrated creative modules: a tile-based Level Editor for designing game worlds, a pixel-art Sprite Editor for crafting visual assets, a visual Entity Behavior system for defining game logic, and an instant Playable Test Mode for real-time gameplay testing. By weaving AI assistance throughout (powered by Claude), RetroForge accelerates the creative process—helping users generate sprites, design levels, and configure behaviors through natural language interaction.

RetroForge targets creators who love retro gaming aesthetics but want modern conveniences. Whether recreating the platformers, RPGs, or action games of their childhood, or inventing entirely new experiences within retro constraints, users can prototype rapidly, iterate visually, and share their creations with others.

Features
1. Project Dashboard  Management
The Project Dashboard is the home base for all creative work in RetroForge. Users need a clear, organized way to manage their game projects—creating new ones, returning to works-in-progress, and understanding what each project contains at a glance.

User Stories: As a user, I want to:

- Create a new game project with a name and description, so that I can begin designing my game
- See all my existing projects displayed as visual cards showing the project name, last modified date, and a thumbnail preview, so that I can quickly find and continue my work
- Open any project to enter the full game editor workspace, so that I can work on my game
- Delete projects I no longer need, with a confirmation dialog to prevent accidents, so that I can keep my workspace organized
- Duplicate an existing project as a starting point for a new game, so that I can reuse my previous work

Project Data Model: Each project contains:

Project metadata (name, description, created/modified timestamps)
Canvas settings (resolution: e.g., 256x224, 320x240, or 160x144)
Tile size configuration (8x8, 16x16, or 32x32 pixels)
Color palette selection
All associated sprites, tilesets, levels, and entity definitions

...

复制

订阅开发者通讯

产品更新、使用指南、社区聚焦等内容,每月发送到你的收件箱。

如果你希望接收我们的月度开发者通讯,请提供你的邮箱地址。你可以随时取消订阅。

Harness design for long-running application development

Published Mar 24, 2026 Harness design is key to performance at the frontier of agentic coding. Heres how we pushed Claude further in frontend design and long-running autonomous software engineering.

Written by Prithvi Rajasekaran, a member of ourLabs team.

Over the past several months I’ve been working on two interconnected problems: getting Claude to produce high-quality frontend designs, and getting it to build complete applications without human intervention. This work originated with earlier efforts on our frontend design skill and long-running coding agent harness, where my colleagues and I were able to improve Claude’s performance well above baseline through prompt engineering and harness design—but both eventually hit ceilings.

To break through, I sought out novel AI engineering approaches that held across two quite different domains, one defined by subjective taste, the other by verifiable correctness and usability. Taking inspiration from Generative Adversarial Networks (GANs), I designed a multi-agent structure with a generator and evaluator agent. Building an evaluator that graded outputs reliably—and with taste—meant first developing a set of criteria that could turn subjective judgments like “is this design good?” into concrete, gradable terms.

I then applied these techniques to long-running autonomous coding, carrying over two lessons from our earlier harness work: decomposing the build into tractable chunks, and using structured artifacts to hand off context between sessions. The final result was a three-agent architecture—planner, generator, and evaluator—that produced rich full-stack applications over multi-hour autonomous coding sessions.

面向长时应用开发的 Harness 设计

在智能体编程的前沿,harness 设计是性能的关键。下面分享我们如何在前端设计与长时自主软件工程上,把 Claude 推得更远。

本文由 Prithvi Rajasekaran 撰写,他是我们 Labs 团队的成员。

过去几个月里,我一直在解决两个相互牵连的问题:让 Claude 产出高质量的前端设计,以及让它在无需人工干预的情况下构建完整应用。这项工作源自我们早期在 frontend design skilllong-running coding agent harness 上的探索,当时我和同事通过提示词工程与 harness 设计,把 Claude 的表现提升到了远高于基线的水平,但两条路线最终都撞上了天花板。

为了突破,我开始寻找能同时适用于两个差异极大的领域的新型 AI 工程方法:一个由主观审美定义,另一个由可验证的正确性与可用性定义。受到 生成对抗网络(GAN)启发,我设计了一个由生成智能体与评估智能体构成的多智能体结构。要让评估器稳定、可靠,而且有品味地打分,首先需要制定一套标准,把像 这个设计好吗 这类主观判断,转化为具体、可评分的条目。

随后,我把这些方法应用到长时自主编码上,并把我们之前 harness 工作中的两条经验带了过来:把构建过程拆解成可处理的小块,以及用结构化工件在不同会话之间交接上下文。最终得到的是一个由三类智能体组成的架构:规划者、生成者和评估者,它能在持续数小时的自主编码会话里产出内容丰富的全栈应用。

Why naive implementations fall short

Weve previously shown that harness design has a substantial impact on the effectiveness of long running agentic coding. In an earlierexperiment, we used an initializer agent to decompose a product spec into a task list, and a coding agent that implemented the tasks one feature at a time before handing off artifacts to carry context across sessions. The broader developer community has converged on similar insights, with approaches like the Ralph Wiggum method using hooks or scripts to keep agents in continuous iteration cycles.

But some problems remained persistent. For more complex tasks, the agent still tends to go off the rails over time. While decomposing this issue, we observed two common failure modes with agents executing these sorts of tasks.

First is that models tend to lose coherence on lengthy tasks as the context window fills (see our post on context engineering). Some models also exhibit context anxiety, in which they begin wrapping up work prematurely as they approach what they believe is their context limit. Context resets—clearing the context window entirely and starting a fresh agent, combined with a structured handoff that carries the previous agents state and the next steps—addresses both these issues.

This differs from compaction, where earlier parts of the conversation are summarized in place so the same agent can keep going on a shortened history. While compaction preserves continuity, it doesnt give the agent a clean slate, which means context anxiety can still persist. A reset provides a clean slate, at the cost of the handoff artifact having enough state for the next agent to pick up the work cleanly. In our earlier testing, we found Claude Sonnet 4.5 exhibited context anxiety strongly enough that compaction alone wasnt sufficient to enable strong long task performance, so context resets became essential to the harness design. This solves the core issue, but adds orchestration complexity, token overhead, and latency to each harness run.

A second issue, which we haven’t previously addressed, is self-evaluation. When asked to evaluate work theyve produced, agents tend to respond by confidently praising the work—even when, to a human observer, the quality is obviously mediocre. This problem is particularly pronounced for subjective tasks like design, where there is no binary check equivalent to a verifiable software test. Whether a layout feels polished or generic is a judgment call, and agents reliably skew positive when grading their own work.

However, even on tasks that do have verifiable outcomes, agents still sometimes exhibit poor judgment that impedes their performance while completing the task. Separating the agent doing the work from the agent judging it proves to be a strong lever to address this issue. The separation doesnt immediately eliminate that leniency on its own; the evaluator is still an LLM that is inclined to be generous towards LLM-generated outputs. But tuning a standalone evaluator to be skeptical turns out to be far more tractable than making a generator critical of its own work, and once that external feedback exists, the generator has something concrete to iterate against.

为什么朴素实现不够用

我们之前已经展示过,harness 设计会显著影响长时智能体编程的效果。在更早的一次 实验 中,我们用一个初始化智能体把产品规格拆成任务清单,再由编码智能体按功能逐个实现任务,并在会话之间交接工件来携带上下文。更广泛的开发者社区也收敛到了类似的洞见,比如 Ralph Wiggum 方法会用 hooks 或脚本让智能体保持连续的迭代循环。

但有些问题一直存在。面对更复杂的任务,智能体仍然会随着时间推移逐渐跑偏。我们拆解这个问题时,观察到这类任务中常见的两种失败模式。

第一种是:随着上下文窗口被不断填满,模型在长任务上往往会逐渐失去连贯性(见我们关于 上下文工程 的文章)。一些模型还会出现上下文焦虑:当它们接近自认为的上下文上限时,会提前开始收尾。上下文重置——把上下文窗口完全清空并启动一个新的智能体,同时用结构化交接把上一个智能体的状态与下一步计划传递过去——可以同时解决这两个问题。

这与压缩(compaction)不同。压缩是在原地把更早的对话内容总结掉,让同一个智能体能在更短的历史上继续推进。压缩能保留连续性,但它不会给智能体一张干净的白纸,因此上下文焦虑仍可能持续。重置能给出一张干净的白纸,但代价是交接工件必须包含足够的状态,才能让下一个智能体无缝接手。在我们早期测试中,我们发现 Claude Sonnet 4.5 的上下文焦虑表现得足够强烈,以至于仅靠压缩不足以支撑强的长任务表现,于是上下文重置成了 harness 设计的必需品。这解决了核心问题,但也为每次 harness 运行增加了编排复杂度、token 开销与延迟。

第二个问题(我们此前还没系统解决过)是自我评估。当被要求评估自己产出的工作时,智能体往往会自信地夸赞——即使在人类看来质量明显一般。这个问题在设计这类主观任务上尤其突出,因为它没有类似可验证软件测试那样的二元检查。布局是精致还是泛化,本来就是判断题,而智能体在给自己的作品打分时会稳定地偏向正面。

不过,即便在有可验证结果的任务里,智能体也仍会时不时表现出糟糕的判断,从而拖慢任务完成。把做事的智能体与判定的智能体分开,是解决这个问题的一根强杠杆。这种分离不会立刻消除宽松倾向,因为评估器仍然是一个倾向于对 LLM 生成内容更慷慨的 LLM。但把一个独立评估器调到更怀疑、更挑剔,往往比让生成器对自己的工作变得苛刻要可控得多;而一旦有了外部反馈,生成器也就有了具体可迭代的目标。

Frontend design: making subjective quality gradable

I started by experimenting on frontend design, where the self-evaluation issue was most visible. Absent any intervention, Claude normally gravitates toward safe, predictable layouts that are technically functional but visually unremarkable.

Two insights shaped the harness I built for frontend design. First, while aesthetics can’t be fully reduced to a score—and individual tastes will always vary—they can be improved with grading criteria that encode design principles and preferences. Is this design beautiful? is hard to answer consistently, but does this follow our principles for good design? gives Claude something concrete to grade against. Second, by separating frontend generation from frontend grading, we can create a feedback loop that drives the generator toward stronger outputs.

With this in mind, I wrote four grading criteria that I gave to both the generator and evaluator agents in their prompts:

  • Design quality: Does the design feel like a coherent whole rather than a collection of parts? Strong work here means the colors, typography, layout, imagery, and other details combine to create a distinct mood and identity.

  • Originality: Is there evidence of custom decisions, or is this template layouts, library defaults, and AI-generated patterns? A human designer should recognize deliberate creative choices. Unmodified stock components—or telltale signs of AI generation like purple gradients over white cards—fail here.

  • Craft: Technical execution: typography hierarchy, spacing consistency, color harmony, contrast ratios. This is a competence check rather than a creativity check. Most reasonable implementations do fine here by default; failing means broken fundamentals.

  • Functionality: Usability independent of aesthetics. Can users understand what the interface does, find primary actions, and complete tasks without guessing?

I emphasized design quality and originality over craft and functionality. Claude already scored well on craft and functionality by default, as the required technical competence tended to come naturally to the model. But on design and originality, Claude often produced outputs that were bland at best. The criteria explicitly penalized highly generic “AI slop” patterns, and by weighting design and originality more heavily it pushed the model toward more aesthetic risk-taking.

I calibrated the evaluator using few-shot examples with detailed score breakdowns. This ensured the evaluator’s judgment aligned with my preferences, and reduced score drift across iterations.

I built the loop on the Claude Agent SDK, which kept the orchestration straightforward. A generator agent first created an HTML/CSS/JS frontend based on a user prompt. I gave the evaluator the Playwright MCP, which let it interact with the live page directly before scoring each criterion and writing a detailed critique. In practice, the evaluator would navigate the page on its own, screenshotting and carefully studying the implementation before producing its assessment. That feedback flowed back to the generator as input for the next iteration. I ran 5 to 15 iterations per generation, with each iteration typically pushing the generator in a more distinctive direction as it responded to the evaluators critique. Because the evaluator was actively navigating the page rather than scoring a static screenshot, each cycle took real wall-clock time. Full runs stretched up to four hours. I also instructed the generator to make a strategic decision after each evaluation: refine the current direction if scores were trending well, or pivot to an entirely different aesthetic if the approach wasnt working.

Across runs, the evaluators assessments improved over iterations before plateauing, with headroom still remaining. Some generations refined incrementally. Others took sharp aesthetic turns between iterations.

The wording of the criteria steered the generator in ways I didnt fully anticipate. Including phrases like the best designs are museum quality pushed designs toward a particular visual convergence, suggesting that the prompting associated with the criteria directly shaped the character of the output.

While scores generally improved over iterations, the pattern was not always cleanly linear. Later implementations tended to be better as a whole, but I regularly saw cases where I preferred a middle iteration over the last one. Implementation complexity also tended to increase across rounds, with the generator reaching for more ambitious solutions in response to the evaluator’s feedback. Even on the first iteration, outputs were noticeably better than a baseline with no prompting at all, suggesting the criteria and associated language themselves steered the model away from generic defaults before any evaluator feedback led to further refinement.

In one notable example, I prompted the model to create a website for a Dutch art museum. By the ninth iteration, it had produced a clean, dark-themed landing page for a fictional museum. The page was visually polished but largely in line with my expectations. Then, on the tenth cycle, it scrapped the approach entirely and reimagined the site as a spatial experience: a 3D room with a checkered floor rendered in CSS perspective, artwork hung on the walls in free-form positions, and doorway-based navigation between gallery rooms instead of scroll or click. It was the kind of creative leap that I hadnt seen before from a single-pass generation.

前端设计:让主观质量可评分

我从前端设计开始做实验,因为自我评估的问题在这里最明显。在没有任何干预时,Claude 通常会倾向于安全、可预期的布局:技术上能用,但视觉上不出彩。

我为前端设计构建的 harness 主要由两点洞见塑形。第一,审美不可能被完全还原成一个分数,个人口味也永远不同,但如果用能编码设计原则与偏好的评分标准,审美是可以被提升的。这个设计美吗 很难稳定回答,但这个是否遵循我们的好设计原则 就能让 Claude 有具体的参照。第二,把前端生成与前端评分分离,可以建立一个反馈回路,推动生成器朝更强的输出前进。

基于此,我写了四条评分标准,并把它们同时放进生成器与评估器的提示里:

  • 设计质量:整体是否像一个连贯的整体,而不是零散部件的拼凑?这里做得好意味着颜色、字体排印、布局、图像等细节能共同塑造出独特的情绪与身份。

  • 原创性:能看到定制化决策吗,还是只是模板布局、库默认值和 AI 生成的套路?人类设计师应当能识别出刻意的创意选择。未改动的素材组件——或白底卡片上叠紫色渐变这类明显的 AI 生成痕迹——都会在这里失败。

  • 工艺:技术实现:字阶层次、间距一致性、色彩和谐、对比度比例。它检查的是基本功而非创意。大多数合理实现默认能过关;不及格则意味着基础被破坏。

  • 功能性:与审美无关的可用性。用户能否理解界面在做什么,找到主要操作,并在不靠猜的情况下完成任务?

我把设计质量与原创性的权重放在工艺与功能性之上。Claude 在工艺与功能性上默认就能拿到不错分数,因为所需的技术熟练度往往会自然地从模型里冒出来。但在设计与原创性上,Claude 常常只能做到不难看而已。评分标准会明确惩罚高度泛化的 AI 糊弄套路,并通过更重的设计与原创性权重,推动模型在审美上更敢冒险。

我用带有细致分数拆解的少样本示例来校准评估器。这能让评估器的判断与我的偏好对齐,也能降低跨迭代的分数漂移。

我用 Claude Agent SDK 搭了这个循环,让编排保持简单。生成智能体先基于用户提示生成一个 HTML/CSS/JS 前端。我给评估器配了 Playwright MCP,让它在给每条标准打分并写出细致点评前,能直接与线上页面交互。实际运行中,评估器会自行在页面里导航、截图,并仔细研究实现后再给出评估。反馈再回流到生成器,作为下一轮迭代的输入。每次生成我会跑 5 到 15 轮迭代,每一轮通常都会在回应评估器批评时,把生成器推向更鲜明的方向。因为评估器是在真实页面里导航,而不是对静态截图打分,每一轮都会消耗真实时间;完整跑下来最长会到四小时。我还要求生成器在每次评估后做一个策略决策:如果分数趋势不错就继续精修当前方向;如果路线不对,就直接切换到完全不同的审美方向。

跨多次运行,评估器的评分会随着迭代提升,随后进入平台期,仍留有提升空间。有些生成是渐进式的精修,有些则会在迭代间发生明显的审美急转。

评分标准的措辞会以一些我没完全预料到的方式牵引生成器。比如加入 最好的设计是博物馆级 这样的语句,会把设计推向某种特定的视觉收敛,这暗示评分标准相关的提示本身,直接塑造了输出的性格。

分数总体会随迭代上升,但这种模式并不总是干净的线性。后期实现往往整体更好,但我也经常更喜欢某个中间迭代,而不是最后一版。实现复杂度也会在轮次里上升,生成器会为了回应评估器反馈而尝试更有野心的方案。甚至在第一轮,输出就明显好于完全不加提示的基线,这说明评分标准与其关联语言本身,就能在还没有评估器反馈之前,把模型从泛化默认值里拉出来,并为后续精修铺路。

一个很典型的例子是,我让模型为一家荷兰艺术博物馆做网站。到第九轮时,它已经做出一个干净、深色主题的虚构博物馆落地页,页面精致,基本也符合我的预期。但在第十轮,它把思路完全推翻,把网站重构为一种空间体验:用 CSS 透视渲染出棋盘格地面、把作品以自由布局挂在墙上,并用门洞在不同展厅之间导航,而不是滚动或点击列表。这种创意跳跃,是我以前没在单次生成里见过的。

Scaling to full-stack coding

With these findings in hand, I applied this GAN-inspired pattern to full-stack development. The generator-evaluator loop maps naturally onto the software development lifecycle, where code review and QA serve the same structural role as the design evaluator.

The architecture

In our earlier long-running harness, we had solved for coherent multi-session coding with an initializer agent, a coding agent that worked one feature at a time, and context resets between sessions. Context resets were a key unlock: the harness used Sonnet 4.5, which exhibited the “context anxiety” tendency mentioned earlier. Creating a harness that worked well across context resets was key to keeping the model on task. Opus 4.5 largely removed that behavior on its own, so I was able to drop context resets from this harness entirely. The agents were run as one continuous session across the whole build, with the Claude Agent SDKs automatic compaction handling context growth along the way.

For this work I built on the foundation from the original harness with a three-agent system, with each agent addressing a specific gap Id observed in prior runs. The system contained the following agent personas:

Planner: Our previous long-running harness required the user to provide a detailed spec upfront. I wanted to automate that step, so I created a planner agent that took a simple 1-4 sentence prompt and expanded it into a full product spec. I prompted it to be ambitious about scope and to stay focused on product context and high level technical design rather than detailed technical implementation. This emphasis was due to the concern that if the planner tried to specify granular technical details upfront and got something wrong, the errors in the spec would cascade into the downstream implementation. It seemed smarter to constrain the agents on the deliverables to be produced and let them figure out the path as they worked. I also asked the planner to find opportunities to weave AI features into the product specs. (See example in the Appendix at the bottom.)

Generator: The one-feature-at-a-time approach from the earlier harness worked well for scope management. I applied a similar model here, instructing the generator to work in sprints, picking up one feature at a time from the spec. Each sprint implemented the app with a React, Vite, FastAPI, and SQLite (later PostgreSQL) stack, and the generator was instructed to self-evaluate its work at the end of each sprint before handing off to QA. It also had git for version control.

Evaluator: Applications from earlier harnesses often looked impressive but still had real bugs when you actually tried to use them. To catch these, the evaluator used the Playwright MCP to click through the running application the way a user would, testing UI features, API endpoints, and database states. It then graded each sprint against both the bugs it had found and a set of criteria modeled on the frontend experiment, adapted here to cover product depth, functionality, visual design, and code quality. Each criterion had a hard threshold, and if any one fell below it, the sprint failed and the generator got detailed feedback on what went wrong.

Before each sprint, the generator and evaluator negotiated a sprint contract: agreeing on what done looked like for that chunk of work before any code was written. This existed because the product spec was intentionally high-level, and I wanted a step to bridge the gap between user stories and testable implementation. The generator proposed what it would build and how success would be verified, and the evaluator reviewed that proposal to make sure the generator was building the right thing. The two iterated until they agreed.

Communication was handled via files: one agent would write a file, another agent would read it and respond either within that file or with a new file that the previous agent would read in turn. The generator then built against the agreed-upon contract before handing the work off to QA. This kept the work faithful to the spec without over-specifying implementation too early.

Running the harness

For the first version of this harness, I used Claude Opus 4.5, running user prompts against both the full harness and a single-agent system for comparison. I used Opus 4.5 since this was our best coding model when I began these experiments.

I wrote the following prompt to generate a retro video game maker:

Create a 2D retro game maker with features including a level editor, sprite editor, entity behaviors, and a playable test mode.

The table below shows the harness type, length it ran for, and the total cost.

HarnessDurationCostSolo20 min$9Full harness6 hr$200

The harness was over 20x more expensive, but the difference in output quality was immediately apparent.

I was expecting an interface where I could construct a level and its component parts (sprites, entities, tile layout) then hit play to actually play the level. I started by opening the solo run’s output, and the initial application seemed in line with those expectations.

As I clicked through, however, issues started to emerge. The layout wasted space, with fixed-height panels leaving most of the viewport empty. The workflow was rigid. Trying to populate a level prompted me to create sprites and entities first, but nothing in the UI guided me toward that sequence. More to the point, the actual game was broken. My entities appeared on screen but nothing responded to input. Digging into the code revealed that the wiring between entity definitions and the game runtime was broken, with no surface indication of where.

Opening screen Sprite editorGame play

Initial screen when opening the app created by the solo harness.

Creating a sprite in the sprite editor made by the solo harness

Trying unsuccessfully to play the level I created

After evaluating the solo run, I turned my attention to the harness run. This run started from the same one-sentence prompt, but the planner step expanded that prompt into a 16-feature spec spread across ten sprints. It went well beyond what the solo run attempted. In addition to the core editors and play mode, the spec called for a sprite animation system, behavior templates, sound effects and music, an AI-assisted sprite generator and level designer, and game export with shareable links. I gave the planner access to our frontend design skill, which it read and used to create a visual design language for the app as part of the spec. For each sprint, the generator and evaluator negotiated a contract defining the specific implementation details for the sprint, and the testable behaviors that would be tested to verify completion.

The app immediately showed more polish and smoothness than the solo run. The canvas used the full viewport, the panels were sized sensibly, and the interface had a consistent visual identity that tracked the design direction from the spec. Some of the clunkiness Id seen in the solo run did remain—the workflow still didnt make it clear that you should build sprites and entities before trying to populate a level, and I had to figure that out by poking around. This read as a gap in the base model’s product intuition rather than something the harness was designed to address, though it did suggest a place where targeted iteration inside the harness could help to further improve output quality.

Working through the editors, the new runs advantages over solo became more apparent. The sprite editor was richer and more fully featured, with cleaner tool palettes, a better color picker, and more usable zoom controls.

Because Id asked the planner to weave AI features into its specs, the app also came with a built-in Claude integration that let me generate different parts of the game through prompting. This significantly sped up the workflow.

Opening screen Sprite editorAI game designAI game design Game play

Initial screen: Creating a new game, in the app built with the full harness

The sprite editor felt cleaner and easier to use

Using the built in AI feature to generate the level

Using the built in AI feature to generate the level

Playing the game I generated

The biggest difference was in play mode. I was actually able to move my entity and play the game. The physics had some rough edges—my character jumped onto a platform but ended up overlapping with it, which felt intuitively wrong—but the core thing worked, which the solo run did not manage. After moving around a bit, I did hit some limitations with the AI’s game level construction. There was a large wall that I wasn’t able to jump past, so I was stuck. This suggested there were some common sense improvements and edge cases that the harness could handle to further refine the app.

Reading through the logs, it was clear that the evaluator kept the implementation in line with the spec. Each sprint, it walked through the sprint contracts test criteria and exercised the running application through Playwright, filing bugs against anything that diverged from expected behavior. The contracts were granular—Sprint 3 alone had 27 criteria covering the level editor—and the evaluators findings were specific enough to act on without extra investigation. The table below shows several examples of issues our evaluator identified:

Contract criterionEvaluator findingRectangle fill tool allows click-drag to fill a rectangular area with selected tileFAIL — Tool only places tiles at drag start/end points instead of filling the region. fillRectangle function exists but isnt triggered properly on mouseUp.User can select and delete placed entity spawn pointsFAIL — Delete key handler at LevelEditor.tsx:892 requires both selection and selectedEntityId to be set, but clicking an entity only sets selectedEntityId. Condition should be selection || (selectedEntityId activeLayer === entity).User can reorder animation frames via APIFAIL — PUT /frames/reorder route defined after /{frame_id} routes. FastAPI matches reorder as a frame_id integer and returns 422: unable to parse string as an integer.

Getting the evaluator to perform at this level took work. Out of the box, Claude is a poor QA agent. In early runs, I watched it identify legitimate issues, then talk itself into deciding they werent a big deal and approve the work anyway. It also tended to test superficially, rather than probing edge cases, so more subtle bugs often slipped through. The tuning loop was to read the evaluators logs, find examples where its judgment diverged from mine, and update the QAs prompt to solve for those issues. It took several rounds of this development loop before the evaluator was grading in a way that I found reasonable. Even then, the harness output showed the limits of the model’s QAing capabilities: small layout issues, interactions that felt unintuitive in places, and undiscovered bugs in more deeply nested features that the evaluator hadnt exercised thoroughly. There was clearly more verification headroom to capture with further tuning. But compared to the solo run, where the central feature of the application simply didnt work, the lift was obvious.

Iterating on the harness

The first set of harness results was encouraging, but it was also bulky, slow, and expensive. The logical next step was to find ways to simplify the harness without degrading its performance. This was partly common sense and partly a function of a more general principle: every component in a harness encodes an assumption about what the model cant do on its own, and those assumptions are worth stress testing, both because they may be incorrect, and because they can quickly go stale as models improve. Our blog post Building Effective Agents frames the underlying idea as find the simplest solution possible, and only increase complexity when needed, and its a pattern that shows up consistently for anyone maintaining an agent harness.

In my first attempt to simplify, I cut the harness back radically and tried a few creative new ideas, but I wasnt able to replicate the performance of the original. It also became difficult to tell which pieces of the harness design were actually load-bearing, and in what ways. Based on that experience, I moved to a more methodical approach, removing one component at a time and reviewing what impact it had on the final result.

As I was going through these iteration cycles, we also released Opus 4.6, which provided further motivation to reduce harness complexity. There was good reason to expect 4.6 would need less scaffolding than 4.5 did. From our launch blog: [Opus 4.6] plans more carefully, sustains agentic tasks for longer, can operate more reliably in larger codebases, and has better code review and debugging skills to catch its own mistakes. It also improved substantially on long-context retrieval. These were all capabilities the harness had been built to supplement.

Removing the sprint construct

I started by removing the sprint construct entirely. The sprint structure had helped to decompose work into chunks for the model to work coherently. Given the improvements in Opus 4.6, there was good reason to believe that the model could natively handle the job without this sort of decomposition.

I kept both the planner and evaluator, as each continued to add obvious value. Without the planner, the generator under-scoped: given the raw prompt, it would start building without first speccing its work, and end up creating a less feature-rich application than the planner did.

With the sprint construct removed, I moved the evaluator to a single pass at the end of the run rather than grading per sprint. Since the model was much more capable, it changed how load-bearing the evaluator was for certain runs, with its usefulness depending on where the task sat relative to what the model could do reliably on its own. On 4.5, that boundary was close: our builds were at the edge of what the generator could do well solo, and the evaluator caught meaningful issues across the build. On 4.6, the models raw capability increased, so the boundary moved outward. Tasks that used to need the evaluators check to be implemented coherently were now often within what the generator handled well on its own, and for tasks within that boundary, the evaluator became unnecessary overhead. But for the parts of the build that were still at the edge of the generator’s capabilities, the evaluator continued to give real lift.

The practical implication is that the evaluator is not a fixed yes-or-no decision. It is worth the cost when the task sits beyond what the current model does reliably solo.

Alongside the structural simplification, I also added prompting to improve how the harness built AI features into each app, specifically getting the generator to build a proper agent that could drive the apps own functionality through tools. That took real iteration, since the relevant knowledge is recent enough that Claudes training data covers it thinly. But with enough tuning, the generator was building agents correctly.

Results from the updated harness

To put the updated harness to the test, I used the following prompt to generate a Digital Audio Workstation (DAW), a music production program for composing, recording, and mixing songs:

Build a fully featured DAW in the browser using the Web Audio API.

The run was still lengthy and expensive, at about 4 hours and $124 in token costs.

Most of the time went to the builder, which ran coherently for over two hours without the sprint decomposition that Opus 4.5 had needed.

Agent PhaseDurationCostPlanner4.7 min$0.46Build (Round 1)2 hr 7 min$71.08QA (Round 1)8.8 min$3.24Build (Round 2)1 hr 2 min$36.89QA (Round 2)6.8 min$3.09Build (Round 3)10.9 min$5.88QA (Round 3)9.6 min$4.06Total V2 Harness3 hr 50 min$124.70

As with the previous harness, the planner expanded the one-line prompt into a full spec. From the logs, I could see the generator model did a good job planning the app and the agent design, wiring the agent up, and testing it before handing off to QA.

That being said, the QA agent still caught real gaps. In its first-round feedback, it noted:

This is a strong app with excellent design fidelity, solid AI agent, and good backend. The main failure point is Feature Completeness — while the app looks impressive and the AI integration works well, several core DAW features are display-only without interactive depth: clips cant be dragged/moved on the timeline, there are no instrument UI panels (synth knobs, drum pads), and no visual effect editors (EQ curves, compressor meters). These arent edge cases — theyre the core interactions that make a DAW usable, and the spec explicitly calls for them.

In its second round feedback, it again caught several functionality gaps:

Remaining gaps: - Audio recording is still stub-only (button toggles but no mic capture) - Clip resize by edge drag and clip split not implemented - Effect visualizations are numeric sliders, not graphical (no EQ curve)

The generator was still liable to miss details or stub features when left to its own devices, and the QA still added value in catching those last mile issues for the generator to fix.

Based on the prompt, I was expecting a program where I could create melodies, harmonies, and drum patterns, arrange them into a song, and get help from an integrated agent along the way. The video below shows the result.

The app is far from a professional music production program, and the agents song composition skills could clearly use a lot of work. Additionally, Claude can’t actually hear, which made the QA feedback loop less effective with respect to musical taste.

But the final app had all the core pieces of a functional music production program: a working arrangement view, mixer, and transport running in the browser. Beyond that, I was able to put together a short song snippet entirely through prompting: the agent set the tempo and key, laid down a melody, built a drum track, adjusted mixer levels, and added reverb. The core primitives for song composition were present, and the agent could drive them autonomously, using tools to create a simple production from end to end. You might say it’s not pitch-perfect yet—but it’s getting there.

扩展到全栈开发

有了这些发现,我把这套 GAN 启发的模式应用到全栈开发上。生成器-评估器循环与软件开发生命周期天然对应:代码审查与 QA 在结构上承担的角色,与设计评估器是一致的。

架构

在我们更早的 long-running harness 中,我们用初始化智能体、按功能逐个推进的编码智能体,以及会话之间的上下文重置,解决了跨多会话编码的连贯性。上下文重置是关键解法:那套 harness 用的是 Sonnet 4.5,而它正如前文所述,会表现出 上下文焦虑。要让模型保持在任务上,就必须构建一个能很好跨重置运行的 harness。Opus 4.5 在很大程度上自行消除了这种行为,因此我在这套 harness 里完全去掉了上下文重置。所有智能体在整个构建期间以一次连续会话运行,并由 Claude Agent SDK 的自动压缩在过程中处理上下文增长。

这次工作是在原始 harness 的基础上,构建了一个三智能体系统,每个智能体对应我在此前运行里观察到的一个具体缺口。系统包含以下人格设定:

Planner:之前的 long-running harness 需要用户一开始就提供很详细的规格。我想把这步自动化,于是做了一个规划智能体:给它 1 到 4 句的简单提示,它会扩写成完整的产品规格。我提示它在范围上要有野心,同时聚焦产品语境与高层技术设计,而不是具体的技术实现细节。这样强调,是因为担心规划者如果一开始就写得过细、还写错了,规格里的错误会在后续实现里层层放大。更聪明的做法似乎是约束智能体最终要交付的东西,让它们在工作过程中自己找到路径。我还要求规划者主动寻找机会,把 AI 功能织进产品规格里。(示例见底部附录。)

Generator:早期 harness 的按功能逐个推进方法,对范围管理很有效。我在这里沿用了类似思路,让生成器以 sprint 形式工作,每次从规格里拾取一个功能实现。每个 sprint 都用 React、Vite、FastAPI 和 SQLite(后期换成 PostgreSQL)的技术栈来实现应用,并要求生成器在每个 sprint 结束时先做自评,再交给 QA。它也可以使用 git 做版本控制。

Evaluator:早期 harness 生成的应用往往看起来很惊艳,但你真用起来仍会遇到实打实的 bug。为捕捉这些问题,评估器用 Playwright MCP 像真实用户一样点击运行中的应用,测试 UI 功能、API 端点和数据库状态。随后它会把每个 sprint 的结果,按它发现的 bug 以及一套从前端实验迁移并改写的标准来打分,这套标准覆盖产品深度、功能性、视觉设计与代码质量。每条标准都有硬阈值,只要有一条低于阈值,这个 sprint 就算失败,生成器会收到对问题原因的细致反馈。

每个 sprint 开始前,生成器与评估器会协商一个 sprint 合同:在写任何代码之前先对这一块工作做到什么算完成达成一致。之所以需要这一步,是因为产品规格刻意保持高层,我希望用一个步骤把用户故事与可测试的实现连接起来。生成器会提出它要构建什么、如何验证成功;评估器会审阅提案,确保生成器在做对的事。双方会迭代直到达成一致。

沟通通过文件完成:一个智能体写文件,另一个智能体读文件,并在同一个文件里回应,或另写一个新文件让前者再读取。生成器会以达成一致的合同为约束来开发,然后把工作交给 QA。这让实现能忠实对齐规格,又避免过早把实现细节写死。

运行 harness

这套 harness 的第一版,我使用 Claude Opus 4.5,并把用户提示分别送进完整 harness 与单智能体系统做对比。我之所以选 Opus 4.5,是因为开始实验时它是我们最强的编码模型。

我用下面的提示生成一个复古视频游戏制作器:

创建一个 2D 复古游戏制作器,功能包括关卡编辑器、Sprite 编辑器、实体行为,以及可游玩的测试模式。

下表展示了 harness 类型、运行时长与总成本。

Harness类型时长成本 Solo 20 分钟 $9 Full harness 6 小时 $200

完整 harness 的成本超过 20 倍,但输出质量的差异立刻就能看出来。

我期待的是一个界面:可以搭建关卡及其组成部分(sprites、entities、图块布局),然后点 play 直接玩这个关卡。我先打开了 solo 运行的结果,最初这个应用看起来符合预期。

但在我一路点下去时,问题开始出现。布局浪费空间,固定高度的面板让大部分视口空着。工作流很僵硬。尝试往关卡里填东西时,它会提示我先创建 sprites 和 entities,但 UI 里没有任何引导告诉我该按这个顺序走。更关键的是,实际游戏坏了:实体能出现在屏幕上,但输入没有任何响应。深入代码后发现,实体定义与游戏运行时之间的连线断了,而且界面上完全看不出哪里出了问题。

打开页面 Sprite 编辑器 游戏运行

打开 solo harness 生成的应用时的初始界面。

在 solo harness 生成的 Sprite 编辑器里创建一个 sprite

尝试游玩我创建的关卡,但没有成功

评估完 solo 运行后,我把注意力转向完整 harness 运行。它同样从一句话提示开始,但规划步骤把提示扩写成一个包含 16 个功能、分布在 10 个 sprint 里的规格,远超 solo 运行所尝试的范围。除了核心编辑器与游玩模式,规格还要求 sprite 动画系统、行为模板、音效与音乐、AI 辅助的 sprite 生成器与关卡设计器,以及可分享链接的游戏导出。我让规划者可以访问我们的 frontend design skill,它会阅读并把其中内容用于为应用制定一套视觉设计语言,作为规格的一部分。每个 sprint 中,生成器与评估器都会协商一份合同,定义该 sprint 的具体实现细节,以及用于验证完成度的可测试行为。

这个应用立刻比 solo 运行显得更精致、更顺滑。画布占满了整个视口,面板尺寸合理,界面也有一致的视觉识别,并跟随规格里的设计方向。solo 运行里那些笨拙感仍有一部分存在——工作流依然不清楚提示你应该先做 sprites 和 entities 再去填关卡,我只能靠自己到处试出来。这更像是底座模型的产品直觉缺口,而不是这套 harness 专门要解决的东西,不过它也提示了一个方向:在 harness 内做更有针对性的迭代,可能还能继续提升输出质量。

沿着编辑器一路用下去,新运行相对 solo 的优势更明显了。Sprite 编辑器更丰富、更完整:工具栏更清爽、取色器更好、缩放控制也更顺手。

因为我要求规划者把 AI 功能织进规格里,应用还内置了 Claude 集成,让我可以通过提示生成游戏的不同部分,工作流因此快了很多。

打开页面 Sprite 编辑器 AI 游戏设计 AI 游戏设计 游戏运行

初始界面:在完整 harness 构建的应用里创建一个新游戏

Sprite 编辑器更干净,也更好用

使用内置 AI 功能生成关卡

使用内置 AI 功能生成关卡

游玩我生成的游戏

最大的差异出现在游玩模式里。我真的能控制实体移动并玩起来。物理系统还有些粗糙——角色跳到平台上后会和平台重叠在一起,这在直觉上不对——但核心功能能用,这是 solo 运行没做到的。我走动了一会儿后,也碰到 AI 在关卡构建上的限制:有一堵大墙怎么都跳不过去,直接卡死。这说明 harness 可能还能处理一些常识性改进与边界情况,让应用更完善。

读日志时可以清楚看到,评估器把实现严格拉回规格。每个 sprint,它都会按 sprint 合同里的测试标准逐项走一遍,并通过 Playwright 操作运行中的应用;凡是和预期行为不一致的地方都会提 bug。合同非常细——仅 Sprint 3 就有 27 条标准覆盖关卡编辑器——而评估器的发现也足够具体,无需额外排查就能直接动手修。下表是评估器识别问题的几个例子:

合同条款评估发现矩形填充工具允许点击拖拽,用所选图块填充矩形区域失败 — 工具只在拖拽起点/终点放置图块,没有填满区域。fillRectangle 函数存在,但在 mouseUp 时没有正确触发。用户可以选择并删除已放置的实体生成点失败 — LevelEditor.tsx:892 的 Delete 键处理要求同时设置 selection 和 selectedEntityId,但点击实体只会设置 selectedEntityId。条件应为 selection || (selectedEntityId activeLayer === entity)。用户可通过 API 重新排序动画帧失败 — PUT /frames/reorder 路由定义在 /{frame_id} 路由之后。FastAPI 会把 reorder 当作 frame_id 整数匹配,并返回 422:无法将字符串解析为整数。

要把评估器拉到这种表现,需要下不少功夫。开箱状态下,Claude 做 QA 并不强。早期运行里,我看着它识别出真实问题,然后又自己把自己说服,觉得问题不大,照样放行。它也倾向于浅尝辄止,而不是去探边界情况,所以更隐蔽的 bug 常会漏掉。调参循环是读评估器日志,找出它的判断与我分歧的例子,再更新 QA 的提示来修正这些问题。跑了几轮这样的开发循环后,评估器才开始以我认为合理的方式打分。即便如此,harness 输出仍暴露了模型 QA 能力的边界:一些小的布局问题、某些交互在直觉上不够顺,以及更深层功能里仍有未被评估器充分覆盖的隐藏 bug。显然还有更多验证空间可以通过进一步调优来挖掘。但与 solo 运行相比(应用的核心功能干脆不能用),提升非常明显。

迭代优化 harness

第一批 harness 结果令人鼓舞,但也臃肿、缓慢且昂贵。下一步合乎逻辑的动作,是在不削弱效果的前提下简化 harness。这既是常识,也来自一个更一般的原则:harness 里的每个组件都编码了一个假设,假设模型自己做不到某件事;这些假设值得做压力测试,一方面因为它们可能本来就是错的,另一方面因为随着模型变强,它们也会很快过时。我们的博文 Building Effective Agents 把核心思想概括为:尽量找最简单的解法,只有在需要时才增加复杂度;这也是任何维护 agent harness 的人都会反复遇到的一种模式。

我第一次尝试简化时,大幅裁剪了 harness,并试了几个新的创意点子,但没能复现原先的表现。与此同时,也越来越难判断 harness 设计里哪些部分真正是承重的,以及它们具体通过什么方式承重。基于这段经历,我改用更方法化的方式:一次移除一个组件,然后回看它对最终结果的影响。

在这些迭代周期里,我们还发布了 Opus 4.6,这进一步推动我降低 harness 复杂度。很有理由相信 4.6 相比 4.5 需要更少的脚手架。我们的 发布博客 中写到:[Opus 4.6] 规划更谨慎,能把智能体任务维持得更久,在大型代码库中运行更可靠,并且具备更强的代码审查与调试能力以捕捉自身错误;它在长上下文检索上也有显著提升。这些正是 harness 原本用来补齐的能力。

移除 sprint 结构

我先把 sprint 结构整体移除了。sprint 的好处是把工作拆成块,让模型能更连贯地推进。考虑到 Opus 4.6 的提升,很有理由相信模型可以不靠这种拆解就原生完成任务。

我保留了规划者与评估者,因为它们各自仍然带来明显价值。没有规划者时,生成器会低估范围:面对原始提示,它会直接开建而不是先做规格,最后做出来的应用功能会比规划者扩写后的规格少得多。

移除 sprint 结构后,我把评估器从按 sprint 打分,改为在整次运行结束后做一次统一评估。由于模型更强,评估器在不同任务中的承重程度发生了变化,它是否有价值取决于任务相对模型单独可靠能力的边界位置。在 4.5 上,这条边界很近:我们的构建处在生成器独立完成能力的边缘,评估器能在全程捕捉到有意义的问题。在 4.6 上,模型原生能力提升,边界向外移动。过去需要评估器把关才能连贯实现的任务,现在常常已经落在生成器自己就能做好的范围内;对于落在边界内的任务,评估器就成了不必要的额外开销。但对于仍处在生成器能力边缘的部分,评估器依然能带来真实增益。

现实含义是:评估器不是一个固定的要么开要么关的选择。当任务超出当前模型单独可靠完成范围时,它就值得付出成本。

在结构简化的同时,我也增加了提示来改进 harness 把 AI 功能嵌进每个应用的方式,尤其是让生成器构建一个真正的 agent,并能通过工具驱动应用自身的功能。这需要实打实的迭代,因为相关知识足够新,Claude 的训练数据覆盖并不充分。但经过足够的调优后,生成器确实能把 agents 正确地搭起来。

更新版 harness 的结果

为了检验更新后的 harness,我用下面的提示生成一个数字音频工作站(DAW),也就是用于作曲、录音和混音的音乐制作软件:

用 Web Audio API 在浏览器中构建一个功能完备的 DAW。

这次运行仍然漫长且昂贵,大约 4 小时,token 成本约 $124。

大部分时间花在构建阶段:在不需要 Opus 4.5 那种 sprint 拆解的前提下,构建器仍能连贯地跑两个多小时。

Agent阶段时长成本 Planner 4.7 分钟 $0.46 Build (Round 1) 2 小时 7 分钟 $71.08 QA (Round 1) 8.8 分钟 $3.24 Build (Round 2) 1 小时 2 分钟 $36.89 QA (Round 2) 6.8 分钟 $3.09 Build (Round 3) 10.9 分钟 $5.88 QA (Round 3) 9.6 分钟 $4.06 Total V2 Harness 3 小时 50 分钟 $124.70

和之前的 harness 一样,规划者把一行提示扩写成完整规格。从日志里能看到,生成器模型在规划应用与 agent 设计、把 agent 接上线、并在交给 QA 前进行测试等方面都做得不错。

尽管如此,QA 智能体仍然抓到了真实缺口。在第一轮反馈里,它指出:

这是一个很强的应用,设计还原度很高,AI agent 扎实,后端也不错。主要失败点在于功能完整性——虽然应用看起来很惊艳,AI 集成也跑得很好,但多个 DAW 核心功能只是展示,没有交互深度:音频片段无法在时间线上拖拽/移动,没有乐器 UI 面板(合成器旋钮、鼓垫),也没有可视化效果器编辑器(EQ 曲线、压缩器电平表)。这些不是边角问题——它们是让 DAW 真正可用的核心交互,而规格里明确要求了它们。

在第二轮反馈里,它又抓到了几处功能缺口:

仍存在的缺口: - 音频录制仍是占位实现(按钮会切换,但没有麦克风采集) - 未实现通过拖拽边缘缩放片段,以及片段切割 - 效果器可视化是数字滑杆而不是图形(没有 EQ 曲线)

当生成器独立工作时,仍然容易漏掉细节或把功能做成占位,而 QA 依然能在最后一公里抓到这些问题,让生成器去补齐。

基于这个提示,我期待的是一个程序:能写旋律、和声与鼓点,把它们编排成一首歌,并且一路有一个集成的 agent 提供帮助。下面的视频展示了结果。

这个应用离专业音乐制作软件还差得很远,agent 的作曲能力显然也还有很大提升空间。此外,Claude 实际上听不见,这让 QA 反馈回路在音乐审美方面的效果打了折扣。

但最终应用具备了一个可用音乐制作程序的所有核心部件:一个能在浏览器里工作的编排视图、调音台与传输控制。更进一步,我还能完全通过提示拼出一段短小的音乐片段:agent 设定速度与调式,铺出旋律,做出鼓轨,调整调音台电平,并加上混响。作曲所需的核心原语都在,agent 也能用工具自主驱动它们,从头到尾做出一个简单的制作成品。也许还谈不上分毫不差,但已经越来越接近了。

What comes next

As models continue to improve, we can roughly expect them to be capable of working for longer, and on more complex tasks. In some cases, that will mean the scaffold surrounding the model matters less over time, and developers can wait for the next model and see certain problems solve themselves. On the other hand, the better the models get, the more space there is to develop harnesses that can achieve complex tasks beyond what the model can do at baseline.

With this in mind, there are a few lessons from this work worth carrying forward. It is always good practice to experiment with the model youre building against, read its traces on realistic problems, and tune its performance to achieve your desired outcomes. When working on more complex tasks, there is sometimes headroom from decomposing the task and applying specialized agents to each aspect of the problem. And when a new model lands, it is generally good practice to re-examine a harness, stripping away pieces that are no longer load-bearing to performance and adding new pieces to achieve greater capability that may not have been possible before.

From this work, my conviction is that the space of interesting harness combinations doesnt shrink as models improve. Instead, it moves, and the interesting work for AI engineers is to keep finding the next novel combination.

接下来会怎样

随着模型持续进步,我们大致可以预期它们能工作更久,也能应对更复杂的任务。有些情况下,这意味着包在模型外的脚手架会随着时间变得不那么重要,开发者只需等下一代模型,某些问题就会自行消失。另一方面,模型越强,可开发的空间越大:我们可以构建 harness,让它完成远超模型基线能力的复杂任务。

基于此,这项工作有几条值得带走的经验。对目标模型做实验、在真实问题上读它的 traces、并把它调到你想要的输出,是一直都值得的实践。面对更复杂的任务,有时把任务拆解,并为不同方面配置专门的智能体,仍有可挖的提升空间。当新模型上线时,通常也值得重新审视 harness:剥离那些不再承重的部分,同时加入新部件,以达成此前可能做不到的更强能力。

我从这项工作里得到的确信是:随着模型变强,有趣的 harness 组合空间不会缩小。它会移动,而 AI 工程师的有趣工作,就是不断找到下一种新的组合。

Acknowledgements

Special thanks to Mike Krieger, Michael Agaby, Justin Young, Jeremy Hadfield, David Hershey, Julius Tarng, Xiaoyi Zhang, Barry Zhang, Orowa Sidker, Michael Tingley, Ibrahim Madha, Martina Long, and Canyon Robbins for their contributions to this work.

Thanks also to Jake Eaton, Alyssa Leonard, and Stef Sequeira for their help shaping the post.

致谢

特别感谢 Mike Krieger、Michael Agaby、Justin Young、Jeremy Hadfield、David Hershey、Julius Tarng、Xiaoyi Zhang、Barry Zhang、Orowa Sidker、Michael Tingley、Ibrahim Madha、Martina Long 和 Canyon Robbins 对这项工作的贡献。

也感谢 Jake Eaton、Alyssa Leonard 和 Stef Sequeira 在塑造本文过程中提供的帮助。

Appendix

Example plan generated by planner agent.

RetroForge - 2D Retro Game Maker

Overview
RetroForge is a web-based creative studio for designing and building 2D retro-style video games. It combines the nostalgic charm of classic 8-bit and 16-bit game aesthetics with modern, intuitive editing tools—enabling anyone from hobbyist creators to indie developers to bring their game ideas to life without writing traditional code.

The platform provides four integrated creative modules: a tile-based Level Editor for designing game worlds, a pixel-art Sprite Editor for crafting visual assets, a visual Entity Behavior system for defining game logic, and an instant Playable Test Mode for real-time gameplay testing. By weaving AI assistance throughout (powered by Claude), RetroForge accelerates the creative process—helping users generate sprites, design levels, and configure behaviors through natural language interaction.

RetroForge targets creators who love retro gaming aesthetics but want modern conveniences. Whether recreating the platformers, RPGs, or action games of their childhood, or inventing entirely new experiences within retro constraints, users can prototype rapidly, iterate visually, and share their creations with others.

Features
1. Project Dashboard  Management
The Project Dashboard is the home base for all creative work in RetroForge. Users need a clear, organized way to manage their game projects—creating new ones, returning to works-in-progress, and understanding what each project contains at a glance.

User Stories: As a user, I want to:

- Create a new game project with a name and description, so that I can begin designing my game
- See all my existing projects displayed as visual cards showing the project name, last modified date, and a thumbnail preview, so that I can quickly find and continue my work
- Open any project to enter the full game editor workspace, so that I can work on my game
- Delete projects I no longer need, with a confirmation dialog to prevent accidents, so that I can keep my workspace organized
- Duplicate an existing project as a starting point for a new game, so that I can reuse my previous work

Project Data Model: Each project contains:

Project metadata (name, description, created/modified timestamps)
Canvas settings (resolution: e.g., 256x224, 320x240, or 160x144)
Tile size configuration (8x8, 16x16, or 32x32 pixels)
Color palette selection 
All associated sprites, tilesets, levels, and entity definitions

...

Copy

附录

规划智能体生成的示例计划。

RetroForge - 2D Retro Game Maker

Overview
RetroForge is a web-based creative studio for designing and building 2D retro-style video games. It combines the nostalgic charm of classic 8-bit and 16-bit game aesthetics with modern, intuitive editing tools—enabling anyone from hobbyist creators to indie developers to bring their game ideas to life without writing traditional code.

The platform provides four integrated creative modules: a tile-based Level Editor for designing game worlds, a pixel-art Sprite Editor for crafting visual assets, a visual Entity Behavior system for defining game logic, and an instant Playable Test Mode for real-time gameplay testing. By weaving AI assistance throughout (powered by Claude), RetroForge accelerates the creative process—helping users generate sprites, design levels, and configure behaviors through natural language interaction.

RetroForge targets creators who love retro gaming aesthetics but want modern conveniences. Whether recreating the platformers, RPGs, or action games of their childhood, or inventing entirely new experiences within retro constraints, users can prototype rapidly, iterate visually, and share their creations with others.

Features
1. Project Dashboard  Management
The Project Dashboard is the home base for all creative work in RetroForge. Users need a clear, organized way to manage their game projects—creating new ones, returning to works-in-progress, and understanding what each project contains at a glance.

User Stories: As a user, I want to:

- Create a new game project with a name and description, so that I can begin designing my game
- See all my existing projects displayed as visual cards showing the project name, last modified date, and a thumbnail preview, so that I can quickly find and continue my work
- Open any project to enter the full game editor workspace, so that I can work on my game
- Delete projects I no longer need, with a confirmation dialog to prevent accidents, so that I can keep my workspace organized
- Duplicate an existing project as a starting point for a new game, so that I can reuse my previous work

Project Data Model: Each project contains:

Project metadata (name, description, created/modified timestamps)
Canvas settings (resolution: e.g., 256x224, 320x240, or 160x144)
Tile size configuration (8x8, 16x16, or 32x32 pixels)
Color palette selection
All associated sprites, tilesets, levels, and entity definitions

...

复制

Get the developer newsletter

Product updates, how-tos, community spotlights, and more. Delivered monthly to your inbox.

Please provide your email address if youd like to receive our monthly developer newsletter. You can unsubscribe at any time.

订阅开发者通讯

产品更新、使用指南、社区聚焦等内容,每月发送到你的收件箱。

如果你希望接收我们的月度开发者通讯,请提供你的邮箱地址。你可以随时取消订阅。

Harness design for long-running application development

Published Mar 24, 2026 Harness design is key to performance at the frontier of agentic coding. Heres how we pushed Claude further in frontend design and long-running autonomous software engineering.

Written by Prithvi Rajasekaran, a member of ourLabs team.

Over the past several months I’ve been working on two interconnected problems: getting Claude to produce high-quality frontend designs, and getting it to build complete applications without human intervention. This work originated with earlier efforts on our frontend design skill and long-running coding agent harness, where my colleagues and I were able to improve Claude’s performance well above baseline through prompt engineering and harness design—but both eventually hit ceilings.

To break through, I sought out novel AI engineering approaches that held across two quite different domains, one defined by subjective taste, the other by verifiable correctness and usability. Taking inspiration from Generative Adversarial Networks (GANs), I designed a multi-agent structure with a generator and evaluator agent. Building an evaluator that graded outputs reliably—and with taste—meant first developing a set of criteria that could turn subjective judgments like “is this design good?” into concrete, gradable terms.

I then applied these techniques to long-running autonomous coding, carrying over two lessons from our earlier harness work: decomposing the build into tractable chunks, and using structured artifacts to hand off context between sessions. The final result was a three-agent architecture—planner, generator, and evaluator—that produced rich full-stack applications over multi-hour autonomous coding sessions.

Why naive implementations fall short

Weve previously shown that harness design has a substantial impact on the effectiveness of long running agentic coding. In an earlierexperiment, we used an initializer agent to decompose a product spec into a task list, and a coding agent that implemented the tasks one feature at a time before handing off artifacts to carry context across sessions. The broader developer community has converged on similar insights, with approaches like the Ralph Wiggum method using hooks or scripts to keep agents in continuous iteration cycles.

But some problems remained persistent. For more complex tasks, the agent still tends to go off the rails over time. While decomposing this issue, we observed two common failure modes with agents executing these sorts of tasks.

First is that models tend to lose coherence on lengthy tasks as the context window fills (see our post on context engineering). Some models also exhibit context anxiety, in which they begin wrapping up work prematurely as they approach what they believe is their context limit. Context resets—clearing the context window entirely and starting a fresh agent, combined with a structured handoff that carries the previous agents state and the next steps—addresses both these issues.

This differs from compaction, where earlier parts of the conversation are summarized in place so the same agent can keep going on a shortened history. While compaction preserves continuity, it doesnt give the agent a clean slate, which means context anxiety can still persist. A reset provides a clean slate, at the cost of the handoff artifact having enough state for the next agent to pick up the work cleanly. In our earlier testing, we found Claude Sonnet 4.5 exhibited context anxiety strongly enough that compaction alone wasnt sufficient to enable strong long task performance, so context resets became essential to the harness design. This solves the core issue, but adds orchestration complexity, token overhead, and latency to each harness run.

A second issue, which we haven’t previously addressed, is self-evaluation. When asked to evaluate work theyve produced, agents tend to respond by confidently praising the work—even when, to a human observer, the quality is obviously mediocre. This problem is particularly pronounced for subjective tasks like design, where there is no binary check equivalent to a verifiable software test. Whether a layout feels polished or generic is a judgment call, and agents reliably skew positive when grading their own work.

However, even on tasks that do have verifiable outcomes, agents still sometimes exhibit poor judgment that impedes their performance while completing the task. Separating the agent doing the work from the agent judging it proves to be a strong lever to address this issue. The separation doesnt immediately eliminate that leniency on its own; the evaluator is still an LLM that is inclined to be generous towards LLM-generated outputs. But tuning a standalone evaluator to be skeptical turns out to be far more tractable than making a generator critical of its own work, and once that external feedback exists, the generator has something concrete to iterate against.

Frontend design: making subjective quality gradable

I started by experimenting on frontend design, where the self-evaluation issue was most visible. Absent any intervention, Claude normally gravitates toward safe, predictable layouts that are technically functional but visually unremarkable.

Two insights shaped the harness I built for frontend design. First, while aesthetics can’t be fully reduced to a score—and individual tastes will always vary—they can be improved with grading criteria that encode design principles and preferences. Is this design beautiful? is hard to answer consistently, but does this follow our principles for good design? gives Claude something concrete to grade against. Second, by separating frontend generation from frontend grading, we can create a feedback loop that drives the generator toward stronger outputs.

With this in mind, I wrote four grading criteria that I gave to both the generator and evaluator agents in their prompts:

  • Design quality: Does the design feel like a coherent whole rather than a collection of parts? Strong work here means the colors, typography, layout, imagery, and other details combine to create a distinct mood and identity.

  • Originality: Is there evidence of custom decisions, or is this template layouts, library defaults, and AI-generated patterns? A human designer should recognize deliberate creative choices. Unmodified stock components—or telltale signs of AI generation like purple gradients over white cards—fail here.

  • Craft: Technical execution: typography hierarchy, spacing consistency, color harmony, contrast ratios. This is a competence check rather than a creativity check. Most reasonable implementations do fine here by default; failing means broken fundamentals.

  • Functionality: Usability independent of aesthetics. Can users understand what the interface does, find primary actions, and complete tasks without guessing?

I emphasized design quality and originality over craft and functionality. Claude already scored well on craft and functionality by default, as the required technical competence tended to come naturally to the model. But on design and originality, Claude often produced outputs that were bland at best. The criteria explicitly penalized highly generic “AI slop” patterns, and by weighting design and originality more heavily it pushed the model toward more aesthetic risk-taking.

I calibrated the evaluator using few-shot examples with detailed score breakdowns. This ensured the evaluator’s judgment aligned with my preferences, and reduced score drift across iterations.

I built the loop on the Claude Agent SDK, which kept the orchestration straightforward. A generator agent first created an HTML/CSS/JS frontend based on a user prompt. I gave the evaluator the Playwright MCP, which let it interact with the live page directly before scoring each criterion and writing a detailed critique. In practice, the evaluator would navigate the page on its own, screenshotting and carefully studying the implementation before producing its assessment. That feedback flowed back to the generator as input for the next iteration. I ran 5 to 15 iterations per generation, with each iteration typically pushing the generator in a more distinctive direction as it responded to the evaluators critique. Because the evaluator was actively navigating the page rather than scoring a static screenshot, each cycle took real wall-clock time. Full runs stretched up to four hours. I also instructed the generator to make a strategic decision after each evaluation: refine the current direction if scores were trending well, or pivot to an entirely different aesthetic if the approach wasnt working.

Across runs, the evaluators assessments improved over iterations before plateauing, with headroom still remaining. Some generations refined incrementally. Others took sharp aesthetic turns between iterations.

The wording of the criteria steered the generator in ways I didnt fully anticipate. Including phrases like the best designs are museum quality pushed designs toward a particular visual convergence, suggesting that the prompting associated with the criteria directly shaped the character of the output.

While scores generally improved over iterations, the pattern was not always cleanly linear. Later implementations tended to be better as a whole, but I regularly saw cases where I preferred a middle iteration over the last one. Implementation complexity also tended to increase across rounds, with the generator reaching for more ambitious solutions in response to the evaluator’s feedback. Even on the first iteration, outputs were noticeably better than a baseline with no prompting at all, suggesting the criteria and associated language themselves steered the model away from generic defaults before any evaluator feedback led to further refinement.

In one notable example, I prompted the model to create a website for a Dutch art museum. By the ninth iteration, it had produced a clean, dark-themed landing page for a fictional museum. The page was visually polished but largely in line with my expectations. Then, on the tenth cycle, it scrapped the approach entirely and reimagined the site as a spatial experience: a 3D room with a checkered floor rendered in CSS perspective, artwork hung on the walls in free-form positions, and doorway-based navigation between gallery rooms instead of scroll or click. It was the kind of creative leap that I hadnt seen before from a single-pass generation.

Scaling to full-stack coding

With these findings in hand, I applied this GAN-inspired pattern to full-stack development. The generator-evaluator loop maps naturally onto the software development lifecycle, where code review and QA serve the same structural role as the design evaluator.

The architecture

In our earlier long-running harness, we had solved for coherent multi-session coding with an initializer agent, a coding agent that worked one feature at a time, and context resets between sessions. Context resets were a key unlock: the harness used Sonnet 4.5, which exhibited the “context anxiety” tendency mentioned earlier. Creating a harness that worked well across context resets was key to keeping the model on task. Opus 4.5 largely removed that behavior on its own, so I was able to drop context resets from this harness entirely. The agents were run as one continuous session across the whole build, with the Claude Agent SDKs automatic compaction handling context growth along the way.

For this work I built on the foundation from the original harness with a three-agent system, with each agent addressing a specific gap Id observed in prior runs. The system contained the following agent personas:

Planner: Our previous long-running harness required the user to provide a detailed spec upfront. I wanted to automate that step, so I created a planner agent that took a simple 1-4 sentence prompt and expanded it into a full product spec. I prompted it to be ambitious about scope and to stay focused on product context and high level technical design rather than detailed technical implementation. This emphasis was due to the concern that if the planner tried to specify granular technical details upfront and got something wrong, the errors in the spec would cascade into the downstream implementation. It seemed smarter to constrain the agents on the deliverables to be produced and let them figure out the path as they worked. I also asked the planner to find opportunities to weave AI features into the product specs. (See example in the Appendix at the bottom.)

Generator: The one-feature-at-a-time approach from the earlier harness worked well for scope management. I applied a similar model here, instructing the generator to work in sprints, picking up one feature at a time from the spec. Each sprint implemented the app with a React, Vite, FastAPI, and SQLite (later PostgreSQL) stack, and the generator was instructed to self-evaluate its work at the end of each sprint before handing off to QA. It also had git for version control.

Evaluator: Applications from earlier harnesses often looked impressive but still had real bugs when you actually tried to use them. To catch these, the evaluator used the Playwright MCP to click through the running application the way a user would, testing UI features, API endpoints, and database states. It then graded each sprint against both the bugs it had found and a set of criteria modeled on the frontend experiment, adapted here to cover product depth, functionality, visual design, and code quality. Each criterion had a hard threshold, and if any one fell below it, the sprint failed and the generator got detailed feedback on what went wrong.

Before each sprint, the generator and evaluator negotiated a sprint contract: agreeing on what done looked like for that chunk of work before any code was written. This existed because the product spec was intentionally high-level, and I wanted a step to bridge the gap between user stories and testable implementation. The generator proposed what it would build and how success would be verified, and the evaluator reviewed that proposal to make sure the generator was building the right thing. The two iterated until they agreed.

Communication was handled via files: one agent would write a file, another agent would read it and respond either within that file or with a new file that the previous agent would read in turn. The generator then built against the agreed-upon contract before handing the work off to QA. This kept the work faithful to the spec without over-specifying implementation too early.

Running the harness

For the first version of this harness, I used Claude Opus 4.5, running user prompts against both the full harness and a single-agent system for comparison. I used Opus 4.5 since this was our best coding model when I began these experiments.

I wrote the following prompt to generate a retro video game maker:

Create a 2D retro game maker with features including a level editor, sprite editor, entity behaviors, and a playable test mode.

The table below shows the harness type, length it ran for, and the total cost.

HarnessDurationCostSolo20 min$9Full harness6 hr$200

The harness was over 20x more expensive, but the difference in output quality was immediately apparent.

I was expecting an interface where I could construct a level and its component parts (sprites, entities, tile layout) then hit play to actually play the level. I started by opening the solo run’s output, and the initial application seemed in line with those expectations.

As I clicked through, however, issues started to emerge. The layout wasted space, with fixed-height panels leaving most of the viewport empty. The workflow was rigid. Trying to populate a level prompted me to create sprites and entities first, but nothing in the UI guided me toward that sequence. More to the point, the actual game was broken. My entities appeared on screen but nothing responded to input. Digging into the code revealed that the wiring between entity definitions and the game runtime was broken, with no surface indication of where.

Opening screen Sprite editorGame play

Initial screen when opening the app created by the solo harness.

Creating a sprite in the sprite editor made by the solo harness

Trying unsuccessfully to play the level I created

After evaluating the solo run, I turned my attention to the harness run. This run started from the same one-sentence prompt, but the planner step expanded that prompt into a 16-feature spec spread across ten sprints. It went well beyond what the solo run attempted. In addition to the core editors and play mode, the spec called for a sprite animation system, behavior templates, sound effects and music, an AI-assisted sprite generator and level designer, and game export with shareable links. I gave the planner access to our frontend design skill, which it read and used to create a visual design language for the app as part of the spec. For each sprint, the generator and evaluator negotiated a contract defining the specific implementation details for the sprint, and the testable behaviors that would be tested to verify completion.

The app immediately showed more polish and smoothness than the solo run. The canvas used the full viewport, the panels were sized sensibly, and the interface had a consistent visual identity that tracked the design direction from the spec. Some of the clunkiness Id seen in the solo run did remain—the workflow still didnt make it clear that you should build sprites and entities before trying to populate a level, and I had to figure that out by poking around. This read as a gap in the base model’s product intuition rather than something the harness was designed to address, though it did suggest a place where targeted iteration inside the harness could help to further improve output quality.

Working through the editors, the new runs advantages over solo became more apparent. The sprite editor was richer and more fully featured, with cleaner tool palettes, a better color picker, and more usable zoom controls.

Because Id asked the planner to weave AI features into its specs, the app also came with a built-in Claude integration that let me generate different parts of the game through prompting. This significantly sped up the workflow.

Opening screen Sprite editorAI game designAI game design Game play

Initial screen: Creating a new game, in the app built with the full harness

The sprite editor felt cleaner and easier to use

Using the built in AI feature to generate the level

Using the built in AI feature to generate the level

Playing the game I generated

The biggest difference was in play mode. I was actually able to move my entity and play the game. The physics had some rough edges—my character jumped onto a platform but ended up overlapping with it, which felt intuitively wrong—but the core thing worked, which the solo run did not manage. After moving around a bit, I did hit some limitations with the AI’s game level construction. There was a large wall that I wasn’t able to jump past, so I was stuck. This suggested there were some common sense improvements and edge cases that the harness could handle to further refine the app.

Reading through the logs, it was clear that the evaluator kept the implementation in line with the spec. Each sprint, it walked through the sprint contracts test criteria and exercised the running application through Playwright, filing bugs against anything that diverged from expected behavior. The contracts were granular—Sprint 3 alone had 27 criteria covering the level editor—and the evaluators findings were specific enough to act on without extra investigation. The table below shows several examples of issues our evaluator identified:

Contract criterionEvaluator findingRectangle fill tool allows click-drag to fill a rectangular area with selected tileFAIL — Tool only places tiles at drag start/end points instead of filling the region. fillRectangle function exists but isnt triggered properly on mouseUp.User can select and delete placed entity spawn pointsFAIL — Delete key handler at LevelEditor.tsx:892 requires both selection and selectedEntityId to be set, but clicking an entity only sets selectedEntityId. Condition should be selection || (selectedEntityId activeLayer === entity).User can reorder animation frames via APIFAIL — PUT /frames/reorder route defined after /{frame_id} routes. FastAPI matches reorder as a frame_id integer and returns 422: unable to parse string as an integer.

Getting the evaluator to perform at this level took work. Out of the box, Claude is a poor QA agent. In early runs, I watched it identify legitimate issues, then talk itself into deciding they werent a big deal and approve the work anyway. It also tended to test superficially, rather than probing edge cases, so more subtle bugs often slipped through. The tuning loop was to read the evaluators logs, find examples where its judgment diverged from mine, and update the QAs prompt to solve for those issues. It took several rounds of this development loop before the evaluator was grading in a way that I found reasonable. Even then, the harness output showed the limits of the model’s QAing capabilities: small layout issues, interactions that felt unintuitive in places, and undiscovered bugs in more deeply nested features that the evaluator hadnt exercised thoroughly. There was clearly more verification headroom to capture with further tuning. But compared to the solo run, where the central feature of the application simply didnt work, the lift was obvious.

Iterating on the harness

The first set of harness results was encouraging, but it was also bulky, slow, and expensive. The logical next step was to find ways to simplify the harness without degrading its performance. This was partly common sense and partly a function of a more general principle: every component in a harness encodes an assumption about what the model cant do on its own, and those assumptions are worth stress testing, both because they may be incorrect, and because they can quickly go stale as models improve. Our blog post Building Effective Agents frames the underlying idea as find the simplest solution possible, and only increase complexity when needed, and its a pattern that shows up consistently for anyone maintaining an agent harness.

In my first attempt to simplify, I cut the harness back radically and tried a few creative new ideas, but I wasnt able to replicate the performance of the original. It also became difficult to tell which pieces of the harness design were actually load-bearing, and in what ways. Based on that experience, I moved to a more methodical approach, removing one component at a time and reviewing what impact it had on the final result.

As I was going through these iteration cycles, we also released Opus 4.6, which provided further motivation to reduce harness complexity. There was good reason to expect 4.6 would need less scaffolding than 4.5 did. From our launch blog: [Opus 4.6] plans more carefully, sustains agentic tasks for longer, can operate more reliably in larger codebases, and has better code review and debugging skills to catch its own mistakes. It also improved substantially on long-context retrieval. These were all capabilities the harness had been built to supplement.

Removing the sprint construct

I started by removing the sprint construct entirely. The sprint structure had helped to decompose work into chunks for the model to work coherently. Given the improvements in Opus 4.6, there was good reason to believe that the model could natively handle the job without this sort of decomposition.

I kept both the planner and evaluator, as each continued to add obvious value. Without the planner, the generator under-scoped: given the raw prompt, it would start building without first speccing its work, and end up creating a less feature-rich application than the planner did.

With the sprint construct removed, I moved the evaluator to a single pass at the end of the run rather than grading per sprint. Since the model was much more capable, it changed how load-bearing the evaluator was for certain runs, with its usefulness depending on where the task sat relative to what the model could do reliably on its own. On 4.5, that boundary was close: our builds were at the edge of what the generator could do well solo, and the evaluator caught meaningful issues across the build. On 4.6, the models raw capability increased, so the boundary moved outward. Tasks that used to need the evaluators check to be implemented coherently were now often within what the generator handled well on its own, and for tasks within that boundary, the evaluator became unnecessary overhead. But for the parts of the build that were still at the edge of the generator’s capabilities, the evaluator continued to give real lift.

The practical implication is that the evaluator is not a fixed yes-or-no decision. It is worth the cost when the task sits beyond what the current model does reliably solo.

Alongside the structural simplification, I also added prompting to improve how the harness built AI features into each app, specifically getting the generator to build a proper agent that could drive the apps own functionality through tools. That took real iteration, since the relevant knowledge is recent enough that Claudes training data covers it thinly. But with enough tuning, the generator was building agents correctly.

Results from the updated harness

To put the updated harness to the test, I used the following prompt to generate a Digital Audio Workstation (DAW), a music production program for composing, recording, and mixing songs:

Build a fully featured DAW in the browser using the Web Audio API.

The run was still lengthy and expensive, at about 4 hours and $124 in token costs.

Most of the time went to the builder, which ran coherently for over two hours without the sprint decomposition that Opus 4.5 had needed.

Agent PhaseDurationCostPlanner4.7 min$0.46Build (Round 1)2 hr 7 min$71.08QA (Round 1)8.8 min$3.24Build (Round 2)1 hr 2 min$36.89QA (Round 2)6.8 min$3.09Build (Round 3)10.9 min$5.88QA (Round 3)9.6 min$4.06Total V2 Harness3 hr 50 min$124.70

As with the previous harness, the planner expanded the one-line prompt into a full spec. From the logs, I could see the generator model did a good job planning the app and the agent design, wiring the agent up, and testing it before handing off to QA.

That being said, the QA agent still caught real gaps. In its first-round feedback, it noted:

This is a strong app with excellent design fidelity, solid AI agent, and good backend. The main failure point is Feature Completeness — while the app looks impressive and the AI integration works well, several core DAW features are display-only without interactive depth: clips cant be dragged/moved on the timeline, there are no instrument UI panels (synth knobs, drum pads), and no visual effect editors (EQ curves, compressor meters). These arent edge cases — theyre the core interactions that make a DAW usable, and the spec explicitly calls for them.

In its second round feedback, it again caught several functionality gaps:

Remaining gaps: - Audio recording is still stub-only (button toggles but no mic capture) - Clip resize by edge drag and clip split not implemented - Effect visualizations are numeric sliders, not graphical (no EQ curve)

The generator was still liable to miss details or stub features when left to its own devices, and the QA still added value in catching those last mile issues for the generator to fix.

Based on the prompt, I was expecting a program where I could create melodies, harmonies, and drum patterns, arrange them into a song, and get help from an integrated agent along the way. The video below shows the result.

The app is far from a professional music production program, and the agents song composition skills could clearly use a lot of work. Additionally, Claude can’t actually hear, which made the QA feedback loop less effective with respect to musical taste.

But the final app had all the core pieces of a functional music production program: a working arrangement view, mixer, and transport running in the browser. Beyond that, I was able to put together a short song snippet entirely through prompting: the agent set the tempo and key, laid down a melody, built a drum track, adjusted mixer levels, and added reverb. The core primitives for song composition were present, and the agent could drive them autonomously, using tools to create a simple production from end to end. You might say it’s not pitch-perfect yet—but it’s getting there.

What comes next

As models continue to improve, we can roughly expect them to be capable of working for longer, and on more complex tasks. In some cases, that will mean the scaffold surrounding the model matters less over time, and developers can wait for the next model and see certain problems solve themselves. On the other hand, the better the models get, the more space there is to develop harnesses that can achieve complex tasks beyond what the model can do at baseline.

With this in mind, there are a few lessons from this work worth carrying forward. It is always good practice to experiment with the model youre building against, read its traces on realistic problems, and tune its performance to achieve your desired outcomes. When working on more complex tasks, there is sometimes headroom from decomposing the task and applying specialized agents to each aspect of the problem. And when a new model lands, it is generally good practice to re-examine a harness, stripping away pieces that are no longer load-bearing to performance and adding new pieces to achieve greater capability that may not have been possible before.

From this work, my conviction is that the space of interesting harness combinations doesnt shrink as models improve. Instead, it moves, and the interesting work for AI engineers is to keep finding the next novel combination.

Acknowledgements

Special thanks to Mike Krieger, Michael Agaby, Justin Young, Jeremy Hadfield, David Hershey, Julius Tarng, Xiaoyi Zhang, Barry Zhang, Orowa Sidker, Michael Tingley, Ibrahim Madha, Martina Long, and Canyon Robbins for their contributions to this work.

Thanks also to Jake Eaton, Alyssa Leonard, and Stef Sequeira for their help shaping the post.

Appendix

Example plan generated by planner agent.

RetroForge - 2D Retro Game Maker

Overview
RetroForge is a web-based creative studio for designing and building 2D retro-style video games. It combines the nostalgic charm of classic 8-bit and 16-bit game aesthetics with modern, intuitive editing tools—enabling anyone from hobbyist creators to indie developers to bring their game ideas to life without writing traditional code.

The platform provides four integrated creative modules: a tile-based Level Editor for designing game worlds, a pixel-art Sprite Editor for crafting visual assets, a visual Entity Behavior system for defining game logic, and an instant Playable Test Mode for real-time gameplay testing. By weaving AI assistance throughout (powered by Claude), RetroForge accelerates the creative process—helping users generate sprites, design levels, and configure behaviors through natural language interaction.

RetroForge targets creators who love retro gaming aesthetics but want modern conveniences. Whether recreating the platformers, RPGs, or action games of their childhood, or inventing entirely new experiences within retro constraints, users can prototype rapidly, iterate visually, and share their creations with others.

Features
1. Project Dashboard  Management
The Project Dashboard is the home base for all creative work in RetroForge. Users need a clear, organized way to manage their game projects—creating new ones, returning to works-in-progress, and understanding what each project contains at a glance.

User Stories: As a user, I want to:

- Create a new game project with a name and description, so that I can begin designing my game
- See all my existing projects displayed as visual cards showing the project name, last modified date, and a thumbnail preview, so that I can quickly find and continue my work
- Open any project to enter the full game editor workspace, so that I can work on my game
- Delete projects I no longer need, with a confirmation dialog to prevent accidents, so that I can keep my workspace organized
- Duplicate an existing project as a starting point for a new game, so that I can reuse my previous work

Project Data Model: Each project contains:

Project metadata (name, description, created/modified timestamps)
Canvas settings (resolution: e.g., 256x224, 320x240, or 160x144)
Tile size configuration (8x8, 16x16, or 32x32 pixels)
Color palette selection 
All associated sprites, tilesets, levels, and entity definitions

...

Copy

Get the developer newsletter

Product updates, how-tos, community spotlights, and more. Delivered monthly to your inbox.

Please provide your email address if youd like to receive our monthly developer newsletter. You can unsubscribe at any time.

📋 讨论归档

讨论进行中…