返回列表
🧠 阿头学 · 💬 讨论题

用自动迭代循环把 Claude Skills 的可靠性从 70% 提升到 90%+

通过让 AI 自主循环测试和微调提示词(autoresearch),可以把不稳定的 skill 从 56% 的成功率优化到 92%,但这套方法的核心价值在于把模糊的"好"转化成可机器执行的二元检查表,而非真正的智能提升。
打开原文 ↗

2026-03-17 原文链接 ↗
阅读简报
双语对照
完整翻译
原文
讨论归档

核心观点

  • 从主观审美到可判定标准的降维 文章的真正洞察不在自动化本身,而在于把"好输出"拆解成 3–6 个明确的 yes/no 问题。这让优化从玄学变成搜索问题,也让团队能共享判断标准而不是靠个人品味。
  • 小步改动 + 自动回滚比一次性重写更稳健 每轮只改一个要素、测试、保留或撤回,这种局部搜索天然适合高噪声系统。相比"重新写一个更好版本",这种方式更容易追踪、更容易回滚、更容易理解为什么有效。
  • Changelog 比最终 Prompt 更值钱 文章强调记录"为什么改、改了什么、是否有帮助"的变更日志。这份记录是跨代际的资产——当更强的模型出现时,可以直接继承这份"避坑指南"而不用从零开始。
  • 过拟合风险:评分标准越少,模型越容易"钻空子" 虽然文章提到了这一点,但没有充分警示:用 3–6 个二元问题优化文案,模型很可能只是学会"强行塞数字、机械套句式"来通过检查,而不是真正写出好文案。这是 Goodhart's Law 的典型表现。
  • 泛化论证过快,成本分析缺失 文章从"优化一个文案 skill"直接推到"任何可打分的东西"都适用,但没有讨论 token 消耗、时间成本、局部最优陷阱、以及长期维护复杂度。自动循环听起来"走开就行",实际上可能是一笔隐性的 API 账单。

跟我们的关联

  • 对 ATou 意味着什么 如果你有一个"一半时候好、一半时候垃圾"的 skill 或工作流,这套方法论可以帮你把它变成"稳定可用"。但前提是你能把"好"定义成 3–6 个客观标准——这本身就是最难的部分。下一步:先不急着跑 autoresearch,花时间写出你的检查清单,这个清单本身就能暴露你 skill 的真实失败模式。
  • 对团队管理的启示 这套"二元检查表"的思路可以直接平移到员工评价、外包管理、流程规范上。不要给"1–10 分"的模糊评价,改成"标题是否包含具体数字""是否避免行业黑话"这样的客观标准,能极大降低沟通成本和主观争议。
  • 对产品增长的启示 冷启动外联、落地页文案、newsletter 开头这些"经验主义"领域,其实都可以被 checklist 化、agent 化、批量测试。关键是先把"什么叫转化"拆成可检测特征,而不是靠"感觉好的人"拍脑袋。
  • 对 AI 产品的启示 Agent 的真正壁垒不在行动能力,而在自我评估闭环。没有 evaluator,agent 只是自动执行器;有精准的 evaluator,才开始接近自优化系统。谁能把"好"定义得最清晰,谁的 agent 就最强。

讨论引子

  • 你现在用的 Claude skill 或 prompt 中,哪一个最容易出现"一半时候好、一半时候垃圾"的情况?如果要给它写一份检查清单,你会怎么拆解"好输出"的标准?
  • 文章强调"3–6 个问题最佳",但你的业务场景中,"好"真的能压缩成这么少的维度吗?如果不能,autoresearch 是否就失效了?
  • 把 AI 自动优化 prompt 的思路反过来用在团队管理上——用二元检查表替代主观评分——这在你的组织里可行吗?会遇到什么阻力?

你的大多数 Claude Skills 大约只有 70% 的时候能正常工作。

剩下的 30% 你拿到的就是垃圾。

你大概早就知道这一点了:你做了个 skill,测试了几次,输出看起来还不错,然后就继续往下做别的了。

但那 30% 的失败率依然存在。而每次你用这个 skill、却得到糟糕输出时,你要么手动修,要么从头再来。

我找到了一种解决办法,全程自动运行,而且我会把如何亲自跑起来的步骤讲得清清楚楚。

你只要把它启动起来,智能体就会在你不碰任何东西的情况下 一遍又一遍 地测试并打磨这个 skill。

我的落地页文案 skill 通过质量检查的比例从 56% 提升到 92%。完全不需要手动改写。

智能体只是不断测试,并自行把提示词越收越紧。

下面就是这套方法,以及我做的那份完整 skill,你可以拿去直接跑在自己的东西上:

P.S. 如果你想每周在邮箱里收到更多像这样的 AI 工作流,加入 3.4 万读者免费订阅:aisolo.beehiiv.com/subscribe

方法从何而来

Andrej Karpathy(OpenAI 联合创始人、前特斯拉 AI 负责人,也是提出“vibe coding”一词的人)发布了一种方法,叫做 autoresearch

核心思路很简单:与其你手动改进,不如让 AI 智能体在一个 循环 里替你完成。

它先做一个很小的改动,然后检查结果是否变好。变好就保留;没变好就丢掉。

然后再来一次。再来一次。

他用它来优化机器学习代码。但这套方法适用于 任何你能衡量并改进的东西

包括你在 Claude 里做的 skills。

我把这套方法做成了一个 skill,同时适用于 Claude Code 和 Cowork。之后我只要把它跑在我配置里的任何其他 skill 上就行。

我只要说一句:“run autoresearch on my landing page skill”,它就把整套流程全包了。

一个循环如何自动提升你的 skills

把它想成这样。

你有一份菜谱:10 次里有 7 次做得很棒,另外 3 次总有点不对劲。可能酱汁太淡,可能调味出了问题。

你不会从头重写整份菜谱,而是先 只改一个配料。带着这个改动再做 10 次。

  • 变好了吗?保留这个改动。

  • 变差了吗?把原来的配料换回去。

然后改下一个点,再做 10 次。更好还是更差?保留还是撤回。

这样循环 50 轮之后,你的菜谱就能做到 10 次里有 9.5 次都稳定出彩。

这就是 autoresearch 对你的 skills 所做的事。

  • “菜谱”是你的 skill 提示词。

  • “做菜”就是运行这个 skill。

  • “试吃”就是给输出打分。

你唯一需要提供的,就是评分标准。

那份清单:告诉智能体什么才算 'good'

你给智能体一份简单的清单,说明什么样才算“好”。这就是整个流程里你唯一需要做的事。

形式就是一组是/否问题的清单。

每个问题只检查输出的一个具体点:通过或不通过,仅此而已。

智能体用这份清单给每次输出打分,而分数会告诉它:它的改动是在帮忙还是在添乱。

把它想成老师用一张清单批改作文。

但不是“把写作质量从 1-10 打分”(这太含糊,而且每次都可能不一样),而是让清单上的每一项都是 明确的‘是’或‘否’:

  • 学生有没有写出论点陈述?是或否。

  • 是否每个来源都被引用?是或否。

  • 是否少于 5 页?是或否。

用这份清单批改 100 篇作文,你每次都能得到 一致的结果

这里也是同样的思路。对于一个落地页文案 skill,你的清单可能是:

  • “标题是否包含一个具体数字或结果?” (用来抓住像 "Grow Your Business" 这种含糊的标题)

  • “文案是否没有出现诸如 'revolutionary,' 'synergy,' 'cutting-edge,' 'next-level' 这类 buzzwords?”

  • “CTA 是否使用了一个具体的动词短语?” (用来抓住像 "Learn More" 或 "Click Here" 这种弱 CTA)

  • “第一句是否点出一个具体痛点?” (用来抓住像 "In today's fast-paced world..." 这种泛泛的开场)

  • “整段文案是否少于 150 个词?” (用来抓住臃肿、容易让读者流失的页面)

你不必自己琢磨出这些问题。你一启动 autoresearch,智能体就会带着你一步步把它做出来。

它会问你“好”应该长什么样,帮你把那些直觉变成具体的‘是/否’问题;如果你有现成的风格指南,它甚至会主动从里面抽取可用的条目。

3–6 个问题是最佳区间。 再多的话,skill 就会开始“钻清单的空子”(就像学生只背答案,却并不理解内容)。

如何运行它

步骤 1:下载这个 skill。 在这里获取,把它放进 Claude Code 或 Cowork 的 skills 文件夹。

步骤 2:选择一个要改进的 skill。 说:“run autoresearch on my [skill name] skill。” 选一个最让你抓狂的——那种一半时候输出很棒、另一半时候全是垃圾的。

步骤 3:智能体会问你 3 件事。 要优化哪个 skill;用什么测试输入(例如 "write landing page copy for an AI productivity tool");以及你的清单问题是什么。

步骤 4:它会运行你的 skill,并展示起始得分。 这就是基线。我的落地页 skill 起步是 56%:标题含糊、buzzword 大杂烩、CTA 很弱,超过一半的检查项都没通过。

步骤 5:它会在浏览器里打开一个实时仪表盘。 分数曲线随时间上升;每个清单问题的通过/失败拆解;它尝试过的每一次改动日志;每 10 秒自动刷新一次。

步骤 6:走开就行。 智能体进入循环:分析哪里在失败;对 skill 提示词做一个很小的改动;再测一次。分数上升就保留,下降就撤回。

然后再来一次。再来一次。它会 自主地 一直跑下去,直到你手动停止,或连续三次达到 95% 以上。

你可以盯着仪表盘看,也可以彻底走开。它不需要你盯着跑。它还会把改进后的版本另存为一个单独文件,因此你的原始 skill 完全不动。

我的落地页 skill 发生了什么

我把它跑在我的落地页文案 skill 上,结果如下:

56% → 92%。共 4 轮改动,3 次保留,1 次撤回。

智能体实际在我的 skill 提示词里改了这些:

  • 为最常见的失败模式加了一条具体规则:"Your headline must include a specific number or result. Never use vague promises like 'Transform Your Business.'"

  • 新增了一份禁用 buzzwords 列表:"NEVER use: revolutionary, cutting-edge, synergy, next-level, game-changing, leverage, unlock, transform."

  • 补充了一个示范例子:给出一段强力的落地页文案片段,并标注痛点式开场与 CTA,让 skill 不用靠猜,而是能 看见 什么才算好。

  • 尝试把字数限制得更紧,又撤回了,因为文案变得过于单薄,CTA 也受影响。(系统能抓到这种“单看某一项像是变好、但整体输出变差”的改动。)

结束后,我得到:

  • 改进后的 skill,单独保存(原始版本保持不动,想回滚也方便)

  • 结果日志,记录每一轮的得分

  • 变更日志(changelog),解释每一次尝试的改动、智能体为什么这么改,以及是否有帮助

  • 原始 skill 的备份,以防将来想回到旧版

那份 changelog 可能是最值钱的部分:它完整记录了这个特定 skill 上什么有效、什么无效。

等以后更聪明的模型出来,你把这份 changelog 交给它们,它们就能 从上一个智能体停下来的地方继续接着干。

它不止能用在 skills 上

这套方法适用于 任何你能打分的东西

  • 网站速度: 有人把它用在页面加载时间上。改一个点,测速度,保留或撤回。67 轮之后从 1100ms 降到 67ms

  • 冷启动外联(cold outreach): 先定义你的清单:"Does it mention the prospect's company? Is it under 75 words? Does it end with a specific question?" 然后让智能体跑 50 个变体。

  • 新闻通讯开头(newsletter intros):"Does the opener include a personal detail?" 以及 "Is it free of cliche phrases?" 让智能体在自动驾驶模式下把你的文字越拧越紧。

  • 任何你会反复使用的 prompt

只要你能给它打分,你就能对它做 autoresearch。

去跑起来吧

挑一个表现最差的 skill。启动 autoresearch。然后回来收获一个真正能稳定工作的版本。

在这里下载这个 skill。

P.S. 如果你想要更多 AI 工作流,帮你获得更多客户、更多关注、做更多事情(而不用投入更多工时),我每周都会免费发给 3.4 万读者。

另外,加入后你还能获得一堂免费的 Claude Cowork 大师课:aisolo.beehiiv.com/subscribe

Most of your Claude skills work about 70% of the time.

The other 30% you get garbage.

You probably know this already. You built a skill, tested it a couple times, the output looked decent, and you moved on.

But that 30% failure rate is still there. And every time you use the skill and get a bad output, you're either manually fixing it or starting over.

I found a way to fix this that runs completely on autopilot, and I'm going to show you exactly how to run it yourself.

You kick it off, and the agent tests and refines the skill over and over without you touching anything.

My landing page copy skill went from passing its quality checks 56% of the time to 92%. Zero manual rewriting.

The agent just kept testing and tightening the prompt on its own.

Here's the method and the exact skill I built so you can run it on your own stuff:

P.S. If you want more AI workflows like this one delivered to your inbox every week, join 34k readers getting them free: aisolo.beehiiv.com/subscribe

你的大多数 Claude Skills 大约只有 70% 的时候能正常工作。

剩下的 30% 你拿到的就是垃圾。

你大概早就知道这一点了:你做了个 skill,测试了几次,输出看起来还不错,然后就继续往下做别的了。

但那 30% 的失败率依然存在。而每次你用这个 skill、却得到糟糕输出时,你要么手动修,要么从头再来。

我找到了一种解决办法,全程自动运行,而且我会把如何亲自跑起来的步骤讲得清清楚楚。

你只要把它启动起来,智能体就会在你不碰任何东西的情况下 一遍又一遍 地测试并打磨这个 skill。

我的落地页文案 skill 通过质量检查的比例从 56% 提升到 92%。完全不需要手动改写。

智能体只是不断测试,并自行把提示词越收越紧。

下面就是这套方法,以及我做的那份完整 skill,你可以拿去直接跑在自己的东西上:

P.S. 如果你想每周在邮箱里收到更多像这样的 AI 工作流,加入 3.4 万读者免费订阅:aisolo.beehiiv.com/subscribe

Where this comes from

Andrej Karpathy (co-founder of OpenAI, former head of AI at Tesla, guy who coined “vibe coding”) released a method called autoresearch.

The idea is simple: instead of you manually improving something, you let an AI agent do it for you in a loop.

It tries a small change. Checks if the result got better. Keeps it if it did, throws it out if it didn't.

Then it does it again. And again.

He used it for machine learning code. But the method works on anything you can measure and improve.

Including the skills you've built in Claude.

I took his method and turned it into a skill that works in both Claude Code and Cowork. I just run it on any other skill in my setup.

I say "run autoresearch on my landing page skill" and it handles the whole thing.

方法从何而来

Andrej Karpathy(OpenAI 联合创始人、前特斯拉 AI 负责人,也是提出“vibe coding”一词的人)发布了一种方法,叫做 autoresearch

核心思路很简单:与其你手动改进,不如让 AI 智能体在一个 循环 里替你完成。

它先做一个很小的改动,然后检查结果是否变好。变好就保留;没变好就丢掉。

然后再来一次。再来一次。

他用它来优化机器学习代码。但这套方法适用于 任何你能衡量并改进的东西

包括你在 Claude 里做的 skills。

我把这套方法做成了一个 skill,同时适用于 Claude Code 和 Cowork。之后我只要把它跑在我配置里的任何其他 skill 上就行。

我只要说一句:“run autoresearch on my landing page skill”,它就把整套流程全包了。

How one loop auto-improves your skills

Think of it like this.

You have a recipe that turns out great 7 out of 10 times. The other 3 times, something's off. Maybe the sauce is bland, maybe the seasoning is wrong.

Instead of rewriting the whole recipe from scratch, you change one ingredient. You cook it 10 times with that change.

  • Did it get better? Keep the change.

  • Did it get worse? Put the old ingredient back.

Then you change the next thing. Cook 10 more times. Better or worse? Keep or revert.

After 50 rounds of this, your recipe works 9.5 out of 10 times.

That's exactly what autoresearch does to your skills.

  • The "recipe" is your skill prompt.

  • The "cooking" is running the skill.

  • The "tasting" is scoring the output.

The only thing you need to provide is the scoring criteria.

一个循环如何自动提升你的 skills

把它想成这样。

你有一份菜谱:10 次里有 7 次做得很棒,另外 3 次总有点不对劲。可能酱汁太淡,可能调味出了问题。

你不会从头重写整份菜谱,而是先 只改一个配料。带着这个改动再做 10 次。

  • 变好了吗?保留这个改动。

  • 变差了吗?把原来的配料换回去。

然后改下一个点,再做 10 次。更好还是更差?保留还是撤回。

这样循环 50 轮之后,你的菜谱就能做到 10 次里有 9.5 次都稳定出彩。

这就是 autoresearch 对你的 skills 所做的事。

  • “菜谱”是你的 skill 提示词。

  • “做菜”就是运行这个 skill。

  • “试吃”就是给输出打分。

你唯一需要提供的,就是评分标准。

The checklist that tells the agent exactly what 'good' means

You give the agent a simple checklist of what "good" looks like. That's your only job in this whole process.

You do it with a simple checklist of yes/no questions.

Each question checks one specific thing about the output. Pass or fail. That's it.

The agent uses this checklist to score every output, and those scores tell it whether its changes are helping or hurting.

Think of it like a teacher grading a paper with a checklist.

But instead of "rate the writing quality 1-10" (which is vague and different every time), each item on the checklist is a clear yes or no:

  • Did the student include a thesis statement? Yes or no.

  • Is every source cited? Yes or no.

  • Is it under 5 pages? Yes or no.

You can grade 100 papers with that checklist and get consistent results every time.

Same idea here. For a landing page copy skill, your checklist might look like:

  • "Does the headline include a specific number or result?" (catches vague headlines like "Grow Your Business")

  • "Is the copy free of buzzwords like 'revolutionary,' 'synergy,' 'cutting-edge,' 'next-level'?"

  • "Does the CTA use a specific verb phrase?" (catches weak CTAs like "Learn More" or "Click Here")

  • "Does the first line call out a specific pain point?" (catches generic openers like "In today's fast-paced world...")

  • "Is the total copy under 150 words?" (catches bloated pages that lose the reader)

You don't need to figure these out on your own. When you start the autoresearch, the agent walks you through it.

It asks what good looks like, helps you turn your vibes into specific yes/no questions, and even offers to pull from existing style guides if you have them.

3-6 questions is the sweet spot. More than that and the skill starts gaming the checklist (like a student who memorizes the answers without understanding the material).

那份清单:告诉智能体什么才算 'good'

你给智能体一份简单的清单,说明什么样才算“好”。这就是整个流程里你唯一需要做的事。

形式就是一组是/否问题的清单。

每个问题只检查输出的一个具体点:通过或不通过,仅此而已。

智能体用这份清单给每次输出打分,而分数会告诉它:它的改动是在帮忙还是在添乱。

把它想成老师用一张清单批改作文。

但不是“把写作质量从 1-10 打分”(这太含糊,而且每次都可能不一样),而是让清单上的每一项都是 明确的‘是’或‘否’:

  • 学生有没有写出论点陈述?是或否。

  • 是否每个来源都被引用?是或否。

  • 是否少于 5 页?是或否。

用这份清单批改 100 篇作文,你每次都能得到 一致的结果

这里也是同样的思路。对于一个落地页文案 skill,你的清单可能是:

  • “标题是否包含一个具体数字或结果?” (用来抓住像 "Grow Your Business" 这种含糊的标题)

  • “文案是否没有出现诸如 'revolutionary,' 'synergy,' 'cutting-edge,' 'next-level' 这类 buzzwords?”

  • “CTA 是否使用了一个具体的动词短语?” (用来抓住像 "Learn More" 或 "Click Here" 这种弱 CTA)

  • “第一句是否点出一个具体痛点?” (用来抓住像 "In today's fast-paced world..." 这种泛泛的开场)

  • “整段文案是否少于 150 个词?” (用来抓住臃肿、容易让读者流失的页面)

你不必自己琢磨出这些问题。你一启动 autoresearch,智能体就会带着你一步步把它做出来。

它会问你“好”应该长什么样,帮你把那些直觉变成具体的‘是/否’问题;如果你有现成的风格指南,它甚至会主动从里面抽取可用的条目。

3–6 个问题是最佳区间。 再多的话,skill 就会开始“钻清单的空子”(就像学生只背答案,却并不理解内容)。

Here's how to run it

Step 1: Download the skill. Grab it here. Drop it into your skills folder in Claude Code or Cowork.

Step 2: Pick a skill to improve. Say "run autoresearch on my [skill name] skill." Pick the one that annoys you most. The one where you get a great output half the time and garbage the other half.

Step 3: The agent asks you 3 things. Which skill to optimize. What test inputs to use (like "write landing page copy for an AI productivity tool"). And what your checklist questions are.

Step 4: It runs your skill and shows you your starting score. This is the baseline. My landing page skill started at 56%. Vague headlines, buzzword soup, weak CTAs. More than half the checks were failing.

Step 5: It opens a live dashboard in your browser. Score chart going up over time. Pass/fail breakdown for each checklist question. A log of every change it tried. Auto-refreshes every 10 seconds.

Step 6: Walk away. The agent enters the loop. Analyzes what's failing. Makes one small change to the skill prompt. Tests again. Keeps the change if the score goes up, undoes it if it goes down.

Then does it again. And again. It keeps going autonomously until you stop it or it hits 95%+ three times in a row.

You can watch the dashboard or walk away entirely. It runs without you. And it saves the improved version as a separate file, so your original skill stays untouched.

如何运行它

步骤 1:下载这个 skill。 在这里获取,把它放进 Claude Code 或 Cowork 的 skills 文件夹。

步骤 2:选择一个要改进的 skill。 说:“run autoresearch on my [skill name] skill。” 选一个最让你抓狂的——那种一半时候输出很棒、另一半时候全是垃圾的。

步骤 3:智能体会问你 3 件事。 要优化哪个 skill;用什么测试输入(例如 "write landing page copy for an AI productivity tool");以及你的清单问题是什么。

步骤 4:它会运行你的 skill,并展示起始得分。 这就是基线。我的落地页 skill 起步是 56%:标题含糊、buzzword 大杂烩、CTA 很弱,超过一半的检查项都没通过。

步骤 5:它会在浏览器里打开一个实时仪表盘。 分数曲线随时间上升;每个清单问题的通过/失败拆解;它尝试过的每一次改动日志;每 10 秒自动刷新一次。

步骤 6:走开就行。 智能体进入循环:分析哪里在失败;对 skill 提示词做一个很小的改动;再测一次。分数上升就保留,下降就撤回。

然后再来一次。再来一次。它会 自主地 一直跑下去,直到你手动停止,或连续三次达到 95% 以上。

你可以盯着仪表盘看,也可以彻底走开。它不需要你盯着跑。它还会把改进后的版本另存为一个单独文件,因此你的原始 skill 完全不动。

What happened to my landing page skill

I ran it on my landing page copy skill. Here's what came back:

56% → 92%. 4 rounds of changes. 3 kept, 1 undone.

Here's what the agent actually changed in my skill prompt:

  • Added a specific rule for the most common failure: "Your headline must include a specific number or result. Never use vague promises like 'Transform Your Business.'"

  • Added a banned buzzwords list: "NEVER use: revolutionary, cutting-edge, synergy, next-level, game-changing, leverage, unlock, transform."

  • Added a worked example of a strong landing page section with the pain point opener and CTA highlighted, so the skill could see what good looks like instead of guessing.

  • Tried a tighter word count, undid it because the copy got too thin and the CTA suffered. (The system catches changes that seem like improvements in isolation but hurt the overall output.)

When it was done, I got:

  • The improved skill, saved separately (the original stays untouched in case you want to revert)

  • A results log showing every round's score

  • A changelog explaining every change that was tried, why the agent tried it, and whether it helped

  • A backup of my original skill in case I ever want to go back

That changelog is probably the most valuable piece. It's a complete record of what works and what doesn't for that specific skill.

When smarter models come out down the road, you hand them that changelog and they pick up right where the last agent left off.

我的落地页 skill 发生了什么

我把它跑在我的落地页文案 skill 上,结果如下:

56% → 92%。共 4 轮改动,3 次保留,1 次撤回。

智能体实际在我的 skill 提示词里改了这些:

  • 为最常见的失败模式加了一条具体规则:"Your headline must include a specific number or result. Never use vague promises like 'Transform Your Business.'"

  • 新增了一份禁用 buzzwords 列表:"NEVER use: revolutionary, cutting-edge, synergy, next-level, game-changing, leverage, unlock, transform."

  • 补充了一个示范例子:给出一段强力的落地页文案片段,并标注痛点式开场与 CTA,让 skill 不用靠猜,而是能 看见 什么才算好。

  • 尝试把字数限制得更紧,又撤回了,因为文案变得过于单薄,CTA 也受影响。(系统能抓到这种“单看某一项像是变好、但整体输出变差”的改动。)

结束后,我得到:

  • 改进后的 skill,单独保存(原始版本保持不动,想回滚也方便)

  • 结果日志,记录每一轮的得分

  • 变更日志(changelog),解释每一次尝试的改动、智能体为什么这么改,以及是否有帮助

  • 原始 skill 的备份,以防将来想回到旧版

那份 changelog 可能是最值钱的部分:它完整记录了这个特定 skill 上什么有效、什么无效。

等以后更聪明的模型出来,你把这份 changelog 交给它们,它们就能 从上一个智能体停下来的地方继续接着干。

This works on way more than skills

The method works on anything you can score.

  • Website speed: One person ran this on page load time. Changed one thing, measured the speed, kept or reverted. Went from 1100ms to 67ms in 67 rounds.

  • Cold outreach: Define your checklist: "Does it mention the prospect's company? Is it under 75 words? Does it end with a specific question?" Let the agent run 50 variations.

  • Newsletter intros: "Does the opener include a personal detail?" and "Is it free of cliche phrases?" Let the agent tighten your writing on autopilot.

  • Any prompt you use repeatedly

If you can score it, you can autoresearch it.

它不止能用在 skills 上

这套方法适用于 任何你能打分的东西

  • 网站速度: 有人把它用在页面加载时间上。改一个点,测速度,保留或撤回。67 轮之后从 1100ms 降到 67ms

  • 冷启动外联(cold outreach): 先定义你的清单:"Does it mention the prospect's company? Is it under 75 words? Does it end with a specific question?" 然后让智能体跑 50 个变体。

  • 新闻通讯开头(newsletter intros):"Does the opener include a personal detail?" 以及 "Is it free of cliche phrases?" 让智能体在自动驾驶模式下把你的文字越拧越紧。

  • 任何你会反复使用的 prompt

只要你能给它打分,你就能对它做 autoresearch。

Go run it

Pick your worst-performing skill. Start the autoresearch. Come back to something that actually works.

Download the skill here.

P.S. If you want more AI workflows that help you get more customers, more attention, and more done (without working more hours), I send them to 34k readers every week for free.

Plus you get a free Claude Cowork masterclass when you join: aisolo.beehiiv.com/subscribe

去跑起来吧

挑一个表现最差的 skill。启动 autoresearch。然后回来收获一个真正能稳定工作的版本。

在这里下载这个 skill。

P.S. 如果你想要更多 AI 工作流,帮你获得更多客户、更多关注、做更多事情(而不用投入更多工时),我每周都会免费发给 3.4 万读者。

另外,加入后你还能获得一堂免费的 Claude Cowork 大师课:aisolo.beehiiv.com/subscribe

Most of your Claude skills work about 70% of the time.

The other 30% you get garbage.

You probably know this already. You built a skill, tested it a couple times, the output looked decent, and you moved on.

But that 30% failure rate is still there. And every time you use the skill and get a bad output, you're either manually fixing it or starting over.

I found a way to fix this that runs completely on autopilot, and I'm going to show you exactly how to run it yourself.

You kick it off, and the agent tests and refines the skill over and over without you touching anything.

My landing page copy skill went from passing its quality checks 56% of the time to 92%. Zero manual rewriting.

The agent just kept testing and tightening the prompt on its own.

Here's the method and the exact skill I built so you can run it on your own stuff:

P.S. If you want more AI workflows like this one delivered to your inbox every week, join 34k readers getting them free: aisolo.beehiiv.com/subscribe

Where this comes from

Andrej Karpathy (co-founder of OpenAI, former head of AI at Tesla, guy who coined “vibe coding”) released a method called autoresearch.

The idea is simple: instead of you manually improving something, you let an AI agent do it for you in a loop.

It tries a small change. Checks if the result got better. Keeps it if it did, throws it out if it didn't.

Then it does it again. And again.

He used it for machine learning code. But the method works on anything you can measure and improve.

Including the skills you've built in Claude.

I took his method and turned it into a skill that works in both Claude Code and Cowork. I just run it on any other skill in my setup.

I say "run autoresearch on my landing page skill" and it handles the whole thing.

How one loop auto-improves your skills

Think of it like this.

You have a recipe that turns out great 7 out of 10 times. The other 3 times, something's off. Maybe the sauce is bland, maybe the seasoning is wrong.

Instead of rewriting the whole recipe from scratch, you change one ingredient. You cook it 10 times with that change.

  • Did it get better? Keep the change.

  • Did it get worse? Put the old ingredient back.

Then you change the next thing. Cook 10 more times. Better or worse? Keep or revert.

After 50 rounds of this, your recipe works 9.5 out of 10 times.

That's exactly what autoresearch does to your skills.

  • The "recipe" is your skill prompt.

  • The "cooking" is running the skill.

  • The "tasting" is scoring the output.

The only thing you need to provide is the scoring criteria.

The checklist that tells the agent exactly what 'good' means

You give the agent a simple checklist of what "good" looks like. That's your only job in this whole process.

You do it with a simple checklist of yes/no questions.

Each question checks one specific thing about the output. Pass or fail. That's it.

The agent uses this checklist to score every output, and those scores tell it whether its changes are helping or hurting.

Think of it like a teacher grading a paper with a checklist.

But instead of "rate the writing quality 1-10" (which is vague and different every time), each item on the checklist is a clear yes or no:

  • Did the student include a thesis statement? Yes or no.

  • Is every source cited? Yes or no.

  • Is it under 5 pages? Yes or no.

You can grade 100 papers with that checklist and get consistent results every time.

Same idea here. For a landing page copy skill, your checklist might look like:

  • "Does the headline include a specific number or result?" (catches vague headlines like "Grow Your Business")

  • "Is the copy free of buzzwords like 'revolutionary,' 'synergy,' 'cutting-edge,' 'next-level'?"

  • "Does the CTA use a specific verb phrase?" (catches weak CTAs like "Learn More" or "Click Here")

  • "Does the first line call out a specific pain point?" (catches generic openers like "In today's fast-paced world...")

  • "Is the total copy under 150 words?" (catches bloated pages that lose the reader)

You don't need to figure these out on your own. When you start the autoresearch, the agent walks you through it.

It asks what good looks like, helps you turn your vibes into specific yes/no questions, and even offers to pull from existing style guides if you have them.

3-6 questions is the sweet spot. More than that and the skill starts gaming the checklist (like a student who memorizes the answers without understanding the material).

Here's how to run it

Step 1: Download the skill. Grab it here. Drop it into your skills folder in Claude Code or Cowork.

Step 2: Pick a skill to improve. Say "run autoresearch on my [skill name] skill." Pick the one that annoys you most. The one where you get a great output half the time and garbage the other half.

Step 3: The agent asks you 3 things. Which skill to optimize. What test inputs to use (like "write landing page copy for an AI productivity tool"). And what your checklist questions are.

Step 4: It runs your skill and shows you your starting score. This is the baseline. My landing page skill started at 56%. Vague headlines, buzzword soup, weak CTAs. More than half the checks were failing.

Step 5: It opens a live dashboard in your browser. Score chart going up over time. Pass/fail breakdown for each checklist question. A log of every change it tried. Auto-refreshes every 10 seconds.

Step 6: Walk away. The agent enters the loop. Analyzes what's failing. Makes one small change to the skill prompt. Tests again. Keeps the change if the score goes up, undoes it if it goes down.

Then does it again. And again. It keeps going autonomously until you stop it or it hits 95%+ three times in a row.

You can watch the dashboard or walk away entirely. It runs without you. And it saves the improved version as a separate file, so your original skill stays untouched.

What happened to my landing page skill

I ran it on my landing page copy skill. Here's what came back:

56% → 92%. 4 rounds of changes. 3 kept, 1 undone.

Here's what the agent actually changed in my skill prompt:

  • Added a specific rule for the most common failure: "Your headline must include a specific number or result. Never use vague promises like 'Transform Your Business.'"

  • Added a banned buzzwords list: "NEVER use: revolutionary, cutting-edge, synergy, next-level, game-changing, leverage, unlock, transform."

  • Added a worked example of a strong landing page section with the pain point opener and CTA highlighted, so the skill could see what good looks like instead of guessing.

  • Tried a tighter word count, undid it because the copy got too thin and the CTA suffered. (The system catches changes that seem like improvements in isolation but hurt the overall output.)

When it was done, I got:

  • The improved skill, saved separately (the original stays untouched in case you want to revert)

  • A results log showing every round's score

  • A changelog explaining every change that was tried, why the agent tried it, and whether it helped

  • A backup of my original skill in case I ever want to go back

That changelog is probably the most valuable piece. It's a complete record of what works and what doesn't for that specific skill.

When smarter models come out down the road, you hand them that changelog and they pick up right where the last agent left off.

This works on way more than skills

The method works on anything you can score.

  • Website speed: One person ran this on page load time. Changed one thing, measured the speed, kept or reverted. Went from 1100ms to 67ms in 67 rounds.

  • Cold outreach: Define your checklist: "Does it mention the prospect's company? Is it under 75 words? Does it end with a specific question?" Let the agent run 50 variations.

  • Newsletter intros: "Does the opener include a personal detail?" and "Is it free of cliche phrases?" Let the agent tighten your writing on autopilot.

  • Any prompt you use repeatedly

If you can score it, you can autoresearch it.

Go run it

Pick your worst-performing skill. Start the autoresearch. Come back to something that actually works.

Download the skill here.

P.S. If you want more AI workflows that help you get more customers, more attention, and more done (without working more hours), I send them to 34k readers every week for free.

Plus you get a free Claude Cowork masterclass when you join: aisolo.beehiiv.com/subscribe

📋 讨论归档

讨论进行中…