返回列表
🧠 阿头学 · 💬 讨论题

Perplexity 公开 Agent Skill 方法论,但其“最佳实践”并不等于行业公理

这篇文章最有价值的判断是“Skill 本质是上下文工程而不是传统软件工程”,但它也明显把 Perplexity 自家架构下的经验包装成了更普适的方法论。
打开原文 ↗

2026-05-10 原文链接 ↗
阅读简报
双语对照
完整翻译
原文
讨论归档

核心观点

  • Skill 不是代码模块,而是分层上下文系统 文章最站得住的地方在于重新定义了 Skill:它不是给人读的文档,也不只是可执行代码,而是给模型按成本、按时机、按层级喂上下文的机制;这个判断对今天多数 Agent 系统都成立,因为真正限制效果的往往不是“会不会调用工具”,而是“何时加载什么信息、加载多少信息”。
  • 描述是路由触发器,不是功能说明 作者强调 `Load when...` 比 “This skill does...” 更重要,这个判断非常实用,因为按需加载系统的成败首先取决于路由精度;但争议也在这里:如果一个词就可能导致路由回归,说明系统本身对短文本触发器过度敏感,开发者再精雕细琢也掩盖不了架构脆弱性。
  • 高价值信息不是流程,而是坑点、反例和边界 文章反复主张“模型通常已经知道常规操作,不知道的是哪里容易错”,这个判断比大多数 Prompt 工程建议都更接近实战;尤其在长尾任务里,负面清单、禁止条件、易错点确实比保姆式步骤更能提高稳定性。
  • 每个 Skill 都是一种上下文税 把 Skill 视为成本中心而不是功能清单,这个判断很硬核也很有用;索引层、加载层、运行时层的分层预算,能逼团队认真处理 token ROI、路由污染和全局负担,而不是无限制往系统里堆说明书。
  • 先写 eval 再写 Skill 是工程化关键 文章要求先写正反例、邻域混淆、禁止加载案例,这个判断非常对,因为 Skill 最大风险不是“内容写不全”,而是“边界不清导致系统整体退化”;但它也低估了现实难度:高质量 eval 尤其是开放任务 eval 的构建和维护,本身就是高成本能力,不是写了就能解决问题。

跟我们的关联

  • 对 ATou 意味着什么、下一步怎么用 ATou 如果在做任何 Agent、知识库或自动化 SOP,都不该继续把提示词当文档写;下一步应直接引入“三层上下文预算”检查:哪些信息该全局常驻,哪些只该按需加载,哪些必须拆到运行时文件里。
  • 对 Neta 意味着什么、下一步怎么用 Neta 做方法论或系统抽象时,可以把本文提炼成“能力模块化不是功能拆分,而是上下文成本治理”;下一步可用它审视现有流程,把“说明性内容”和“纠偏性内容”分开,优先保留后者。
  • 对 Uota 意味着什么、下一步怎么用 Uota 如果关心体验或品牌风格,这篇文章证明“审美判断”和“风格约束”可以被做成 Skill,但前提是这些判断真的稳定且可复用;下一步不是写一大套风格手册,而是先抓 5-10 个真实失败案例,总结最关键的反例与禁忌。
  • 对三者共同意味着什么、下一步怎么用 新增一个 Skill 不只是多一个能力,而是给全局系统再加一个冲突源;下一步最该建立的不是更多 Skill,而是 Skill 准入机制:没有 eval、不清楚触发边界、不能说明 token 成本的,一律不上线。

讨论引子

1. 如果一个 Skill 的描述改一个词就可能破坏全局路由,那问题到底主要在 Skill 写法,还是在路由架构本身? 2. “模型已经知道的内容要删掉”这个原则在多模型并行、模型频繁切换的现实里,究竟有多可靠? 3. 当上下文窗口继续变大、推理成本继续下降时,今天这种极端强调 token 节约的 Skill 设计,会不会很快过时?

Perplexity 的前沿代理产品,建立在以模块化 Agent Skills 形式封装的知识经验与领域专长之上。我们在各类技术环境中维护着一套经过精心筛选的 Skill 库。这些 Skill 包括许多驱动 Perplexity Computer 的通用工具,也包括金融、法律、健康等垂直领域的专门能力,以及大量用于满足用户需求的长尾模块。有些 Skill 调用频率不高,但一旦被调用就至关重要。为了确保始终如一的卓越用户体验,Perplexity 的 Agents 团队像重视代码质量一样重视 Skill 质量。 开发高质量 Skill 所需的直觉与最佳实践,和构建传统软件所需的能力有很大不同。Agents 团队会审阅许多优秀工程师在工作中编写 Skill 时提交的 pull request,结果几乎总会带来大量评论与修改建议。原因在于,许多写代码时有用的模式,放到 Skill 创作里反而会变成反模式。

比如,如果把 PEP20 – The Zen of Python 里的几条箴言拿来看,就会很快发现,写好 Python 代码和写好 Skill 根本不是一回事。在那 20 条智慧中,至少有一半在写 Skill 时要么完全错误,要么会主动误导你。下面列出其中五条:

Python 之禅 Skill 之禅
简单胜于复杂 Skill 是一个文件夹,不是一个文件。复杂性本身就是特性。
显式胜于隐式 激活依赖隐式模式匹配。渐进式展开。
稀疏胜于密集 上下文很昂贵。每个 token 都要承载尽可能高的信号。
特例不足以特殊到破坏规则 坑点才是真正的特例,它们是价值最高的内容。
如果实现方式很容易解释,那它多半是个好主意 如果它很容易解释,模型本来就已经知道了。删掉它。

这份指南,就是 Perplexity 各团队工程师在开发和审阅 Skill 时使用的文档。我们现在也将这份指南公开发布,希望我们一路摸索出的发现与经验能让更广泛的社区受益。无论你是日常工作中负责设计生产级 Skill 的工程师,是想在自己最熟悉的领域开发专属 Skill 的 Computer 用户,还是两者兼有,这份指南都适合你。

什么是 Skill

当你在写一个 Skill 时,你并不是在写传统意义上的软件,即便 Skill 如今已经成为代理系统主逻辑引擎的一部分。你真正构建的是模型及其运行环境所需的上下文。Skill 有不同的约束,也有不同的设计原则。如果你按写代码的方式去写 Skill,最后一定会失败。

一个 Skill 至少同时具备四种属性,尤其是在 Perplexity 的构建方式里更是如此。

Skill 是一个目录

Skill 并不只是单个 SKILL.md 文件。在很多情况下,一个 Skill 会包含多个文件。在以你的 Skill 命名的目录下,通常可能有这些内容:

  • SKILL.md:frontmatter 和说明

  • scripts/:由代理执行的代码,而不是让它临场重造

  • references/:体量较大的文档,按条件加载

  • assets/:模板、schema 和数据

  • config.json:首次运行时的用户设置

这种中心加辐射的模式,能让 Skill 保持聚焦和紧凑,也让文件夹结构本身可以被非常有创意地利用。有时,尤其是那些特别复杂的 Skill,会从多层级结构中明显受益,因为这能帮助模型更好地导航。假设一个 Skill 需要覆盖 300 个主题,这些主题又可以归为 20 个领域。即便是当今最强的前沿模型,要在 300 个主题里稳定挑中正确的一个,依然是个没被解决的问题。相比之下,让模型先在 20 个领域里收敛,再在该领域下的 15 个主题里定位,选择难度会低得多。

多层级结构如何产生价值,这里有个例子。在上一个美国报税季,我们团队为 Computer 的美国个人所得税能力构建的 Skill,就采用了三层主题嵌套。考虑到税法本身的复杂程度,这种层级结构完全不可或缺。在我们早期测试中,如果把美国《国内税收法典》的全部 1,945 个条目直接放进一个单独文件夹里给模型看,效果甚至比完全不加载这个 Skill 还差。把信息组织成逻辑清晰的子分区,是确保高精度读取操作的必要条件。

但这种层级并不是零成本的。层级越深,信息架构上的整理与策划成本也越高,因为你必须处理随之而来的间接性。我们为此设计了速查指南、定制搜索工具,以及其他辅助模型定位信息的工具,尽量减少中间跳转。在这个案例里,认真做足这些整理工作,最终换来了正向结果:一个让模型在税务相关任务上远比单靠通用工具更有能力的 Skill。

Image 1

展开

Skill 是一种格式

Skill 也是一种格式。根目录下的核心 SKILL.md 文件必须同时包含名称和描述。而且,Skill 的名称必须与它所在目录的目录名完全一致。名称必须全部小写,不能有空格,可以使用连字符。描述则是路由触发器。这里是一个很常见的失败点:描述不是给人看的内部文档,用来解释 Skill 是做什么的。它本质上是在告诉模型,什么时候该加载这个 Skill。所以你会经常看到 Load when,而不是 This Skill does。这一点非常重要,因为大多数实现都会把这段描述直接注入到模型上下文中。

在 frontmatter 里,还有 depends:,它允许你建立层级化的 Skill 依赖关系;还有 metadata:,用于评审与评估。不同的代理系统甚至可以自定义自己的 frontmatter 字段,以适配各自特有的使用方式。另一种做法,是把 Skill 专属元数据放进辅助的 JSON 或 YAML 配置文件里。当你构建的代理系统需要根据不同 Skill 实现不同运行时行为,又不想把这些细枝末节污染进模型上下文时,这样做会更合适。最后,还有一种类似效果可以通过在读取时剥离 Skill frontmatter 实现。Computer 就采用了这种方法,从而允许把配置保留在根 SKILL.md 文件中。这里对解析逻辑的细节要求很高,如果有些字段在模型上下文中确实有价值,你也可以考虑做条件式剥离。

Skill 是可调用的

Skill 是可调用的。代理会在运行时加载某个 Skill。重要的是,Skill 并不会总是默认打包进上下文。绝大多数代理系统都会按需渐进式展开 Skill。

在 Computer 对 Skill 的实现方式里,至少存在三层上下文成本。流程如下:

  1. Computer 调用 load_skill(name="...")

  2. Computer 将 Skill 目录复制到隔离执行沙箱中

  3. Computer 递归自动加载 depends: 标签中声明的依赖

  4. Computer 随后剥离 frontmatter,于是代理看到的只剩正文和附加文件

不同的代理系统可以选择以不同方式向模型暴露 Skill 内容。举例来说,有些系统可能完全不暴露文件层级,而是让模型通过文件系统操作自行发现结构。另一些系统则可能向模型提供整个文件树的映射,只是会设置一定的截断或深度限制。为了保持上下文整洁,Computer 在调用上下文中默认不放完整文件层级,不过这一点可以按 Skill 单独覆盖。

Skill 是渐进式的

Skill 是渐进式的。在 Computer 里,上下文成本分成三层,而且我们会在不同阶段承担全部三层成本:

层级 加载内容 预算 何时付费
索引 每个非隐藏 Skill 的 name: description 每个 Skill 约 100 tokens 每个会话、每个用户、始终付费
加载 完整 SKILL.md 正文 约 5,000 tokens 触发加载时付费
运行时 scripts/references/assets/、子 Skill、FORMATTING.mdSPECIAL_CASES.md 中的文件 无上限 只有代理读取时才付费

Computer 会构建一个 Skill 索引,其中包含每个可用 Skill 的名称和描述。这一层预算大约是每个 Skill 100 tokens,越短越好。之所以卡得这么紧,是因为这个成本会在每个会话、每个用户身上反复支付。它会在对话一开始就注入系统提示词中。模型会看到一批有名称和描述的 Skill,据此决定是否调用 load_skill()。能被放进这个索引的门槛极高。你的 Skill 必须足够有用,而且描述必须极度浓缩、极度简洁,因为所有人都会一直为它买单。

当代理系统加载某个 Skill 之后,接着就是完整的 SKILL.md 正文。理想情况下,正文不要超过 5,000 tokens。即便如此,你仍然希望每一句话都有意义,因为一旦加载了这个 Skill,接下来的整段对话都要一直承担这部分成本,直到碰到压缩边界。很多线程会同时加载三到五个不同 Skill,这会把成本进一步放大。充满废话的 Skill,几乎必然会拖累其他 Skill,也会拖累代理整体能力。简而言之,如果你的 Skill 被加载了,却没让事情朝正确方向发展,那它占用的上下文就是白白浪费。

渐进式展开的最后一层,是脚本或特殊规则,比如子 Skill 或格式要求。这一层最适合放无上限的条件分支逻辑。代理只会在需要时才使用它,所以这里对内容收录的门槛会低很多。

在索引层,每个 token 都很重要。加载后的 Skill 正文会宽松一些,而运行时层最宽松。它可以是 20,000 tokens,也可以是 0 tokens。也正是在这一层,你才应该考虑以渐进方式扩展模型上下文。

什么时候需要 Skill

Agents 团队经常会被问到,某个领域或用例到底是不是真的需要 Skill。绝大多数时候,我们很难只靠第一性原理直接给出确定答案。真正靠谱的做法,是先拿没有这个 Skill 的代理跑几组核心查询,再判断代理是否做得足够好。

什么时候需要 Skill

有很多任务本来就在已训练模型的分布之内。只有当你想以某种特定方式改变它的行为,而且这种改变无法靠提示词里一句话实现时,才需要 Skill。也就是说,当代理在缺少专门上下文时会做错,或者当你需要它在多次运行中保持极高一致性,消除某种不稳定性时,你才需要 Skill。

也可能是你的知识本身很稳定,但并不在训练数据里。也可能是训练截止时间的问题,或者企业特定流程的问题,甚至只是风格审美的问题。比如,Computer 里有几个与设计相关的 Skill,是由我们的设计负责人 Henry Modisett 编写的。那些 Skill 里每个 token 之所以存在,都是因为 Henry 在网站和 PDF 设计方面审美极好。Henry 会明确规定该用哪些字体,不该用哪些字体,这些字体给人的感受,以及其他单靠训练数据学不出来的判断细节。

什么时候不需要 Skill

我们见过很多 Skill,里面只是工程师把一串需要依次执行的 git 命令写进去。这没有必要,因为模型本来就知道怎么做。这样的内容适合作为文档,却不适合作为 Skill。

我们也见过一些 Skill,只是在重复系统提示词里已经有的说明。这也不需要 Skill。对于绝大多数请求都相关的知识,应该放在全局上下文里,而不是放进一个按条件加载的 Skill。

如果某件事变化得比你维护它的速度还快,那也不需要 Skill。比如你依赖某个远程 MCP 端点,而它的工具或工具版本变化频繁,就不应该把这些内容注入 Skill。否则只会不断漂移,模型也会因此出错。

每个 Skill 都是一种税

这里有个很好用的测试,可以套到 Skill 里的每一句话上:如果没有这句说明,代理会做错吗。如果这句话并不需要存在,那它就负担不起存在的成本,因为每次运行、每个人都在为它付费。当你在决定要不要新增一个 Skill 时,别忘了这笔税:每个会话、每个用户,都会消耗 token。

下面这句名言,用法语说出来更好听,大意是,我之所以把这封信写得这么长,只是因为我没有时间把它写短。

« Je n’ai fait celle-ci plus longue que parce que je n’ai pas eu le loisir de la faire plus courte. » — Blaise Pascal, Lettres Provinciales, 1657

和 Pascal 一样,你必须在每个 Skill 上投入时间。写短 Skill 很难。如果你的 Skill 很容易写出来,它多半是太长了,或者根本就不该存在。一个好的 Skill,应该短到不能再短。

如果你发现自己想一次生成整个 Skill,五分钟就提一个 PR,结果几乎一定不会好。事实上,早期研究已经表明,如果你用 LLM 去写 Skill,LLM 本身很可能也得不到好处:“Self-generated Skills provide no benefit on average, showing that models cannot reliably author the procedural knowledge they benefit from consuming.”

如何构建 Skill

换句话说,写 Skill 时,你必须把自己的判断注入进去。按下面这些步骤做。

第 0 步:先写 Evals

先写一部分评估用例。评估案例可以来自这些地方:

  • 真实用户查询:从生产环境或你信任的内部样本里抽取

  • 已知失败案例:代理曾因 Skill 不存在而失败

  • 邻域混淆:接近你领域边界,但被路由到另一个 Skill

至少,你要确保自己测到了 Skill 会在需要时被正确加载。理想情况下,你还应该抽一些真实案例,最好来自生产环境。你也可以考虑已知错误场景:也许你决定写这个 Skill,本来就是因为你注意到某个特定失败;也可能你正在做重构,而两个相近领域的边界让同一个 Skill 覆盖得不够清楚。

先从相近的反例和正例开始。反例非常有力量,有时甚至比正例更重要。

第 1 步:写描述

这是整个 Skill 里最难写的一行。它是路由触发器,不是文档。要把名称和描述写对,你不该关心 Skill 的具体内容。你只该关心这个 Skill 是否会在正确的时机被加载和注入,同时不会产生跑偏的副作用。这是最常见的失败模式。每增加一个 Skill,你都有可能让其他每一个 Skill 都略微变差,所以你必须确保把回归风险压到最低。

再说一遍,糟糕的描述会去解释 Skill 做什么,或者它为什么有用。好的描述会说明代理应该在什么时候加载这个 Skill。比如,你有一个用于监控 pull request 的东西。不要写它是做什么的。要写工程师在焦虑时会怎么说,写他们希望你确保 PR 能顺利通过时会用的词,比如 babysit、watch CI、make sure this lands。

下面是一份简短检查清单:

  • 以 Load when... 开头

  • 控制在 50 个词以内

  • 描述用户意图,最好来自真实查询

  • 不要总结工作流

真实查询是你可以用 80 比 20 原则覆盖到的内容。通常,两到三个例子就很够用。要做到不多不少,刚好写够,没那么容易。

第 2 步:写正文

接下来,开始写 Skill 本体的内容。注意,这不是第 0 步,也不是第 1 步。

向 LLM 传达工作流,和向同事传达工作流完全不是一回事,甚至和向运行时系统传达也不是一回事。学习一个新的软件工具时,工程师可能需要读文档、找有经验的人带一遍,再慢慢学会怎么用。但对 LLM 来说,只要是那种已经存在至少一年的软件工具,通常你只要提一下名字,它就已经掌握了需要的信息。

写正文时,把显而易见的内容跳过去。很多工程师很擅长写 readme.md,会把每一条命令都完整列出来。写 Skill 时很容易不自觉回到这种方式,因为感觉自己像是在写文档。但如果你这么做,你的 Skill 就会很糟。所以,不要把一连串命令写出来。

比如,你不需要写,“git log # find the commit; git checkout main; git checkout -b <clean-branch>; git cherry-pick <commit>;

你应该写,“Cherry-pick the commit onto a clean branch. Resolve conflicts preserving intent. If it can't land cleanly, explain why.”

面对后一种写法,模型通常会做得比前一种死板命令序列更好,尤其是在出问题的时候。不要把路线钉死,也不要过度规定步骤,那样会很脆弱。只要多种方法都能奏效,就应该保留弹性。再说一次,对人类友好的好文档,放到模型这里,往往反而是坏文档。

接下来,把注意力放到坑点和反例上。这些内容信号极强,因为它们往往能明确告诉模型不要做什么。每当代理踩坑时加上一条,你会在反复运行中不断学到东西,坑点也会自然增长。

最后,如果有些内容是条件性的,或者体量特别大,就把它从作为中心的 SKILL.md 里拿出来,放进辐射文件中。放进可以渐进加载的附属文件里,下一节会详细说。

第 3 步:利用层级结构

当你有脚本、参考资料,或者要使用某个特定工具时,就该用上 Skill 的层级结构:

scripts/

代理每次都会重新发明的确定性逻辑,直接给它可组合的代码,而不是让它重建。

references/

只在条件满足时才加载的重型文档。 如果 API 返回非 200,就去读 api-errors.md

assets/

代理复制并填写的输出模板。 比如 report-template.md、输出 schema。

config.json

首次运行时的用户设置。 询问 Slack 频道,保存下来,下次复用。

凡是带条件、会从主 Skill 分岔出去的内容,都拆进文件夹里。也别忘了,多层级结构同样适用于特别复杂的 Skill。对于这类 Skill,你需要认真判断,究竟该把功能做成一个整体式的大 Skill,还是拆成一组 Skill,也许再通过 depends: 建立加载关系。

第 4 步:反复迭代

接着,在分支上做大量迭代。先从没有 Skill 的主分支开始,做若干轮尝试,建立你的核心查询集合,再跑一大批 eval。任何审阅你 Skill 代码的人,都会感谢你提交的是一个完整变更集,并且附带评估集合。连续审阅一串零碎增量修改是很困难的,除非只是新增一个坑点,所以尽量减少这种情况。

你大概率会做很多很小的措辞修改。描述里哪怕只改一个小词,都可能对路由产生极大影响,甚至波及其他 Skill。所以这些工作最好都在第 5 步之前做完。

第 5 步:发布

发布它。

如何维护 Skill

写完一个 Skill 之后,你还得维护它。

坑点飞轮

从这一步开始,你的坑点列表通常会持续增长,或者频繁变化。我们经常看到工程师提交没有经过 eval 的 PR,比如直接改描述。如果你的 Skill 已经合并了,之后还在改描述,那你基本已经跑偏了。如果你改动的是决定是否路由这个 Skill 的那部分内容,就必须写相应的 eval 来支撑这次修改。

Skill 通常是以追加为主的。坑点这一节,会随着时间累积出最大价值:

  • 代理在某件事上失败了 → 加一条坑点

  • 代理在不该触发时加载了 Skill → 收紧描述,补充反向 eval

  • 代理该加载时却没加载 Skill → 增加关键词,补充正向 eval

  • 系统提示词变了 → 检查是否出现冲突或重复

在内部测试或生产环境里发现一个单独的失败案例,再补一条坑点,这件事通常很容易做。它本质上是一个反例,不是真正在改显式指令,但它会让模型知道,这里有一个已知失败模式。

当你从 80 比 20 的覆盖,继续往 99.9% 甚至 99.99% 成功率推进时,这份坑点列表会很容易不断变长。随着这些反例陆续出现,你应该主要往坑点这一节里追加内容,而不是不断把说明写得更长,或者反复改描述。

Eval 套件

在 Perplexity,我们会跑很多不同类型的 eval 套件,来检查不同问题。比如 Skill 加载和 Skill 文件读取的 eval,会检查 Skill 加载本身的精确率、召回率,以及禁止加载的情况。代理会不会在该路由时正确路由到你的 Skill。它们能确保新 Skill 不会破坏原有边界。

还有一些 eval,会检查渐进式加载是否正确。代理也许确实加载了 Skill,但它有没有去读取附属文件。比如,你有一个面向金融查询的 finance Skill,那么它会不会去读那个特殊的 FORMATTING.md 文件。

还有一些 Skill eval 会测试领域内任务是否真正端到端完成。我们会跑完整的代理循环,再用一个 LLM 裁判,按照定义清楚的标准来打分。

最后,重要的一点是,要在不同模型上都跑这些 eval。Computer 至少支持三类编排模型家族:GPT、Claude Opus 和 Claude Sonnet。你需要用这些不同的代理编排器去跑 Skill 加载和领域 Skill,确保不会出现行为分裂。Sonnet 和 GPT 在处理 Skill 时的表现差异相当大。

最后的想法与结论

你构建的 Skill 越多,你就越擅长构建 Skill。如果你现在还没有开始把日常工作中那些可重复的任务自动化,或者让它们变得更可复现,那就该立刻开始。

构建 Skill 这件事,本身会让你更擅长继续构建 Skill。同时,Skill 在自动化业务流程这件事上也非常强。如果有某件事是你每周站会前都会做、每个 sprint 结束时都会做,或者是你作为工程师每天、每周、甚至每季度都会重复做的,那你都应该写一个 Skill,把这部分时间买回来。

复盘能自动化吗。Pull request 审查能自动化吗。任何你能做的任务,至少都可以先让 Agent Skill 做第一轮处理。这会替你节省大量时间。

不过也别忘了,Skill 并不容易写,而且也不总是有必要。少即是多。还有几条值得带走的结论:

  1. 在写 Skill 之前先写 eval。把反例,以及那些相邻但不同的 Skill 的禁止加载案例也放进去。

  2. 描述才是最难的部分。Load when... 这几个词里,每个词都在消耗注意力。

  3. 坑点是价值极高的内容。先写薄一点,等代理出错后再慢慢长出来。

还要记住,即便你根本没有碰某个旧 Skill,只是新增了一个 Skill,也很容易把原有 Skill 搞坏。小心这种远距离作用。

每次在编写和维护 Skill 时,都把手头能用的工具都用起来。如果你想继续深入,Agent Skills 网站上有很多优秀示例,我们的内部仓库和公开生态里,也都有大量设计精良的 Skill 例子。

Perplexity’s frontier agent products rest on a foundation of know-how and domain expertise packaged in modular Agent Skills. We maintain a carefully curated library of Skills across our technical environments. These Skills include many of the general-purpose utilities powering Perplexity Computer; vertical-specific capabilities in areas such as finance, law, and health; and a very long tail of modules for addressing user needs. Some Skills are infrequently invoked but critical when invoked. To ensure a consistently excellent user experience, Perplexity’s Agents team prioritizes Skill quality just as much as code quality. The intuitions and best practices required to develop a high-quality Skill differ significantly from those required to build traditional software. The Agents team reviews many pull requests from excellent engineers who develop Skills in the course of their work. The result is almost always numerous comments and suggestions for revision. This is because many useful patterns for writing code become antipatterns in Skill creation.

For example, if you take some of the aphorisms from PEP20 – The Zen of Python, it quickly becomes clear that writing good Python code is unlike writing good Skills. Of the 20 lines of wisdom, at least half are fully wrong or actively misleading when writing Skills. Here are five of them:

Zen of Python Zen of Skills
Simple is better than complex A Skill is a folder, not a file. Complexity is the feature.
Explicit is better than implicit Activation is implicit pattern matching. Progressive disclosure.
Sparse is better than dense Context is expensive. Maximum signal per token.
Special cases aren’t special enough to break the rules Gotchas ARE the special cases (they're the highest-value content).
If the implementation is easy to explain, it may be a good idea If it's easy to explain, the model already knows it. Delete it.

This guide is the document that engineers across Perplexity use when developing and reviewing Skills. We’re also releasing this guide to the public so that our discoveries and learnings can benefit the broader community. Whether you’re an engineer designing production Skills in your day-to-day work, a Computer user looking to develop your own Skill in an area you know best, or both, this guide is for you.

Perplexity 的前沿代理产品,建立在以模块化 Agent Skills 形式封装的知识经验与领域专长之上。我们在各类技术环境中维护着一套经过精心筛选的 Skill 库。这些 Skill 包括许多驱动 Perplexity Computer 的通用工具,也包括金融、法律、健康等垂直领域的专门能力,以及大量用于满足用户需求的长尾模块。有些 Skill 调用频率不高,但一旦被调用就至关重要。为了确保始终如一的卓越用户体验,Perplexity 的 Agents 团队像重视代码质量一样重视 Skill 质量。 开发高质量 Skill 所需的直觉与最佳实践,和构建传统软件所需的能力有很大不同。Agents 团队会审阅许多优秀工程师在工作中编写 Skill 时提交的 pull request,结果几乎总会带来大量评论与修改建议。原因在于,许多写代码时有用的模式,放到 Skill 创作里反而会变成反模式。

比如,如果把 PEP20 – The Zen of Python 里的几条箴言拿来看,就会很快发现,写好 Python 代码和写好 Skill 根本不是一回事。在那 20 条智慧中,至少有一半在写 Skill 时要么完全错误,要么会主动误导你。下面列出其中五条:

Python 之禅 Skill 之禅
简单胜于复杂 Skill 是一个文件夹,不是一个文件。复杂性本身就是特性。
显式胜于隐式 激活依赖隐式模式匹配。渐进式展开。
稀疏胜于密集 上下文很昂贵。每个 token 都要承载尽可能高的信号。
特例不足以特殊到破坏规则 坑点才是真正的特例,它们是价值最高的内容。
如果实现方式很容易解释,那它多半是个好主意 如果它很容易解释,模型本来就已经知道了。删掉它。

这份指南,就是 Perplexity 各团队工程师在开发和审阅 Skill 时使用的文档。我们现在也将这份指南公开发布,希望我们一路摸索出的发现与经验能让更广泛的社区受益。无论你是日常工作中负责设计生产级 Skill 的工程师,是想在自己最熟悉的领域开发专属 Skill 的 Computer 用户,还是两者兼有,这份指南都适合你。

What is a Skill?

When you write a Skill, you aren’t writing plain old software (even though Skills are now part of the main logical engines for agent systems). Rather, you're building context for models and their environments. A Skill has different constraints and different design principles. If you write a Skill like you do code, you will fail.

A Skill is at least four things, especially in the context of how we build them at Perplexity.

A Skill is a Directory

A Skill is not just a single SKILL.md file. In many cases, a Skill includes several files. Under the directory named after your Skill, you might have:

  • SKILL.md: frontmatter and instructions

  • scripts/: code the agent runs, not reinvents

  • references/: heavy docs, loaded conditionally

  • assets/: templates, schemas, and data

  • config.json: first-run user setup

This hub-and-spoke pattern allows you to keep Skills very focused and tight, and one can use the folder structure in a very creative way. Sometimes, particularly intricate Skills benefit from multiple levels of hierarchy to help the model navigate better. Suppose a Skill requires knowledge across 300 topics, groupable into 20 subject matter areas. Reliably choosing the right topic among 300 is an unsolved challenge even for today’s best frontier models. It’s a much easier choice problem for a model to hone in on one of 20 areas, than among the 15 topics within that area.

As one example of how multilevel hierarchy provides value, our team employed three levels of topical nesting within the Skills powering Computer’s U.S. income tax capabilities this past tax season. This hierarchy was absolutely indispensable given the complexity of tax law: in our early tests, presenting the model with a single folder containing all 1,945 sections of the U.S. Internal Revenue Code resulted in worse performance than not loading the Skill at all Organizing the information into logical subdivisions was indispensable for ensuring high-precision read operations.

Yet this hierarchy did not come free. Increasing levels of hierarchy require increasing levels of curation across the information architecture to manage the resulting indirection. We devised quick reference guides, custom search utilities, and other tools to support the model in locating information with a minimum of indirection. In this case, doing the hard work of curation ultimately produced a positive end result: a Skill that allowed models to perform tax-related tasks much more capably than using general tools alone.

Image 1

Expand

A Skill is a Format

A Skill is a format. The core root SKILL.md file must have both a name and a description. Furthermore, the Skill needs to exactly map to the directory name in which the Skill is located. The name must be all lower-case characters, have no spaces, and can use hyphens. The description is the routing trigger. This is a common failure point: the description is not internal documentation for what the Skill does. It amounts to instructions for the model for when to load the Skill. So, you will frequently see “Load when,” not “This Skill does.” This is important because of the way that most implementations inject the description into the model context.

Within the frontmatter, there is also “depends:”, which allows you to create hierarchical Skill dependencies, and “metadata:”, which is used for reviews and evaluations. Different agent systems can even define their own frontmatter fields, to be used in a manner specific to those systems. As an alternative, Skill-specific metadata can be packaged in an auxiliary JSON or YAML configuration file. This is desirable when building agent systems that need to facilitate different types of runtime behavior per Skill without polluting the model’s context with minutiae. Finally, similar behavior is obtainable through stripping Skill frontmatter on read. Computer employs this methodology, which allows configuration to be preserved in the root SKILL.md file. Careful attention to detail is required in the parsing logic, and one might wish to implement conditional stripping if there are certain fields that are useful to have within the model context.

A Skill is Invocable

A Skill is invocable. The agent loads a Skill at runtime. Importantly, Skills aren’t always bundled into the context. By default, most agent systems unfold Skills progressively upon specific need.

There are at least three tiers of context costs in the way that we've implemented Skills in Computer. Here is the process:

  1. Computer calls load_skill(name="...")

  2. Computer copies the Skill directory into the isolated execution sandbox

  3. Computer recursively auto-loads dependencies in the “depends:” tag

  4. Computer then strips the frontmatter and the agent thus only sees the body and the additional files

Different agent systems can choose to expose Skill content in different ways. As an example, some systems might choose not to expose the file hierarchy at all, leaving it to the model to discover the hierarchy through filesystem operations. Other systems may choose to give the model a mapping of the entire filetree up to a certain truncation and/or depth limit. To keep context clean, Computer omits full file hierarchies from the invocation context; however, this is overridable on a per-Skill basis.

A Skill is Progressive

Skills are progressive. In Computer, there are three different tiers of context costs, and we incur all three at various stages:

Tier What loads Budget When you pay
Index name: description for every non-hidden Skill ~100 tokens per Skill Every session, every user, always paid
Load Full SKILL.md body ~5,000 tokens ~5,000 tokens
Runtime Files in scripts/, references/, assets/, subskills, FORMATTING.md, SPECIAL_CASES.md Unbounded Only when the agent reads them

Computer builds a Skill index that has the name and the description for every available Skill. The budget for this is around 100 tokens per Skill (shorter is even better). It’s so tight because you're paying this cost in every session, for every user. This is injected into the system prompt at the very beginning of the conversation. The model has access to a bunch of named Skills and descriptions so that it can decide whether to call “load_skill()”. The bar to getting into this index is extremely high. Your Skill needs to be very useful, and the description needs to be extremely dense and terse because everyone is paying the cost all the time.

After the agent system loads the Skill, there’s the full SKILL.md body. Ideally, the body text does not exceed 5,000 tokens. Even then, you want every sentence to matter because once you load a Skill, the rest of the conversation has to pay that until you hit the compaction boundary. Many threads load anywhere between three and five different Skills, multiplying this cost. Skills with a lot of fluff will almost certainly degrade other Skills as well as overall agentic capabilities. In short, if your Skill loads and it doesn't do the right thing, that's wasted context.

The final level of progression is scripts or special cases, like subskills or formatting. This is where you want to put unbounded conditional branched logic. The agent will only use it when it needs to, meaning there's a much lower bar for what you want to put in here.

In the index, every token is important. The loaded Skill body is more relaxed, and the runtime is the most relaxed. This could be 20,000 tokens or zero tokens. This is the level at which you might think about expanding the context of the model in a progressive fashion.

什么是 Skill

当你在写一个 Skill 时,你并不是在写传统意义上的软件,即便 Skill 如今已经成为代理系统主逻辑引擎的一部分。你真正构建的是模型及其运行环境所需的上下文。Skill 有不同的约束,也有不同的设计原则。如果你按写代码的方式去写 Skill,最后一定会失败。

一个 Skill 至少同时具备四种属性,尤其是在 Perplexity 的构建方式里更是如此。

Skill 是一个目录

Skill 并不只是单个 SKILL.md 文件。在很多情况下,一个 Skill 会包含多个文件。在以你的 Skill 命名的目录下,通常可能有这些内容:

  • SKILL.md:frontmatter 和说明

  • scripts/:由代理执行的代码,而不是让它临场重造

  • references/:体量较大的文档,按条件加载

  • assets/:模板、schema 和数据

  • config.json:首次运行时的用户设置

这种中心加辐射的模式,能让 Skill 保持聚焦和紧凑,也让文件夹结构本身可以被非常有创意地利用。有时,尤其是那些特别复杂的 Skill,会从多层级结构中明显受益,因为这能帮助模型更好地导航。假设一个 Skill 需要覆盖 300 个主题,这些主题又可以归为 20 个领域。即便是当今最强的前沿模型,要在 300 个主题里稳定挑中正确的一个,依然是个没被解决的问题。相比之下,让模型先在 20 个领域里收敛,再在该领域下的 15 个主题里定位,选择难度会低得多。

多层级结构如何产生价值,这里有个例子。在上一个美国报税季,我们团队为 Computer 的美国个人所得税能力构建的 Skill,就采用了三层主题嵌套。考虑到税法本身的复杂程度,这种层级结构完全不可或缺。在我们早期测试中,如果把美国《国内税收法典》的全部 1,945 个条目直接放进一个单独文件夹里给模型看,效果甚至比完全不加载这个 Skill 还差。把信息组织成逻辑清晰的子分区,是确保高精度读取操作的必要条件。

但这种层级并不是零成本的。层级越深,信息架构上的整理与策划成本也越高,因为你必须处理随之而来的间接性。我们为此设计了速查指南、定制搜索工具,以及其他辅助模型定位信息的工具,尽量减少中间跳转。在这个案例里,认真做足这些整理工作,最终换来了正向结果:一个让模型在税务相关任务上远比单靠通用工具更有能力的 Skill。

Image 1

展开

Skill 是一种格式

Skill 也是一种格式。根目录下的核心 SKILL.md 文件必须同时包含名称和描述。而且,Skill 的名称必须与它所在目录的目录名完全一致。名称必须全部小写,不能有空格,可以使用连字符。描述则是路由触发器。这里是一个很常见的失败点:描述不是给人看的内部文档,用来解释 Skill 是做什么的。它本质上是在告诉模型,什么时候该加载这个 Skill。所以你会经常看到 Load when,而不是 This Skill does。这一点非常重要,因为大多数实现都会把这段描述直接注入到模型上下文中。

在 frontmatter 里,还有 depends:,它允许你建立层级化的 Skill 依赖关系;还有 metadata:,用于评审与评估。不同的代理系统甚至可以自定义自己的 frontmatter 字段,以适配各自特有的使用方式。另一种做法,是把 Skill 专属元数据放进辅助的 JSON 或 YAML 配置文件里。当你构建的代理系统需要根据不同 Skill 实现不同运行时行为,又不想把这些细枝末节污染进模型上下文时,这样做会更合适。最后,还有一种类似效果可以通过在读取时剥离 Skill frontmatter 实现。Computer 就采用了这种方法,从而允许把配置保留在根 SKILL.md 文件中。这里对解析逻辑的细节要求很高,如果有些字段在模型上下文中确实有价值,你也可以考虑做条件式剥离。

Skill 是可调用的

Skill 是可调用的。代理会在运行时加载某个 Skill。重要的是,Skill 并不会总是默认打包进上下文。绝大多数代理系统都会按需渐进式展开 Skill。

在 Computer 对 Skill 的实现方式里,至少存在三层上下文成本。流程如下:

  1. Computer 调用 load_skill(name="...")

  2. Computer 将 Skill 目录复制到隔离执行沙箱中

  3. Computer 递归自动加载 depends: 标签中声明的依赖

  4. Computer 随后剥离 frontmatter,于是代理看到的只剩正文和附加文件

不同的代理系统可以选择以不同方式向模型暴露 Skill 内容。举例来说,有些系统可能完全不暴露文件层级,而是让模型通过文件系统操作自行发现结构。另一些系统则可能向模型提供整个文件树的映射,只是会设置一定的截断或深度限制。为了保持上下文整洁,Computer 在调用上下文中默认不放完整文件层级,不过这一点可以按 Skill 单独覆盖。

Skill 是渐进式的

Skill 是渐进式的。在 Computer 里,上下文成本分成三层,而且我们会在不同阶段承担全部三层成本:

层级 加载内容 预算 何时付费
索引 每个非隐藏 Skill 的 name: description 每个 Skill 约 100 tokens 每个会话、每个用户、始终付费
加载 完整 SKILL.md 正文 约 5,000 tokens 触发加载时付费
运行时 scripts/references/assets/、子 Skill、FORMATTING.mdSPECIAL_CASES.md 中的文件 无上限 只有代理读取时才付费

Computer 会构建一个 Skill 索引,其中包含每个可用 Skill 的名称和描述。这一层预算大约是每个 Skill 100 tokens,越短越好。之所以卡得这么紧,是因为这个成本会在每个会话、每个用户身上反复支付。它会在对话一开始就注入系统提示词中。模型会看到一批有名称和描述的 Skill,据此决定是否调用 load_skill()。能被放进这个索引的门槛极高。你的 Skill 必须足够有用,而且描述必须极度浓缩、极度简洁,因为所有人都会一直为它买单。

当代理系统加载某个 Skill 之后,接着就是完整的 SKILL.md 正文。理想情况下,正文不要超过 5,000 tokens。即便如此,你仍然希望每一句话都有意义,因为一旦加载了这个 Skill,接下来的整段对话都要一直承担这部分成本,直到碰到压缩边界。很多线程会同时加载三到五个不同 Skill,这会把成本进一步放大。充满废话的 Skill,几乎必然会拖累其他 Skill,也会拖累代理整体能力。简而言之,如果你的 Skill 被加载了,却没让事情朝正确方向发展,那它占用的上下文就是白白浪费。

渐进式展开的最后一层,是脚本或特殊规则,比如子 Skill 或格式要求。这一层最适合放无上限的条件分支逻辑。代理只会在需要时才使用它,所以这里对内容收录的门槛会低很多。

在索引层,每个 token 都很重要。加载后的 Skill 正文会宽松一些,而运行时层最宽松。它可以是 20,000 tokens,也可以是 0 tokens。也正是在这一层,你才应该考虑以渐进方式扩展模型上下文。

When do you need a Skill?

The Agents team is often asked to opine on whether a Skill is truly needed for a given domain or use case. Very rarely do we have a definitive answer from first principles alone. The only way to really figure this out is to start with your agent without the Skill, run several hero queries, and then figure out whether the agent is doing a good job.

When you need a Skill

There are many tasks that are in distribution for trained models. You only need to apply a Skill if you want to change that behavior in some specific way that you can't with say, one sentence in your prompt. So, you need a Skill when the agent will get it wrong without special context, or if there's some inconsistency or non-determinism that you need to be extremely consistent across runs.

It could be that your knowledge is durable but not in the training data. There could be cutoffs or enterprise specific workflows, or it could be a matter of taste. For example, we have several design-related Skills in Computer written by Henry Modisett (our head of design). The reason that every token exists in those Skills is because Henry has very good taste when it comes to designing websites and PDFs. Henry specifies which fonts to use and which fonts not to use, how those fonts feel, and other matters of judgment that the model can't learn from training data alone.

When you don’t need a Skill

We see many Skills in which engineers have written a series of git commands that need to be executed in order. That’s unnecessary because the model already knows how to do that, meaning it makes for great documentation but a poor Skill.

We see examples where Skills recapitulate instructions from the system prompt. You don't need a Skill for that. Knowledge relevant for the majority of requests should be included in global context, not in a conditionally loaded Skill

If there's something that's changing faster than you can maintain it, you don't need a Skill. For example, if you're hitting some remote MCP endpoint and its tools or the versions of those tools are changing frequently, you shouldn't inject those into a Skill. If you do, you’ll just end up with drift and the model will make mistakes.

Every Skill is a tax

Here’s a useful test you can apply to every sentence in your Skill: “Would the agent get this wrong without this instruction?” If the sentence does not need to be there, it cannot afford to be there because everyone is paying this cost every single time. When you are deciding whether to add a Skill or not, remember this tax wherein every session and every user costs tokens.

The following famous quote, which sounds much better in French, roughly translates to “I have only made this letter longer because I have not had the time to make it shorter.”

« Je n’ai fait celle-ci plus longue que parce que je n’ai pas eu le loisir de la faire plus courte. » — Blaise Pascal, Lettres Provinciales, 1657

Just like Pascal, you need to invest time in every Skill. It is hard to write a short Skill. If your Skill is easy to write, it is probably too long or shouldn’t exist. A good Skill is as short as it can be.

If you find yourself trying to one-shot Skill generation and putting up PRs in five minutes, the results will almost certainly be subpar. In fact, early research hasshown that if you're using LLMs to write Skills, the LLM will probably not benefit from it: “Self-generated Skills provide no benefit on average, showing that models cannot reliably author the procedural knowledge they benefit from consuming.”

什么时候需要 Skill

Agents 团队经常会被问到,某个领域或用例到底是不是真的需要 Skill。绝大多数时候,我们很难只靠第一性原理直接给出确定答案。真正靠谱的做法,是先拿没有这个 Skill 的代理跑几组核心查询,再判断代理是否做得足够好。

什么时候需要 Skill

有很多任务本来就在已训练模型的分布之内。只有当你想以某种特定方式改变它的行为,而且这种改变无法靠提示词里一句话实现时,才需要 Skill。也就是说,当代理在缺少专门上下文时会做错,或者当你需要它在多次运行中保持极高一致性,消除某种不稳定性时,你才需要 Skill。

也可能是你的知识本身很稳定,但并不在训练数据里。也可能是训练截止时间的问题,或者企业特定流程的问题,甚至只是风格审美的问题。比如,Computer 里有几个与设计相关的 Skill,是由我们的设计负责人 Henry Modisett 编写的。那些 Skill 里每个 token 之所以存在,都是因为 Henry 在网站和 PDF 设计方面审美极好。Henry 会明确规定该用哪些字体,不该用哪些字体,这些字体给人的感受,以及其他单靠训练数据学不出来的判断细节。

什么时候不需要 Skill

我们见过很多 Skill,里面只是工程师把一串需要依次执行的 git 命令写进去。这没有必要,因为模型本来就知道怎么做。这样的内容适合作为文档,却不适合作为 Skill。

我们也见过一些 Skill,只是在重复系统提示词里已经有的说明。这也不需要 Skill。对于绝大多数请求都相关的知识,应该放在全局上下文里,而不是放进一个按条件加载的 Skill。

如果某件事变化得比你维护它的速度还快,那也不需要 Skill。比如你依赖某个远程 MCP 端点,而它的工具或工具版本变化频繁,就不应该把这些内容注入 Skill。否则只会不断漂移,模型也会因此出错。

每个 Skill 都是一种税

这里有个很好用的测试,可以套到 Skill 里的每一句话上:如果没有这句说明,代理会做错吗。如果这句话并不需要存在,那它就负担不起存在的成本,因为每次运行、每个人都在为它付费。当你在决定要不要新增一个 Skill 时,别忘了这笔税:每个会话、每个用户,都会消耗 token。

下面这句名言,用法语说出来更好听,大意是,我之所以把这封信写得这么长,只是因为我没有时间把它写短。

« Je n’ai fait celle-ci plus longue que parce que je n’ai pas eu le loisir de la faire plus courte. » — Blaise Pascal, Lettres Provinciales, 1657

和 Pascal 一样,你必须在每个 Skill 上投入时间。写短 Skill 很难。如果你的 Skill 很容易写出来,它多半是太长了,或者根本就不该存在。一个好的 Skill,应该短到不能再短。

如果你发现自己想一次生成整个 Skill,五分钟就提一个 PR,结果几乎一定不会好。事实上,早期研究已经表明,如果你用 LLM 去写 Skill,LLM 本身很可能也得不到好处:“Self-generated Skills provide no benefit on average, showing that models cannot reliably author the procedural knowledge they benefit from consuming.”

How to build a Skill

Put another way, you need to inject your opinion into any Skill that you write. Follow these steps.

Step 0: Write the Evals

Write some of the evals first. You can source evaluation cases from:

  • Real user queries: sample from production or your brain trust

  • Known failures: The agent failed because the Skill didn't exist

  • Neighbor confusion: Close to your domain boundary but routes to another Skill

At the very least, you should be making sure that you're testing that the Skill loads when needed. Ideally, you sample some of these, maybe from a production environment. You might also consider known error cases: maybe the whole reason that you set out to write the Skill is because of a specific failure you noticed or maybe you're refactoring and there's some confusion in two close domains that are covered by one Skill.

Start with similar negative and positive examples. Negative examples are extremely powerful and can matter more than positive examples.

Step 1: The Description

This is the hardest line in the Skill. It’s a routing trigger, not documentation. To get the name and the description right, you don't care about the content of the Skill. You only care about whether the Skill is loaded and injected at the right points and is free of off-target side effects, which is the number one failure mode. Every time you add an additional Skill, you risk making every other Skill slightly worse, so you need to make sure that you're minimizing regression.

Again, a bad description describes what the Skill does or why it is useful. A good description says when the agent should load the Skill. For example, say you have something for monitoring pull requests. Don't write what the Skill does. Write what engineers say when they're frustrated and they want you to make sure that their PR works, like “babysit” or “watch CI” or “make sure this lands.”

Here’s a quick checklist:

  • Starts with "Load when..."

  • Target 50 words or fewer

  • Describes the user’s intent, ideally from real queries

  • Does not summarize the workflow

Real queries are what you can cover in an 80-20. Usually, two or three examples work well. It's not easy to add exactly and only as much as you need.

Step 2: Write the Body

Next, write the content of the Skill itself. Notice this is not Step 0 or Step 1.

Communicating workflows to an LLM is completely different to communicating workflows to a colleague, or even to your runtime system. When learning a new software tool, an engineer might need to read the documentation, get a walkthrough from someone with experience, and learn how to use the tool. Meanwhile, for almost any software tool that has been around at least a year, you just need to mention its name and the LLM has all the information it needs.

When you are writing the body, skip the obvious things. Many engineers have plenty of experience writing readme.md files that list out every command someone needs to run. It's easy to fall back into that when you're writing a Skill because it feels like you're writing documentation, but if you do that, your Skill will be garbage. So, don't write out a series of commands.

For example, you don’t need to write, “git log # find the commit; git checkout main; git checkout -b <clean-branch>; git cherry-pick <commit>;

Instead, write, “Cherry-pick the commit onto a clean branch. Resolve conflicts preserving intent. If it can't land cleanly, explain why.”

The model will do a much better job with the latter than with the former’s overly prescriptive series of commands, especially when things go wrong. Don't railroad, or be overly prescriptive, which is fragile, and instead be flexible where multiple approaches can work. Again, good documentation for humans is most often bad documentation for models.

Next, focus on the gotchas or negative examples. These are extremely high-signal content because they often guide the model in terms of what not to do. If you add a line every time the agent trips up, you’ll learn by running it and the gotchas will grow organically.

Lastly, if there’s any portion that's conditional or extremely heavy in content, take it out of the SKILL.md, which is the hub, and put it into one of the spokes. Put it into an accessory file that can be progressively loaded, which we’ll dive into next.

Step 3: Use the Hierarchy

Make use of the Skill hierarchy when you've got a script, references, or you’re using some specific tool:

scripts/

Deterministic logic the agent would reinvent every run Give it code to compose, not reconstruct references/

Heavy docs loaded only when a condition is met"Read api-errors.md if API returns non-200" assets/

Output templates the agent copies and fillsreport-template.md, output schemas config.json

First-run user setup Ask for the Slack channel, save, and reuse next time

For anything that's conditional or branching from the main Skill, break it out into a folder. Remember, also, that multilevel hierarchy can be used for particularly intricate Skills. For these, you’ll want to give careful thought to whether the functionality should be implemented monolithically or as a collection of Skills (perhaps with depends: based loading relationships).

Step 4: Iterate

Next, do a bunch of iterations on a branch. Start on the main branch with no Skill, do some iterations, build your hero query set, and run a slew of evals. Anyone reviewing your Skill code will thank you for submitting a single changeset complete with an evaluation set. Reviewing consecutive incremental changes (except a new gotcha) is very hard, so try to minimize it.

You’ll likely do many small word changes. Small word changes in descriptions can have an outsized impact on routing (including spillover effects on other Skills), so do all that work before Step 5.

Step 5: Ship

Ship it.

如何构建 Skill

换句话说,写 Skill 时,你必须把自己的判断注入进去。按下面这些步骤做。

第 0 步:先写 Evals

先写一部分评估用例。评估案例可以来自这些地方:

  • 真实用户查询:从生产环境或你信任的内部样本里抽取

  • 已知失败案例:代理曾因 Skill 不存在而失败

  • 邻域混淆:接近你领域边界,但被路由到另一个 Skill

至少,你要确保自己测到了 Skill 会在需要时被正确加载。理想情况下,你还应该抽一些真实案例,最好来自生产环境。你也可以考虑已知错误场景:也许你决定写这个 Skill,本来就是因为你注意到某个特定失败;也可能你正在做重构,而两个相近领域的边界让同一个 Skill 覆盖得不够清楚。

先从相近的反例和正例开始。反例非常有力量,有时甚至比正例更重要。

第 1 步:写描述

这是整个 Skill 里最难写的一行。它是路由触发器,不是文档。要把名称和描述写对,你不该关心 Skill 的具体内容。你只该关心这个 Skill 是否会在正确的时机被加载和注入,同时不会产生跑偏的副作用。这是最常见的失败模式。每增加一个 Skill,你都有可能让其他每一个 Skill 都略微变差,所以你必须确保把回归风险压到最低。

再说一遍,糟糕的描述会去解释 Skill 做什么,或者它为什么有用。好的描述会说明代理应该在什么时候加载这个 Skill。比如,你有一个用于监控 pull request 的东西。不要写它是做什么的。要写工程师在焦虑时会怎么说,写他们希望你确保 PR 能顺利通过时会用的词,比如 babysit、watch CI、make sure this lands。

下面是一份简短检查清单:

  • 以 Load when... 开头

  • 控制在 50 个词以内

  • 描述用户意图,最好来自真实查询

  • 不要总结工作流

真实查询是你可以用 80 比 20 原则覆盖到的内容。通常,两到三个例子就很够用。要做到不多不少,刚好写够,没那么容易。

第 2 步:写正文

接下来,开始写 Skill 本体的内容。注意,这不是第 0 步,也不是第 1 步。

向 LLM 传达工作流,和向同事传达工作流完全不是一回事,甚至和向运行时系统传达也不是一回事。学习一个新的软件工具时,工程师可能需要读文档、找有经验的人带一遍,再慢慢学会怎么用。但对 LLM 来说,只要是那种已经存在至少一年的软件工具,通常你只要提一下名字,它就已经掌握了需要的信息。

写正文时,把显而易见的内容跳过去。很多工程师很擅长写 readme.md,会把每一条命令都完整列出来。写 Skill 时很容易不自觉回到这种方式,因为感觉自己像是在写文档。但如果你这么做,你的 Skill 就会很糟。所以,不要把一连串命令写出来。

比如,你不需要写,“git log # find the commit; git checkout main; git checkout -b <clean-branch>; git cherry-pick <commit>;

你应该写,“Cherry-pick the commit onto a clean branch. Resolve conflicts preserving intent. If it can't land cleanly, explain why.”

面对后一种写法,模型通常会做得比前一种死板命令序列更好,尤其是在出问题的时候。不要把路线钉死,也不要过度规定步骤,那样会很脆弱。只要多种方法都能奏效,就应该保留弹性。再说一次,对人类友好的好文档,放到模型这里,往往反而是坏文档。

接下来,把注意力放到坑点和反例上。这些内容信号极强,因为它们往往能明确告诉模型不要做什么。每当代理踩坑时加上一条,你会在反复运行中不断学到东西,坑点也会自然增长。

最后,如果有些内容是条件性的,或者体量特别大,就把它从作为中心的 SKILL.md 里拿出来,放进辐射文件中。放进可以渐进加载的附属文件里,下一节会详细说。

第 3 步:利用层级结构

当你有脚本、参考资料,或者要使用某个特定工具时,就该用上 Skill 的层级结构:

scripts/

代理每次都会重新发明的确定性逻辑,直接给它可组合的代码,而不是让它重建。

references/

只在条件满足时才加载的重型文档。 如果 API 返回非 200,就去读 api-errors.md

assets/

代理复制并填写的输出模板。 比如 report-template.md、输出 schema。

config.json

首次运行时的用户设置。 询问 Slack 频道,保存下来,下次复用。

凡是带条件、会从主 Skill 分岔出去的内容,都拆进文件夹里。也别忘了,多层级结构同样适用于特别复杂的 Skill。对于这类 Skill,你需要认真判断,究竟该把功能做成一个整体式的大 Skill,还是拆成一组 Skill,也许再通过 depends: 建立加载关系。

第 4 步:反复迭代

接着,在分支上做大量迭代。先从没有 Skill 的主分支开始,做若干轮尝试,建立你的核心查询集合,再跑一大批 eval。任何审阅你 Skill 代码的人,都会感谢你提交的是一个完整变更集,并且附带评估集合。连续审阅一串零碎增量修改是很困难的,除非只是新增一个坑点,所以尽量减少这种情况。

你大概率会做很多很小的措辞修改。描述里哪怕只改一个小词,都可能对路由产生极大影响,甚至波及其他 Skill。所以这些工作最好都在第 5 步之前做完。

第 5 步:发布

发布它。

How to Maintain a Skill

Now that you’ve written a Skill, you have to maintain it.

The Gotchas Flywheel

From this point on, your list of gotchas tends to grow or change a lot. We often see engineers who make PRs that are un-evaled, for example, change the description. If you're changing the description after your Skill has been merged, you are off track. If you're making changes to the thing that decides whether to route your Skill, you need to write some evals that support the changes.

Skills are append-mostly. The gotchas section accrues the most value over time:

  • Agent fails at something → Add a gotcha

  • Agent loads the Skill off target → Tighten description and add negative evals

  • Agent doesn't load the Skill when it should → Add keywords and positive evals

  • System prompt changes → Check for contention or duplication

It's easy to notice a single failure case in internal testing or in production and add a gotcha. It’s a negative example so it’s not really changing explicit guidance, but it lets the model know, “Hey, there's this known failure.”

As you move from the 80-20 to getting to a 99.9% or 99.99% success rate, it's easy to grow this gotcha list. As you see these negative examples, you should be appending mostly to the gotcha section. You shouldn't be adding longer instructions or changing the description.

Eval Suites

At Perplexity, we run many eval suites to check for different things. There are Skill loading and Skill file reads, which checks the precision, recall, and forbidden checks of the Skill loading itself. Will the agent route your Skill when it's supposed to? These ensure new Skills don’t break existing boundaries.

There are also evals that can check for proper progressive loading. The agent might load the Skill, but does it read the accessory file or files? For example, if you have a finance Skill for finance queries, does it read the special FORMATTING.md file?

There are also evals for Skills that test for end-to-end task completion within domains. We run the full agent loop and use an LLM judge to grade the results based on a rubric of well-defined criteria.

Finally, it’s important to run these evals against different models. Computer supports at least three different orchestration model families: GPT, Claude Opus, and Claude Sonnet. You want to run your Skill loading and the domain Skills against these different agent orchestrators to ensure that you don't get different behavior. Sonnet and GPT behave quite differently when it comes to Skills.

Final thoughts and takeaways

The more Skills you build, the better you will get at building them. If you're not automating or trying to make more reproducible tasks that you're doing on a day-to-day basis using Skills, start immediately.

The act of building Skills makes you better at building more Skills, but also, they're extremely good at automating business processes. If you can describe something you do every week before your standup, at the end of every sprint, or anything that you do as an engineer on a daily, weekly, or even quarterly basis, you should be writing a Skill to buy back your time.

Can you automate postmortems? Can you review pull requests? Any task that you can do, you can at least have the first pass be an Agent Skill. It will save you significant time.

That said, remember that a Skill isn’t easy or even always necessary. Less is more. A few other takeaways:

  1. Write evals before the Skill. Include negative examples and forbidden loads for adjacent but distinct skills.

  2. The description is the hard part. "Load when..." (every word costs attention).

  3. Gotchas are extremely high-value content. Start thin, grow as the agent fails.

Remember that it is easy to break other pre-existing Skills by adding a new Skill, even though you didn’t touch it (beware of action at a distance).

Use all the available tools every time you’re writing and maintaining a Skill. If you want to learn more, the Agent Skills website has plenty of good examples, and both our internal repository and the public ecosystem contain many examples of well-designed Skills.

如何维护 Skill

写完一个 Skill 之后,你还得维护它。

坑点飞轮

从这一步开始,你的坑点列表通常会持续增长,或者频繁变化。我们经常看到工程师提交没有经过 eval 的 PR,比如直接改描述。如果你的 Skill 已经合并了,之后还在改描述,那你基本已经跑偏了。如果你改动的是决定是否路由这个 Skill 的那部分内容,就必须写相应的 eval 来支撑这次修改。

Skill 通常是以追加为主的。坑点这一节,会随着时间累积出最大价值:

  • 代理在某件事上失败了 → 加一条坑点

  • 代理在不该触发时加载了 Skill → 收紧描述,补充反向 eval

  • 代理该加载时却没加载 Skill → 增加关键词,补充正向 eval

  • 系统提示词变了 → 检查是否出现冲突或重复

在内部测试或生产环境里发现一个单独的失败案例,再补一条坑点,这件事通常很容易做。它本质上是一个反例,不是真正在改显式指令,但它会让模型知道,这里有一个已知失败模式。

当你从 80 比 20 的覆盖,继续往 99.9% 甚至 99.99% 成功率推进时,这份坑点列表会很容易不断变长。随着这些反例陆续出现,你应该主要往坑点这一节里追加内容,而不是不断把说明写得更长,或者反复改描述。

Eval 套件

在 Perplexity,我们会跑很多不同类型的 eval 套件,来检查不同问题。比如 Skill 加载和 Skill 文件读取的 eval,会检查 Skill 加载本身的精确率、召回率,以及禁止加载的情况。代理会不会在该路由时正确路由到你的 Skill。它们能确保新 Skill 不会破坏原有边界。

还有一些 eval,会检查渐进式加载是否正确。代理也许确实加载了 Skill,但它有没有去读取附属文件。比如,你有一个面向金融查询的 finance Skill,那么它会不会去读那个特殊的 FORMATTING.md 文件。

还有一些 Skill eval 会测试领域内任务是否真正端到端完成。我们会跑完整的代理循环,再用一个 LLM 裁判,按照定义清楚的标准来打分。

最后,重要的一点是,要在不同模型上都跑这些 eval。Computer 至少支持三类编排模型家族:GPT、Claude Opus 和 Claude Sonnet。你需要用这些不同的代理编排器去跑 Skill 加载和领域 Skill,确保不会出现行为分裂。Sonnet 和 GPT 在处理 Skill 时的表现差异相当大。

最后的想法与结论

你构建的 Skill 越多,你就越擅长构建 Skill。如果你现在还没有开始把日常工作中那些可重复的任务自动化,或者让它们变得更可复现,那就该立刻开始。

构建 Skill 这件事,本身会让你更擅长继续构建 Skill。同时,Skill 在自动化业务流程这件事上也非常强。如果有某件事是你每周站会前都会做、每个 sprint 结束时都会做,或者是你作为工程师每天、每周、甚至每季度都会重复做的,那你都应该写一个 Skill,把这部分时间买回来。

复盘能自动化吗。Pull request 审查能自动化吗。任何你能做的任务,至少都可以先让 Agent Skill 做第一轮处理。这会替你节省大量时间。

不过也别忘了,Skill 并不容易写,而且也不总是有必要。少即是多。还有几条值得带走的结论:

  1. 在写 Skill 之前先写 eval。把反例,以及那些相邻但不同的 Skill 的禁止加载案例也放进去。

  2. 描述才是最难的部分。Load when... 这几个词里,每个词都在消耗注意力。

  3. 坑点是价值极高的内容。先写薄一点,等代理出错后再慢慢长出来。

还要记住,即便你根本没有碰某个旧 Skill,只是新增了一个 Skill,也很容易把原有 Skill 搞坏。小心这种远距离作用。

每次在编写和维护 Skill 时,都把手头能用的工具都用起来。如果你想继续深入,Agent Skills 网站上有很多优秀示例,我们的内部仓库和公开生态里,也都有大量设计精良的 Skill 例子。

Perplexity’s frontier agent products rest on a foundation of know-how and domain expertise packaged in modular Agent Skills. We maintain a carefully curated library of Skills across our technical environments. These Skills include many of the general-purpose utilities powering Perplexity Computer; vertical-specific capabilities in areas such as finance, law, and health; and a very long tail of modules for addressing user needs. Some Skills are infrequently invoked but critical when invoked. To ensure a consistently excellent user experience, Perplexity’s Agents team prioritizes Skill quality just as much as code quality. The intuitions and best practices required to develop a high-quality Skill differ significantly from those required to build traditional software. The Agents team reviews many pull requests from excellent engineers who develop Skills in the course of their work. The result is almost always numerous comments and suggestions for revision. This is because many useful patterns for writing code become antipatterns in Skill creation.

For example, if you take some of the aphorisms from PEP20 – The Zen of Python, it quickly becomes clear that writing good Python code is unlike writing good Skills. Of the 20 lines of wisdom, at least half are fully wrong or actively misleading when writing Skills. Here are five of them:

Zen of Python Zen of Skills
Simple is better than complex A Skill is a folder, not a file. Complexity is the feature.
Explicit is better than implicit Activation is implicit pattern matching. Progressive disclosure.
Sparse is better than dense Context is expensive. Maximum signal per token.
Special cases aren’t special enough to break the rules Gotchas ARE the special cases (they're the highest-value content).
If the implementation is easy to explain, it may be a good idea If it's easy to explain, the model already knows it. Delete it.

This guide is the document that engineers across Perplexity use when developing and reviewing Skills. We’re also releasing this guide to the public so that our discoveries and learnings can benefit the broader community. Whether you’re an engineer designing production Skills in your day-to-day work, a Computer user looking to develop your own Skill in an area you know best, or both, this guide is for you.

What is a Skill?

When you write a Skill, you aren’t writing plain old software (even though Skills are now part of the main logical engines for agent systems). Rather, you're building context for models and their environments. A Skill has different constraints and different design principles. If you write a Skill like you do code, you will fail.

A Skill is at least four things, especially in the context of how we build them at Perplexity.

A Skill is a Directory

A Skill is not just a single SKILL.md file. In many cases, a Skill includes several files. Under the directory named after your Skill, you might have:

  • SKILL.md: frontmatter and instructions

  • scripts/: code the agent runs, not reinvents

  • references/: heavy docs, loaded conditionally

  • assets/: templates, schemas, and data

  • config.json: first-run user setup

This hub-and-spoke pattern allows you to keep Skills very focused and tight, and one can use the folder structure in a very creative way. Sometimes, particularly intricate Skills benefit from multiple levels of hierarchy to help the model navigate better. Suppose a Skill requires knowledge across 300 topics, groupable into 20 subject matter areas. Reliably choosing the right topic among 300 is an unsolved challenge even for today’s best frontier models. It’s a much easier choice problem for a model to hone in on one of 20 areas, than among the 15 topics within that area.

As one example of how multilevel hierarchy provides value, our team employed three levels of topical nesting within the Skills powering Computer’s U.S. income tax capabilities this past tax season. This hierarchy was absolutely indispensable given the complexity of tax law: in our early tests, presenting the model with a single folder containing all 1,945 sections of the U.S. Internal Revenue Code resulted in worse performance than not loading the Skill at all Organizing the information into logical subdivisions was indispensable for ensuring high-precision read operations.

Yet this hierarchy did not come free. Increasing levels of hierarchy require increasing levels of curation across the information architecture to manage the resulting indirection. We devised quick reference guides, custom search utilities, and other tools to support the model in locating information with a minimum of indirection. In this case, doing the hard work of curation ultimately produced a positive end result: a Skill that allowed models to perform tax-related tasks much more capably than using general tools alone.

Image 1

Expand

A Skill is a Format

A Skill is a format. The core root SKILL.md file must have both a name and a description. Furthermore, the Skill needs to exactly map to the directory name in which the Skill is located. The name must be all lower-case characters, have no spaces, and can use hyphens. The description is the routing trigger. This is a common failure point: the description is not internal documentation for what the Skill does. It amounts to instructions for the model for when to load the Skill. So, you will frequently see “Load when,” not “This Skill does.” This is important because of the way that most implementations inject the description into the model context.

Within the frontmatter, there is also “depends:”, which allows you to create hierarchical Skill dependencies, and “metadata:”, which is used for reviews and evaluations. Different agent systems can even define their own frontmatter fields, to be used in a manner specific to those systems. As an alternative, Skill-specific metadata can be packaged in an auxiliary JSON or YAML configuration file. This is desirable when building agent systems that need to facilitate different types of runtime behavior per Skill without polluting the model’s context with minutiae. Finally, similar behavior is obtainable through stripping Skill frontmatter on read. Computer employs this methodology, which allows configuration to be preserved in the root SKILL.md file. Careful attention to detail is required in the parsing logic, and one might wish to implement conditional stripping if there are certain fields that are useful to have within the model context.

A Skill is Invocable

A Skill is invocable. The agent loads a Skill at runtime. Importantly, Skills aren’t always bundled into the context. By default, most agent systems unfold Skills progressively upon specific need.

There are at least three tiers of context costs in the way that we've implemented Skills in Computer. Here is the process:

  1. Computer calls load_skill(name="...")

  2. Computer copies the Skill directory into the isolated execution sandbox

  3. Computer recursively auto-loads dependencies in the “depends:” tag

  4. Computer then strips the frontmatter and the agent thus only sees the body and the additional files

Different agent systems can choose to expose Skill content in different ways. As an example, some systems might choose not to expose the file hierarchy at all, leaving it to the model to discover the hierarchy through filesystem operations. Other systems may choose to give the model a mapping of the entire filetree up to a certain truncation and/or depth limit. To keep context clean, Computer omits full file hierarchies from the invocation context; however, this is overridable on a per-Skill basis.

A Skill is Progressive

Skills are progressive. In Computer, there are three different tiers of context costs, and we incur all three at various stages:

Tier What loads Budget When you pay
Index name: description for every non-hidden Skill ~100 tokens per Skill Every session, every user, always paid
Load Full SKILL.md body ~5,000 tokens ~5,000 tokens
Runtime Files in scripts/, references/, assets/, subskills, FORMATTING.md, SPECIAL_CASES.md Unbounded Only when the agent reads them

Computer builds a Skill index that has the name and the description for every available Skill. The budget for this is around 100 tokens per Skill (shorter is even better). It’s so tight because you're paying this cost in every session, for every user. This is injected into the system prompt at the very beginning of the conversation. The model has access to a bunch of named Skills and descriptions so that it can decide whether to call “load_skill()”. The bar to getting into this index is extremely high. Your Skill needs to be very useful, and the description needs to be extremely dense and terse because everyone is paying the cost all the time.

After the agent system loads the Skill, there’s the full SKILL.md body. Ideally, the body text does not exceed 5,000 tokens. Even then, you want every sentence to matter because once you load a Skill, the rest of the conversation has to pay that until you hit the compaction boundary. Many threads load anywhere between three and five different Skills, multiplying this cost. Skills with a lot of fluff will almost certainly degrade other Skills as well as overall agentic capabilities. In short, if your Skill loads and it doesn't do the right thing, that's wasted context.

The final level of progression is scripts or special cases, like subskills or formatting. This is where you want to put unbounded conditional branched logic. The agent will only use it when it needs to, meaning there's a much lower bar for what you want to put in here.

In the index, every token is important. The loaded Skill body is more relaxed, and the runtime is the most relaxed. This could be 20,000 tokens or zero tokens. This is the level at which you might think about expanding the context of the model in a progressive fashion.

When do you need a Skill?

The Agents team is often asked to opine on whether a Skill is truly needed for a given domain or use case. Very rarely do we have a definitive answer from first principles alone. The only way to really figure this out is to start with your agent without the Skill, run several hero queries, and then figure out whether the agent is doing a good job.

When you need a Skill

There are many tasks that are in distribution for trained models. You only need to apply a Skill if you want to change that behavior in some specific way that you can't with say, one sentence in your prompt. So, you need a Skill when the agent will get it wrong without special context, or if there's some inconsistency or non-determinism that you need to be extremely consistent across runs.

It could be that your knowledge is durable but not in the training data. There could be cutoffs or enterprise specific workflows, or it could be a matter of taste. For example, we have several design-related Skills in Computer written by Henry Modisett (our head of design). The reason that every token exists in those Skills is because Henry has very good taste when it comes to designing websites and PDFs. Henry specifies which fonts to use and which fonts not to use, how those fonts feel, and other matters of judgment that the model can't learn from training data alone.

When you don’t need a Skill

We see many Skills in which engineers have written a series of git commands that need to be executed in order. That’s unnecessary because the model already knows how to do that, meaning it makes for great documentation but a poor Skill.

We see examples where Skills recapitulate instructions from the system prompt. You don't need a Skill for that. Knowledge relevant for the majority of requests should be included in global context, not in a conditionally loaded Skill

If there's something that's changing faster than you can maintain it, you don't need a Skill. For example, if you're hitting some remote MCP endpoint and its tools or the versions of those tools are changing frequently, you shouldn't inject those into a Skill. If you do, you’ll just end up with drift and the model will make mistakes.

Every Skill is a tax

Here’s a useful test you can apply to every sentence in your Skill: “Would the agent get this wrong without this instruction?” If the sentence does not need to be there, it cannot afford to be there because everyone is paying this cost every single time. When you are deciding whether to add a Skill or not, remember this tax wherein every session and every user costs tokens.

The following famous quote, which sounds much better in French, roughly translates to “I have only made this letter longer because I have not had the time to make it shorter.”

« Je n’ai fait celle-ci plus longue que parce que je n’ai pas eu le loisir de la faire plus courte. » — Blaise Pascal, Lettres Provinciales, 1657

Just like Pascal, you need to invest time in every Skill. It is hard to write a short Skill. If your Skill is easy to write, it is probably too long or shouldn’t exist. A good Skill is as short as it can be.

If you find yourself trying to one-shot Skill generation and putting up PRs in five minutes, the results will almost certainly be subpar. In fact, early research hasshown that if you're using LLMs to write Skills, the LLM will probably not benefit from it: “Self-generated Skills provide no benefit on average, showing that models cannot reliably author the procedural knowledge they benefit from consuming.”

How to build a Skill

Put another way, you need to inject your opinion into any Skill that you write. Follow these steps.

Step 0: Write the Evals

Write some of the evals first. You can source evaluation cases from:

  • Real user queries: sample from production or your brain trust

  • Known failures: The agent failed because the Skill didn't exist

  • Neighbor confusion: Close to your domain boundary but routes to another Skill

At the very least, you should be making sure that you're testing that the Skill loads when needed. Ideally, you sample some of these, maybe from a production environment. You might also consider known error cases: maybe the whole reason that you set out to write the Skill is because of a specific failure you noticed or maybe you're refactoring and there's some confusion in two close domains that are covered by one Skill.

Start with similar negative and positive examples. Negative examples are extremely powerful and can matter more than positive examples.

Step 1: The Description

This is the hardest line in the Skill. It’s a routing trigger, not documentation. To get the name and the description right, you don't care about the content of the Skill. You only care about whether the Skill is loaded and injected at the right points and is free of off-target side effects, which is the number one failure mode. Every time you add an additional Skill, you risk making every other Skill slightly worse, so you need to make sure that you're minimizing regression.

Again, a bad description describes what the Skill does or why it is useful. A good description says when the agent should load the Skill. For example, say you have something for monitoring pull requests. Don't write what the Skill does. Write what engineers say when they're frustrated and they want you to make sure that their PR works, like “babysit” or “watch CI” or “make sure this lands.”

Here’s a quick checklist:

  • Starts with "Load when..."

  • Target 50 words or fewer

  • Describes the user’s intent, ideally from real queries

  • Does not summarize the workflow

Real queries are what you can cover in an 80-20. Usually, two or three examples work well. It's not easy to add exactly and only as much as you need.

Step 2: Write the Body

Next, write the content of the Skill itself. Notice this is not Step 0 or Step 1.

Communicating workflows to an LLM is completely different to communicating workflows to a colleague, or even to your runtime system. When learning a new software tool, an engineer might need to read the documentation, get a walkthrough from someone with experience, and learn how to use the tool. Meanwhile, for almost any software tool that has been around at least a year, you just need to mention its name and the LLM has all the information it needs.

When you are writing the body, skip the obvious things. Many engineers have plenty of experience writing readme.md files that list out every command someone needs to run. It's easy to fall back into that when you're writing a Skill because it feels like you're writing documentation, but if you do that, your Skill will be garbage. So, don't write out a series of commands.

For example, you don’t need to write, “git log # find the commit; git checkout main; git checkout -b <clean-branch>; git cherry-pick <commit>;

Instead, write, “Cherry-pick the commit onto a clean branch. Resolve conflicts preserving intent. If it can't land cleanly, explain why.”

The model will do a much better job with the latter than with the former’s overly prescriptive series of commands, especially when things go wrong. Don't railroad, or be overly prescriptive, which is fragile, and instead be flexible where multiple approaches can work. Again, good documentation for humans is most often bad documentation for models.

Next, focus on the gotchas or negative examples. These are extremely high-signal content because they often guide the model in terms of what not to do. If you add a line every time the agent trips up, you’ll learn by running it and the gotchas will grow organically.

Lastly, if there’s any portion that's conditional or extremely heavy in content, take it out of the SKILL.md, which is the hub, and put it into one of the spokes. Put it into an accessory file that can be progressively loaded, which we’ll dive into next.

Step 3: Use the Hierarchy

Make use of the Skill hierarchy when you've got a script, references, or you’re using some specific tool:

scripts/

Deterministic logic the agent would reinvent every run Give it code to compose, not reconstruct references/

Heavy docs loaded only when a condition is met"Read api-errors.md if API returns non-200" assets/

Output templates the agent copies and fillsreport-template.md, output schemas config.json

First-run user setup Ask for the Slack channel, save, and reuse next time

For anything that's conditional or branching from the main Skill, break it out into a folder. Remember, also, that multilevel hierarchy can be used for particularly intricate Skills. For these, you’ll want to give careful thought to whether the functionality should be implemented monolithically or as a collection of Skills (perhaps with depends: based loading relationships).

Step 4: Iterate

Next, do a bunch of iterations on a branch. Start on the main branch with no Skill, do some iterations, build your hero query set, and run a slew of evals. Anyone reviewing your Skill code will thank you for submitting a single changeset complete with an evaluation set. Reviewing consecutive incremental changes (except a new gotcha) is very hard, so try to minimize it.

You’ll likely do many small word changes. Small word changes in descriptions can have an outsized impact on routing (including spillover effects on other Skills), so do all that work before Step 5.

Step 5: Ship

Ship it.

How to Maintain a Skill

Now that you’ve written a Skill, you have to maintain it.

The Gotchas Flywheel

From this point on, your list of gotchas tends to grow or change a lot. We often see engineers who make PRs that are un-evaled, for example, change the description. If you're changing the description after your Skill has been merged, you are off track. If you're making changes to the thing that decides whether to route your Skill, you need to write some evals that support the changes.

Skills are append-mostly. The gotchas section accrues the most value over time:

  • Agent fails at something → Add a gotcha

  • Agent loads the Skill off target → Tighten description and add negative evals

  • Agent doesn't load the Skill when it should → Add keywords and positive evals

  • System prompt changes → Check for contention or duplication

It's easy to notice a single failure case in internal testing or in production and add a gotcha. It’s a negative example so it’s not really changing explicit guidance, but it lets the model know, “Hey, there's this known failure.”

As you move from the 80-20 to getting to a 99.9% or 99.99% success rate, it's easy to grow this gotcha list. As you see these negative examples, you should be appending mostly to the gotcha section. You shouldn't be adding longer instructions or changing the description.

Eval Suites

At Perplexity, we run many eval suites to check for different things. There are Skill loading and Skill file reads, which checks the precision, recall, and forbidden checks of the Skill loading itself. Will the agent route your Skill when it's supposed to? These ensure new Skills don’t break existing boundaries.

There are also evals that can check for proper progressive loading. The agent might load the Skill, but does it read the accessory file or files? For example, if you have a finance Skill for finance queries, does it read the special FORMATTING.md file?

There are also evals for Skills that test for end-to-end task completion within domains. We run the full agent loop and use an LLM judge to grade the results based on a rubric of well-defined criteria.

Finally, it’s important to run these evals against different models. Computer supports at least three different orchestration model families: GPT, Claude Opus, and Claude Sonnet. You want to run your Skill loading and the domain Skills against these different agent orchestrators to ensure that you don't get different behavior. Sonnet and GPT behave quite differently when it comes to Skills.

Final thoughts and takeaways

The more Skills you build, the better you will get at building them. If you're not automating or trying to make more reproducible tasks that you're doing on a day-to-day basis using Skills, start immediately.

The act of building Skills makes you better at building more Skills, but also, they're extremely good at automating business processes. If you can describe something you do every week before your standup, at the end of every sprint, or anything that you do as an engineer on a daily, weekly, or even quarterly basis, you should be writing a Skill to buy back your time.

Can you automate postmortems? Can you review pull requests? Any task that you can do, you can at least have the first pass be an Agent Skill. It will save you significant time.

That said, remember that a Skill isn’t easy or even always necessary. Less is more. A few other takeaways:

  1. Write evals before the Skill. Include negative examples and forbidden loads for adjacent but distinct skills.

  2. The description is the hard part. "Load when..." (every word costs attention).

  3. Gotchas are extremely high-value content. Start thin, grow as the agent fails.

Remember that it is easy to break other pre-existing Skills by adding a new Skill, even though you didn’t touch it (beware of action at a distance).

Use all the available tools every time you’re writing and maintaining a Skill. If you want to learn more, the Agent Skills website has plenty of good examples, and both our internal repository and the public ecosystem contain many examples of well-designed Skills.

📋 讨论归档

讨论进行中…