返回列表
🧠 阿头学 · 💬 讨论题

技能不是提示词,而应是可被编排的行动单元

这篇文章最有价值的判断是“技能应从静态说明书升级为按情境调用的执行单元”,但它把自家 Slate 的执行模型包装成行业分水岭,证据明显不够。
打开原文 ↗

2026-04-04 原文链接 ↗
阅读简报
双语对照
完整翻译
原文
讨论归档

核心观点

  • 技能不是说明书而是执行单元 作者准确指出,把技能当成全局提示词或手动 slash command 会浪费上下文且难以自动化,因此把技能定义为“在特定情境下触发、在隔离上下文中执行、结束后清理”的行动单元,这个方向比“往 system prompt 里继续堆规则”更合理。
  • 它抓住了 Agent 的真痛点:知识会但不会主动用 文章用 knowledge overhang 描述“模型知道怎么做,但不会主动选择这么做”的落差,这个判断站得住脚,因为很多 Agent 失败确实不是纯能力不足,而是触发条件、流程约束和默认策略设计失灵。
  • 交互式技能需要新的运行时原语 作者对异步线程难以处理用户交互的分析是全文最扎实的部分:子线程直接弹窗会严重破坏体验,子线程强制升级给主控会制造控制权混乱,因此把“交互”从 prompt 问题提升为 runtime 问题,这个判断是对的。
  • Slate 的 fork 方案有启发,但并未被证明是最佳解 同步 fork 作为“阻塞式、可接管 UI 的隔离执行环境”确实提供了一种优雅的人机协作模型,但文章没有给出稳定性、延迟、恢复机制和对照实验,所以它更像有前景的架构假说,而不是已被验证的行业标准。
  • 所谓编排技能本质上仍是规则,只是规则上移了 作者批评静态技能,却又提出 orchestration skills 去定义“何时 fork、何时 plan、何时 QA”,这在本质上仍是硬编码流程,只是从主提示词转移到了编排层,因此它解决了组织方式问题,却没有彻底解决“模型自主规划”的问题。

跟我们的关联

  • 对 ATou 意味着什么 ATou 如果在做 Agent 产品或工作流设计,就不该再把“技能库”理解为 prompt 仓库;下一步应该按“触发条件—权限边界—结束条件—回收方式”重构技能定义,先做少量高频技能的行动化封装。
  • 对 Neta 意味着什么 Neta 可以把这篇文章当作“知识悬置”诊断框架:问题未必是模型不会,而可能是系统没让它在正确时机调用正确策略;下一步应用是审查现有流程里哪些失败属于触发失灵而不是模型失灵。
  • 对 Uota 意味着什么 Uota 若关心交互体验,应警惕“功能可做、体验不可用”的子线程问答方案;下一步可以把用户协作点单独设计成前台接管式流程,而不是把所有交互塞给后台 agent。
  • 对投资判断意味着什么 这类公司真正的壁垒未必是“技能数量”,而是 runtime、上下文隔离、权限控制和人机协作机制;下一步尽调应重点追问失败恢复、成本结构、任务完成率和真实上线情况,而不是只听概念包装。

讨论引子

1. “技能是行动不是提示词”这一定义,究竟是实质性突破,还是对已有 tool-use/workflow 思路的重新命名? 2. 如果 orchestration skill 仍然是规则编排,那它相比传统状态机工作流的真实优势到底在哪里? 3. 对需要频繁人类介入的任务,阻塞式 fork 是更优解,还是意味着 Agent 其实离真正自治还很远?

本文主要转载自我们博客上的这篇文章 https://randomlabs[.]ai/blog/skill-chaining

引言

大多数智能体系统把技能当成需要手动调用的静态提示词。我们 @0xrandomlabs 认为,智能体技能应该更贴近人类语境里技能的本义:为解决任务而动态调用的情境化行为。技能应该是智能体去的东西,而不是智能体去的东西。

这篇文章会讨论当前智能体技能实现的局限、我们的重新理解,以及我们如何在 Slate 里改用另一种方式来实现技能。也会说明促成当前设计的约束条件,以及在 Slate 里如何把技能专门设置为可自动化。

先岔开一句,聊聊 slate 是怎么工作的。

Slate 执行模型快速回顾

后面会反复用到 Slate 的运行方式,所以先从这里讲起。

Slate 通过线程来逐步执行行动。

行动(action)是朝目标迈出的一个小范围、单一的步骤:"打开开发服务器","出于 Y 的原因检查文件 X 的改动","在智能体里按目标路径点一遍"

线程(thread)是一个隔离的工作单元,有自己的上下文窗口、受限能力,以及受限的任务执行范围。它执行一项带权限、带范围的任务,并返回对其行动的压缩表示。悄悄说一句,给搞编程语言理论(PLT)的人:它和 Lisp 里的续延(continuation)挺像。线程可以部分执行,返回一个代表该段执行的状态,之后还能继续运行。

想进一步了解 slate 的工作方式,以及我们如何处理情节记忆(episodic memory),可以看看我们的技术博客这篇文章

好,回到正文。

背景

在人类这里,技能是学来的。通常会先看到别人做我们想做的事,然后自己反复模仿到足够熟练,把它内化为一种受情境约束的行为。之后还会持续打磨,并在时间里反复复用。

人类的技能学习 可以 类比到 LLM 的两条路径:

  1. 预训练阶段的行为克隆(对某段演示过的 token 序列做克隆)

  2. 后训练阶段的智能体 RL(把有用的战术与策略教给模型)

这些行为会在测试时被激发出来。

但如果 LLM 没有持续学习机制,就很难把模型明明“知道怎么做”、却不会主动选择去用的行为稳定地激发出来。

为了便于讨论,可以把 LLM 不会自然采用的所有策略与战术都归入 知识悬置(knowledge overhang)

知识悬置(knowledge overhang) 指的是模型“选择去做的事”与模型“知道怎么做的事”之间的差距。技能提供了一种领域特定的解决方式来弥合它(类似规则),做法是直接注入文本,让模型在明明会做、但通常不会自行采取的情况下,仍然被条件化去采取相应行动。

这也是为什么智能体技能(配合良好的 instruction tuning)应该能显著提升表现:它允许你引入指令,把分布外的知识拉成分布内的行为。这一现象依赖 上下文学习(in context learning)。[1]

规则 vs 技能

有个很直观的问题:为什么不直接用规则来做?

先把概念对齐一下:

规则是你的智能体框架会强制加载进上下文的文件;而技能遵循 渐进式披露(progressive disclosure) 的原则(智能体会逐步获得关于技能的更多信息)。

这样做的核心好处是:技能应该是情境化的,主智能体的上下文不会一上来就被动塞进成千上万 token 的条件化指令,而这些指令并不是总是有用。在有效上下文长度仍远低于 200k tokens 的当下,前缀上下文是非常宝贵的资源。

理想情况下,模型会按需激活与停用技能,动态选择在不同情境下所需的分布外上下文。(顺便说一句,Codex 似乎是唯一真的会这么做的模型。)

但现实里往往是用户在激活技能。

你会为模型手动激活技能,除了“激活”之外几乎没有清晰的生命周期管理。更重要的是,人们把技能当成终端智能体里的斜杠命令来用,而技能本来可能是一种很“黑客式”的持续学习替代品。

这看起来并不是我们真正想从技能里得到的那种“按上下文范围做行为改造”。

把技能建模为情境化行动

技能的定义更自然地对应于:在特定情境下有用的行为,它们有范围约束,而且用完不会污染主上下文。这非常接近人类已经熟练掌握某项技能后,执行时所需的认知负担会更低的体验。

技能似乎也很契合“情节记忆(episodic memory)”的概念。比如你在给车换胎时,通常会很清楚自己在这条步骤序列里的哪一步。但关键在于:你学到的是一串行动序列,它建立在许多已学会的子例程之上,这些子例程组合起来,构成了一个新的技能,叫做“换胎”。这时,“换胎”就变成了一个新的隔离上下文,你可以在其中工作,并把它作为一个“事件(episode)”来记忆。

在更早之前,这些子例程可能是你被单独教过的(也可能是在换胎语境里学到的),随后你决定把它们按更高阶的顺序组织起来,于是你成功完成了任务。

如果完全靠自己摸索来学,会很费劲,对吧?但大概率有人教过你其中的零件,或教过你完整的序列。

从这个角度看,被教一项技能很像给模型一项技能;被教如何把技能排成序列,则很像给模型一项“如何编排技能”的技能。

但再强调一次,这里仍然有问题。

你当初跟着学的那个视频作者,并不会一直守在你旁边,善意地替你跑一个斜杠命令,提醒你该按序列做事,或者帮你把视频再翻出来。你之所以能完成技能,是因为你会自己被任务驱动去回忆需要的内容,并把它应用到这条目标行动序列上。

因此,技能更好的定义也就浮现出来了。

技能应该是情境化行为,它们会被组合进更大的序列中,由目标与环境状态来引导该技能的使用。

如果这个定义成立,再结合我们对 Slate 的理解,Slate 会是实现更自然技能使用的理想架构。Slate 的线程提供了动态配置权限、提示词与隔离上下文的能力,用来执行有范围的行动。说白了,你想要的“按需运行技能”的权限与执行模型要素都在这里,把技能作为一种参数来配置线程也应该很简单。

在 Slate 中构建动态技能

我们在 slate 里测试的第一件事,是给主线程("orchestrator")提供一份可用技能列表、查看与搜索它们的方式,以及一个参数,让它能直接用技能去实例化线程。

理论上,Slate 的架构允许它在执行 行动 的同时激活技能的 实例。这里的“行动”,就是应用某个技能所对应的一串步骤。

太棒了,技能问题解决了,一切都很好,世界从此被改变了诸如此类。

哈,并没有。

我们最初的方案大概是这样:

给主线程加一个 skills 工具,提供 view 和 list 子命令,同时保留它的编排能力。

然后它可以把某个前端工程技能直接应用到自己拉起的一个线程上,这个线程在那一次事件(episode)里就拥有该技能。事件结束时技能自动失效,线程上下文也会被清理掉。

很好,这样就实现了“上下文范围内的技能使用”,主智能体也可以根据它对技能适用场景的认识来激活技能。

像默认的 Anthropic 前端技能这类东西,跑起来确实很好用。

但有个核心问题:用户交互怎么办?

有些技能需要用户交互来完成规划、做决策等事情。

我们当时想到的选项是:

  1. 给线程提供一个工具,弹出阻塞式对话框,让用户可以直接回复智能体

  2. 如果线程需要和用户说话,就强制它升级到主智能体,由线程来驱动主智能体的用户交互

  3. 神秘的第三种方式。

选项 1:一次性直接对话框

我们的做法是给线程提供一个类似 ask_question 的工具。模型可以给用户发消息并获得回复。这有点像权限请求弹窗。

主智能体在提供需要用户交互的技能时,可以配置出一个子智能体来和用户聊天。

但有两个问题:

  1. 体验很糟糕,对用户来说会非常难用

  2. 子智能体实际上很难和用户进行一段合理的对话

这两点让交互式、对话式技能几乎不可能落地。

选项 2:强制线程升级

这是更偏确定性的失败驱动方案。基本思路是告诉子智能体:如果遇到需要和用户对话的任务,你做不到,所以你应该把问题上报给 orchestrator,然后终止这次事件。

这显然很差,因为编排智能体等于是在拉起一个子智能体,让它撞墙,然后再来驱动主智能体的行为。

不仅浪费算力、让模型困惑,还会让线程产生一种“驱动 orchestrator 行为”的激励。

我们之前讨论过多智能体的一个问题:这种“线程驱动主智能体”的方案,会把它们拉到同一个优先级层面,意味着你需要一套共识机制,最好还是直接避免。

选项 3:神秘的第三种方式

我们认真想了很久。最聪明的朋友们挤进一个房间,把需求掰开揉碎讨论了一遍。

然后我们意识到:如果我们改变执行模型呢?

如果不是所有线程都异步并行地运行呢?

这条思路刚好导向一个方案,它覆盖了所有交互场景,体验最优,同时保留了线程带来的范围化执行优势。

我们更新了 Slate 的执行模型,支持一个新的原语:分叉(forking)!

https://randomlabs.ai/blog/slate

上下文分叉与交互式技能使用

注意:截至 2026 年 4 月 1 日,该功能仍处于 Alpha,并为确保生产环境可靠性而推迟上线

Slate 以前强制所有线程都在后台运行,这意味着线程在技能执行过程中无法合理地与用户交互。

我们加入了同步分叉,让用户可以继续和一个几乎等同于 orchestrator 的智能体对话;不同之处在于,当技能使用结束后,隔离性与情节记忆仍然会生效。

它的工作方式是,Slate 可以选择拉起一个 fork,并阻塞整个系统。这类似于运行一个同步函数。

fork 不能像线程那样被继续执行,并且会立即阻塞 orchestrator 的其他所有行动。

在 UI 里,它会在底层强制接管交互,因此体验上就是和智能体持续聊天。

对用户来说变化不大,但它解锁了交互式技能使用。

这一切带来了什么?

它带来的是技能自动化。Slate 的实现模型加入分叉之后,它的委托能力就覆盖了在“把技能当作线程里的行动”时会遇到的全部需求。既然我们已经用通用方式覆盖了“技能作为子线程”的需求空间,就可以开始编排它们。

所以,我们也在引入另一个概念:编排技能(orchestration skills)

编排技能是你作为用户,按需在主 orchestrator 智能体上激活的一类技能。普通技能会引用脚本与资源,而 编排技能(orchestration skill) 应该引用的是……其他技能!

关键点在于,你现在可以选择性地让模型理解在什么条件下该如何用线程、fork 与技能,以及它们的激活序列。

你可以定义类似这样的东西:


这个技能能帮助你在代码库中正确实现功能。 当用户想实现新功能时,建议使用这个技能。


当这个技能处于激活状态时,应先进行分叉并运行 plan 技能。 ... 当分叉完成规划流程后,Slate 开始按计划实现。 ... 当你审阅完代码后,通过运行 /qa 技能来验证输出。 ... 当 /qa 技能成功完成后,把技能使用结果返回给用户。

https://arxiv.org/abs/2205.11916

这个技能会教会模型:它应该用你已有的其他技能去执行怎样的一串行动序列。

另外,因为它本质上仍然只是一个技能,你可以用不同的编排技能包出不同的运行模式,在其中定义不同的子例程组合。

主模型随后就能用一种比现有框架更接近“程序化”的方式,按条件去执行这些子例程。

再也不用在每次会话结束时手动跑 /review 了!

很高兴宣布,技能链终于发布了。

如果你在使用这些新能力时有了新玩法,或者卡住了,欢迎写信到 team@randomlabs.ai 告诉我们。

附录

在 Slate 执行模型中将分叉作为原语的好处

把分叉作为原语还有一个好处:作为用户,你现在可以直接与正在做实际工作的智能体协作。这意味着,你可以在需要的时候像使用任何高触达智能体那样做深度互动,但只在你确实需要时才介入。

虽然这在某种程度上改变了 Slate 的执行模型,但我们认为它带来的可用性收益非常大,尤其是在模型处理存量工程(brownfield projects)时,似乎仍然难以做到完全自治的情况下。

与操作系统的类比

我们发现自己在描述正在构建的东西时,越来越多地借用 unix 与操作系统的术语。Slate 的线程、分叉、线程权限模型、把上下文视为进程内存等,都呈现出相似的模式。

这种趋势会如何发展目前还不明确,因为我们只是自然地得出了这些结论。

之所以写这一节,是因为早期读者和我们交流过的人都提到,这种对应关系很值得点出来。他们也建议我们把它当作一个有趣的旁注,明确记录下来。

参考

  1. Kojima 等:大型语言模型是零样本推理者(NeurIPS 2022)

This article was mostly crossposted from our blog at https://randomlabs[.]ai/blog/skill-chaining

本文主要转载自我们博客上的这篇文章 https://randomlabs[.]ai/blog/skill-chaining

Introduction

引言

Most agent systems treat skills as manually invoked static prompts. We @0xrandomlabs believe agent skills should more closely mirror the thing they are named after in humans: contextual behaviors that are dynamically used to solve tasks. A skill should be something the agent does not something the agent reads.

大多数智能体系统把技能当成需要手动调用的静态提示词。我们 @0xrandomlabs 认为,智能体技能应该更贴近人类语境里技能的本义:为解决任务而动态调用的情境化行为。技能应该是智能体去的东西,而不是智能体去的东西。

In this blog, we discuss the current limitations of agent skills implementations, our reframing, and how we instead decided to implement them in Slate. We also cover the constraints that led to our current design, and how skills can be set up for automation specifically in Slate.

这篇文章会讨论当前智能体技能实现的局限、我们的重新理解,以及我们如何在 Slate 里改用另一种方式来实现技能。也会说明促成当前设计的约束条件,以及在 Slate 里如何把技能专门设置为可自动化。

First we'll start with an aside on how slate works.

先岔开一句,聊聊 slate 是怎么工作的。

A brief refresher on Slate's execution model

Slate 执行模型快速回顾

Knowing how Slate works will be very useful later on, so we are going to start with it here.

后面会反复用到 Slate 的运行方式,所以先从这里讲起。

Slate takes incremental actions using threads.

Slate 通过线程来逐步执行行动。

An action is a low scope, singular step towards a goal: "Turn on the dev server", "review changes in file X for Y reason", "click through the target path on the agent"

行动(action)是朝目标迈出的一个小范围、单一的步骤:"打开开发服务器","出于 Y 的原因检查文件 X 的改动","在智能体里按目标路径点一遍"

A thread is an isolated worker with its own context window, scoped capabilities, and scoped task execution. It executes a permissioned, scoped task, and returns a compressed representation of its actions. Pssst. For the PLT people, these map relatively closely to continuations in lisp where a thread can be executed partially, returns a state that represents that portion of the execution, and can be resumed later.

线程(thread)是一个隔离的工作单元,有自己的上下文窗口、受限能力,以及受限的任务执行范围。它执行一项带权限、带范围的任务,并返回对其行动的压缩表示。悄悄说一句,给搞编程语言理论(PLT)的人:它和 Lisp 里的续延(continuation)挺像。线程可以部分执行,返回一个代表该段执行的状态,之后还能继续运行。

To further understand how slate works and how we tackle episodic memory, check out our technical blog here

想进一步了解 slate 的工作方式,以及我们如何处理情节记忆(episodic memory),可以看看我们的技术博客这篇文章

Ok, onto the actual post.

好,回到正文。

Background

背景

In humans, skills are learned. We usually see someone else do the task we want to perform, then we attempt to replicate the behavior enough times to internalize it as a context conditioned behavior. We then continue to refine it and reuse it over time.

在人类这里,技能是学来的。通常会先看到别人做我们想做的事,然后自己反复模仿到足够熟练,把它内化为一种受情境约束的行为。之后还会持续打磨,并在时间里反复复用。

Skill learning in humans can be seen as mapping to LLM's in the following two ways:

人类的技能学习 可以 类比到 LLM 的两条路径:

  1. Behavior cloning (of some demonstrated token sequence) during pretraining
  1. 预训练阶段的行为克隆(对某段演示过的 token 序列做克隆)
  1. Agentic RL during post-training (teaching useful tactics and strategies to the model)
  1. 后训练阶段的智能体 RL(把有用的战术与策略教给模型)

These behaviors are then elicited at test time.

这些行为会在测试时被激发出来。

However, without continual learning in LLM's, there isn't much of a pathway to elicit behaviors that the model has knowledge of but doesn't actively choose to employ.

但如果 LLM 没有持续学习机制,就很难把模型明明“知道怎么做”、却不会主动选择去用的行为稳定地激发出来。

For the purposes of this blog, we can say that all of the strategies and tactics that the LLM doesn't naturally employ are within the knowledge overhang.

为了便于讨论,可以把 LLM 不会自然采用的所有策略与战术都归入 知识悬置(knowledge overhang)

Knowledge overhang is the gap between what the model chooses to do and what the model knows how to do. Skills offer a domain specific solution to overcoming this (similar to rules) by directly injecting text that conditions the model to take actions it would not otherwise take on its own despite knowing how to do.

知识悬置(knowledge overhang) 指的是模型“选择去做的事”与模型“知道怎么做的事”之间的差距。技能提供了一种领域特定的解决方式来弥合它(类似规则),做法是直接注入文本,让模型在明明会做、但通常不会自行采取的情况下,仍然被条件化去采取相应行动。

This is where agent skills in particular (alongside good instruction tuning) should be able to meaningfully improve performance since they allow you to introduce instructions that pull that out of distribution knowledge into in distribution behavior. This phenomena is reliant on in context learning.[1]

这也是为什么智能体技能(配合良好的 instruction tuning)应该能显著提升表现:它允许你引入指令,把分布外的知识拉成分布内的行为。这一现象依赖 上下文学习(in context learning)。[1]

Rules v.s. Skills

规则 vs 技能

There's a pretty obvious question: why not just do this with rules?

有个很直观的问题:为什么不直接用规则来做?

So that everyone reading this is on the same page:

先把概念对齐一下:

Rules are files that your agent harness forcefully loads into the context, whereas skills follow the principle of progressive disclosure (the agent is given progressively more information about the skill).

规则是你的智能体框架会强制加载进上下文的文件;而技能遵循 渐进式披露(progressive disclosure) 的原则(智能体会逐步获得关于技能的更多信息)。

The core benefit of this is that since skills are supposed to be context specific, the main agent does not have it's context immediately flooded with a tens of thousands of tokens of conditional instructions that aren't always useful. The prefixed context is a precious resource when effective context lengths are still well below 200k tokens.

这样做的核心好处是:技能应该是情境化的,主智能体的上下文不会一上来就被动塞进成千上万 token 的条件化指令,而这些指令并不是总是有用。在有效上下文长度仍远低于 200k tokens 的当下,前缀上下文是非常宝贵的资源。

The expectation is that the model activates and deactivates skills as necessary, dynamically choosing the out-of-distribution context it needs in different situations. (Codex seems to be the one model that actually does this btw.)

理想情况下,模型会按需激活与停用技能,动态选择在不同情境下所需的分布外上下文。(顺便说一句,Codex 似乎是唯一真的会这么做的模型。)

In practice what happens is the user tends to be the one activating the skills.

但现实里往往是用户在激活技能。

You'll activate skills manually for the model without any clear lifecycle management apart from activation. More importantly people treat skills, which should arguably be a potential form of hacky continual learning, as slash commands in their terminal agent.

你会为模型手动激活技能,除了“激活”之外几乎没有清晰的生命周期管理。更重要的是,人们把技能当成终端智能体里的斜杠命令来用,而技能本来可能是一种很“黑客式”的持续学习替代品。

This seems to not be the context scoped behavioral modification that we actually want from skills.

这看起来并不是我们真正想从技能里得到的那种“按上下文范围做行为改造”。

Modelling skills as contextual actions

把技能建模为情境化行动

The definition of a skill aligns more naturally with situationally useful behaviors that are context scoped and don't pollute the main context after use. This is very very similar to how you do less cognitive work to perform skills that you've already mastered.

技能的定义更自然地对应于:在特定情境下有用的行为,它们有范围约束,而且用完不会污染主上下文。这非常接近人类已经熟练掌握某项技能后,执行时所需的认知负担会更低的体验。

Skills also seem to map relatively well to the idea of episodic memory. For example, while you are changing a tire on a car, you likely have a clear notion of which step in the sequence you are at. But that's the catch: you've learned a sequence of actions that have built on other learned subroutines which compose into this new skill called "changing a tire" which is now a new isolated context you can work in and remember as an "episode".

技能似乎也很契合“情节记忆(episodic memory)”的概念。比如你在给车换胎时,通常会很清楚自己在这条步骤序列里的哪一步。但关键在于:你学到的是一串行动序列,它建立在许多已学会的子例程之上,这些子例程组合起来,构成了一个新的技能,叫做“换胎”。这时,“换胎”就变成了一个新的隔离上下文,你可以在其中工作,并把它作为一个“事件(episode)”来记忆。

Previously all those subroutines were things you were taught in isolation (or maybe in the context of changing a tire) and then you decided to apply them all in a higher order sequence which made you successful at the task.

在更早之前,这些子例程可能是你被单独教过的(也可能是在换胎语境里学到的),随后你决定把它们按更高阶的顺序组织起来,于是你成功完成了任务。

Learning them entirely on your own would take a lot of work, right? But someone likely taught you either the pieces or the full sequence.

如果完全靠自己摸索来学,会很费劲,对吧?但大概率有人教过你其中的零件,或教过你完整的序列。

In this way, being taught a skill is similar to giving a model a skill. And being taught how to sequence them is similar to giving the model a skill for how to sequence them.

从这个角度看,被教一项技能很像给模型一项技能;被教如何把技能排成序列,则很像给模型一项“如何编排技能”的技能。

But to reiterate, there's still a problem here.

但再强调一次,这里仍然有问题。

The youtuber who made the video you learned from isn't benevolently watching over you and running a slash command to make you remember that you should follow the sequence or look that video back up. You are actually motivated on your own to recall what you need in order to perform the skill and apply it to this target action sequence.

你当初跟着学的那个视频作者,并不会一直守在你旁边,善意地替你跑一个斜杠命令,提醒你该按序列做事,或者帮你把视频再翻出来。你之所以能完成技能,是因为你会自己被任务驱动去回忆需要的内容,并把它应用到这条目标行动序列上。

So, thus emerges a better definition of what a skill should be.

因此,技能更好的定义也就浮现出来了。

Skills should be situational behaviors that are composed into larger sequences where the goal and environment state guide the usage of said skill.

技能应该是情境化行为,它们会被组合进更大的序列中,由目标与环境状态来引导该技能的使用。

If this is an apt definition and given what we know about Slate, Slate would be the perfect architecture for more natural skill use. Slate's threads provide an ability to dynamically provision permissions, prompts, and isolated contexts for taking scoped actions. Essentially everything you'd want in a permission and execution model for dynamically running skills and adding skills as a way to parameterize threads should be relatively simple.

如果这个定义成立,再结合我们对 Slate 的理解,Slate 会是实现更自然技能使用的理想架构。Slate 的线程提供了动态配置权限、提示词与隔离上下文的能力,用来执行有范围的行动。说白了,你想要的“按需运行技能”的权限与执行模型要素都在这里,把技能作为一种参数来配置线程也应该很简单。

Building dynamic skill in slate

在 Slate 中构建动态技能

The first thing we tested with slate was providing the main thread (the "orchestrator") a list of available skills, a way to view and search them, and a parameter for instantiating threads with skills directly.

我们在 slate 里测试的第一件事,是给主线程("orchestrator")提供一份可用技能列表、查看与搜索它们的方式,以及一个参数,让它能直接用技能去实例化线程。

Slates architecture, in theory, enables it to activate instances of skills while taking actions. An action in this case is simply the sequence of steps that apply a skill!

理论上,Slate 的架构允许它在执行 行动 的同时激活技能的 实例。这里的“行动”,就是应用某个技能所对应的一串步骤。

Wonderful, skills are solved, everything is great, the world is forever changed etc. etc.

太棒了,技能问题解决了,一切都很好,世界从此被改变了诸如此类。

Ha. No.

哈,并没有。

Our initial solution looked something like this:

我们最初的方案大概是这样:

We gave the main thread a skills tool with the view and list subcommands on top of its orchestration abilities.

给主线程加一个 skills 工具,提供 view 和 list 子命令,同时保留它的编排能力。

It can then directly apply something like a frontend-engineering skill to a thread that it spins up, and that thread, for that episode, will have the skill active. The skill deactivates at the termination of the episode, and the thread's context is cleaned up!

然后它可以把某个前端工程技能直接应用到自己拉起的一个线程上,这个线程在那一次事件(episode)里就拥有该技能。事件结束时技能自动失效,线程上下文也会被清理掉。

Great, so we have context scoped skill use, and the main agent can activate a skill based on what it knows about the situations in which it should be used.

很好,这样就实现了“上下文范围内的技能使用”,主智能体也可以根据它对技能适用场景的认识来激活技能。

It actually worked great for things like the default Anthropic frontend skill.

像默认的 Anthropic 前端技能这类东西,跑起来确实很好用。

However, there's a core problem: what do you do about user interaction?

但有个核心问题:用户交互怎么办?

There are skills that require user interaction for things like planning, decisions etc.

有些技能需要用户交互来完成规划、做决策等事情。

Here were the options we came up with:

我们当时想到的选项是:

  1. Provide threads a tool to get a blocking popup dialog where the user can directly respond to the agent
  1. 给线程提供一个工具,弹出阻塞式对话框,让用户可以直接回复智能体
  1. Force threads to escalate to the main agent if they need to talk to the user, and use threads to drive the main agent's user interaction
  1. 如果线程需要和用户说话,就强制它升级到主智能体,由线程来驱动主智能体的用户交互
  1. Secret third thing.
  1. 神秘的第三种方式。

Option 1: Direct single use dialogs

选项 1:一次性直接对话框

How we added this in was by providing a tool to the threads similar to an ask_question tool. The model gets to send the user a message and get a response. This is somewhat similar to a permission request dialog.

我们的做法是给线程提供一个类似 ask_question 的工具。模型可以给用户发消息并获得回复。这有点像权限请求弹窗。

The main agent, when providing a skill that required user interaction, could be set up to provision as subagent to chat with a user.

主智能体在提供需要用户交互的技能时,可以配置出一个子智能体来和用户聊天。

There are two issues:

但有两个问题:

  1. The UX is awful and would be terrible as a user
  1. 体验很糟糕,对用户来说会非常难用
  1. The subagent cannot actually reasonably have a conversation with the user
  1. 子智能体实际上很难和用户进行一段合理的对话

Both of these make interactive/conversational skills nearly impossible

这两点让交互式、对话式技能几乎不可能落地。

Option 2: Forced thread escalation

选项 2:强制线程升级

This follows a more deterministic failure based approach. Basically we could tell a subagent "Hey, if you are given a task where you need to talk to the user, you actually can't which means you should report the problem to the orchestrator and terminate the episode".

这是更偏确定性的失败驱动方案。基本思路是告诉子智能体:如果遇到需要和用户对话的任务,你做不到,所以你应该把问题上报给 orchestrator,然后终止这次事件。

This is obviously bad because the orchestration agent would be spawning a subagent simply for it to run into a wall and then drive the behavior of the main agent.

这显然很差,因为编排智能体等于是在拉起一个子智能体,让它撞墙,然后再来驱动主智能体的行为。

Not only does it burn cycles and confuse the model, but it also provides an incentive for threads to drive the behavior of the orchestrator.

不仅浪费算力、让模型困惑,还会让线程产生一种“驱动 orchestrator 行为”的激励。

We have previously gone into the issue with multi-agents and this particular solution where a thread drives the main agent puts them at the same priority level meaning that you need a consensus mechanism and is better just avoided.

我们之前讨论过多智能体的一个问题:这种“线程驱动主智能体”的方案,会把它们拉到同一个优先级层面,意味着你需要一套共识机制,最好还是直接避免。

Option 3: Secret third thing

选项 3:神秘的第三种方式

We thought really hard. Our smartest friends all got in a room, and discussed exactly what we needed.

我们认真想了很久。最聪明的朋友们挤进一个房间,把需求掰开揉碎讨论了一遍。

And then we realized: what if we changed the execution model?

然后我们意识到:如果我们改变执行模型呢?

What if not all threads were driven asynchronously and in parallel?

如果不是所有线程都异步并行地运行呢?

This line of questioning just so happened to lead to a solution that covers all possible interaction cases, yields an optimal ux, and maintains all the scoped execution benefits of threads.

这条思路刚好导向一个方案,它覆盖了所有交互场景,体验最优,同时保留了线程带来的范围化执行优势。

We updated Slate's execution model to support a new primitive: forking!

我们更新了 Slate 的执行模型,支持一个新的原语:分叉(forking)!

Context forking and interactive skill use

上下文分叉与交互式技能使用

NOTE: AS OF APRIL 1, 2026, THIS FEATURE IS IN ALPHA AND HAS BEEN PUSHED BACK TO ENSURE RELIABILITY FOR PROD

注意:截至 2026 年 4 月 1 日,该功能仍处于 Alpha,并为确保生产环境可靠性而推迟上线

Slate previously forced all threads to run in the background which meant that a thread couldn't reasonably interact with the user for a skill.

Slate 以前强制所有线程都在后台运行,这意味着线程在技能执行过程中无法合理地与用户交互。

We added synchronous forking so that the user could continue talking to an agent that was basically the same orchestrator agent, except with the added benefit that once the skill use was done, the isolation and episodic memory would still kick in.

我们加入了同步分叉,让用户可以继续和一个几乎等同于 orchestrator 的智能体对话;不同之处在于,当技能使用结束后,隔离性与情节记忆仍然会生效。

The way this works is that Slate can choose to spawn a fork which then blocks the entire system. This is similar to running a synchronous function.

它的工作方式是,Slate 可以选择拉起一个 fork,并阻塞整个系统。这类似于运行一个同步函数。

Forks cannot be continued the way that threads can be, and they immediately block all other actions taken by the orchestrator.

fork 不能像线程那样被继续执行,并且会立即阻塞 orchestrator 的其他所有行动。

In the UI, they forcefully take over the ui interaction under the hood, which means the UX is just a continuous chat with the agent.

在 UI 里,它会在底层强制接管交互,因此体验上就是和智能体持续聊天。

Not much changes for you as a user, but it unlocks interactive skill use.

对用户来说变化不大,但它解锁了交互式技能使用。

What does all this buy you?

这一切带来了什么?

This buys you skill automation. By changing Slate's implementation model to include forking, it means that Slate's delegation abilities cover all the possible feature requirements for using skills as actions in threads. Since we now generically cover the space of requirements for supporting skills as subthreads, we can then start to orchestrate them.

它带来的是技能自动化。Slate 的实现模型加入分叉之后,它的委托能力就覆盖了在“把技能当作线程里的行动”时会遇到的全部需求。既然我们已经用通用方式覆盖了“技能作为子线程”的需求空间,就可以开始编排它们。

So, we're also introducing another idea: orchestration skills

所以,我们也在引入另一个概念:编排技能(orchestration skills)

An orchestration skill is a skill that you, the user, activate as needed on the main orchestrator agent. Rather than having references to scripts and resources like a normal skill, an orchestration skill should reference... other skills!

编排技能是你作为用户,按需在主 orchestrator 智能体上激活的一类技能。普通技能会引用脚本与资源,而 编排技能(orchestration skill) 应该引用的是……其他技能!

The whole point is now you can selectively enable the model to understand the conditional activation sequences for how to use threads, forks, and skills.

关键点在于,你现在可以选择性地让模型理解在什么条件下该如何用线程、fork 与技能,以及它们的激活序列。

You can define something like this:

你可以定义类似这样的东西:


This skill allows you to correctly implement features in the codebase. Suggest this skill when the user wants to implement a new feature.



这个技能能帮助你在代码库中正确实现功能。 当用户想实现新功能时,建议使用这个技能。


When this skill is active, you should start by forking and running the plan skill. ... Once the fork has completed the planning process Slate begins implementing the plan. ... Once you have reviewed the code, verify the output by running the /qa skill. ... Once the /qa skill has successfully completed, you should return to the user with the results of the skill use.

当这个技能处于激活状态时,应先进行分叉并运行 plan 技能。 ... 当分叉完成规划流程后,Slate 开始按计划实现。 ... 当你审阅完代码后,通过运行 /qa 技能来验证输出。 ... 当 /qa 技能成功完成后,把技能使用结果返回给用户。

Where this skill will teach the sequence of actions that it should execute using all of your other skills.

这个技能会教会模型:它应该用你已有的其他技能去执行怎样的一串行动序列。

Plus, because it is again just a skill, you can have multiple different modes of operation with different packages of subroutines defined in the orchestration skill.

另外,因为它本质上仍然只是一个技能,你可以用不同的编排技能包出不同的运行模式,在其中定义不同的子例程组合。

The main model can then go ahead and conditionally execute these subroutines in a fairly programmatic way compared to existing harnesses.

主模型随后就能用一种比现有框架更接近“程序化”的方式,按条件去执行这些子例程。

No more manually running /review at the end of your sessions!

再也不用在每次会话结束时手动跑 /review 了!

We're happy to announce that skill chaining is finally released.

很高兴宣布,技能链终于发布了。

Let us know how you end up using these new capabilities or if you get stuck at team@randomlabs.ai

如果你在使用这些新能力时有了新玩法,或者卡住了,欢迎写信到 team@randomlabs.ai 告诉我们。

Appendix

附录

The benefits of forking as a primitive in slate's execution model

在 Slate 执行模型中将分叉作为原语的好处

Forking as a primitive also offers another benefit: as a user you can now directly work with the agent performing the actual work. What this means is you can do all the same high-touch work you would do with any other agent, but only when you need it.

把分叉作为原语还有一个好处:作为用户,你现在可以直接与正在做实际工作的智能体协作。这意味着,你可以在需要的时候像使用任何高触达智能体那样做深度互动,但只在你确实需要时才介入。

Although this somewhat changes Slate's execution model, we think that the usability benefits here are very large especially while models seem to still have issues with full autonomy on brownfield projects.

虽然这在某种程度上改变了 Slate 的执行模型,但我们认为它带来的可用性收益非常大,尤其是在模型处理存量工程(brownfield projects)时,似乎仍然难以做到完全自治的情况下。

Comparisons to an operating system

与操作系统的类比

We find ourselves continuously moving more and more towards using unix and os terminology to describe what we are building. Slate's threads, forking, the thread permission model, viewing context as process memory etc. all seem to follow similar patterns.

我们发现自己在描述正在构建的东西时,越来越多地借用 unix 与操作系统的术语。Slate 的线程、分叉、线程权限模型、把上下文视为进程内存等,都呈现出相似的模式。

It is unclear how this will continue since we are naturally coming to these conclusions.

这种趋势会如何发展目前还不明确,因为我们只是自然地得出了这些结论。

The only reason this section is here is because the corollary has been brought up as worth pointing out by early readers of this blog and people we run into. And it was suggested to us that we take note of this explicitly as an interesting aside.

之所以写这一节,是因为早期读者和我们交流过的人都提到,这种对应关系很值得点出来。他们也建议我们把它当作一个有趣的旁注,明确记录下来。

References

参考

  1. Kojima et al.: Large Language Models are Zero-Shot Reasoners (NeurIPS 2022)
  1. Kojima 等:大型语言模型是零样本推理者(NeurIPS 2022)

This article was mostly crossposted from our blog at https://randomlabs[.]ai/blog/skill-chaining

Introduction

Most agent systems treat skills as manually invoked static prompts. We @0xrandomlabs believe agent skills should more closely mirror the thing they are named after in humans: contextual behaviors that are dynamically used to solve tasks. A skill should be something the agent does not something the agent reads.

In this blog, we discuss the current limitations of agent skills implementations, our reframing, and how we instead decided to implement them in Slate. We also cover the constraints that led to our current design, and how skills can be set up for automation specifically in Slate.

First we'll start with an aside on how slate works.

A brief refresher on Slate's execution model

Knowing how Slate works will be very useful later on, so we are going to start with it here.

Slate takes incremental actions using threads.

An action is a low scope, singular step towards a goal: "Turn on the dev server", "review changes in file X for Y reason", "click through the target path on the agent"

A thread is an isolated worker with its own context window, scoped capabilities, and scoped task execution. It executes a permissioned, scoped task, and returns a compressed representation of its actions. Pssst. For the PLT people, these map relatively closely to continuations in lisp where a thread can be executed partially, returns a state that represents that portion of the execution, and can be resumed later.

To further understand how slate works and how we tackle episodic memory, check out our technical blog here

Ok, onto the actual post.

Background

In humans, skills are learned. We usually see someone else do the task we want to perform, then we attempt to replicate the behavior enough times to internalize it as a context conditioned behavior. We then continue to refine it and reuse it over time.

Skill learning in humans can be seen as mapping to LLM's in the following two ways:

  1. Behavior cloning (of some demonstrated token sequence) during pretraining

  2. Agentic RL during post-training (teaching useful tactics and strategies to the model)

These behaviors are then elicited at test time.

However, without continual learning in LLM's, there isn't much of a pathway to elicit behaviors that the model has knowledge of but doesn't actively choose to employ.

For the purposes of this blog, we can say that all of the strategies and tactics that the LLM doesn't naturally employ are within the knowledge overhang.

Knowledge overhang is the gap between what the model chooses to do and what the model knows how to do. Skills offer a domain specific solution to overcoming this (similar to rules) by directly injecting text that conditions the model to take actions it would not otherwise take on its own despite knowing how to do.

This is where agent skills in particular (alongside good instruction tuning) should be able to meaningfully improve performance since they allow you to introduce instructions that pull that out of distribution knowledge into in distribution behavior. This phenomena is reliant on in context learning.[1]

Rules v.s. Skills

There's a pretty obvious question: why not just do this with rules?

So that everyone reading this is on the same page:

Rules are files that your agent harness forcefully loads into the context, whereas skills follow the principle of progressive disclosure (the agent is given progressively more information about the skill).

The core benefit of this is that since skills are supposed to be context specific, the main agent does not have it's context immediately flooded with a tens of thousands of tokens of conditional instructions that aren't always useful. The prefixed context is a precious resource when effective context lengths are still well below 200k tokens.

The expectation is that the model activates and deactivates skills as necessary, dynamically choosing the out-of-distribution context it needs in different situations. (Codex seems to be the one model that actually does this btw.)

In practice what happens is the user tends to be the one activating the skills.

You'll activate skills manually for the model without any clear lifecycle management apart from activation. More importantly people treat skills, which should arguably be a potential form of hacky continual learning, as slash commands in their terminal agent.

This seems to not be the context scoped behavioral modification that we actually want from skills.

Modelling skills as contextual actions

The definition of a skill aligns more naturally with situationally useful behaviors that are context scoped and don't pollute the main context after use. This is very very similar to how you do less cognitive work to perform skills that you've already mastered.

Skills also seem to map relatively well to the idea of episodic memory. For example, while you are changing a tire on a car, you likely have a clear notion of which step in the sequence you are at. But that's the catch: you've learned a sequence of actions that have built on other learned subroutines which compose into this new skill called "changing a tire" which is now a new isolated context you can work in and remember as an "episode".

Previously all those subroutines were things you were taught in isolation (or maybe in the context of changing a tire) and then you decided to apply them all in a higher order sequence which made you successful at the task.

Learning them entirely on your own would take a lot of work, right? But someone likely taught you either the pieces or the full sequence.

In this way, being taught a skill is similar to giving a model a skill. And being taught how to sequence them is similar to giving the model a skill for how to sequence them.

But to reiterate, there's still a problem here.

The youtuber who made the video you learned from isn't benevolently watching over you and running a slash command to make you remember that you should follow the sequence or look that video back up. You are actually motivated on your own to recall what you need in order to perform the skill and apply it to this target action sequence.

So, thus emerges a better definition of what a skill should be.

Skills should be situational behaviors that are composed into larger sequences where the goal and environment state guide the usage of said skill.

If this is an apt definition and given what we know about Slate, Slate would be the perfect architecture for more natural skill use. Slate's threads provide an ability to dynamically provision permissions, prompts, and isolated contexts for taking scoped actions. Essentially everything you'd want in a permission and execution model for dynamically running skills and adding skills as a way to parameterize threads should be relatively simple.

Building dynamic skill in slate

The first thing we tested with slate was providing the main thread (the "orchestrator") a list of available skills, a way to view and search them, and a parameter for instantiating threads with skills directly.

Slates architecture, in theory, enables it to activate instances of skills while taking actions. An action in this case is simply the sequence of steps that apply a skill!

Wonderful, skills are solved, everything is great, the world is forever changed etc. etc.

Ha. No.

Our initial solution looked something like this:

We gave the main thread a skills tool with the view and list subcommands on top of its orchestration abilities.

It can then directly apply something like a frontend-engineering skill to a thread that it spins up, and that thread, for that episode, will have the skill active. The skill deactivates at the termination of the episode, and the thread's context is cleaned up!

Great, so we have context scoped skill use, and the main agent can activate a skill based on what it knows about the situations in which it should be used.

It actually worked great for things like the default Anthropic frontend skill.

However, there's a core problem: what do you do about user interaction?

There are skills that require user interaction for things like planning, decisions etc.

Here were the options we came up with:

  1. Provide threads a tool to get a blocking popup dialog where the user can directly respond to the agent

  2. Force threads to escalate to the main agent if they need to talk to the user, and use threads to drive the main agent's user interaction

  3. Secret third thing.

Option 1: Direct single use dialogs

How we added this in was by providing a tool to the threads similar to an ask_question tool. The model gets to send the user a message and get a response. This is somewhat similar to a permission request dialog.

The main agent, when providing a skill that required user interaction, could be set up to provision as subagent to chat with a user.

There are two issues:

  1. The UX is awful and would be terrible as a user

  2. The subagent cannot actually reasonably have a conversation with the user

Both of these make interactive/conversational skills nearly impossible

Option 2: Forced thread escalation

This follows a more deterministic failure based approach. Basically we could tell a subagent "Hey, if you are given a task where you need to talk to the user, you actually can't which means you should report the problem to the orchestrator and terminate the episode".

This is obviously bad because the orchestration agent would be spawning a subagent simply for it to run into a wall and then drive the behavior of the main agent.

Not only does it burn cycles and confuse the model, but it also provides an incentive for threads to drive the behavior of the orchestrator.

We have previously gone into the issue with multi-agents and this particular solution where a thread drives the main agent puts them at the same priority level meaning that you need a consensus mechanism and is better just avoided.

Option 3: Secret third thing

We thought really hard. Our smartest friends all got in a room, and discussed exactly what we needed.

And then we realized: what if we changed the execution model?

What if not all threads were driven asynchronously and in parallel?

This line of questioning just so happened to lead to a solution that covers all possible interaction cases, yields an optimal ux, and maintains all the scoped execution benefits of threads.

We updated Slate's execution model to support a new primitive: forking!

https://randomlabs.ai/blog/slate

Context forking and interactive skill use

NOTE: AS OF APRIL 1, 2026, THIS FEATURE IS IN ALPHA AND HAS BEEN PUSHED BACK TO ENSURE RELIABILITY FOR PROD

Slate previously forced all threads to run in the background which meant that a thread couldn't reasonably interact with the user for a skill.

We added synchronous forking so that the user could continue talking to an agent that was basically the same orchestrator agent, except with the added benefit that once the skill use was done, the isolation and episodic memory would still kick in.

The way this works is that Slate can choose to spawn a fork which then blocks the entire system. This is similar to running a synchronous function.

Forks cannot be continued the way that threads can be, and they immediately block all other actions taken by the orchestrator.

In the UI, they forcefully take over the ui interaction under the hood, which means the UX is just a continuous chat with the agent.

Not much changes for you as a user, but it unlocks interactive skill use.

What does all this buy you?

This buys you skill automation. By changing Slate's implementation model to include forking, it means that Slate's delegation abilities cover all the possible feature requirements for using skills as actions in threads. Since we now generically cover the space of requirements for supporting skills as subthreads, we can then start to orchestrate them.

So, we're also introducing another idea: orchestration skills

An orchestration skill is a skill that you, the user, activate as needed on the main orchestrator agent. Rather than having references to scripts and resources like a normal skill, an orchestration skill should reference... other skills!

The whole point is now you can selectively enable the model to understand the conditional activation sequences for how to use threads, forks, and skills.

You can define something like this:


This skill allows you to correctly implement features in the codebase. Suggest this skill when the user wants to implement a new feature.


When this skill is active, you should start by forking and running the plan skill. ... Once the fork has completed the planning process Slate begins implementing the plan. ... Once you have reviewed the code, verify the output by running the /qa skill. ... Once the /qa skill has successfully completed, you should return to the user with the results of the skill use.

https://arxiv.org/abs/2205.11916

Where this skill will teach the sequence of actions that it should execute using all of your other skills.

Plus, because it is again just a skill, you can have multiple different modes of operation with different packages of subroutines defined in the orchestration skill.

The main model can then go ahead and conditionally execute these subroutines in a fairly programmatic way compared to existing harnesses.

No more manually running /review at the end of your sessions!

We're happy to announce that skill chaining is finally released.

Let us know how you end up using these new capabilities or if you get stuck at team@randomlabs.ai

Appendix

The benefits of forking as a primitive in slate's execution model

Forking as a primitive also offers another benefit: as a user you can now directly work with the agent performing the actual work. What this means is you can do all the same high-touch work you would do with any other agent, but only when you need it.

Although this somewhat changes Slate's execution model, we think that the usability benefits here are very large especially while models seem to still have issues with full autonomy on brownfield projects.

Comparisons to an operating system

We find ourselves continuously moving more and more towards using unix and os terminology to describe what we are building. Slate's threads, forking, the thread permission model, viewing context as process memory etc. all seem to follow similar patterns.

It is unclear how this will continue since we are naturally coming to these conclusions.

The only reason this section is here is because the corollary has been brought up as worth pointing out by early readers of this blog and people we run into. And it was suggested to us that we take note of this explicitly as an interesting aside.

References

  1. Kojima et al.: Large Language Models are Zero-Shot Reasoners (NeurIPS 2022)

📋 讨论归档

讨论进行中…