返回列表
🧠 阿头学 · 💰投资

智能体技能要从“写好一次”升级为“持续自我迭代”

这篇文章判断:静态 SKILL.md 迟早会“烂掉”,必须用可观测、可审计的自我改进闭环来管理技能,但作者对评估难度和系统成本明显乐观甚至遮掩。
打开原文 ↗

2026-03-15 原文链接 ↗
阅读简报
双语对照
完整翻译
原文
讨论归档

核心观点

  • 静态技能必然退化 随着模型更新、代码库变更和用户任务分布漂移,固定的 prompt/skill 会逐步失效,这是大规模智能体系统不可回避的“提示词半衰期”问题。
  • 可观测性是自我改进的前提 只有持续记录技能调用时的任务类型、成功/失败、错误和反馈,失败才从“隐形 bug/感觉变差”变成系统可以推理和归因的对象。
  • 技能要变成“可演化组件” 通过图谱化存储技能与运行日志,再走“观察→检视→修订→评估/回滚”的闭环,技能就不再是静态文件,而是拥有版本历史和决策依据的活组件。
  • 自我改进必须可审计 修订不能“直接改 prompt 就上生产”,而要带评估、指标对比和回滚机制,把技能演化变成结构化、可追踪的工程流程。
  • 这是一个明显的产品化 PR 文中强行把现实问题指向 cognee-skills 这一解法,使用专有概念并附上 PyPI/GitHub/Discord 链接,说明作者在推一条“图谱 + 自演化框架”的技术路线。

跟我们的关联

1. 对 ATou:从“会写 prompt”升级为“会运营技能生命周期” 这篇文章意味着:仅仅把 SKILL.md 收集成库是不够的,你需要设计“技能如何老化、如何被发现、如何被修”的全流程。下一步可以:

  • 在自己的 agent 系统里先加上最小版的 Observe:对每次 skill 调用记录输入、选中的 skill、结果和人工评分;
  • 再基于这些数据,用简单规则先做人肉 Inspect & Amend,体验一遍闭环,再考虑自动化。

2. 对 Neta:评估产品时要看它有没有“技能演化层” 文章提醒:判断一个 agent/自动化平台是不是玩具,不要只看“工具集成多不多”,而要看:

  • 是否有运行数据、失败归因、版本历史和回滚机制;
  • 是否有针对技能的自动/半自动改进流程。

下一步在看项目或标的时,可以主动问:技能/策略是怎么随时间变好的,有没有具体指标或 A/B 数据。

3. 对 Uota:把“自我改进闭环”迁移到团队和 SOP 管理 文中的技能闭环其实是一个通用的“流程自迭代模型”:记录执行→复盘失败→修订 SOP→灰度评估→保留有效版本。对你做组织、流程、运营,是一套可直接挪用的框架。下一步可以:

  • 选一个关键 SOP(比如客户响应流程),先加上结构化记录(输入场景、结果、失败原因);
  • 每月做一次 Inspect 会,把失败样本集中分析,再形成明确的 Amend 和实验计划。

4. 对 Neta/ATou:看清 PR 性质,谨慎评估落地成本 这篇是标准技术 PR:观点方向对,但落地细节和成本被严重压缩。对你意味着:在考虑接入 cognee 或类似框架前,要自己算清楚:

  • token 成本、存储和延迟是否能接受;
  • 自动评估是否可信,是否需要人工在环;
  • 先从少量关键技能做试点,而不是一上来全量自动修技能。

讨论引子

1. 在我们自己的系统或业务里,哪些“技能/SOP”其实已经在静默退化,却没有任何观测或告警机制? 2. 对于“自动修改技能”,我们能接受的最低治理要求是什么——必须有哪些人工审查、回滚和指标验证,才算安全? 3. 在资源有限的情况下,我们应该优先给哪一小部分技能/流程加上“自我改进闭环”,才能获得最大收益而不被系统复杂度拖死?

不只是拥有技能的智能体,而是拥有能随时间自我改进技能的智能体

看起来,“SKILL.md” 会长期存在;但围绕它们,我们其实还没有解决最根本的问题:

技能通常是静态的,而其所处的环境却不是!

一个几周前还能正常工作的技能,可能会在代码库变更、模型行为改变,或用户提问任务类型随时间迁移时,悄无声息地开始失效。在大多数系统里,这些失败在被人察觉之前都是不可见的——要么有人发现输出变差了,要么干脆彻底跑不起来了。

要让 skills 文件夹真正变得有用,缺失的一块拼图,是开始把它们当作活的系统组件,而不是固定的提示词文件来对待。

这正是 cognee-skills 背后的核心想法。

不只是把技能存得更好、路由得更好,而是让它们在失败或表现不佳时能够自我改进!

直到今天,skills 基本都在做这三件事:

  1. 写一段 prompt

  2. 把它存进一个文件夹

  3. 需要时就调用它

这在演示里出奇地好用,但遗憾的是,也就止步于演示……到了一定阶段,我们总会撞上同一面墙:

  • 某个 skill 被选中的次数过多

  • 另一个看起来很不错,但在实践里总失败

  • 某条具体指令反复出错

  • 某次工具调用因为环境变化而崩掉

最糟糕的是,没有人知道问题究竟出在路由、指令本身,还是工具调用本身,于是只能靠人工维护与排查。我们通过这套实现,把整个闭环补上了,让我们最终走向能够随时间自我改进的技能

下面也简单概览一下底层到底发生了什么。

1. Skill ingestion(技能导入)

现在你的 skills 文件夹大概长这样:

my_skills/ summarize/ SKILL.md bug-triage/ SKILL.md code-review/ SKILL.md

上周我们展示过:用 cognee 可以把所有内容组织得更清晰——不只是因为更好看,也因为它能让搜索变得更高效。我们还可以为不同字段补充语义含义、任务模式、摘要和关系,从而帮助系统更聪明地理解信息并进行路由。所有这些都通过 cognee 的 “Custom DataPoint” 来存储。

下面是一个小的可视化示意,展示你的 skills 可能会变成什么样:

动态图谱视图链接: https://cognee-graph-skills.vercel.app/

2. Observe(观察)

如果系统对技能运行时发生了什么没有记忆,技能就无法改进。因此,每次技能执行结束后,我们都会存储数据,以便知道:

  • 尝试完成了什么任务

  • 选择了哪个 skill

  • 是否成功

  • 发生了什么错误

  • 用户反馈(如果有)

有了观察,失败就变成了系统可以推理的对象。你无法改进一个技能,如果你不知道它运行时究竟发生了什么。考虑到我们是在结构化图谱上运作,这些信息可以通过新增一个节点来承载,节点里包含收集到的所有观察数据。这都可以由 cognee 的 “Custom DataPoint” 来管理,你可以自行指定想要填充的字段。

3. Inspect(检视)

当失败的运行次数积累到足够多(甚至一次关键失败就够)时,就可以检视这个 skill 周围关联的历史:过去的运行记录、反馈、工具故障,以及相关的任务模式。由于这些都以图谱形式存储,系统可以追踪导致糟糕结果反复出现的因素,并据此提出一个更好的 skill 版本。

运行 → 反复出现的薄弱结果 → 检视

4. Amend skill → .amendify()(修订技能)

当系统积累到足够证据,确认某个 skill 表现不佳时,它就可以提出对指令的修订建议。这个建议可以由人类审核后再应用,也可以自动应用。目标很简单:

  • 降低随着系统规模增长而维护技能的摩擦成本。

不再需要你手动在代码库里翻找哪些 prompts 坏了;系统可以查看某个 skill 的执行历史——包括过往运行、失败、反馈与工具错误——并给出一次有针对性的改动建议。

这次修订可能会:

  • 收紧触发条件

  • 补上一条缺失的条件

  • 调整步骤顺序

  • 改变输出格式

就在这一刻,skills 不再像静态的提示词文件那样运作,而开始更像可演化的组件。你不必打开某个 SKILL.md 然后凭感觉猜该改哪里;系统可以基于这个 skill 的真实行为证据,提出一份可落地的补丁。

5. Evaluate & Update skill(评估与更新)

不过,一个能自我改进的系统,绝不该仅仅因为它“能修改自己”就被信任。任何修订都必须被评估:新版本真的改善了结果吗?失败率降低了吗?是否在别处引入了新错误?

因此,这个闭环不能只是:

  • observe → inspect → amend

相反,它必须遵循更严格的周期:

  • observe → inspect → amend → evaluate

如果某次修订没有带来可衡量的提升,系统就应该能够回滚。因为每次变更都会连同其动机与结果一起被追踪,原始指令永远不会丢失,于是自我改进就成为一个结构化、可审计的过程,而不是不受控制的修改。当评估确认确有提升时,这次修订就会成为该 skill 的下一版。

http://skill.md/

Conclusion(结语)

当围绕技能的系统持续变化时,技能不可能永远静止不变。随着模型、代码库与任务形态的演进,固定的提示词文件必然会逐步退化。我们介绍了一种直观的方法,让这一切能够自动发生,同时又不放弃对技能本身的控制与监督。

查看 PyPi 构建版本: https://pypi.org/project/cognee/0.5.4.dev2/

查看 Cognee: https://github.com/topoteretes/cognee

加入 Discord 社区: https://discord.gg/pMFAz242

not just agents with skills, but agents with skills that can improve over time

不只是拥有技能的智能体,而是拥有能随时间自我改进技能的智能体

Seems that “SKILL.md” is here to stay, however, we haven’t really solved the most fundamental problem around them:

看起来,“SKILL.md” 会长期存在;但围绕它们,我们其实还没有解决最根本的问题:

Skills are usually static, while the environment around them is not!

技能通常是静态的,而其所处的环境却不是!

A skill that worked a few weeks ago can quietly start failing when the codebase changes, when the model behaves differently, or when the kinds of tasks users ask for shift over time. In most systems, those failures are invisible until someone notices the output is worse, or starts failing completely.

一个几周前还能正常工作的技能,可能会在代码库变更、模型行为改变,或用户提问任务类型随时间迁移时,悄无声息地开始失效。在大多数系统里,这些失败在被人察觉之前都是不可见的——要么有人发现输出变差了,要么干脆彻底跑不起来了。

The missing piece here for making the skills folder actually useful is to start treating them as living system components, not fixed prompt files.

要让 skills 文件夹真正变得有用,缺失的一块拼图,是开始把它们当作活的系统组件,而不是固定的提示词文件来对待。

And this is exactly the idea behind cognee-skills

这正是 cognee-skills 背后的核心想法。

Not just how to store skills better or route them better, but how to make them improve when they fail or underperform!

不只是把技能存得更好、路由得更好,而是让它们在失败或表现不佳时能够自我改进!

Until today, the skills were about:

直到今天,skills 基本都在做这三件事:

  1. writing a prompt
  1. 写一段 prompt
  1. saving it in a folder
  1. 把它存进一个文件夹
  1. calling it whenever needed
  1. 需要时就调用它

This works surprisingly well, but unfortunately only for demos… After a certain point, we start hitting the same wall:

这在演示里出奇地好用,但遗憾的是,也就止步于演示……到了一定阶段,我们总会撞上同一面墙:

  • One skill gets selected too often
  • 某个 skill 被选中的次数过多
  • Another looks good but fails in practice
  • 另一个看起来很不错,但在实践里总失败
  • One individual instruction keeps failing
  • 某条具体指令反复出错
  • A tool call breaks because environment has changed
  • 某次工具调用因为环境变化而崩掉

And the worst part of all is that no one knows if the issue is routing, instructions, or the tool call itself, which leads to manual maintenance and inspection. What we achieved with this implementation is to have the whole loop closed leading us to skills that can self-improve over time.

最糟糕的是,没有人知道问题究竟出在路由、指令本身,还是工具调用本身,于是只能靠人工维护与排查。我们通过这套实现,把整个闭环补上了,让我们最终走向能够随时间自我改进的技能

But let’s also give a brief overview of what is happening under the hood.

下面也简单概览一下底层到底发生了什么。

1. Skill ingestion

1. Skill ingestion(技能导入)

Right now your skill folder looks something like this:

现在你的 skills 文件夹大概长这样:

my_skills/ summarize/ SKILL.md bug-triage/ SKILL.md code-review/ SKILL.md

my_skills/ summarize/ SKILL.md bug-triage/ SKILL.md code-review/ SKILL.md

Last week, we showed that with cognee we can give everything a clearer structure, not just because it looks nicer, but because it also makes searching much more effective. We can also enrich the different fields with semantic meaning, task patterns, summaries, and relationships, which helps the system understand and route information smarter. All of these are stored using cognee’s “Custom DataPoint”.

上周我们展示过:用 cognee 可以把所有内容组织得更清晰——不只是因为更好看,也因为它能让搜索变得更高效。我们还可以为不同字段补充语义含义、任务模式、摘要和关系,从而帮助系统更聪明地理解信息并进行路由。所有这些都通过 cognee 的 “Custom DataPoint” 来存储。

Here is a small visualization of how your skills could look like:

下面是一个小的可视化示意,展示你的 skills 可能会变成什么样:

Link to the dynamic graph view: https://cognee-graph-skills.vercel.app/

动态图谱视图链接: https://cognee-graph-skills.vercel.app/

2. Observe

2. Observe(观察)

A skill cannot improve if the system has no memory of what happened when it ran. For that reason, after the execution of each skill, we store data in order to know:

如果系统对技能运行时发生了什么没有记忆,技能就无法改进。因此,每次技能执行结束后,我们都会存储数据,以便知道:

  • What task was attempted
  • 尝试完成了什么任务
  • Which skill was selected
  • 选择了哪个 skill
  • Whether it succeeded
  • 是否成功
  • What error occurred
  • 发生了什么错误
  • User feedback, if any
  • 用户反馈(如果有)

With observation, failure becomes something the system can reason about. You cannot improve a skill if you do not know what happened when it ran. Keeping in mind that we operate on a structure graph this can be added by an additional node which will have all the observations collected. That is all manageable by cognee’s “Custom DataPoint”, where one could specify all the fields that they want to populate.

有了观察,失败就变成了系统可以推理的对象。你无法改进一个技能,如果你不知道它运行时究竟发生了什么。考虑到我们是在结构化图谱上运作,这些信息可以通过新增一个节点来承载,节点里包含收集到的所有观察数据。这都可以由 cognee 的 “Custom DataPoint” 来管理,你可以自行指定想要填充的字段。

3. Inspect

3. Inspect(检视)

Once enough failed runs accumulate (or even after a single important failure) one can inspect the connected history around that skill: past runs, feedback, tool failures, and related task patterns. Because all of this is stored as a graph, the system can trace the recurring factors behind bad outcomes and use that evidence to propose a better version of the skill.

当失败的运行次数积累到足够多(甚至一次关键失败就够)时,就可以检视这个 skill 周围关联的历史:过去的运行记录、反馈、工具故障,以及相关的任务模式。由于这些都以图谱形式存储,系统可以追踪导致糟糕结果反复出现的因素,并据此提出一个更好的 skill 版本。

runs → repeated weak outcomes → inspection

运行 → 反复出现的薄弱结果 → 检视

4. Amend skill → .amendify()

4. Amend skill → .amendify()(修订技能)

Once the system has enough evidence that a skill is underperforming, it can propose an amendment to the instructions. That proposal can be reviewed by a human, or applied automatically. The goal is simple:

当系统积累到足够证据,确认某个 skill 表现不佳时,它就可以提出对指令的修订建议。这个建议可以由人类审核后再应用,也可以自动应用。目标很简单:

  • Reduce the friction of maintaining skills as systems grow.
  • 降低随着系统规模增长而维护技能的摩擦成本。

Instead of manually searching through your codebase for broken prompts, the system can look at the execution history of a skill, including past runs, failures, feedback, and tool errors, and suggest a targeted change.

不再需要你手动在代码库里翻找哪些 prompts 坏了;系统可以查看某个 skill 的执行历史——包括过往运行、失败、反馈与工具错误——并给出一次有针对性的改动建议。

The amendment might:

这次修订可能会:

  • tighten the trigger
  • 收紧触发条件
  • add a missing condition
  • 补上一条缺失的条件
  • reorder steps
  • 调整步骤顺序
  • change the output format
  • 改变输出格式

This is the moment where skills stop behaving like static prompt files and start behaving more like evolving components. Instead of opening a SKILL.md file and guessing what to change, the system can propose a patch grounded in evidence from how the skill actually behaved.

就在这一刻,skills 不再像静态的提示词文件那样运作,而开始更像可演化的组件。你不必打开某个 SKILL.md 然后凭感觉猜该改哪里;系统可以基于这个 skill 的真实行为证据,提出一份可落地的补丁。

5. Evaluate & Update skill

5. Evaluate & Update skill(评估与更新)

A self-improving system though, should never be trusted simply because it can modify itself. Any amendment must be evaluated. Did the new version actually improve outcomes? Did it reduce failures? Did it introduce errors elsewhere?

不过,一个能自我改进的系统,绝不该仅仅因为它“能修改自己”就被信任。任何修订都必须被评估:新版本真的改善了结果吗?失败率降低了吗?是否在别处引入了新错误?

For that reason, the loop cannot be just:

因此,这个闭环不能只是:

  • observe → inspect → amend
  • observe → inspect → amend

Instead, it must follow a more disciplined cycle:

相反,它必须遵循更严格的周期:

  • observe → inspect → amend → evaluate
  • observe → inspect → amend → evaluate

If an amendment does not produce a measurable improvement, the system should be able to roll it back. Because every change is tracked with its rationale and results, the original instructions are never lost, and self-improvement becomes a structured, auditable process rather than uncontrolled modification. When the evaluation confirms improvement, the amendment becomes the next version of the skill.

如果某次修订没有带来可衡量的提升,系统就应该能够回滚。因为每次变更都会连同其动机与结果一起被追踪,原始指令永远不会丢失,于是自我改进就成为一个结构化、可审计的过程,而不是不受控制的修改。当评估确认确有提升时,这次修订就会成为该 skill 的下一版。

Conclusion

Conclusion(结语)

Skills cannot stay static while the systems around them constantly change. As models, codebases, and tasks evolve, fixed prompt files inevitably degrade. We introduced a straightforward way to do so, automatically, while not giving up any of the control and oversight over the skills themselves.

当围绕技能的系统持续变化时,技能不可能永远静止不变。随着模型、代码库与任务形态的演进,固定的提示词文件必然会逐步退化。我们介绍了一种直观的方法,让这一切能够自动发生,同时又不放弃对技能本身的控制与监督。

Check out the PyPi build: https://pypi.org/project/cognee/0.5.4.dev2/

查看 PyPi 构建版本: https://pypi.org/project/cognee/0.5.4.dev2/

Check out Cognee: https://github.com/topoteretes/cognee

查看 Cognee: https://github.com/topoteretes/cognee

Join the Discord community: https://discord.gg/pMFAz242

加入 Discord 社区: https://discord.gg/pMFAz242

not just agents with skills, but agents with skills that can improve over time

Seems that “SKILL.md” is here to stay, however, we haven’t really solved the most fundamental problem around them:

Skills are usually static, while the environment around them is not!

A skill that worked a few weeks ago can quietly start failing when the codebase changes, when the model behaves differently, or when the kinds of tasks users ask for shift over time. In most systems, those failures are invisible until someone notices the output is worse, or starts failing completely.

The missing piece here for making the skills folder actually useful is to start treating them as living system components, not fixed prompt files.

And this is exactly the idea behind cognee-skills

Not just how to store skills better or route them better, but how to make them improve when they fail or underperform!

Until today, the skills were about:

  1. writing a prompt

  2. saving it in a folder

  3. calling it whenever needed

This works surprisingly well, but unfortunately only for demos… After a certain point, we start hitting the same wall:

  • One skill gets selected too often

  • Another looks good but fails in practice

  • One individual instruction keeps failing

  • A tool call breaks because environment has changed

And the worst part of all is that no one knows if the issue is routing, instructions, or the tool call itself, which leads to manual maintenance and inspection. What we achieved with this implementation is to have the whole loop closed leading us to skills that can self-improve over time.

But let’s also give a brief overview of what is happening under the hood.

1. Skill ingestion

Right now your skill folder looks something like this:

my_skills/ summarize/ SKILL.md bug-triage/ SKILL.md code-review/ SKILL.md

Last week, we showed that with cognee we can give everything a clearer structure, not just because it looks nicer, but because it also makes searching much more effective. We can also enrich the different fields with semantic meaning, task patterns, summaries, and relationships, which helps the system understand and route information smarter. All of these are stored using cognee’s “Custom DataPoint”.

Here is a small visualization of how your skills could look like:

Link to the dynamic graph view: https://cognee-graph-skills.vercel.app/

2. Observe

A skill cannot improve if the system has no memory of what happened when it ran. For that reason, after the execution of each skill, we store data in order to know:

  • What task was attempted

  • Which skill was selected

  • Whether it succeeded

  • What error occurred

  • User feedback, if any

With observation, failure becomes something the system can reason about. You cannot improve a skill if you do not know what happened when it ran. Keeping in mind that we operate on a structure graph this can be added by an additional node which will have all the observations collected. That is all manageable by cognee’s “Custom DataPoint”, where one could specify all the fields that they want to populate.

3. Inspect

Once enough failed runs accumulate (or even after a single important failure) one can inspect the connected history around that skill: past runs, feedback, tool failures, and related task patterns. Because all of this is stored as a graph, the system can trace the recurring factors behind bad outcomes and use that evidence to propose a better version of the skill.

runs → repeated weak outcomes → inspection

4. Amend skill → .amendify()

Once the system has enough evidence that a skill is underperforming, it can propose an amendment to the instructions. That proposal can be reviewed by a human, or applied automatically. The goal is simple:

  • Reduce the friction of maintaining skills as systems grow.

Instead of manually searching through your codebase for broken prompts, the system can look at the execution history of a skill, including past runs, failures, feedback, and tool errors, and suggest a targeted change.

The amendment might:

  • tighten the trigger

  • add a missing condition

  • reorder steps

  • change the output format

This is the moment where skills stop behaving like static prompt files and start behaving more like evolving components. Instead of opening a SKILL.md file and guessing what to change, the system can propose a patch grounded in evidence from how the skill actually behaved.

5. Evaluate & Update skill

A self-improving system though, should never be trusted simply because it can modify itself. Any amendment must be evaluated. Did the new version actually improve outcomes? Did it reduce failures? Did it introduce errors elsewhere?

For that reason, the loop cannot be just:

  • observe → inspect → amend

Instead, it must follow a more disciplined cycle:

  • observe → inspect → amend → evaluate

If an amendment does not produce a measurable improvement, the system should be able to roll it back. Because every change is tracked with its rationale and results, the original instructions are never lost, and self-improvement becomes a structured, auditable process rather than uncontrolled modification. When the evaluation confirms improvement, the amendment becomes the next version of the skill.

http://skill.md/

Conclusion

Skills cannot stay static while the systems around them constantly change. As models, codebases, and tasks evolve, fixed prompt files inevitably degrade. We introduced a straightforward way to do so, automatically, while not giving up any of the control and oversight over the skills themselves.

Check out the PyPi build: https://pypi.org/project/cognee/0.5.4.dev2/

Check out Cognee: https://github.com/topoteretes/cognee

Join the Discord community: https://discord.gg/pMFAz242

📋 讨论归档

讨论进行中…