🪞 Uota学 · 🧠 Neta

别再喂 prompt 了——把你做成一个 Git 仓库，Agent 才会“像你”

真正能让 Agent 长期稳定产出的，不是更聪明的提示词，而是“渐进式披露 + 文件化记忆 + 可版本化的工作流”这套上下文工程。

2026-02-24 原文链接 ↗

阅读简报

双语对照

完整翻译

原文

讨论归档

核心观点

瓶颈不是 prompt，而是注意力预算 你塞进 system prompt 的每个 token 都在稀释模型对关键约束的注意力；长期协作时，信息架构比措辞更重要。
“渐进式披露”是可扩展的上下文装配模式 三层漏斗：轻量路由（永远加载）→ 模块指令（按需加载）→ 具体数据（只在需要时逐行读）。它本质上是在做“最小必要上下文（MNC）”。
文件系统可以当记忆层，关键是格式—功能映射 JSONL 适合 append-only 日志与流式读取；YAML 适合层级配置+注释；Markdown 适合叙事与人机共读。核心不是“用文件”，而是把“不会被 Agent 误写/覆写”的安全性设计进格式里。
把“判断”也写进记忆，Agent 才能复用你的决策方式 记录 experiences/decisions/failures（情绪权重、替代方案、结果追踪、预防步骤），比只存事实更能避免“泛泛建议”。
Skill 的价值是把质量关卡内置化 自动加载保证一致性（语气/反模式）；手动调用保证精确流程；模板+自检让产出可控、可迭代、可回归。

跟我们的关联

🪞Uota

我们现在的 OpenClaw 体系已经是“文件系统做上下文”的雏形（AGENTS/SOUL/MEMORY/skills）。下一步该补的是：

1) 更明确的“路由层”与“按需加载层”（把常见任务→需要读哪些文件写成决策表/路由表）； 2) append-only 的运行日志（JSONL）把关键决策/失败沉淀下来，防止重复踩坑。

🧠Neta

如果要把“AI-native 效能系统”做成组织能力：先做信息架构（模块边界 + 最小上下文装配），再谈多 agent。

👤ATou

你现在最宝贵的不是“多读”，而是把读到的东西变成可复用的结构（规则、模板、反模式、决策记录）。这篇就是一套方法论背书。

讨论引子

我们现在最该优化的是“模型能力”还是“上下文装配”？分别的 ROI 怎么算？
哪些信息应该永远不进上下文（因为稀释注意力）？我们怎么定义“最小必要上下文”？
对团队来说，“append-only 的失败日志”会不会带来心理压力？怎样设计才能既真实又可用？

文件系统才是新的数据库：我如何为 AI 代理打造个人操作系统

每一次和 AI 的对话几乎都以同样的方式开始。你解释自己是谁。你解释你在做什么。你把自己的风格指南贴进去。你重新描述你的目标。你提供和昨天、前天、再前天一模一样的背景信息。然后，聊到第 40 分钟，模型就忘了你的语气，开始像写新闻稿一样写作。

我厌倦了这一切。于是我搭了一个系统来解决它。

我把它叫作 Personal Brain OS。它是一个基于文件的个人操作系统，住在一个 Git 仓库里。把它 clone 下来，用 Cursor 或 Claude Code 打开，AI 助手就拥有一切：我的语气、我的品牌、我的目标、我的联系人、我的内容流水线、我的研究、我的失败记录。没有数据库，没有 API key，没有构建步骤。只有 80+ 个 markdown、YAML 和 JSONL 文件——人和语言模型都能原生阅读。

我会分享完整的架构、设计决策，以及我踩过的坑，帮助你搭出自己的版本。不是复刻我的；而是你的。对你的工作来说，具体模块、文件 schema、skill 定义一定会长得不一样。但模式是可迁移的。为 AI 代理组织信息的原则是通用的。拿走适合你的，忽略不适合的，然后交付一个能让你的 AI 真正有用、而不是“泛泛地有帮助”的东西。

下面就是我如何搭建它、为什么这些架构决策很重要，以及我用血泪换来的经验。

1) 核心问题：上下文，而非提示词

大多数人以为 AI 助手的瓶颈在提示词：写得更好，就得到更好的答案。对单次交互和生产级 agent prompt 来说，这确实成立。但当你想让 AI 在数周甚至数月里，跨几十种任务“以你的方式”工作时，这套逻辑就崩了。

注意力预算：语言模型的上下文窗口是有限的，而且窗口里的每一部分价值并不相同。这意味着把你知道的一切一股脑塞进 system prompt 不只是浪费，它还会主动降低性能。你每多加一个 token，都在和其他内容争夺模型的注意力。

我们的大脑也类似。有人在会议前给你讲 15 分钟的背景，你往往记住开头和结尾，中间会变模糊。语言模型也有同样的 U 形注意力曲线，只不过它的曲线是可被数学度量的。token 的位置会影响被回忆的概率。新模型在这方面越来越好，但即便如此，你依然是在分散模型对最重要信息的注意力。意识到这一点，会彻底改变你为 AI 系统设计信息架构的方式。

所以我没有写一个巨大的 system prompt，而是把 Personal OS 切成 11 个相互隔离的模块。我要它写博客，就加载我的语气指南和品牌文件；我要它为会议做准备，就加载我的联系人数据库和互动历史。内容任务时，模型看不到网络数据；会议准备任务时，模型也不会看到内容模板。

渐进式披露：这就是让整个系统能跑起来的架构模式。系统不会一次性加载 80+ 个文件，而是分成三层。第 1 层是一个轻量的路由文件，始终加载，它告诉 AI 哪个模块相关。第 2 层是模块级指令，只在需要该模块时才加载。第 3 层才是真正的数据：JSONL 日志、YAML 配置、研究文档，只在任务确实需要时才加载。

这和专家的工作方式一致。三层结构形成一个漏斗：先粗路由，再给模块上下文，然后落到具体数据。每一步，模型都只拿到“刚好够用”的信息，而不是更多。

我的路由文件是 SKILL.md，它会告诉 agent：“这是内容任务，加载品牌模块”，或者“这是网络任务，加载联系人”。模块指令文件（CONTENT.md、OPERATIONS.md、NETWORK.md）每个大约 40–100 行，包含文件清单、工作流顺序，以及一个带该领域行为规则的 <instructions> 区块。数据文件最后才加载，而且只在需要时加载。AI 会从 JSONL 里逐行读取联系人，而不是把整个文件一次性解析。三层结构，到任何一条信息最多跳两次。

代理指令层级：我搭了三层指令，用来约束 AI 在不同层级的行为。在仓库层，CLAUDE.md 是上手文档——每个 AI 工具都会先读它，从而拿到项目全图。在 brain 层，AGENT.md 包含七条核心规则，以及一张把常见请求映射到精确动作序列的决策表。在模块层，每个目录都有自己的指令文件，提供该领域特有的行为约束。

这解决了大型 AI 项目里最常见的“指令冲突”问题。当所有规则都塞在一个 system prompt 里，它们必然互相打架：内容创作的规则可能会和人脉维护的规则冲突。把规则按领域进行范围限定，你就消除了冲突，让 agent 拿到清晰、互不重叠的指导。层级结构也意味着你可以更新某个模块的规则，而不必担心把另一个模块的行为搞回归。

我的 AGENT.md 就是一张决策表。AI 读到“User says 'send email to Z'”时，会立刻看到：

Step 1，在 HubSpot 里查找联系人。

Step 2，核对邮箱地址。

Step 3，通过 Gmail 发送。

像 OPERATIONS.md 这样的模块文件会定义优先级（P0：今天做，P1：本周做，P2：本月做，P3：积压），让 agent 的任务分流始终一致。agent 之所以能遵循我使用的同一套优先级体系，是因为它被写进系统里了，而不是靠“默认理解”。

2) 文件系统作为记忆

我做过最反直觉的一个决定：不用数据库。不用向量库。除了 Cursor 或 Claude Code 自带的能力，不用任何检索系统。就是磁盘上的文件，用 Git 做版本管理。

格式—功能映射：系统里的每一种文件格式，都是基于 AI 代理处理信息的方式、为某个特定目的挑出来的。日志用 JSONL，因为它天生是追加写（append-only），对流式读取友好（agent 可以逐行读取，而不必解析整个文件），并且每一行都是自包含、合法的 JSON。配置用 YAML，因为它能干净地表达层级数据，支持注释，并且对人和机器都可读，不像 JSON 那样充满括号噪音。叙事内容用 Markdown，因为 LLM 天生就会读它，它到处都能渲染，而且在 Git 里能产生干净的 diff。

JSONL 的追加写特性可以避免一类 bug：agent 不小心把历史数据覆盖掉。我见过这种事发生在 JSON 文件上：agent 会重写整个文件，结果三个月的联系人历史直接丢失。用 JSONL，agent 只能加新行。删除则通过把条目标记为 "status": "archived" 来完成，从而保留完整历史，便于做模式分析。YAML 支持注释，这意味着我可以在目标文件里加上上下文注解，agent 会读，但不会污染数据结构。Markdown 的通用渲染则保证我的语气指南在 Cursor、GitHub 和任何浏览器里看起来都一样。

我的系统使用 11 个 JSONL 文件（posts、contacts、interactions、bookmarks、ideas、metrics、experiences、decisions、failures、engagement、meetings）、6 个 YAML 文件（goals、values、learning、circles、rhythms、heuristics），以及 50+ 个 Markdown 文件（voice guides、research、templates、drafts、todos）。每个 JSONL 文件都以一行 schema 开头：{"_schema": "contact", "_version": "1.0", "_description": "..."}。agent 在读数据之前，永远先知道结构是什么。

情景记忆：多数“第二大脑”系统只存事实；我的还存判断。memory/ 模块里有三份追加写日志：experiences.jsonl（带情绪权重分数 1–10 的关键时刻）、decisions.jsonl（关键决策：理由、考虑过的替代方案、以及后续结果追踪）、failures.jsonl（哪里出了问题、根因是什么、以及预防步骤）。

“AI 拥有你的文件”和“AI 拥有你的判断”之间，差别很大。事实告诉 agent 发生了什么。情景记忆告诉 agent 什么重要、我会怎么做得不一样、以及我如何权衡取舍。当 agent 遇到与我记录过的相似决策时，它可以参考我过去的推理，而不是生成一堆通用建议。失败日志是最有价值的——它把我付出真实痛苦才获得的模式识别能力编码了进去。

当我在考虑是接受 Antler Canada 的 $250K 投资，还是加入 Sully.ai 担任 Context Engineer 时，决策日志记录了两种选择、各自的理由，以及最终结果。以后再遇到类似的职业取舍，agent 就不会给我泛泛的职业建议；它会引用我真实的决策方式：“学习 > 影响 > 收入 > 成长”是我的优先顺序，而“我能不能触达所有东西？我会不会在能力边界上学习？我尊重创始人吗？”是我加入公司的评估框架。

跨模块引用：系统使用一种“平面文件的关系模型”。没有数据库，但结构化程度足够，让 agent 能跨文件做 join。interactions.jsonl 里的 contact_id 指向 contacts.jsonl 中的条目。ideas.jsonl 里的 pillar 映射到 identity/brand.md 里定义的内容支柱。书签会喂给内容想法。帖子指标会喂给每周复盘。模块在加载时彼此隔离，但在推理时彼此连通。

只有隔离、没有连接，就只是一堆文件夹。跨引用让 agent 在需要时能走过知识图谱。“为我准备和 Sarah 的会议”会触发一条查找链：在联系人里找到 Sarah，拉取她的互动记录，查看涉及她的待办事项，最后汇总成一份简报。agent 能沿着引用跨模块移动，而无需加载整个系统。

我的会前流程会串起三个文件：contacts.jsonl（她是谁）、interactions.jsonl（按 contact_id 过滤得到的历史）、以及 todos.md（任何未完成事项）。agent 会生成一页纸的简报，包含关系上下文、上次对话摘要，以及未完的跟进事项。不需要手工拼装——数据结构让这条工作流成为可能。

3) Skill 系统：教 AI 如何做你的工作

文件用来存知识。Skill 用来编码流程。我按照 Anthropic Agent Skills 标准构建了 Agent Skills：结构化指令，告诉 AI 如何执行特定任务，并把质量关卡内置其中。

自动加载 vs. 手动调用：两类 skill 解决两种不同的问题。参考型 skill（voice-guide、writing-anti-patterns）会在 YAML frontmatter 中设置 user-invocable: false。agent 会读它们的 description 字段，只要任务涉及写作，就自动注入。我从不手动调用——它们每次都会悄悄生效。任务型 skill（/write-blog、/topic-research、/content-workflow）会设置 disable-model-invocation: true。agent 不能自行触发它们。我输入斜杠命令后，这个 skill 就会成为该任务的完整指令集。

自动加载解决一致性问题。我不必每次让它写草稿时都提醒“用我的语气”——系统会替我记住。手动调用解决精确性问题。研究任务的质量关卡和博客写作完全不同；把它们分开，能防止 agent 把两种工作流混为一谈。机制就是 YAML frontmatter，几个元数据字段就能控制整个加载行为。

当我输入 /write-blog context engineering for marketing teams 时，会自动发生五件事：加载语气指南（我怎么写）、加载反模式（我绝不会怎么写）、加载博客模板（7 段结构与每段字数目标）、检查 persona 文件夹里的受众画像、检查 research 文件夹里是否已有相关主题研究。一条斜杠命令就触发了完整的上下文装配。skill 文件本身会写“阅读 brand/tone-of-voice.md”，它只引用来源模块，从不复制内容。单一事实来源（single source of truth）。

语气系统：我的语气被编码成结构化数据——说实话也带点氛围感。语气画像用 1–10 的尺度给五个属性打分：Formal/Casual (6)、Serious/Playful (4)、Technical/Simple (7)、Reserved/Expressive (6)、Humble/Confident (7)。反模式文件包含三档的 50+ 禁用词、禁用开头、结构陷阱（强行“三段论”、回避系动词、过度对冲），以及每段最多一个 em-dash 的硬性限制。

大多数人用形容词描述自己的语气：“专业但亲和”。这对 AI 毫无用处。Technical/Simple 量表上的 7 分，会告诉模型准确应该落在哪里。禁用词列表更强——定义“你不是谁”，往往比定义“你是谁”更容易。agent 会把每份草稿都对照反模式清单检查，凡是触发的地方就重写。结果内容之所以像我，是因为这些护栏阻止它写得像 AI。

每个内容模板都会每 500 字设置一次语气检查点：“我是在用洞见开头吗？我有没有用具体数字？这真的是我会发的东西吗？”博客模板内置了四轮编辑流程：结构编辑（hook 抓人吗？）、语气编辑（扫禁用词、检查句子节奏）、证据编辑（论断有来源吗？）、以及朗读测试。质量关卡是 skill 的一部分，而不是我事后再补的东西。

把模板当作结构化脚手架：五个内容模板定义了不同内容类型的结构。长文博客模板有七个部分（Hook、Core Concept、Framework、Practical Application、Failure Modes、Getting Started、Closing），每部分都有字数目标，总计 2,000–3,500 字。Thread 模板定义了 11 条的结构：hook、深挖、结果、CTA。研究模板有四个阶段：landscape mapping、technical deep-dive、evidence collection、gap analysis。

模板不只是约束创造力，也约束混乱。没有结构时，agent 产出的往往是一团无定形的文本；有了结构，它就能产出有节奏、有推进、有回报的内容。每个模板还包含一份质量清单，让 agent 在交付草稿前先自评。

研究模板会把结果输出到 knowledge/research/[topic].md，格式是结构化的：Executive Summary、Landscape Map、Core Concepts、Evidence Bank（包含统计数据、引言、案例研究、论文；每一项都标注来源与日期）、Failure Modes、Content Opportunities，以及一份按 HIGH/MEDIUM/LOW 可靠性分级的 Sources List。那份研究文档会进入博客模板的大纲阶段：一个 skill 的输出，成为下一个 skill 的输入。流水线会自我叠加。

4) 操作系统：我每天怎么用它

没有落地执行的架构，什么都不是。

下面是这套系统在实践中如何运转。

内容流水线：七个阶段：Idea、Research、Outline、Draft、Edit、Publish、Promote。

想法会被捕获进 ideas.jsonl，并使用一套评分系统：每个想法在定位匹配度、独特洞见、受众需求、时效性、以及投入—产出比上分别打 1–5 分。总分达到 15 分或以上才继续推进。

研究产出进入 knowledge 模块。

草稿要走四轮编辑。

已发布内容会记录到 posts.jsonl，包含平台、URL 和互动指标。

推广阶段会用 thread 模板做一条 X 宣布和一份 LinkedIn 适配版本。

我会在周日批量创作：3–4 小时，目标产出 3–4 篇完成大纲与初稿的文章。内容日历会把每一天映射到一个平台和一种内容类型。

个人 CRM：联系人被组织成四个圈层，不同圈层有不同的维护节奏：inner（每周）、active（每两周）、network（每月）、dormant（每季度唤醒）。每条联系人记录都有 can_help_with 和 you_can_help_with 字段，用来驱动“引荐匹配系统”；交叉引用这两个字段，会浮出对双方都有价值的互相介绍。互动记录会带情绪跟踪（positive、neutral、needs_attention），关系健康度一眼可见。

多数人把联系人放在脑子里，然后让关系在忽视中慢慢衰退。stale_contacts 脚本会交叉引用 contacts（他们是谁）、interactions（我们上次何时交流）、circles（我们应该多久交流一次），从而浮出需要触达的人。每周只要 30 秒扫一眼，我就能知道哪些关系需要关注。

circles.yaml 里还有更专门的分组：founders、investors、ai_builders、creators、mentors、mentees；每一组都有明确的关系发展策略。对 AI builders：分享有用内容、一起协作开源、提供工具反馈、放大他们的工作。对 mentors：带着具体问题去请教、同步上次建议后的进展、寻找回馈价值的方式。这些都是操作性指令——当我问“这周我该联系谁？”时，agent 就按这些策略执行。

自动化链：五个脚本处理周期性工作流，并且可以串联成复合操作。周日每周复盘会按顺序跑三个脚本：metrics_snapshot.py 更新数字、stale_contacts.py 标记关系、weekly_review.py 生成一份总结文档，包含“完成 vs. 计划”、指标趋势、以及下周优先级。内容构思链会读取最近的书签，检查未开发的想法，生成新建议，并与内容日历交叉引用，找出排期空档。这些不是 cron job——要么我问它做复盘时由 agent 运行，要么我用 npm run weekly-review 触发。

输出 agent 可读 stdout 的脚本，打通了数据与行动之间的闭环。每周复盘脚本不仅告诉我发生了什么——它还会引用我的目标，指出哪些关键结果在轨道上、哪些落后，以及下周该优先做什么。脚本读取的也是 agent 日常运行时读取的同一套文件，因此不存在数据复制或同步问题。

跑完每周复盘后，agent 就拥有更新下周 todos.md、调整 goals.yaml 进度数字、并建议与“落后关键结果”对齐的内容选题所需的一切信息。复盘不是一份报告——它是下周规划的起点。自动化带来了反馈回路：目标驱动内容，内容驱动指标，指标驱动复盘，复盘驱动目标。

5) 我做错了什么，以及我会怎么改

第一版 schema 过度工程化了。最初我的 JSONL schema 每条有 15+ 个字段，绝大多数都是空的。agent 面对稀疏数据会很吃力——它会试图把字段补齐，或者评论“缺了什么”。我把 schema 砍到 8–10 个关键字段，只有当我确实有数据时才加可选字段。schema 越简单，agent 行为越好。

语气指南一开始也太长了。tone-of-voice.md 的第一版有 1,200 行。agent 开头写得很像，但到第四段就开始漂移，因为语气指令落在了“中间遗失区”。我把它重构为：前 100 行先放最具辨识度的模式（标志性短语、禁用词、开头模式），更长的例子放在后面。关键规则必须在顶部，而不是在中间。

模块边界比你想象的更重要。我最初把 identity 和 brand 放在同一个模块里。agent 有时只需要我的禁用词清单，却会把整个个人简介一起加载。把它们拆成两个模块后，纯语气任务的 token 使用量下降了 40%。每一个模块边界都是一次加载决策。边界划错了，你就会加载得太多或太少。

追加写不可谈判。我早期丢过三个月的帖子互动数据，因为 agent 重写了 posts.jsonl，而不是在末尾追加。JSONL 的 append-only 模式不只是约定——它是安全机制。agent 可以新增数据，但不能销毁数据。这是整个系统里最重要的架构决策。

6) 结果，以及背后的原则

最重要的结果，比任何指标都简单：我打开 Cursor 或 Claude Code，开始对话，AI 就已经知道我是谁、我怎么写、我在做什么、我在乎什么。它能写出我的语气，因为我的语气被编码成了结构化数据。它能遵循我的优先级，因为我的目标写在一个 YAML 文件里，它在建议我做什么之前会先读它。它能管理我的关系，因为我的联系人和互动记录都在它可以查询的文件里。

背后的原则也很简单：这叫上下文工程（context engineering），不是提示词工程（prompt engineering）。提示词工程问的是：“我该如何把问题表述得更好？”上下文工程问的是：“这个 AI 需要哪些信息才能做出正确决策？我该如何结构化这些信息，才能让模型真的用起来？”

变化在于：从优化单次交互，转向设计信息架构。这就像写好一封邮件和搭好一套归档系统的区别：一封邮件只帮你一次；一套系统会在每一次都帮你。

整个系统都装在一个 Git 仓库里。克隆到任意机器，把任意 AI 工具指向它，操作系统就跑起来了。零依赖、全便携。而且因为是 Git，每一次变更都有版本记录，每一个决策都可追溯，没有任何东西会真正丢失。

Muratcan Koylan 是 Sully.ai 的 Context Engineer，负责为医疗 AI 设计上下文工程系统。他在上下文工程方面的开源工作（GitHub 8,000+ stars）被学术研究引用，并与 Anthropic 一同被提及。此前他是 99Ravens AI 的 AI Agent Systems Manager，构建过每周处理 10,000+ 次互动的多代理系统。

Framework: Agent Skills for Context Engineering

Link: http://x.com/i/article/2025249985722224640

相关笔记

Every AI conversation starts the same way. You explain who you are. You explain what you're working on. You paste in your style guide. You re-describe your goals. You give the same context you gave yesterday, and the day before, and the day before that. Then, 40 minutes in, the model forgets your voice and starts writing like a press release.

I got tired of this. So I built a system to fix it.

我厌倦了这一切。于是我搭了一个系统来解决它。

I call it Personal Brain OS. It's a file-based personal operating system that lives inside a Git repository. Clone it, open it in Cursor or Claude Code, and the AI assistant has everything: my voice, my brand, my goals, my contacts, my content pipeline, my research, my failures. No database, no API keys, no build step. Just 80+ files in markdown, YAML, and JSONL that both humans and language models read natively.

I'm sharing the full architecture, the design decisions, and the mistakes so you can build your own version. Not a copy of mine; yours. The specific modules, the file schemas, the skill definitions will look different for your work. But the patterns transfer. The principles for structuring information for AI agents are universal. Take what fits, ignore what doesn't, and ship something that makes your AI actually useful instead of generically helpful.

Here's how I built it, why the architecture decisions matter, and what I learned the hard way.

下面就是我如何搭建它、为什么这些架构决策很重要，以及我用血泪换来的经验。

1) THE CORE PROBLEM: CONTEXT, NOT PROMPTS

1) 核心问题：上下文，而非提示词

Most people think the bottleneck with AI assistants is prompting. Write a better prompt, get a better answer. That's true for single interactions and production agent prompts. It falls apart when you want an AI to operate as you across dozens of tasks over weeks and months.

The Attention Budget: Language models have a finite context window, and not all of it is created equal. This means dumping everything you know into a system prompt isn't just wasteful, it actively degrades performance. Every token you add competes for the model's attention.

Our brains work similarly. When someone briefs you for 15 minutes before a meeting, you remember the first thing they said and the last thing they said. The middle blurs. Language models have the same U-shaped attention curve, except theirs is mathematically measurable. Token position affects recall probability. The newer models are getting better at this, but still, you are distracting the model from focusing on what matters most. Knowing this changes how you design information architecture for AI systems.

Instead of writing one massive system prompt, I split Personal OS into 11 isolated modules. When I ask the AI to write a blog post, it loads my voice guide and brand files. When I ask it to prepare for a meeting, it loads my contact database and interaction history. The model never sees network data during a content task, and never sees content templates during a meeting prep task.

Progressive Disclosure: This is the architectural pattern that makes the whole system work. Instead of loading all 80+ files at once, the system uses three levels. Level 1 is a lightweight routing file that's always loaded. It tells the AI which module is relevant. Level 2 is module-specific instructions that load only when that module is needed. Level 3 is the actual data JSONL logs, YAML configs, research documents, loaded only when the task requires them.

This mirrors how experts operate. The three levels create a funnel: broad routing, then module context, then specific data. At each step, the model has exactly what it needs and nothing more.

My routing file is SKILL.md that tells the agent "this is a content task, load the brand module" or "this is a network task, load the contacts." The module instruction files (CONTENT.md, OPERATIONS.md, NETWORK.md) are 40-100 lines each, with file inventories, workflow sequences, and an <instructions> block with behavioural rules for that domain. Data files load last, only when needed. The AI reads contacts line by line from JSONL rather than parsing the entire file. Three levels, with a maximum of two hops to any piece of information.

The Agent Instruction Hierarchy: I built three layers of instructions that scope how the AI behaves at different levels. At the repository level, CLAUDE.md is the onboarding document -- every AI tool reads it first and gets the full map of the project. At the brain level, AGENT.md contains seven core rules and a decision table that maps common requests to exact action sequences. At the module level, each directory has its own instruction file with domain-specific behavioral constraints.

This solves the "conflicting instructions" problem that plagues large AI projects. When everything lives in one system prompt, rules contradict each other. A content creation instruction might conflict with a networking instruction. By scoping rules to their domain, you eliminate conflicts and give the agent clear, non-overlapping guidance. The hierarchy also means you can update one module's rules without risking regression in another module's behavior.

My AGENT.md is a decision table. The AI reads "User says 'send email to Z'" and immediately sees:

我的 AGENT.md 就是一张决策表。AI 读到“User says 'send email to Z'”时，会立刻看到：

Step 1, look up contact in HubSpot.

Step 1，在 HubSpot 里查找联系人。

Step 2, verify email address.

Step 2，核对邮箱地址。

Step 3, send via Gmail.

Step 3，通过 Gmail 发送。

Module-level files like OPERATIONS.md define priority levels (P0: do today, P1: this week, P2: this month, P3: backlog) so the agent triages tasks consistently. The agent follows the same priority system I use because the system is codified, not implied.

2) THE FILE SYSTEM AS MEMORY

2) 文件系统作为记忆

One of the most counterintuitive decisions I made: no database. No vector store. No retrieval system except Cursor or Claude Code's features. Just files on disk, versioned with Git.

Format-Function Mapping: Every file format in the system was chosen for a specific reason related to how AI agents process information. JSONL for logs because it's append-only by design, stream-friendly (the agent reads line by line without parsing the entire file), and every line is self-contained valid JSON. YAML for configuration because it handles hierarchical data cleanly, supports comments, and is readable by both humans and machines without the noise of JSON brackets. Markdown for narrative because LLMs read it natively, it renders everywhere, and it produces clean diffs in Git.

JSONL's append-only nature prevents a category of bugs where an agent accidentally overwrites historical data. I've seen this happen with JSON files agent writes the whole file, loses three months of contact history. With JSONL, the agent can only add lines. Deletion is done by marking entries as "status": "archived", which preserves the full history for pattern analysis. YAML's comment support means I can annotate my goals file with context the agent reads but that doesn't pollute the data structure. And Markdown's universal rendering means my voice guide looks the same in Cursor, on GitHub, and in any browser.

My system uses 11 JSONL files (posts, contacts, interactions, bookmarks, ideas, metrics, experiences, decisions, failures, engagement, meetings), 6 YAML files (goals, values, learning, circles, rhythms, heuristics), and 50+ Markdown files (voice guides, research, templates, drafts, todos). Every JSONL file starts with a schema line: {"_schema": "contact", "_version": "1.0", "_description": "..."}. The agent always knows the structure before reading the data.

Episodic Memory: Most "second brain" systems store facts. Mine stores judgment as well. The memory/ module contains three append-only logs: experiences.jsonl (key moments with emotional weight scores from 1-10), decisions.jsonl (key decisions with reasoning, alternatives considered, and outcomes tracked), and failures.jsonl (what went wrong, root cause, and prevention steps).

There's a difference between an AI that has your files and an AI that has your judgment. Facts tell the agent what happened. Episodic memory tells the agent what mattered, what I'd do differently, and how I think about tradeoffs. When the agent encounters a decision similar to one I've logged, it can reference my past reasoning instead of generating generic advice. The failures log is the most valuable, it encodes pattern recognition that took real pain to acquire.

When I was deciding whether to accept Antler Canada's $250K investment or join Sully.ai as Context Engineer, the decision log captured both options, the reasoning for each, and the outcome. If a similar career tradeoff comes up, the agent doesn't give me generic career advice. It references how I actually think about these decisions: "Learning > Impact > Revenue > Growth" is my priority order, and "Can I touch everything? Will I learn at the edge of my capability? Do I respect the founders?" is my company-joining framework.

Cross-Module References: The system uses a flat-file relational model. No database, but structured enough for agents to join data across files. contact_id in interactions.jsonl points to entries in contacts.jsonl. pillar in ideas.jsonl maps to content pillars defined in identity/brand.md. Bookmarks feed content ideas. Post metrics feed weekly reviews. The modules are isolated for loading, but connected for reasoning.

Isolation without connection is just a pile of folders. The cross-references let the agent traverse the knowledge graph when needed. "Prepare for my meeting with Sarah" triggers a lookup chain: find Sarah in contacts, pull her interactions, check pending todos involving her, compile a brief. The agent follows the references across modules without loading the entire system.

My pre-meeting workflow chains three files: contacts.jsonl (who they are), interactions.jsonl (filtered by contact_id for history), and todos.md (any pending items). The agent produces a one-page brief with relationship context, last conversation summary, and open follow-ups. No manual assembly. The data structure makes the workflow possible.

3) THE SKILL SYSTEM: TEACHING AI HOW TO DO YOUR WORK

3) Skill 系统：教 AI 如何做你的工作

Files store knowledge. Skills encode process. I built Agent Skills following the Anthropic Agent Skills standard, structured instructions that tell the AI how to perform specific tasks with quality gates baked in.

Auto-Loading vs. Manual Invocation: Two types of skills solve two different problems. Reference skills (voice-guide, writing-anti-patterns) set user-invocable: false in their YAML frontmatter. The agent reads the description field and injects them automatically whenever the task involves writing. I never invoke them, they activate silently, every time. Task skills (/write-blog, /topic-research, /content-workflow) set disable-model-invocation: true. The agent can't trigger them on its own. I type the slash command, and the skill becomes the agent's complete instruction set for that task.

Auto-loading solves the consistency problem. I don't have to remember to say "use my voice" every time I ask for a draft. The system remembers for me. Manual invocation solves the precision problem. A research task has different quality gates than a blog post. Keeping them separate prevents the agent from conflating two different workflows. The YAML frontmatter is the mechanism, and a few metadata fields control the entire loading behaviour.

When I type /write-blog context engineering for marketing teams, five things happen automatically: the voice guide loads (how I write), the anti-patterns load (what I never write), the blog template loads (7-section structure with word count targets), the persona folder is checked for audience profiles, and the research folder is checked for existing topic research. One slash command triggers a full context assembly. The skill file itself says "Read brand/tone-of-voice.md", it references the source module, never duplicates the content. Single source of truth.

The Voice System: My voice is encoded as structured data and ngl with some vibes. The voice profile rates five attributes on a 1-10 scale: Formal/Casual (6), Serious/Playful (4), Technical/Simple (7), Reserved/Expressive (6), Humble/Confident (7). The anti-patterns file contains 50+ banned words across three tiers, banned openings, structural traps (forced rule of three, copula avoidance, excessive hedging), and a hard limit of one em-dash per paragraph.

Most people describe their voice with adjectives: "professional but approachable." That's useless for an AI. A 7 on the Technical/Simple scale tells the model exactly where to land. The banned word list is even more powerful; it's easier to define what you're NOT than what you are. The agent checks every draft against the anti-patterns list and rewrites anything that triggers it. The result is content that sounds like me because the guardrails prevent it from sounding like AI.

Every content template includes voice checkpoints every 500 words: "Am I leading with insight? Am I being specific with numbers? Would I actually post this?" The blog template has a 4-pass editing process built in: structure edit (does the hook grab?), voice edit (banned words scan, sentence rhythm check), evidence edit (claims sourced?), and a read-aloud test. The quality gates are part of the skill, not something I add after the fact.

Templates as Structured Scaffolds: Five content templates define the structure for different content types. The long-form blog template has seven sections (Hook, Core Concept, Framework, Practical Application, Failure Modes, Getting Started, Closing) with word count targets per section totaling 2,000-3,500 words. The thread template defines an 11-post structure with a hook, deep-dive, results, and CTA. The research template has four phases: landscape mapping, technical deep-dive, evidence collection, and gap analysis.

Templates not only constrain creativity but also constrain chaos. Without structure, the agent produces amorphous blobs of text. With structure, it produces content that has rhythm, progression, and payoff. Each template also includes a quality checklist so the agent can self-evaluate before presenting the draft.

The research template outputs to knowledge/research/[topic].md with a structured format: Executive Summary, Landscape Map, Core Concepts, Evidence Bank (with statistics, quotes, case studies, and papers each cited with source and date), Failure Modes, Content Opportunities, and a Sources List graded HIGH/MEDIUM/LOW on reliability. That research document then feeds into the blog template's outline stage. The output of one skill becomes the input of the next. The pipeline builds on itself.

4) THE OPERATING SYSTEM: HOW I ACTUALLY USE THIS DAILY

4) 操作系统：我每天怎么用它

Architecture is nothing without execution.

没有落地执行的架构，什么都不是。

Here's how the system runs in practice.

下面是这套系统在实践中如何运转。

The Content Pipeline: Seven stages: Idea, Research, Outline, Draft, Edit, Publish, Promote.

内容流水线：七个阶段：Idea、Research、Outline、Draft、Edit、Publish、Promote。

Ideas are captured to ideas.jsonl with a scoring system, each idea rated 1-5 on alignment with positioning, unique insight, audience need, timeliness, and effort-versus-impact. Proceed if total score hits 15 or higher.

Research outputs to the knowledge module.

研究产出进入 knowledge 模块。

Drafts go through four editing passes.

草稿要走四轮编辑。

Published content gets logged to posts.jsonl with platform, URL, and engagement metrics.

已发布内容会记录到 posts.jsonl，包含平台、URL 和互动指标。

Promotion uses the thread template to create an X announcement and a LinkedIn adaptation.

推广阶段会用 thread 模板做一条 X 宣布和一份 LinkedIn 适配版本。

I batch content creation on Sundays: 3-4 hours, target output of 3-4 posts drafted and outlined. The content calendar maps each day to a platform and content type.

我会在周日批量创作：3–4 小时，目标产出 3–4 篇完成大纲与初稿的文章。内容日历会把每一天映射到一个平台和一种内容类型。

The Personal CRM: Contacts organized into four circles with different maintenance cadences: inner (weekly), active (bi-weekly), network (monthly), dormant (quarterly reactivation). Each contact record has can_help_with and you_can_help_with fields that enable the introduction matching system. cross-referencing these fields surfaces mutually valuable intros. Interactions are logged with sentiment tracking (positive, neutral, needs_attention) so relationship health is visible at a glance.

Most people keep contacts in their head and let relationships decay through neglect. The stale_contacts script cross-references contacts (who they are), interactions (when we last talked), and circles (how often we should talk) to surface outreach needs. A 30-second scan each week shows me which relationships need attention.

Specialized groups in circles.yamlfounders, investors, ai_builders, creators, mentors, mentees, each have explicit relationship development strategies. For AI builders: share useful content, collaborate on open source, provide tool feedback, amplify their work. For mentors: bring specific questions, update on progress from previous advice, look for ways to add value back. These are operational instructions the agent follows when I ask "Who should I reach out to this week?"

Automation Chains: Five scripts handle recurring workflows. They chain together for compound operations. The Sunday weekly review runs three scripts in sequence: metrics_snapshot.py updates the numbers, stale_contacts.py flags relationships, weekly_review.py generates a summary document with completed-versus-planned, metrics trends, and next week's priorities. The content ideation chain reads recent bookmarks, checks undeveloped ideas, generates fresh suggestions, and cross-references with the content calendar to find scheduling gaps. These aren't cron jobs -- the agent runs them when I ask for a review, or I trigger them with npm run weekly-review.

Scripts that output to stdout in agent-readable format close the loop between data and action. The weekly review script doesn't just tell me what happened -- it references my goals and identifies which key results are on track, which are behind, and what to prioritize next week. The scripts read from the same files the agent reads during normal operation, so there's no data duplication or synchronization problem.

After running the weekly review, the agent has everything it needs to update todos.md for next week, adjust goals.yaml progress numbers, and suggest content topics that align with underperforming key results. The review isn't a report -- it's the starting point for next week's planning. The automation creates a feedback loop: goals drive content, content drives metrics, metrics drive reviews, reviews drive goals.

5) WHAT I GOT WRONG AND WHAT I'D DO DIFFERENTLY

5) 我做错了什么，以及我会怎么改

I over-engineered the schema first pass. My initial JSONL schemas had 15+ fields per entry. Most were empty. Agents struggle with sparse data -- they try to fill in fields or comment on the absence. I cut schemas to 8-10 essential fields and added optional fields only when I actually had data for them. Simpler schemas, better agent behavior.

The voice guide was too long at first. Version one of tone-of-voice.md was 1,200 lines. The agent would start strong, then drift by paragraph four as the voice instructions fell into the lost-in-middle zone. I restructured it to front-load the most distinctive patterns (signature phrases, banned words, opening patterns) in the first 100 lines, with extended examples further down. The critical rules need to be at the top, not the middle.

Module boundaries matter more than you think. I initially had identity and brand in one module. The agent would load my entire bio when it only needed my banned words list. Splitting them into two modules cut token usage for voice-only tasks by 40%. Every module boundary is a loading decision. Get them wrong and you load too much or too little.

Append-only is non-negotiable. I lost three months of post engagement data early on because an agent rewrote posts.jsonl instead of appending to it. JSONL's append-only pattern isn't just a convention -- it's a safety mechanism. The agent can add data. It cannot destroy data. This is the most important architectural decision in the system.

6) THE RESULTS AND THE PRINCIPLE BEHIND THEM

6) 结果，以及背后的原则

The real result is simpler than any metric. I open Cursor or Claude Code, start a conversation, and the AI already knows who I am, how I write, what I'm working on, and what I care about. It writes in my voice because my voice is encoded as structured data. It follows my priorities because my goals are in a YAML file it reads before suggesting what to work on. It manages my relationships because my contacts and interactions are in files it can query.

The principle behind all of it: this is context engineering, not prompt engineering. Prompt engineering asks "how do I phrase this question better?" Context engineering asks "what information does this AI need to make the right decision, and how do I structure that information so the model actually uses it?"

The shift is from optimizing individual interactions to designing information architecture. It's the difference between writing a good email and building a good filing system. One helps you once. The other helps you every time.

The entire system fits in a Git repository. Clone it to any machine, point any AI tool at it, and the operating system is running. Zero dependencies. Full portability. And because it's Git, every change is versioned, every decision is traceable, and nothing is ever truly lost.

Muratcan Koylan is Context Engineer at Sully.ai, where he designs context engineering systems for healthcare AI. His on-source work on context engineering (8,000+ GitHub stars) is cited in academic research alongside Anthropic. Previously AI Agent Systems Manager at 99Ravens AI, building multi-agent systems handling 10,000+ weekly interactions.

Framework: Agent Skills for Context Engineering

Link: http://x.com/i/article/2025249985722224640

The File System Is the New Database: How I Built a Personal OS for AI Agents

Source: https://x.com/koylanai/status/2025286163641118915?s=46
Mirror: https://x.com/koylanai/status/2025286163641118915?s=46
Published: 2026-02-21T19:07:04+00:00
Saved: 2026-02-24

Content

I got tired of this. So I built a system to fix it.

Here's how I built it, why the architecture decisions matter, and what I learned the hard way.

1) THE CORE PROBLEM: CONTEXT, NOT PROMPTS

This mirrors how experts operate. The three levels create a funnel: broad routing, then module context, then specific data. At each step, the model has exactly what it needs and nothing more.

My AGENT.md is a decision table. The AI reads "User says 'send email to Z'" and immediately sees:

Step 1, look up contact in HubSpot.

Step 2, verify email address.

Step 3, send via Gmail.

2) THE FILE SYSTEM AS MEMORY

One of the most counterintuitive decisions I made: no database. No vector store. No retrieval system except Cursor or Claude Code's features. Just files on disk, versioned with Git.

3) THE SKILL SYSTEM: TEACHING AI HOW TO DO YOUR WORK

4) THE OPERATING SYSTEM: HOW I ACTUALLY USE THIS DAILY

Architecture is nothing without execution.

Here's how the system runs in practice.

The Content Pipeline: Seven stages: Idea, Research, Outline, Draft, Edit, Publish, Promote.

Research outputs to the knowledge module.

Drafts go through four editing passes.

Published content gets logged to posts.jsonl with platform, URL, and engagement metrics.

Promotion uses the thread template to create an X announcement and a LinkedIn adaptation.

I batch content creation on Sundays: 3-4 hours, target output of 3-4 posts drafted and outlined. The content calendar maps each day to a platform and content type.

5) WHAT I GOT WRONG AND WHAT I'D DO DIFFERENTLY

6) THE RESULTS AND THE PRINCIPLE BEHIND THEM

Framework: Agent Skills for Context Engineering

Link: http://x.com/i/article/2025249985722224640

📋 讨论归档

讨论进行中…