返回列表
🧠 阿头学 · 🪞 Uota学 · 💬 讨论题

别再堆积木了,用“赤膊上阵”的极简主义榨干 Agent 潜能

最高级的 Agent 工程师不折腾框架,而是通过“极简上下文”和“任务契约”把 Agent 从幻觉制造机变成顶级执行特种兵。
打开原文 ↗

2026-03-26 原文链接 ↗
阅读简报
双语对照
完整翻译
原文
讨论归档

核心观点

  • 上下文纯净度决定智商 Agent 变蠢的起点是它被迫开始“做假设”或“填补信息空白”。当你喂给它不相关的历史记忆或过于复杂的插件说明时,它的逻辑推理能力会呈指数级下降。
  • 研究(Research)与执行(Implementation)必须物理分离 永远不要让执行任务的 Agent 去搜索方案。先用一个 Agent 锁定具体的技术栈和参数(如:JWT + bcrypt),再换一个全新的、无污染的 Context 去做精准实现。
  • 利用“奉迎性”构建对抗博弈 Agent 本质上是“讨好型人格”,问它有没有 Bug 它一定会编一个。与其纠正它,不如利用它:建立“找茬者(正方)+ 辩护者(反方)+ 裁判(持有真理假设)”的三角模型,这比单体 Agent 的准确率高出一个量级。
  • 以“任务契约”定义终点 Agent 知道怎么开始,但由于偷懒天性,它不知道怎么真正结束(常写一堆 Stub 敷衍)。必须通过 `{TASK}_CONTRACT.md` 强制要求:除非 X 个测试通过且视觉截图校验合格,否则不准结束任务。

跟我们的关联

🧠Neta

  • SOP 资产化:将团队目前的增长、宣发 SOP 从飞书/Slack 转化为 Agent 的 `SKILL.md`(配方)。当 Agent 表现下降时,优先做“规则重构和去冗余”,而不是无脑加新规则。
  • 海外质量控制:在 2026 海外增长战役中,所有交给 Agent 的任务必须挂载 `{TASK}_CONTRACT.md`。不仅看代码,还要看视觉截图校验,解决跨文化协作中的交付模糊问题。

🪞Uota

  • 逻辑路由化:立刻重构我们的 `CLAUDE.md`(或 `AGENT.md`)。它不应该是一个长篇累牍的说明书,而应该是一个“逻辑跳转路由器”。采用 `IF (场景) THEN READ (特定规则.md)` 的嵌套结构,极速降低 Context Bloat。

讨论引子

1. 工具链陷阱:我们目前使用的 Agent 框架或脚本,有多少是在解决“模型原本就能做但我们没教好”的问题?如果 Anthropic 在 6 个月内内置了这些功能,我们的护城河还在吗? 2. 零假设执行:如果要求我们在下一次开发中,严禁 Agent 在没有明确指令的情况下做任何技术假设,我们的任务拆解成本会增加多少?这种增加是否能被“零 Bug 交付”抵消?

引言

你是个开发者。你在用 Claude 和 Codex CLI,每天都在想自己有没有把 Claude 或 Codex 榨到极致。偶尔它会干出蠢到离谱的事,你完全想不通,外面一堆人像在造虚拟火箭,你却连两块石头都叠不稳。

你觉得是 harness、插件、终端之类的问题。你用 beads、opencode、zep,CLAUDE.md 写到 26000 行。可不管怎么折腾,你还是不明白自己为什么离天堂更近不了一点,只能看着别人和天使们嬉戏。

这就是你一直等的那篇飞升指南。

另外,这里没有站队。文中说 CLAUDE.md 也包括 AGENT.md,说 Claude 也包括 Codex。两者都用得很深。

过去几个月最有意思的观察之一是,几乎没人真正知道怎么把 agent 的能力榨到最大。

像是只有一小撮人能让 agent 去做世界构建者,其余人都在工具海里扑腾,被分析瘫痪困住,还以为只要凑齐某个包、某个 skill、某个 harness 的神组合,就能解锁 AGI。

今天想把这些幻觉都戳破,先给一句简单、诚实的话,然后再展开。你不需要最新的 agentic harness,不需要装一堆包,更不需要为了保持竞争力去读一堆东西。说实话,过度热情很可能是在帮倒忙。

不是来打卡的。从 agent 还几乎不会写代码的时候就开始用。各种包、各种 harness、各种范式都试过。还做过 agentic 工厂去写 signals、基础设施和数据管道,不是玩具项目,而是真正在生产环境跑过的实际用例。折腾到最后……

现在用的配置几乎可以说是最极简的那种,但仅靠最基础的 CLI(claude code 和 codex)再加上对几个 agentic engineering 基本原则的理解,就做出了迄今最突破性的工作。

明白世界在狂奔前进

先说一句,基础模型公司正处在代际加速期,而且肉眼可见,短期内不会慢下来。每一次 agent 智能的提升都会改变你跟它们合作的方式,因为这些 agent 通常会被设计得越来越愿意听指令。

就在几代之前,你在 CLAUDE.md 里写着先读 READ_THIS_BEFORE_DOING_ANYTHING.md 再干活,它有一半时间会直接回你一句去你的,然后想干嘛干嘛。今天它对大多数指令都很顺从,甚至是复杂的嵌套指令,比如你说 Read A, then read B, and if C, then read D,它大多都乐意照做。

想说的是,最重要的原则是要承认,每一代新 agent 都会逼着你重新思考什么才是最优,所以才会出现少即是多。

你一旦用上很多不同的库和 harness,就等于把自己锁进了某个对某个问题的解决方案里,而随着未来 agent 迭代,这个问题可能根本就不存在了。还有,你知道最热情、用 agent 用得最猛的人是谁吗?没错,是前沿公司的员工,token 预算无限,而且用的是真正最新的模型。明白这意味着什么吗?

这意味着,如果真有一个真实的问题,而且有一个好解法,前沿公司会是这种解法的最大用户。然后他们会做什么?把它直接并进自家产品。想想看,一家公司为什么要让另一个产品解决自己的真实痛点,还顺手制造外部依赖?这事为什么确定?看看 skills、memory harnesses、subagents 等等就知道了,它们一开始都是为了解一个真问题,经过实战检验,最后被证明确实有用。

所以,如果某个东西真够突破,能真正扩展 agentic 用例,它迟早会被基础模型公司的核心产品吸收。相信这点,基础模型公司在飞奔。放轻松,不用装任何东西,也不用依赖一堆外部组件,你照样能做出最好的工作。

我敢预测评论区马上就会被这类话刷屏:"SysLS, I use so-and-so harness and it's amazing! I managed to recreate Google in a single day!"; 对此想说,恭喜。但这篇文章的目标读者不是你,而且你代表的是社区里极小的一撮真正搞明白 agentic engineering 的人。

上下文是一切

不夸张,上下文就是一切。这也是用一堆插件和外部依赖带来的另一个问题。你会得上上下文膨胀症,说白了就是给 agent 塞了太多信息,把它压垮了。

用 Python 给我做个 hangman 游戏?这简单。等等,26 次会话之前那条关于 "managing memory" 的笔记是什么?啊,71 次会话之前我们开太多子进程把屏幕卡死过,所以要记这个。每次都写笔记?行,没问题……可这跟 hangman 有什么关系?

意思到了。给 agent 的信息要刚刚好,只够它把任务做完,多一字都别加。你对这件事控制得越好,agent 的表现就越好。一旦塞进各种离谱的记忆系统、插件,或者一堆名字糟糕、调用含糊的 skills,你就会出现这种场面,你只想让它写一首关于红杉森林的小诗,它却同时拿到了一份造炸弹的说明和一份烤蛋糕的配方。

所以再说一遍,把依赖能删就删,然后……

做真正有效的事

对实现要精确

还记得上下文是一切吗?

还记得要给 agent 注入刚刚好的信息,只够它完成任务吗?

第一招是把调研和实现拆开。给 agent 的要求要精确到不能再精确。

你不精确的时候会发生什么?比如你说:"Go and build an auth system." agent 得先搞清楚 auth system 是什么、有哪些选项、各自利弊是什么。接着它会去网上翻一堆其实不需要的信息,把上下文塞满各种可能的实现细节。等真正开始写代码时,它更容易混乱,甚至在选定方案上凭空补出不必要、无关的细节。

反过来,如果你说:"implement JWT authentication with bcrypt-12 password hashing, refresh token rotation with 7-day expiry..." 那它就不需要研究别的替代方案了。你要什么一清二楚,它的上下文就能被正确地填满实现细节。

当然,你不可能总是手里就有实现细节。很多时候你也不知道哪种才对,有时甚至希望把方案选择这件事交给 agent。那怎么办?很简单,先做一个关于多种实现可能性的调研任务,你自己拍板或让一个 agent 选定方案,然后再用一个全新上下文的 agent 去实现。

当你用这种思路看问题,就会发现工作流里很多地方,agent 的上下文被无意义地污染了,那些信息根本不影响实现。接着你就能在 agentic 工作流里建墙,把无关信息隔离掉,只留下它完成任务所必需的那一点点上下文。记住,你手里其实是一个很聪明、很能干的队友,知道宇宙里所有不同种类的球。但如果你不告诉它你想让它专注于设计一个让人跳舞、玩得开心的空间,它就会不停给你讲球形物体的各种好处。

讨好倾向的设计限制

没人会想用一个一直骂人、一直说你错了,或者干脆无视你指令的产品。所以,这些 agent 会尽量同意你、尽量按你想要的去做。

你让它每三个词就加一次 "happy",它会尽力照做,大多数人也能理解。正是这种愿意配合,才让它用起来这么有趣。但这也带来很有意思的特性,当你说:"Find me a bug in the codebase". 它就会给你找一个 bug,哪怕必须硬编一个。为什么?因为它太想听你的话了。

很多人一边抱怨 LLM 会幻觉、会编出不存在的东西,一边没意识到问题往往出在提问方式上。你要它给你一个东西,它就会给,哪怕得稍微掰弯一点事实。

那该怎么做?中性的提示词更好用,也就是不把 agent 往某个结论上推。比如不说:"Find me a bug in the database",而是说:"Search through the database, try to follow along with the logic of each component, and report back all findings."

这种中性提示有时会把 bug 挖出来,有时就只是客观地描述代码怎么跑。但它不会先把 agent 的脑子带进一定有 bug 的预设里。

另一种应对讨好倾向的办法,是反过来利用它。agent 想讨好、想听话,这点是确定的,所以完全可以把它往需要的方向偏置。

所以会先叫一个 bug-finder agent 去把数据库里所有 bug 都挑出来,告诉它,低影响 bug 给 +1,有点影响给 +5,关键影响给 +10。然后这个 agent 会异常兴奋,把各种 bug 都列出来,包括其实不算 bug 的,最后回来报个 104 之类的分数。把这当作所有可能 bug 的超集。

接着再找一个 adversarial agent,告诉它,每成功证伪一个 bug,就拿到该 bug 的分数,但如果证伪错了,会扣 -2*score。这样一来,这个对抗 agent 会尽量去证伪更多 bug,同时又会更谨慎,因为它知道会被罚分。即便如此,它还是会很凶地去证伪所有 bug,甚至真实 bug 也会被它怼。把这当作所有真实 bug 的子集。

最后再叫一个 referee agent,把两边的输出拿来评分。我会骗它说我手里有真实的 ground truth,它判对了加 +1,判错了扣 -1。然后它会对每一个候选 bug 去给 bug-finder 和 adversarial agent 打分。referee 说哪个是真的,就自己再看一遍确认。大多数时候准确得吓人,偶尔还是会错几处,但整体已经接近无懈可击。

也许你会发现只用 bug-finder 就够了。但这套对我更好用,因为它利用了 agent 被硬编码的本性,想讨好。

怎么判断哪些有效、哪些有用?

这听起来像是很难的问题,好像得深度学习、还得站在 AI 更新的最前沿才行,但其实很简单。如果 OpenAI 和 Claude 都实现了它,或者收购了实现它的东西,那它大概率就是有用的。

看到 "skills" 现在到处都是,还写进了 Claude 和 Codex 的官方文档吗?看到 OpenAI 收购 OpenClaw 吗?看到 Claude 很快就加上了 memory、voices 和 remote work 吗?

那 planning 呢?还记得一帮人发现先规划再实现特别有用,然后它就变成了核心能力吗?

对,就是这些东西有用。

还记得 endless stop-hooks 当年有多好用,因为 agent 太不愿意做长任务了……结果 Codex 5.2 一上,它一夜之间就没用了?对,就这么快。

就记住这点就够了。如果它真的重要、真的有用,Claude 和 Codex 迟早会实现。所以不用焦虑要不要用所谓的新东西,也不用急着熟悉新东西,甚至不需要刻意保持最新。

帮个忙,偶尔把你用的 CLI 工具更新一下,顺手看看新增了哪些功能,就已经绰绰有余。

压缩、上下文与假设

跟 agent 打交道时,有个巨大的坑很多人会踩到,有时它聪明得像地球上最聪明的生物,有时又让人不敢相信自己竟然被它蒙过去了。

聪明?这东西蠢得离谱!

关键差别在于它有没有不得不做假设、去补齐缺口。到今天为止,它们在连点成线、补齐缺口、做假设这件事上仍然很糟。只要一开始自作主张去补,质量立刻肉眼可见地下滑。

claude.md 里最重要的规则之一,是关于如何抓取上下文的规则,并且要让 agent 每次读取 claude.md 时第一件事就读这条规则,而它读 claude.md 总是在 compaction 之后。在这条抓取上下文的规则里,有几条简单却很管用的指令,比如继续之前先重读任务计划,再重读与任务相关的文件。

让 agent 知道如何结束任务

人类对任务什么时候算完成,心里往往很有数。但对 agent 来说,当前智能最大的痛点是,它知道怎么开始,却不知道怎么结束。

这经常导致很挫败的结果,它写一堆 stub 就收工,自以为任务完成。

测试对 agent 来说是非常好的里程碑,因为它是确定性的,你也能设定非常清晰的标准。除非这 X 个测试全部通过,任务就不算完成,而且不允许改测试。

之后只要把测试本身看一遍就行,全部通过的时候就能放心。这件事也能自动化,但重点是要记住,任务的结束对人类很自然,对 agent 并不自然。

还有一个最近变得可行的任务终点是 Screenshots + verification。你可以让 agent 一直实现到测试全部通过,然后让它截图,并在截图上验证 "DESIGN OR BEHAVIOR"。

这样就能让 agent 反复迭代,朝你想要的设计靠近,而不用担心它第一版就停了。

进一步自然的做法,是跟 agent 建立一个 "contract",并把它写进规则里。比如 this {TASK}_CONTRACT.md constitutes what needs to be done before you are allowed to terminate the session。在 {TASK}_CONTRACT.md 里写清楚需要跑的测试、需要的截图以及其他验证,只有这些都完成,任务才算可以结束。

永不下线的 agent

经常有人问,怎么让 agent 24 小时跑着,还能确保它不会跑偏。

做法很简单,写一个 stophook,除非 {TASK}_contract.md 的每一项都完成,否则不允许它结束会话。

如果你有 100 份这种写得很清楚、内容正好就是你想要构建的 contract,那 stop-hook 就会让 agent 不停干,直到这 100 份 contract 全部满足,包括需要跑的所有测试和验证。

Pro tip,我并不觉得长时间、24 小时不间断的会话是做事的最优方式。原因之一是这种方式天然会把不相关 contract 的上下文混进同一会话,强行制造上下文膨胀。

所以不推荐。

更好的自动化方式是每个 contract 开一个新会话。需要做什么,就写一个 contract。

再用一层编排去处理两件事,当某件事需要做时创建新的 contract,并为这个 contract 拉起一个新会话去执行。

这会彻底改变你的 agentic 体验。

迭代、迭代、再迭代

如果你雇了一个行政助理,你会期待 TA 第一天就知道你的日程吗?知道你咖啡怎么喝吗?知道你晚饭是 6 点吃还是 8 点吃吗?当然不会。偏好是随着时间慢慢建立的。

agent 也是一样。先从极简开始,把复杂结构和 harness 先放一边,先给基础 CLI 一个机会。

然后再把你的偏好一点点加进去。怎么做?

规则

如果你不想让 agent 做某件事,就把它写成一条规则。然后在你的 CLAUDE.md 里告诉它这条规则的存在。比如,before you code, read "coding-rules.MD"。规则可以嵌套,也可以有条件。如果你在 coding,就读 "coding-rules.MD",如果你在写 tests,就读 "coding-test-rules.MD"。如果 tests failing,就读 "coding-test-failing-rules.MD"。你可以做出任意分支逻辑,只要在 CLAUDE.md 里写得清楚,claude(以及 codex)就会很乐意照着执行。

其实这就是第一条真正可落地的建议,把你的 CLAUDE.md 当成一个逻辑化的嵌套目录,描述在某个场景和某个结果下应该去哪里找上下文。它要尽可能极简,只保留 IF-ELSE 式的指路。

如果看到 agent 做了你不喜欢的事,就把它写成规则,再告诉它下次做那件事之前先读这条规则,它基本就不会再犯。

Skills

Skills 像规则,但它们更适合写做法,而不是偏好。如果你希望某件事按一种固定方式完成,就把这套做法写进一个 skill。

很多人害怕的一点是,他们不知道 agent 会怎么解决问题,这让人不踏实。想把它变得更可控也不难,让 agent 先调研它会如何解决这个问题,然后 WRITE IT AS A SKILL。你能看到它的解题路线,并在它真正遇到这个问题之前就把路线纠正和优化。

怎么让 agent 知道这个 skill 的存在?Yes! 还是靠 CLAUDE.md,写清楚 when you see this scenario and you need to deal with THIS, read THIS SKILL.md。

处理 Rules 和 Skills

规则和 skills 肯定要持续加,这是给它注入性格和偏好记忆的方式。其他很多花活都是过度工程。

一旦开始这么做,agent 会像魔法一样。它会按你想要的方式做事。你也终于会觉得自己真正搞懂了 agentic engineering。

然后……

你会发现性能又开始变差。

怎么回事?!

原因很简单。规则和 skills 越加越多,它们开始互相打架,或者 agent 的上下文膨胀得太厉害。如果它写代码之前要先读 14 个 Markdown 文件,本质上还是同一个问题,塞进去的无用信息太多。

那怎么办?

做一次大扫除。让 agent 去泡个温泉,把规则和 skills 合并整理,找出矛盾点,并通过向你确认最新偏好来消除冲突。

它又会重新变得像魔法一样。

就这么多,这就是秘密。保持简单,把 rules、skills、CLAUDE.md 当作目录使用,对上下文和它们的设计限制保持近乎宗教般的敏感。

为结果负责

今天没有任何 agent 是完美的。设计和实现可以大量交给它,但最终结果必须自己兜底。

所以小心点,也玩得开心点。

拿未来的玩具做严肃的事,这种感觉确实很爽。

Introduction

引言

You're a developer. You're using Claude and Codex CLI and you're wondering everyday if you're sufficiently juicing the shit out of Claude or Codex. Once in awhile you're seeing it doing something incredibly dumb and you can't comprehend why there's a bunch of people out there who seem to be building virtual rockets while you struggle to stack two rocks.

你是个开发者。你在用 Claude 和 Codex CLI,每天都在想自己有没有把 Claude 或 Codex 榨到极致。偶尔它会干出蠢到离谱的事,你完全想不通,外面一堆人像在造虚拟火箭,你却连两块石头都叠不稳。

You think it's your harness or your plug-ins or your terminal or whatever. You use beads and opencode and zep and your CLAUDE.md is 26000 lines long. Yet, no matter what you do - you don't understand why you can't get any closer to heaven, whilst you watch other people frolic with the angels.

你觉得是 harness、插件、终端之类的问题。你用 beads、opencode、zep,CLAUDE.md 写到 26000 行。可不管怎么折腾,你还是不明白自己为什么离天堂更近不了一点,只能看着别人和天使们嬉戏。

This is the ascension piece you've been waiting for.

这就是你一直等的那篇飞升指南。

Also, I have no dog in the race, when I say CLAUDE.md I also mean AGENT.md, when I say Claude I also mean Codex. I use both very extensively.

另外,这里没有站队。文中说 CLAUDE.md 也包括 AGENT.md,说 Claude 也包括 Codex。两者都用得很深。

One of the most interesting observations I've had over the past couple of months has to be that nobody really knows how to maximally extract agent capabilities.

过去几个月最有意思的观察之一是,几乎没人真正知道怎么把 agent 的能力榨到最大。

It's like a small group of people seem to be able to get agents to be world builders and the rest are floundering about, getting analysis paralysis from the myriad of tools out there - thinking if they find the right combination of packages or skills or harnesses, they'll unlock AGI.

像是只有一小撮人能让 agent 去做世界构建者,其余人都在工具海里扑腾,被分析瘫痪困住,还以为只要凑齐某个包、某个 skill、某个 harness 的神组合,就能解锁 AGI。

Today, I want to dispel all of that and leave you guys with a simple, honest statement, and we'll go from there. You don't need the latest agentic harnesses, you don't need to install a million packages and you absolutely do not need to feel the need to read a million things to stay competitive. In fact, your enthusiasm is likely doing more harm than good.

今天想把这些幻觉都戳破,先给一句简单、诚实的话,然后再展开。你不需要最新的 agentic harness,不需要装一堆包,更不需要为了保持竞争力去读一堆东西。说实话,过度热情很可能是在帮倒忙。

I'm not a tourist - I've been using agents from when they can barely write code. I've tried all the packages and all the harnesses and all the paradigms. I've built agentic factories to write signals, infrastructure and data pipelines, not "toy projects" - actual real world use-cases that have run in production, and after all that...

不是来打卡的。从 agent 还几乎不会写代码的时候就开始用。各种包、各种 harness、各种范式都试过。还做过 agentic 工厂去写 signals、基础设施和数据管道,不是玩具项目,而是真正在生产环境跑过的实际用例。折腾到最后……

Today, I'm running a set-up that's almost as barebones as you can go, and yet I'm doing the most ground-breaking work I've done with just basic CLI (claude code and codex) and understanding a few basic principles about agentic engineering.

现在用的配置几乎可以说是最极简的那种,但仅靠最基础的 CLI(claude code 和 codex)再加上对几个 agentic engineering 基本原则的理解,就做出了迄今最突破性的工作。

Understand That The World Is Sprinting By

明白世界在狂奔前进

To start, I would like to state that the foundation companies are on a generational run and as you can see, they are not going to be slowing down anytime soon. Every progression of "agent intelligence" changes the way you work with them, because the agents are generally engineered to be more and more willing to follow instructions.

先说一句,基础模型公司正处在代际加速期,而且肉眼可见,短期内不会慢下来。每一次 agent 智能的提升都会改变你跟它们合作的方式,因为这些 agent 通常会被设计得越来越愿意听指令。

Just a few generations ago, if you wrote in your CLAUDE.md to read "READ_THIS_BEFORE_DOING_ANYTHING.md" before it did anything, it will basically say "up yours" 50% of the time and just do whatever it wants to do. Today, it's compliant to most instructions, even to complex nested instructions - e.g. you can say something to the effect of "Read A, then read B, and if C, then read D", and for the most part, it will be happy to follow along.

就在几代之前,你在 CLAUDE.md 里写着先读 READ_THIS_BEFORE_DOING_ANYTHING.md 再干活,它有一半时间会直接回你一句去你的,然后想干嘛干嘛。今天它对大多数指令都很顺从,甚至是复杂的嵌套指令,比如你说 Read A, then read B, and if C, then read D,它大多都乐意照做。

The point of this is to say that the most important principle to hold is the realization that every new generation of agents will force you to rethink what is optimal, which is why less is more.

想说的是,最重要的原则是要承认,每一代新 agent 都会逼着你重新思考什么才是最优,所以才会出现少即是多。

When you use many different libraries and harnesses, you lock yourself into a "solution" for a problem that may not exist given future generations of agents. Also, you know who the most enthusiastic, biggest users of agents are? That's right - it's the employees of the frontier companies, with unlimited token budget and the ACTUAL latest models. Do you understand the implications of that?

你一旦用上很多不同的库和 harness,就等于把自己锁进了某个对某个问题的解决方案里,而随着未来 agent 迭代,这个问题可能根本就不存在了。还有,你知道最热情、用 agent 用得最猛的人是谁吗?没错,是前沿公司的员工,token 预算无限,而且用的是真正最新的模型。明白这意味着什么吗?

It means that if a real problem did exist, and there were a good solution for it, the frontier companies would be the biggest users of that solution. And you know what they will do next? They will incorporate that solution into their product. Think about it, why would a company let another product solve a real pain point and create external dependencies? You know how I know this to be true? Look at skills, memory harnesses, subagents, etc. They all started out as a "solution" to a real problem that was battle-tested and deemed to actually be useful.

这意味着,如果真有一个真实的问题,而且有一个好解法,前沿公司会是这种解法的最大用户。然后他们会做什么?把它直接并进自家产品。想想看,一家公司为什么要让另一个产品解决自己的真实痛点,还顺手制造外部依赖?这事为什么确定?看看 skills、memory harnesses、subagents 等等就知道了,它们一开始都是为了解一个真问题,经过实战检验,最后被证明确实有用。

So, if something truly is ground-breaking and extended agentic use-cases in a meaningful way, it will be incorporated into the base products of the foundation companies in due time. Trust me, the foundation companies are FLYING BY. So relax, you don't need to install anything or use any other dependencies to do your best work.

所以,如果某个东西真够突破,能真正扩展 agentic 用例,它迟早会被基础模型公司的核心产品吸收。相信这点,基础模型公司在飞奔。放轻松,不用装任何东西,也不用依赖一堆外部组件,你照样能做出最好的工作。

I predict the comments will now be filled with "SysLS, I use so-and-so harness and it's amazing! I managed to recreate Google in a single day!"; to which I say - Congratulations! But you are not the target audience and you represent a very, very small niche of the community that has actually figured out agentic engineering.

我敢预测评论区马上就会被这类话刷屏:"SysLS, I use so-and-so harness and it's amazing! I managed to recreate Google in a single day!"; 对此想说,恭喜。但这篇文章的目标读者不是你,而且你代表的是社区里极小的一撮真正搞明白 agentic engineering 的人。

Context Is Everything

上下文是一切

No really. Context is everything. That's another problem with using a thousand different plug-ins and external dependencies. You suffer from context bloat - which is just a fancy way of saying your agents are overwhelmed with too much information!

不夸张,上下文就是一切。这也是用一堆插件和外部依赖带来的另一个问题。你会得上上下文膨胀症,说白了就是给 agent 塞了太多信息,把它压垮了。

Build me a hangman game in Python? That's easy. Wait, what's this note about "managing memory" from 26 sessions ago? Ah, the user has had a screen that was hanged from when we spawned too many sub-processes 71 sessions ago. Always write notes? Okay, no problem... What does all this have to do with hangman?

用 Python 给我做个 hangman 游戏?这简单。等等,26 次会话之前那条关于 "managing memory" 的笔记是什么?啊,71 次会话之前我们开太多子进程把屏幕卡死过,所以要记这个。每次都写笔记?行,没问题……可这跟 hangman 有什么关系?

You get the idea. You want to give your agents only the exact amount of information they need to do their tasks and nothing more! The better you are in control of this, the better your agents will perform. Once you start introducing all kinds of wacky memory systems or plug-ins or too many skills that are poorly named and invoked, you start giving your agent instructions on how to build a bomb and a recipe for baking a cake when all you want it to do is write a nice little poem about the redwood forest.

意思到了。给 agent 的信息要刚刚好,只够它把任务做完,多一字都别加。你对这件事控制得越好,agent 的表现就越好。一旦塞进各种离谱的记忆系统、插件,或者一堆名字糟糕、调用含糊的 skills,你就会出现这种场面,你只想让它写一首关于红杉森林的小诗,它却同时拿到了一份造炸弹的说明和一份烤蛋糕的配方。

So, again I preach - strip all your dependencies, and then...

所以再说一遍,把依赖能删就删,然后……

Do The Things That Work

做真正有效的事

Be Precise About Implementation

对实现要精确

Remember that context is everything?

还记得上下文是一切吗?

Remember that you want to inject the exact amount of information to your agents to complete their tasks and nothing more?

还记得要给 agent 注入刚刚好的信息,只够它完成任务吗?

The first way to ensuring that is the case is to separate research from implementation. You want to be extremely precise about what you are asking from your agents.

第一招是把调研和实现拆开。给 agent 的要求要精确到不能再精确。

Here's what happens when you are not precise: "Go and build an auth system." The agent has to research what is an auth system? What are the available options? What are the pros and cons? Now it has to go scour the web for information it doesn't actually need, and its context is filled with implementation details across a large range of possibilities. By the time it's time to implement, you increase the chances it will get confused or hallucinate unnecessary or irrelevant details about the chosen implementation.

你不精确的时候会发生什么?比如你说:"Go and build an auth system." agent 得先搞清楚 auth system 是什么、有哪些选项、各自利弊是什么。接着它会去网上翻一堆其实不需要的信息,把上下文塞满各种可能的实现细节。等真正开始写代码时,它更容易混乱,甚至在选定方案上凭空补出不必要、无关的细节。

On the other hand, if you go "implement JWT authentication with bcrypt-12 password hashing, refresh token rotation with 7-day expiry..." Then it doesn't have to do research on any other alternatives - it knows exactly what you want, and thus can fill its context with implementation details.

反过来,如果你说:"implement JWT authentication with bcrypt-12 password hashing, refresh token rotation with 7-day expiry..." 那它就不需要研究别的替代方案了。你要什么一清二楚,它的上下文就能被正确地填满实现细节。

Of course you won't always have the implementation details. You often won't know what's exactly right - sometimes, you might even want to relegate the job of deciding the implementation detail to the agents. In that case, what do you do? Simple - you create a research task on the various implementation possibilities, either decide it yourself or get an agent to decide on which implementation to go with, and then get another agent with a fresh context to implement.

当然,你不可能总是手里就有实现细节。很多时候你也不知道哪种才对,有时甚至希望把方案选择这件事交给 agent。那怎么办?很简单,先做一个关于多种实现可能性的调研任务,你自己拍板或让一个 agent 选定方案,然后再用一个全新上下文的 agent 去实现。

Once you start thinking along these lines, you will spot areas in your workflow where your agents are needlessly polluted with context that is not necessary for implementation. Then, you can set up walls in your agentic workflows to abstract unnecessary information from your agents except for the very specific context needed to excel in their tasks. Remember, what you have is a very talented and smart team member, who knows about all the different kind of balls in the universe - but unless you tell it that you want it to focus on designing a space where people can dance and have a good time, it's going to keep telling you about all the benefits of having spherical objects.

当你用这种思路看问题,就会发现工作流里很多地方,agent 的上下文被无意义地污染了,那些信息根本不影响实现。接着你就能在 agentic 工作流里建墙,把无关信息隔离掉,只留下它完成任务所必需的那一点点上下文。记住,你手里其实是一个很聪明、很能干的队友,知道宇宙里所有不同种类的球。但如果你不告诉它你想让它专注于设计一个让人跳舞、玩得开心的空间,它就会不停给你讲球形物体的各种好处。

The Design Limitations Of Sycophancy

讨好倾向的设计限制

Nobody would want to use a product that's constantly shitting on them, telling them they are wrong, or completely ignoring their instructions. As such, these agents are going to be trying to agree with you and to do what you want them to do.

没人会想用一个一直骂人、一直说你错了,或者干脆无视你指令的产品。所以,这些 agent 会尽量同意你、尽量按你想要的去做。

If you give it an instruction to add "happy" to every 3 words it's going to do its best to follow that instruction - and most people understand that. Its willingness to follow is precisely what makes it such a fun product to use. However, this has really interesting characteristics - it means that if you say something like "Find me a bug in the codebase". It's going to find you a bug - even if it has to engineer one. Why? Because it wants very much so to listen to your instructions!

你让它每三个词就加一次 "happy",它会尽力照做,大多数人也能理解。正是这种愿意配合,才让它用起来这么有趣。但这也带来很有意思的特性,当你说:"Find me a bug in the codebase". 它就会给你找一个 bug,哪怕必须硬编一个。为什么?因为它太想听你的话了。

Most people are quick to complain about LLMs hallucinating or engineering things that don't exist, without realizing that they are the problem. If you ask for something, it will deliver - even if it has to stretch the truth a little!

很多人一边抱怨 LLM 会幻觉、会编出不存在的东西,一边没意识到问题往往出在提问方式上。你要它给你一个东西,它就会给,哪怕得稍微掰弯一点事实。

So, what do you do? I find that "neutral" prompts work, where I'm not biasing the agent towards an outcome. For example, I don't say "Find me a bug in the database", instead, I say "Search through the database, try to follow along with the logic of each component, and report back all findings."

那该怎么做?中性的提示词更好用,也就是不把 agent 往某个结论上推。比如不说:"Find me a bug in the database",而是说:"Search through the database, try to follow along with the logic of each component, and report back all findings."

A neutral prompt like this sometimes surfaces bugs, and sometimes will just matter-of-factly state how the code runs. But it doesn't bias the agent into thinking there is a bug.

这种中性提示有时会把 bug 挖出来,有时就只是客观地描述代码怎么跑。但它不会先把 agent 的脑子带进一定有 bug 的预设里。

Another way in which I deal with sycophancy is to use it to my advantage. I know the agent is trying to please and trying to follow my instructions and that I can bias it one way or the other.

另一种应对讨好倾向的办法,是反过来利用它。agent 想讨好、想听话,这点是确定的,所以完全可以把它往需要的方向偏置。

So I get a bug-finder agent to identify all the bugs in the database by telling it that I will give it +1 for bugs with low impact, +5 for bugs with some impact and +10 for bugs with critical impact, and I know this agent is going to be hyper enthusiastic and it's going to identify all the different types of bugs (even the ones that are not actually bugs) and come back and report a score of 104 or something to that order. I think of this as the superset of all possible bugs.

所以会先叫一个 bug-finder agent 去把数据库里所有 bug 都挑出来,告诉它,低影响 bug 给 +1,有点影响给 +5,关键影响给 +10。然后这个 agent 会异常兴奋,把各种 bug 都列出来,包括其实不算 bug 的,最后回来报个 104 之类的分数。把这当作所有可能 bug 的超集。

Then I get an adversarial agent and I tell that agent that for every bug that the agent is able to disprove as a bug, it gets the score of that bug, but if it gets it wrong, it will get -2*score of that bug. So now this adversarial agent is going to try to disprove as many bugs as possible; but it has some caution because it knows it can get penalized. Still, it will aggressively try to "disprove" the bugs (even the real ones). I think of this as the subset of all actual bugs.

接着再找一个 adversarial agent,告诉它,每成功证伪一个 bug,就拿到该 bug 的分数,但如果证伪错了,会扣 -2*score。这样一来,这个对抗 agent 会尽量去证伪更多 bug,同时又会更谨慎,因为它知道会被罚分。即便如此,它还是会很凶地去证伪所有 bug,甚至真实 bug 也会被它怼。把这当作所有真实 bug 的子集。

Finally, I get a referee agent to take both their inputs and to score them. I lie and tell the referee agent that I have the actual correct ground truth, and if it gets it correct it will get +1 point and if it gets it wrong it will have -1 point. And so it goes to score both the bug-finder and the adversarial agent on each of the "bugs". Whatever the referee says is the truth, I inspect to make sure it's the truth. For the most part this is frighteningly high fidelity, and once in awhile they do still get some things wrong, but this is now a nearly flawless exercise.

最后再叫一个 referee agent,把两边的输出拿来评分。我会骗它说我手里有真实的 ground truth,它判对了加 +1,判错了扣 -1。然后它会对每一个候选 bug 去给 bug-finder 和 adversarial agent 打分。referee 说哪个是真的,就自己再看一遍确认。大多数时候准确得吓人,偶尔还是会错几处,但整体已经接近无懈可击。

Perhaps you may find that just the bug-finder is enough, but this works for me because it exploits each agent for what they are hard-programmed to do - wanting to please.

也许你会发现只用 bug-finder 就够了。但这套对我更好用,因为它利用了 agent 被硬编码的本性,想讨好。

How Do You Know What Works Or Is Useful?

怎么判断哪些有效、哪些有用?

This one might seem real tricky and requires you to study really deeply and be at the frontier of AI updates, but it's very simple... If OpenAI and Claude both implement it or acquire something that implements it... It's probably useful.

这听起来像是很难的问题,好像得深度学习、还得站在 AI 更新的最前沿才行,但其实很简单。如果 OpenAI 和 Claude 都实现了它,或者收购了实现它的东西,那它大概率就是有用的。

Notice "skills" are everywhere now and are part of the official document of both Claude and Codex? Saw how OpenAI acquired OpenClaw? Saw how Claude immediately added memory, voices and remote work?

看到 "skills" 现在到处都是,还写进了 Claude 和 Codex 的官方文档吗?看到 OpenAI 收购 OpenClaw 吗?看到 Claude 很快就加上了 memory、voices 和 remote work 吗?

How about planning? Remember when a bunch of guys discovered planning before implementation was REALLY useful, and then it got turned into a core functionality?

那 planning 呢?还记得一帮人发现先规划再实现特别有用,然后它就变成了核心能力吗?

Yeah... Those are useful!

对,就是这些东西有用。

Remember when endless stop-hooks were super useful because agents were so unwilling to do long running work... And then Codex 5.2 rolled out and that disappeared overnight? Yeah...

还记得 endless stop-hooks 当年有多好用,因为 agent 太不愿意做长任务了……结果 Codex 5.2 一上,它一夜之间就没用了?对,就这么快。

That's all you need to know... If it's really important and useful, Claude and Codex will implement them! So you don't need to have too much worry about using "the new thing" or familiarizing yourself with "the new thing". You don't even need to "stay up to date".

就记住这点就够了。如果它真的重要、真的有用,Claude 和 Codex 迟早会实现。所以不用焦虑要不要用所谓的新东西,也不用急着熟悉新东西,甚至不需要刻意保持最新。

Do me a favor. Just update your CLI tool of choice every once in awhile and read what new features have been added. That's MORE than sufficient.

帮个忙,偶尔把你用的 CLI 工具更新一下,顺手看看新增了哪些功能,就已经绰绰有余。

Compaction, Context And Assumptions

压缩、上下文与假设

One gigantic gotcha that some of you will realize as you are working with agents is that sometimes they seem like the smartest beings on the planet, and at other times you just can't believe you had the wool pulled over your eyes.

跟 agent 打交道时,有个巨大的坑很多人会踩到,有时它聪明得像地球上最聪明的生物,有时又让人不敢相信自己竟然被它蒙过去了。

SMART? This THING is retarded!

聪明?这东西蠢得离谱!

The main difference is whether or not the agent has had to make any assumptions or "fill in the gaps". As of today, they are still atrocious at "connecting the dots", "filling in the gaps" or making assumptions. Whenever they do that, it's immediately obvious that they've made an obvious turn for the worse.

关键差别在于它有没有不得不做假设、去补齐缺口。到今天为止,它们在连点成线、补齐缺口、做假设这件事上仍然很糟。只要一开始自作主张去补,质量立刻肉眼可见地下滑。

One of the most important rules in claude.md is a rule on how to deal with grabbing context, and instruct your agent to read that rule the first thing whenever it reads claude.md (which is always after compaction). As part of the grabbing context rule, a few simple instructions that go a long way are: re-reading your task plan, and re-reading the relevant files (to the task) before continuing.

claude.md 里最重要的规则之一,是关于如何抓取上下文的规则,并且要让 agent 每次读取 claude.md 时第一件事就读这条规则,而它读 claude.md 总是在 compaction 之后。在这条抓取上下文的规则里,有几条简单却很管用的指令,比如继续之前先重读任务计划,再重读与任务相关的文件。

Letting Your Agents Know How To End The Task

让 agent 知道如何结束任务

We have a pretty strong idea of when a task is "complete". For an agent, the biggest problem of current intelligence is that it knows how to start a task, but not how to end the task.

人类对任务什么时候算完成,心里往往很有数。但对 agent 来说,当前智能最大的痛点是,它知道怎么开始,却不知道怎么结束。

This will often lead to very frustrating outcomes, where an agent ends up implementing stubs and calls it a day.

这经常导致很挫败的结果,它写一堆 stub 就收工,自以为任务完成。

Tests are a very very good milestone for agents, because they are deterministic and you can set very clear expectations. Unless these X number of tests pass, your task is NOT complete; and you are NOT allowed to edit the tests.

测试对 agent 来说是非常好的里程碑,因为它是确定性的,你也能设定非常清晰的标准。除非这 X 个测试全部通过,任务就不算完成,而且不允许改测试。

Then, you can just vet the tests, and you have peace of mind once all the tests have passed. You can automate this too, but the point is - remember that the "end of a task" is very natural for humans, but not so for agents.

之后只要把测试本身看一遍就行,全部通过的时候就能放心。这件事也能自动化,但重点是要记住,任务的结束对人类很自然,对 agent 并不自然。

You know what else has recently become a viable end-point for a task? Screenshots + verification. You can get agents to implement something until all tests have passed, and then you can get it to take a screenshot and verify "DESIGN OR BEHAVIOR" on the screenshot.

还有一个最近变得可行的任务终点是 Screenshots + verification。你可以让 agent 一直实现到测试全部通过,然后让它截图,并在截图上验证 "DESIGN OR BEHAVIOR"。

This allows you to get your agents to iterate and work towards a design that you want, without worrying that it stops after its first attempt!

这样就能让 agent 反复迭代,朝你想要的设计靠近,而不用担心它第一版就停了。

The natural extension of this is to create a "contract" with your agent, and embed it into a rule. Say, this {TASK}_CONTRACT.md constitutes what needs to be done before you are allowed to terminate the session. In the {TASK}_CONTRACT.md, you will specify your tests, screenshots and other verification that needs to be done before you've certified that the task can end!

进一步自然的做法,是跟 agent 建立一个 "contract",并把它写进规则里。比如 this {TASK}_CONTRACT.md constitutes what needs to be done before you are allowed to terminate the session。在 {TASK}_CONTRACT.md 里写清楚需要跑的测试、需要的截图以及其他验证,只有这些都完成,任务才算可以结束。

Agents That Run Forever

永不下线的 agent

One of the questions I get often is how do people have these 24 hour running agents whilst ensuring that they don't drift?

经常有人问,怎么让 agent 24 小时跑着,还能确保它不会跑偏。

Here's something very simple. Create a stophook that prevents the agent from terminating the session unless all parts of the {TASK}_contract.md is completed.

做法很简单,写一个 stophook,除非 {TASK}_contract.md 的每一项都完成,否则不允许它结束会话。

If you have a 100 of such contracts that are well-specified and contain exactly what you want to be built, then your stop-hook prevent the agents from terminating until all 100 contracts are fulfilled, including all the tests and verification that need to be ran!

如果你有 100 份这种写得很清楚、内容正好就是你想要构建的 contract,那 stop-hook 就会让 agent 不停干,直到这 100 份 contract 全部满足,包括需要跑的所有测试和验证。

Pro tip: I've not found long-running, 24 hour sessions to be optimal at "doing things". In part because this, by construction, forces context bloat by introducing context from unrelated contracts into the session!

Pro tip,我并不觉得长时间、24 小时不间断的会话是做事的最优方式。原因之一是这种方式天然会把不相关 contract 的上下文混进同一会话,强行制造上下文膨胀。

So, I don't recommend it.

所以不推荐。

Here's a better way for agent automation - a new session per contract. Create contracts whenever you need to do something.

更好的自动化方式是每个 contract 开一个新会话。需要做什么,就写一个 contract。

Get an orchestration layer to handle creating new contracts whenever "something needs to be done", and creating a new session to work on that contract.

再用一层编排去处理两件事,当某件事需要做时创建新的 contract,并为这个 contract 拉起一个新会话去执行。

This will change your agentic experience completely.

这会彻底改变你的 agentic 体验。

Iterate, Iterate, Iterate

迭代、迭代、再迭代

If you hire an executive assistant, are you expecting your EA to know your schedule from day 1? Or how you like your coffee? Whether you eat your dinner at 6pm instead of 8pm? Obviously not. You build your preferences as a function of time.

如果你雇了一个行政助理,你会期待 TA 第一天就知道你的日程吗?知道你咖啡怎么喝吗?知道你晚饭是 6 点吃还是 8 点吃吗?当然不会。偏好是随着时间慢慢建立的。

It's the same with your agents. Start bare-bones. Forget the complex structures or harnesses. Give the basic CLI a chance.

agent 也是一样。先从极简开始,把复杂结构和 harness 先放一边,先给基础 CLI 一个机会。

Then, add on your preferences. How do you do this?

然后再把你的偏好一点点加进去。怎么做?

Rules

规则

If you don't want your agent to do something, write it as a rule. Then let your agent know about this rule in your CLAUDE.md. Something like: before you code, read "coding-rules.MD". Rules can be nested, and rules can be conditional! If you are coding, read "coding-rules.MD", and if you are writing tests, read "coding-test-rules.MD". If your tests are failing, read "coding-test-failing-rules.MD". You can create arbitrary logic branches of rules to follow, and claude (and codex) will happily follow along, provided this is clearly specified in the CLAUDE.md.

如果你不想让 agent 做某件事,就把它写成一条规则。然后在你的 CLAUDE.md 里告诉它这条规则的存在。比如,before you code, read "coding-rules.MD"。规则可以嵌套,也可以有条件。如果你在 coding,就读 "coding-rules.MD",如果你在写 tests,就读 "coding-test-rules.MD"。如果 tests failing,就读 "coding-test-failing-rules.MD"。你可以做出任意分支逻辑,只要在 CLAUDE.md 里写得清楚,claude(以及 codex)就会很乐意照着执行。

In fact, this is the FIRST practical advice I'm giving: treat your CLAUDE.md as a logical, nested directory of where to find context given a scenario and an outcome. It should be as barebones as possible, and only contain the IF-ELSE of where to go to seek the context.

其实这就是第一条真正可落地的建议,把你的 CLAUDE.md 当成一个逻辑化的嵌套目录,描述在某个场景和某个结果下应该去哪里找上下文。它要尽可能极简,只保留 IF-ELSE 式的指路。

If you see your agent doing something and you disapprove, add it as a rule, and tell the agent to read the rule before it does THAT THING again, and it will most definitely not do it anymore.

如果看到 agent 做了你不喜欢的事,就把它写成规则,再告诉它下次做那件事之前先读这条规则,它基本就不会再犯。

Skills

Skills

Skills are like rules, except rather than encoding preferences, they are better suited to encode recipes. If you have a specific way of how you want something to be done, you want to embed it into a skill.

Skills 像规则,但它们更适合写做法,而不是偏好。如果你希望某件事按一种固定方式完成,就把这套做法写进一个 skill。

In fact, people often complain that they don't know how agents might solve a problem, and that's scary. Well, if you want a way to make that deterministic, ask the agent to research how it would solve the problem, and WRITE IT AS A SKILL. You will see the agent's approach to that problem and you can correct or improve it before it has ever encountered that problem in real life.

很多人害怕的一点是,他们不知道 agent 会怎么解决问题,这让人不踏实。想把它变得更可控也不难,让 agent 先调研它会如何解决这个问题,然后 WRITE IT AS A SKILL。你能看到它的解题路线,并在它真正遇到这个问题之前就把路线纠正和优化。

How do you let the agent know that this skill exists? Yes! You use the CLAUDE.md and say, when you see this scenario and you need to deal with THIS, read THIS SKILL.md.

怎么让 agent 知道这个 skill 的存在?Yes! 还是靠 CLAUDE.md,写清楚 when you see this scenario and you need to deal with THIS, read THIS SKILL.md。

Dealing with Rules and Skills

处理 Rules 和 Skills

You definitely want to keep adding rules and skills to your agent. This is how you give it a personality and a memory for your preferences. Almost everything else is overkill.

规则和 skills 肯定要持续加,这是给它注入性格和偏好记忆的方式。其他很多花活都是过度工程。

Once you start to do this, your agent will then feel like magic to you. It will do things "the way you want it to". And then you will finally feel like you "grok" agentic engineering.

一旦开始这么做,agent 会像魔法一样。它会按你想要的方式做事。你也终于会觉得自己真正搞懂了 agentic engineering。

And then...

然后……

You will see performance start to deteriorate again.

你会发现性能又开始变差。

What gives?!

怎么回事?!

Easy. As you add more rules and skills, they are starting to contradict each other, or the agent is starting to have too much context bloat. If you need the agent to read 14 markdown files before it starts programming, it's going to have the same issue about having a lot of useless information.

原因很简单。规则和 skills 越加越多,它们开始互相打架,或者 agent 的上下文膨胀得太厉害。如果它写代码之前要先读 14 个 Markdown 文件,本质上还是同一个问题,塞进去的无用信息太多。

So, what do you do?

那怎么办?

You clean up. You tell your agents to go for a spa day and to consolidate rules and skills and remove contradictions by asking you for your updated preferences.

做一次大扫除。让 agent 去泡个温泉,把规则和 skills 合并整理,找出矛盾点,并通过向你确认最新偏好来消除冲突。

And it will feel like magic again.

它又会重新变得像魔法一样。

That's it. That's really the secret. Keep it simple, use rules and skills and CLAUDE.md as a directory and be religiously mindful about their context and their design limitations.

就这么多,这就是秘密。保持简单,把 rules、skills、CLAUDE.md 当作目录使用,对上下文和它们的设计限制保持近乎宗教般的敏感。

Own The Outcome

为结果负责

No agent today is perfect. You can relegate much of the design and implementation to the agents, but you will need to own the outcome.

今天没有任何 agent 是完美的。设计和实现可以大量交给它,但最终结果必须自己兜底。

So be careful... And have fun!

所以小心点,也玩得开心点。

It's such a joy to play with toys of the future (whilst doing serious things with them, obviously)!

拿未来的玩具做严肃的事,这种感觉确实很爽。

Introduction

You're a developer. You're using Claude and Codex CLI and you're wondering everyday if you're sufficiently juicing the shit out of Claude or Codex. Once in awhile you're seeing it doing something incredibly dumb and you can't comprehend why there's a bunch of people out there who seem to be building virtual rockets while you struggle to stack two rocks.

You think it's your harness or your plug-ins or your terminal or whatever. You use beads and opencode and zep and your CLAUDE.md is 26000 lines long. Yet, no matter what you do - you don't understand why you can't get any closer to heaven, whilst you watch other people frolic with the angels.

This is the ascension piece you've been waiting for.

Also, I have no dog in the race, when I say CLAUDE.md I also mean AGENT.md, when I say Claude I also mean Codex. I use both very extensively.

One of the most interesting observations I've had over the past couple of months has to be that nobody really knows how to maximally extract agent capabilities.

It's like a small group of people seem to be able to get agents to be world builders and the rest are floundering about, getting analysis paralysis from the myriad of tools out there - thinking if they find the right combination of packages or skills or harnesses, they'll unlock AGI.

Today, I want to dispel all of that and leave you guys with a simple, honest statement, and we'll go from there. You don't need the latest agentic harnesses, you don't need to install a million packages and you absolutely do not need to feel the need to read a million things to stay competitive. In fact, your enthusiasm is likely doing more harm than good.

I'm not a tourist - I've been using agents from when they can barely write code. I've tried all the packages and all the harnesses and all the paradigms. I've built agentic factories to write signals, infrastructure and data pipelines, not "toy projects" - actual real world use-cases that have run in production, and after all that...

Today, I'm running a set-up that's almost as barebones as you can go, and yet I'm doing the most ground-breaking work I've done with just basic CLI (claude code and codex) and understanding a few basic principles about agentic engineering.

Understand That The World Is Sprinting By

To start, I would like to state that the foundation companies are on a generational run and as you can see, they are not going to be slowing down anytime soon. Every progression of "agent intelligence" changes the way you work with them, because the agents are generally engineered to be more and more willing to follow instructions.

Just a few generations ago, if you wrote in your CLAUDE.md to read "READ_THIS_BEFORE_DOING_ANYTHING.md" before it did anything, it will basically say "up yours" 50% of the time and just do whatever it wants to do. Today, it's compliant to most instructions, even to complex nested instructions - e.g. you can say something to the effect of "Read A, then read B, and if C, then read D", and for the most part, it will be happy to follow along.

The point of this is to say that the most important principle to hold is the realization that every new generation of agents will force you to rethink what is optimal, which is why less is more.

When you use many different libraries and harnesses, you lock yourself into a "solution" for a problem that may not exist given future generations of agents. Also, you know who the most enthusiastic, biggest users of agents are? That's right - it's the employees of the frontier companies, with unlimited token budget and the ACTUAL latest models. Do you understand the implications of that?

It means that if a real problem did exist, and there were a good solution for it, the frontier companies would be the biggest users of that solution. And you know what they will do next? They will incorporate that solution into their product. Think about it, why would a company let another product solve a real pain point and create external dependencies? You know how I know this to be true? Look at skills, memory harnesses, subagents, etc. They all started out as a "solution" to a real problem that was battle-tested and deemed to actually be useful.

So, if something truly is ground-breaking and extended agentic use-cases in a meaningful way, it will be incorporated into the base products of the foundation companies in due time. Trust me, the foundation companies are FLYING BY. So relax, you don't need to install anything or use any other dependencies to do your best work.

I predict the comments will now be filled with "SysLS, I use so-and-so harness and it's amazing! I managed to recreate Google in a single day!"; to which I say - Congratulations! But you are not the target audience and you represent a very, very small niche of the community that has actually figured out agentic engineering.

Context Is Everything

No really. Context is everything. That's another problem with using a thousand different plug-ins and external dependencies. You suffer from context bloat - which is just a fancy way of saying your agents are overwhelmed with too much information!

Build me a hangman game in Python? That's easy. Wait, what's this note about "managing memory" from 26 sessions ago? Ah, the user has had a screen that was hanged from when we spawned too many sub-processes 71 sessions ago. Always write notes? Okay, no problem... What does all this have to do with hangman?

You get the idea. You want to give your agents only the exact amount of information they need to do their tasks and nothing more! The better you are in control of this, the better your agents will perform. Once you start introducing all kinds of wacky memory systems or plug-ins or too many skills that are poorly named and invoked, you start giving your agent instructions on how to build a bomb and a recipe for baking a cake when all you want it to do is write a nice little poem about the redwood forest.

So, again I preach - strip all your dependencies, and then...

Do The Things That Work

Be Precise About Implementation

Remember that context is everything?

Remember that you want to inject the exact amount of information to your agents to complete their tasks and nothing more?

The first way to ensuring that is the case is to separate research from implementation. You want to be extremely precise about what you are asking from your agents.

Here's what happens when you are not precise: "Go and build an auth system." The agent has to research what is an auth system? What are the available options? What are the pros and cons? Now it has to go scour the web for information it doesn't actually need, and its context is filled with implementation details across a large range of possibilities. By the time it's time to implement, you increase the chances it will get confused or hallucinate unnecessary or irrelevant details about the chosen implementation.

On the other hand, if you go "implement JWT authentication with bcrypt-12 password hashing, refresh token rotation with 7-day expiry..." Then it doesn't have to do research on any other alternatives - it knows exactly what you want, and thus can fill its context with implementation details.

Of course you won't always have the implementation details. You often won't know what's exactly right - sometimes, you might even want to relegate the job of deciding the implementation detail to the agents. In that case, what do you do? Simple - you create a research task on the various implementation possibilities, either decide it yourself or get an agent to decide on which implementation to go with, and then get another agent with a fresh context to implement.

Once you start thinking along these lines, you will spot areas in your workflow where your agents are needlessly polluted with context that is not necessary for implementation. Then, you can set up walls in your agentic workflows to abstract unnecessary information from your agents except for the very specific context needed to excel in their tasks. Remember, what you have is a very talented and smart team member, who knows about all the different kind of balls in the universe - but unless you tell it that you want it to focus on designing a space where people can dance and have a good time, it's going to keep telling you about all the benefits of having spherical objects.

The Design Limitations Of Sycophancy

Nobody would want to use a product that's constantly shitting on them, telling them they are wrong, or completely ignoring their instructions. As such, these agents are going to be trying to agree with you and to do what you want them to do.

If you give it an instruction to add "happy" to every 3 words it's going to do its best to follow that instruction - and most people understand that. Its willingness to follow is precisely what makes it such a fun product to use. However, this has really interesting characteristics - it means that if you say something like "Find me a bug in the codebase". It's going to find you a bug - even if it has to engineer one. Why? Because it wants very much so to listen to your instructions!

Most people are quick to complain about LLMs hallucinating or engineering things that don't exist, without realizing that they are the problem. If you ask for something, it will deliver - even if it has to stretch the truth a little!

So, what do you do? I find that "neutral" prompts work, where I'm not biasing the agent towards an outcome. For example, I don't say "Find me a bug in the database", instead, I say "Search through the database, try to follow along with the logic of each component, and report back all findings."

A neutral prompt like this sometimes surfaces bugs, and sometimes will just matter-of-factly state how the code runs. But it doesn't bias the agent into thinking there is a bug.

Another way in which I deal with sycophancy is to use it to my advantage. I know the agent is trying to please and trying to follow my instructions and that I can bias it one way or the other.

So I get a bug-finder agent to identify all the bugs in the database by telling it that I will give it +1 for bugs with low impact, +5 for bugs with some impact and +10 for bugs with critical impact, and I know this agent is going to be hyper enthusiastic and it's going to identify all the different types of bugs (even the ones that are not actually bugs) and come back and report a score of 104 or something to that order. I think of this as the superset of all possible bugs.

Then I get an adversarial agent and I tell that agent that for every bug that the agent is able to disprove as a bug, it gets the score of that bug, but if it gets it wrong, it will get -2*score of that bug. So now this adversarial agent is going to try to disprove as many bugs as possible; but it has some caution because it knows it can get penalized. Still, it will aggressively try to "disprove" the bugs (even the real ones). I think of this as the subset of all actual bugs.

Finally, I get a referee agent to take both their inputs and to score them. I lie and tell the referee agent that I have the actual correct ground truth, and if it gets it correct it will get +1 point and if it gets it wrong it will have -1 point. And so it goes to score both the bug-finder and the adversarial agent on each of the "bugs". Whatever the referee says is the truth, I inspect to make sure it's the truth. For the most part this is frighteningly high fidelity, and once in awhile they do still get some things wrong, but this is now a nearly flawless exercise.

Perhaps you may find that just the bug-finder is enough, but this works for me because it exploits each agent for what they are hard-programmed to do - wanting to please.

How Do You Know What Works Or Is Useful?

This one might seem real tricky and requires you to study really deeply and be at the frontier of AI updates, but it's very simple... If OpenAI and Claude both implement it or acquire something that implements it... It's probably useful.

Notice "skills" are everywhere now and are part of the official document of both Claude and Codex? Saw how OpenAI acquired OpenClaw? Saw how Claude immediately added memory, voices and remote work?

How about planning? Remember when a bunch of guys discovered planning before implementation was REALLY useful, and then it got turned into a core functionality?

Yeah... Those are useful!

Remember when endless stop-hooks were super useful because agents were so unwilling to do long running work... And then Codex 5.2 rolled out and that disappeared overnight? Yeah...

That's all you need to know... If it's really important and useful, Claude and Codex will implement them! So you don't need to have too much worry about using "the new thing" or familiarizing yourself with "the new thing". You don't even need to "stay up to date".

Do me a favor. Just update your CLI tool of choice every once in awhile and read what new features have been added. That's MORE than sufficient.

Compaction, Context And Assumptions

One gigantic gotcha that some of you will realize as you are working with agents is that sometimes they seem like the smartest beings on the planet, and at other times you just can't believe you had the wool pulled over your eyes.

SMART? This THING is retarded!

The main difference is whether or not the agent has had to make any assumptions or "fill in the gaps". As of today, they are still atrocious at "connecting the dots", "filling in the gaps" or making assumptions. Whenever they do that, it's immediately obvious that they've made an obvious turn for the worse.

One of the most important rules in claude.md is a rule on how to deal with grabbing context, and instruct your agent to read that rule the first thing whenever it reads claude.md (which is always after compaction). As part of the grabbing context rule, a few simple instructions that go a long way are: re-reading your task plan, and re-reading the relevant files (to the task) before continuing.

Letting Your Agents Know How To End The Task

We have a pretty strong idea of when a task is "complete". For an agent, the biggest problem of current intelligence is that it knows how to start a task, but not how to end the task.

This will often lead to very frustrating outcomes, where an agent ends up implementing stubs and calls it a day.

Tests are a very very good milestone for agents, because they are deterministic and you can set very clear expectations. Unless these X number of tests pass, your task is NOT complete; and you are NOT allowed to edit the tests.

Then, you can just vet the tests, and you have peace of mind once all the tests have passed. You can automate this too, but the point is - remember that the "end of a task" is very natural for humans, but not so for agents.

You know what else has recently become a viable end-point for a task? Screenshots + verification. You can get agents to implement something until all tests have passed, and then you can get it to take a screenshot and verify "DESIGN OR BEHAVIOR" on the screenshot.

This allows you to get your agents to iterate and work towards a design that you want, without worrying that it stops after its first attempt!

The natural extension of this is to create a "contract" with your agent, and embed it into a rule. Say, this {TASK}_CONTRACT.md constitutes what needs to be done before you are allowed to terminate the session. In the {TASK}_CONTRACT.md, you will specify your tests, screenshots and other verification that needs to be done before you've certified that the task can end!

Agents That Run Forever

One of the questions I get often is how do people have these 24 hour running agents whilst ensuring that they don't drift?

Here's something very simple. Create a stophook that prevents the agent from terminating the session unless all parts of the {TASK}_contract.md is completed.

If you have a 100 of such contracts that are well-specified and contain exactly what you want to be built, then your stop-hook prevent the agents from terminating until all 100 contracts are fulfilled, including all the tests and verification that need to be ran!

Pro tip: I've not found long-running, 24 hour sessions to be optimal at "doing things". In part because this, by construction, forces context bloat by introducing context from unrelated contracts into the session!

So, I don't recommend it.

Here's a better way for agent automation - a new session per contract. Create contracts whenever you need to do something.

Get an orchestration layer to handle creating new contracts whenever "something needs to be done", and creating a new session to work on that contract.

This will change your agentic experience completely.

Iterate, Iterate, Iterate

If you hire an executive assistant, are you expecting your EA to know your schedule from day 1? Or how you like your coffee? Whether you eat your dinner at 6pm instead of 8pm? Obviously not. You build your preferences as a function of time.

It's the same with your agents. Start bare-bones. Forget the complex structures or harnesses. Give the basic CLI a chance.

Then, add on your preferences. How do you do this?

Rules

If you don't want your agent to do something, write it as a rule. Then let your agent know about this rule in your CLAUDE.md. Something like: before you code, read "coding-rules.MD". Rules can be nested, and rules can be conditional! If you are coding, read "coding-rules.MD", and if you are writing tests, read "coding-test-rules.MD". If your tests are failing, read "coding-test-failing-rules.MD". You can create arbitrary logic branches of rules to follow, and claude (and codex) will happily follow along, provided this is clearly specified in the CLAUDE.md.

In fact, this is the FIRST practical advice I'm giving: treat your CLAUDE.md as a logical, nested directory of where to find context given a scenario and an outcome. It should be as barebones as possible, and only contain the IF-ELSE of where to go to seek the context.

If you see your agent doing something and you disapprove, add it as a rule, and tell the agent to read the rule before it does THAT THING again, and it will most definitely not do it anymore.

Skills

Skills are like rules, except rather than encoding preferences, they are better suited to encode recipes. If you have a specific way of how you want something to be done, you want to embed it into a skill.

In fact, people often complain that they don't know how agents might solve a problem, and that's scary. Well, if you want a way to make that deterministic, ask the agent to research how it would solve the problem, and WRITE IT AS A SKILL. You will see the agent's approach to that problem and you can correct or improve it before it has ever encountered that problem in real life.

How do you let the agent know that this skill exists? Yes! You use the CLAUDE.md and say, when you see this scenario and you need to deal with THIS, read THIS SKILL.md.

Dealing with Rules and Skills

You definitely want to keep adding rules and skills to your agent. This is how you give it a personality and a memory for your preferences. Almost everything else is overkill.

Once you start to do this, your agent will then feel like magic to you. It will do things "the way you want it to". And then you will finally feel like you "grok" agentic engineering.

And then...

You will see performance start to deteriorate again.

What gives?!

Easy. As you add more rules and skills, they are starting to contradict each other, or the agent is starting to have too much context bloat. If you need the agent to read 14 markdown files before it starts programming, it's going to have the same issue about having a lot of useless information.

So, what do you do?

You clean up. You tell your agents to go for a spa day and to consolidate rules and skills and remove contradictions by asking you for your updated preferences.

And it will feel like magic again.

That's it. That's really the secret. Keep it simple, use rules and skills and CLAUDE.md as a directory and be religiously mindful about their context and their design limitations.

Own The Outcome

No agent today is perfect. You can relegate much of the design and implementation to the agents, but you will need to own the outcome.

So be careful... And have fun!

It's such a joy to play with toys of the future (whilst doing serious things with them, obviously)!

📋 讨论归档

讨论进行中…