🧠 阿头学 · 💬 讨论题 · 💰投资

LLM 智能体应该用 CLI 而非函数调用

放弃复杂的结构化函数调用，用单一的 Unix 风格 CLI 工具让 LLM 智能体工作，不仅更符合模型的训练分布，还能通过双层架构和启发式设计大幅降低出错率和上下文消耗。
打开原文 ↗

2026-03-13 原文链接 ↗

阅读简报

双语对照

完整翻译

原文

讨论归档

核心观点

CLI 是 LLM 的母语，不是妥协 LLM 训练数据中充满了 shell 命令和管道模式，让它写 `cat log | grep ERROR | wc -l` 比在 15 个不同 schema 的函数间做选择更自然；单工具设计把认知负担从"选哪个 API"转移到"怎么组合命令"，后者是模型已经掌握的技能。

错误信息必须包含下一步建议，不能只报错 传统 CLI 错误是为能 Google 的人设计的，但 Agent 看不到 Google；每个错误必须明确指向正确工具（如"二进制文件，请改用 see photo.png"），这样能把 10 次盲试降到 1-2 步纠偏，直接省掉 50 秒推理时间。

双层架构：执行层保持原始，展示层专为 LLM 设计 底层 Unix 管道不能被截断或加元数据（否则 pipe 链会破坏），所有处理都在返回给 LLM 前做：二进制拦截、长文本截断+指向完整文件、stderr 强制附带、exit code+耗时追加；这样既保证了命令的正确性，又让 LLM 看到它需要的信息。

渐进式帮助让 Agent 自己发现能力，不要一次性塞文档 不在系统提示词里硬塞 3000 字工具文档，而是让 Agent 按需钻取：无参调用返回子命令列表、缺参调用返回参数格式；这样既省上下文，又让 Agent 在实际任务中学会工具，形成长期学习闭环。

一致的输出格式让 Agent 随时间变聪明 每次返回都追加 `[exit:0 | 12ms]` 这样的元信息，Agent 会内化"exit:0 = 成功、exit:127 = 命令不存在、12ms = 便宜、45s = 昂贵"的模式，逐步优化调用策略；不一致的格式会让每次调用都像第一次。

跟我们的关联

对 ATou 意味着什么： 如果你在构建内部 Agent 系统或 AI 工具链，这套设计直接可用——不需要自定义 JSON schema，只需要把现有 CLI 工具包装成 `run(command=...)` 接口，再加上二进制防护和错误导航；下一步可以审视现有工具的 help 文档和错误信息是否为 Agent 优化过。

对 Neta 意味着什么： 这是一个关于"接口设计决定 Agent 能力"的深层洞察——不是模型越聪明越好，而是工具接口设计得越清晰、反馈越及时，Agent 就越可靠；对标准化 Agent 框架的设计有直接启发，特别是工具发现、错误处理、输出格式这三个环节。

对 Uota 意味着什么： 这套思路可以迁移到任何需要 LLM 与外部系统交互的场景——浏览器自动化、代码执行、RAG 检索、工作流编排；核心原则是"不要发明新协议，复用模型已经学过的成熟协议"，这对降低 Agent 系统的复杂度和提升可靠性都有帮助。

下一步怎么用： 如果你的 Agent 系统现在用的是多工具函数调用，可以先做一个小范围试验——选 5-10 个高频工具，包装成 CLI 命令，对比成功率和 token 消耗；同时审视现有的错误处理和输出格式，按文章的"双层架构"原则重构，特别是确保 stderr 不被丢弃、长输出被正确截断。

讨论引子

1. 你的 Agent 系统里，工具选择的失败率有多高？ 是因为模型不够聪明，还是因为工具太多、schema 太复杂、错误信息太差？如果是后者，CLI 方案可能比升级模型更划算。

2. 当 Agent 出错时，它通常是"一次就对"还是"盲试多次"？ 如果经常盲试，说明错误信息的设计有问题；按文章的标准重构错误提示（包含下一步建议），能直接看到成本下降。

3. 你的 Agent 系统有没有"展示层"和"执行层"的分离？ 如果没有，现在的长文本、二进制数据、stderr 是怎么处理的？这个分离对可靠性有多大影响？

Reddit——互联网的心脏跳至主要内容

打开菜单打开导航前往 Reddit 首页

r/LocalLLaMA

获取 App 获取 Reddit 应用登录登录 Reddit

展开用户菜单打开设置菜单

前往 LocalLLaMA

r/LocalLLaMA •

MorroHsu

我曾是 Manus 的后端负责人。做了两年智能体后，我彻底不再使用函数调用。以下是我改用的方案。

英语不是我的母语。我先用中文写下这篇文章，再借助 AI 翻译。文字可能带点 AI 味道，但这些设计决策、生产事故，以及将它们沉淀为原则的思考——都出自我手。

在 Meta 收购之前，我曾在 Manus 担任后端负责人。过去两年里，我一直在构建 AI 智能体——先在 Manus，随后在我自己开源的智能体运行时（Pinix）和智能体（agent-clip）上继续。一路走来，我得出了一个连我自己都意外的结论：

一个采用类 Unix 命令风格的 run(command=...) 工具，胜过一整套带类型的函数调用目录。

以下是我的心得。

为什么选择 *nix

50 年前，Unix 做了一个设计选择：一切都是文本流。程序不交换复杂的二进制结构，也不共享内存对象——它们通过文本管道通信。小工具各司其职，通过 | 组合成强大的工作流。程序用 --help 自述，用退出码报告成功或失败，用 stderr 传递错误信息。

50 年后，LLM 做出了几乎同样的选择：一切都是 token。它们只理解文本，也只产生文本。它们的思考是文本，行动是文本，从世界获得的反馈也必须是文本。

这两个相隔半个世纪、出发点完全不同的决定，最终汇合到同一种接口模型上。Unix 为人类终端操作员设计的文本系统——cat、grep、pipe、退出码、man pages——不仅能被 LLM 使用，而且天然契合。就工具使用而言，LLM 本质上就是一个终端操作员——只是比任何人都快，并且在训练数据里早已见过海量 shell 命令与 CLI 模式。

这就是 *nix Agent 的核心哲学：不要发明新的工具接口。把 Unix 50 年证明有效的东西，直接交给 LLM。

为什么只用一个 run

单工具假设

大多数智能体框架会给 LLM 一份彼此独立的工具目录：


tools: [search_web, read_file, write_file, run_code, send_email, ...]

每一次调用之前，LLM 都必须做一次工具选择——选哪个？参数怎么填？工具越多，选择越难，准确率就越低。认知负担消耗在“用哪个工具？”上，而不是“我要完成什么？”上。

我的做法：只提供一个 run(command=...) 工具，把所有能力都以 CLI 命令的方式暴露出来。


run(command=cat notes.md)
run(command=cat log.txt | grep ERROR | wc -l)
run(command=see screenshot.png)
run(command=memory search deployment issue)
run(command=clip sandbox bash python3 analyze.py)

LLM 仍然需要选择用什么命令，但这与在 15 个不同 schema 的工具之间做选择，本质不同。命令选择是同一命名空间内的字符串组合；函数选择则是不同 API 之间的上下文切换。

LLM 本就会说 CLI

为什么 CLI 命令比结构化函数调用更适配 LLM？

因为 CLI 是 LLM 训练数据里最密集的工具使用模式。GitHub 上数十亿行文本里充满了：


# README install instructions
pip install -r requirements.txt  python main.py

# CI/CD build scripts
make build  make test  make deploy

# Stack Overflow solutions
cat /var/log/syslog | grep Out of memory | tail -20

我不需要教 LLM 怎么用 CLI——它早就会。这种熟悉程度取决于模型与概率分布，但在实践里，它在主流模型上的稳定性出奇地好。

对比同一个任务的两种做法：


Task: Read a log file, count the error lines

Function-calling approach (3 tool calls):
  1. read_file(path=/var/log/app.log) → returns entire file
  2. search_text(text=entire file, pattern=ERROR) → returns matching lines
  3. count_lines(text=matched lines) → returns number

CLI approach (1 tool call):
  run(command=cat /var/log/app.log | grep ERROR | wc -l)
  → 42

一次调用顶三次。不是因为什么特殊优化——而是因为 Unix 管道天然支持组合。

让管道与链式执行真正可用

仅有一个 run 还不够。如果 run 一次只能执行一个命令，那么组合任务仍然需要多次调用。于是我在命令路由层做了一个 chain parser（parseChain），支持四个 Unix 运算符：


|   Pipe: stdout of previous command becomes stdin of next
  And:  execute next only if previous succeeded
||  Or:   execute next only if previous failed
;   Seq:  execute next regardless of previous result

有了这个机制，每一次工具调用都可以是一整段完整工作流：


# One tool call: download → inspect
curl -sL $URL -o data.csv  cat data.csv | head 5

# One tool call: read → filter → sort → top 10
cat access.log | grep 500 | sort | head 10

# One tool call: try A, fall back to B
cat config.yaml || echo config not found, using defaults

N 个命令 × 4 个运算符——组合空间会急剧增大。而对 LLM 来说，这不过是一条它本来就会写的字符串。

命令行就是 LLM 的母语工具接口。

启发式设计：让 CLI 引导智能体

单工具 + CLI 解决了“用什么”。但智能体仍然需要知道“怎么用”。它不能 Google，不能问同事。我用三种渐进式的设计技巧，让 CLI 自己成为智能体的导航系统。

技巧 1：渐进式的 --help 发现

一个设计良好的 CLI 工具，不需要读文档——因为 --help 会告诉你一切。我把同样的原则应用到智能体上，并将其组织为渐进式披露：智能体不需要一次性加载全部文档，而是在深入时按需发现细节。

Level 0：工具描述 → 注入命令列表

run 工具的描述会在每次对话开始时动态生成，列出所有已注册命令及其一行摘要：


Available commands:
  cat    — Read a text file. For images use see. For binary use cat -b.
  see    — View an image (auto-attaches to vision)
  ls     — List files in current topic
  write  — Write file. Usage: write path [content] or stdin
  grep   — Filter lines matching a pattern (supports -i, -v, -c)
  memory — Search or manage memory
  clip   — Operate external environments (sandboxes, services)
  ...

智能体从第一轮就知道有什么可用，但它不需要每个命令的每个参数——那会浪费上下文。

注：这里有个开放的设计问题：是注入完整命令列表，还是按需发现？随着命令变多，列表本身就会消耗上下文预算。我仍在探索合适的平衡点，欢迎建议。

Level 1：command（无参数）→ 用法

当智能体对某个命令感兴趣时，它就直接调用它。不带参数？命令返回自己的用法：


→ run(command=memory)
[error] memory: usage: memory search|recent|store|facts|forget

→ run(command=clip)
  clip list                              — list available clips
  clip name                            — show clip details and commands
  clip name command [args...]         — invoke a command
  clip name pull remote-path [name]   — pull file from clip to local
  clip name push local-path remote  — push local file to clip

现在智能体知道 memory 有五个子命令，clip 支持 list/pull/push。一条调用，零噪声。

Level 2：command subcommand（缺参）→ 具体参数

智能体决定用 memory search，但不确定格式？就继续向下钻：


→ run(command=memory search)
[error] memory: usage: memory search query [-t topic_id] [-k keyword]

→ run(command=clip sandbox)
  Clip: sandbox
  Commands:
    clip sandbox bash script
    clip sandbox read path
    clip sandbox write path
  File transfer:
    clip sandbox pull remote-path [local-name]
    clip sandbox push local-path remote-path

渐进式披露：概览（注入）→ 用法（探索）→ 参数（钻取）。智能体按需发现，每一层都只提供下一步所需的信息。

这与把 3,000 字的工具文档硬塞进系统提示词截然不同。那些信息大多数时候都无关——纯粹浪费上下文。渐进式 help 让智能体自己决定何时需要更多信息。

这也对命令设计提出了要求：每个命令和子命令必须有完整的 help 输出。它不只是给人看的——更是给智能体看的。一条好的 help 信息意味着一次就命中；缺失则意味着盲猜。

技巧 2：把错误信息当作导航

智能体一定会犯错。关键不在于杜绝错误——而在于让每个错误都指向正确方向。

传统 CLI 的错误是为能 Google 的人设计的。智能体不能 Google。所以我要求每个错误都必须同时包含“哪里错了”和“该怎么做”：


Traditional CLI:
  $ cat photo.png
  cat: binary file (standard output)
  → Human Googles how to view image in terminal

My design:
  [error] cat: binary image file (182KB). Use: see photo.png
  → Agent calls see directly, one-step correction

更多例子：


[error] unknown command: foo
Available: cat, ls, see, write, grep, memory, clip, ...
→ Agent immediately knows what commands exist

[error] not an image file: data.csv (use cat to read text files)
→ Agent switches from see to cat

[error] clip sandbox not found. Use clip list to see available clips
→ Agent knows to list clips first

技巧 1（help）解决“我能做什么？”技巧 2（错误）解决“那我应该改做什么？”二者结合，智能体的恢复成本很低——通常 1–2 步就能回到正确路径。

真实案例：静默 stderr 的代价

有一段时间，我的代码在调用外部 sandbox 时会悄悄丢掉 stderr——只要 stdout 非空，就会丢弃 stderr。智能体运行 pip install pymupdf，得到退出码 127。stderr 里明明有 bash: pip: command not found，但智能体看不到。它只知道失败，却不知道原因，于是开始盲猜 10 种不同的包管理器：


pip install         → 127  (doesnt exist)
python3 -m pip      → 1    (module not found)
uv pip install      → 1    (wrong usage)
pip3 install        → 127
sudo apt install    → 127
... 5 more attempts ...
uv run --with pymupdf python3 script.py → 0 ✓  (10th try)

10 次调用，每次推理大约 5 秒。如果第一次就能看到 stderr，一次调用就够了。

当命令失败时，stderr 恰恰是智能体最需要的信息。永远不要丢。

技巧 3：一致的输出格式

前两种技巧解决了发现与纠错。第三个技巧让智能体能随着时间推移越来越擅长使用系统。

我会在每次工具结果后追加一致的元信息：


file1.txt
file2.txt
dir1/
[exit:0 | 12ms]

LLM 会抽取两类信号：

退出码（Unix 约定，LLM 本来就懂）：

exit:0 — 成功
exit:1 — 一般错误
exit:127 — 未找到命令

耗时（成本感知）：

12ms — 便宜，可频繁调用
3.2s — 中等
45s — 昂贵，应谨慎使用

当在一次对话里看过几十次 [exit:N | Xs] 之后，智能体会内化这个模式。它开始提前预期——看到 exit:1 就知道要检查错误；看到耗时很长就会减少调用。

一致的输出格式会让智能体随着时间变聪明。不一致会让每次调用都像第一次。

三种技巧形成一条递进链：


--help       →  What can I do?        →  Proactive discovery
Error Msg    →  What should I do?     →  Reactive correction
Output Fmt   →  How did it go?        →  Continuous learning

双层架构：把启发式设计工程化

上面的部分描述了 CLI 如何在语义层面引导智能体。但要在实践中跑通，还存在一个工程问题：命令的原始输出与 LLM 需要看到的内容，往往完全不同。

LLM 的两个硬约束

约束 A：上下文窗口有限且昂贵。每个 token 都要花钱、占注意力、拖慢推理速度。把一个 10MB 文件塞进上下文不仅浪费预算——还会把更早的对话挤出窗口，智能体会忘记。

约束 B：LLM 只能处理文本。二进制数据通过 tokenizer 会产生高熵的无意义 token。它不仅浪费上下文——还会干扰周围正常 token 的注意力分配，降低推理质量。

这两个约束意味着：命令的原始输出不能直接送给 LLM——它需要一个展示层进行处理。但这种处理不能影响命令执行逻辑——否则 pipe 会被破坏。因此必须分层。

执行层 vs. 展示层


┌─────────────────────────────────────────────┐
│  Layer 2: LLM Presentation Layer            │  ← Designed for LLM constraints
│  Binary guard | Truncation+overflow | Meta   │
├─────────────────────────────────────────────┤
│  Layer 1: Unix Execution Layer              │  ← Pure Unix semantics
│  Command routing | pipe | chain | exit code │
└─────────────────────────────────────────────┘

当 cat bigfile.txt | grep error | head 10 执行时：


Inside Layer 1:
  cat output → [500KB raw text] → grep input
  grep output → [matching lines] → head input
  head output → [first 10 lines]

如果你在 Layer 1 里截断 cat 的输出 → grep 只会搜索前 200 行，结果不完整。
如果你在 Layer 1 里加上 [exit:0] → 它会作为数据流进 grep，成为搜索目标。

所以 Layer 1 必须保持原始、无损、无元数据。所有处理只发生在 Layer 2——在 pipe 链完成之后、最终结果即将返回给 LLM 之前。

Layer 1 服务于 Unix 语义。Layer 2 服务于 LLM 认知。二者分离不是偏好——而是逻辑必然。

Layer 2 的四个机制

机制 A：二进制防护（对应约束 B）

在把任何内容返回给 LLM 之前，先判断它是否为文本：


Null byte detected → binary
UTF-8 validation failed → binary
Control character ratio  10% → binary

If image: [error] binary image (182KB). Use: see photo.png
If other: [error] binary file (1.2MB). Use: cat -b file.bin

LLM 永远不会收到它无法处理的数据。

机制 B：溢出模式（对应约束 A）


Output  200 lines or  50KB?
  → Truncate to first 200 lines (rune-safe, wont split UTF-8)
  → Write full output to /tmp/cmd-output/cmd-{n}.txt
  → Return to LLM:

    [first 200 lines]

    --- output truncated (5000 lines, 245.3KB) ---
    Full output: /tmp/cmd-output/cmd-3.txt
    Explore: cat /tmp/cmd-output/cmd-3.txt | grep pattern
             cat /tmp/cmd-output/cmd-3.txt | tail 100
    [exit:0 | 1.2s]

关键洞察：LLM 本来就会用 grep、head、tail 在文件里导航。溢出模式把“大数据探索”变成 LLM 已经掌握的技能。

机制 C：元数据页脚


actual output here
[exit:0 | 1.2s]

退出码 + 耗时，作为 Layer 2 的最后一行追加。既能给智能体提供成功/失败与成本信号，又不会污染 Layer 1 的 pipe 数据。

机制 D：stderr 附带


When command fails with stderr:
  output + \n[stderr]  + stderr

Ensures the agent can see why something failed, preventing blind retries.

经验教训：来自生产环境的故事

故事 1：一张 PNG 引发的 20 轮疯狂挣扎

用户上传了一张架构图。智能体用 cat 读取，收到了 182KB 的 PNG 原始字节。LLM 的 tokenizer 把这些字节变成了成千上万的无意义 token，硬塞进上下文。LLM 无法理解，于是开始尝试各种读取方式——cat -f、cat --format、cat --type image——每次都得到同样的垃圾。20 轮之后，进程被强制终止。

根因：cat 没有二进制检测，Layer 2 也没有防护。修复：增加 isBinary() 防护 + 错误引导 Use: see photo.png。教训：工具结果就是智能体的眼睛。返回垃圾 = 智能体变瞎。

故事 2：静默 stderr 与 10 次盲目重试

智能体需要读取一个 PDF。它尝试 pip install pymupdf，得到退出码 127。stderr 里有 bash: pip: command not found，但代码把它丢掉了——因为 stdout 有内容，而逻辑是“只要 stdout 存在就忽略 stderr”。

智能体只知道失败，却不知道原因。接下来就是漫长的试错：


pip install         → 127  (doesnt exist)
python3 -m pip      → 1    (module not found)
uv pip install      → 1    (wrong usage)
pip3 install        → 127
sudo apt install    → 127
... 5 more attempts ...
uv run --with pymupdf python3 script.py → 0 ✓

10 次调用，每次推理约 5 秒。若第一次就能看到 stderr，一次调用足矣。

根因：InvokeClip 在 stdout 非空时静默丢弃 stderr。修复：失败时始终附带 stderr。教训：当命令失败时，stderr 正是智能体最需要的信息。

故事 3：溢出模式的价值

智能体分析一个 5,000 行的日志文件。如果不截断，全文（约 200KB）会被塞进上下文。LLM 注意力被淹没，回复质量急剧下降，早前对话也会被挤出上下文窗口。

启用溢出模式后：


[first 200 lines of log content]

--- output truncated (5000 lines, 198.5KB) ---
Full output: /tmp/cmd-output/cmd-3.txt
Explore: cat /tmp/cmd-output/cmd-3.txt | grep pattern
         cat /tmp/cmd-output/cmd-3.txt | tail 100
[exit:0 | 45ms]

智能体先看到前 200 行，理解文件结构，然后用 grep 精准定位问题——总共 3 次调用，上下文不足 2KB。

教训：给智能体一张地图，远比把整片领土塞给它更有效。

边界与限制

CLI 并非银弹。在以下场景，带类型的 API 可能更合适：

强类型交互：数据库查询、GraphQL API 等需要结构化输入/输出的场景。schema 校验比字符串解析更可靠。
高安全需求：CLI 的字符串拼接天然存在注入风险。在不可信输入场景里，带类型参数更安全。agent-clip 通过 sandbox 隔离来缓解这一点。
原生多模态：纯音频/视频处理等二进制流场景，CLI 的文本管道会成为瓶颈。

此外，“不设迭代次数上限”不等于“没有安全边界”。安全由外部机制保证：

Sandbox 隔离：命令在 BoxLite 容器内执行，不可能逃逸
API 预算：LLM 调用有账户级的花费上限
用户取消：前端提供取消按钮，后端支持优雅终止

把 Unix 哲学交给执行层，把 LLM 的认知约束交给展示层，再用 help、错误信息、输出格式这三种渐进式启发式导航技术串起来。

CLI 就是智能体所需的一切。

源码（Go）：github.com/epiral/agent-clip

核心文件：internal/tools.go（命令路由）、internal/chain.go（管道）、internal/loop.go（双层智能体循环）、internal/fs.go（二进制防护）、internal/clip.go（stderr 处理）、internal/browser.go（视觉自动附带）、internal/memory.go（语义记忆）。

也欢迎交流——尤其是如果你尝试过类似方法，或发现了 CLI 失效的边界案例。命令发现问题（注入多少 vs. 让智能体自己发现）也是我仍在积极探索的方向。

Reddit - The heart of the internet Skip to main content

    Open menu  Open navigation   Go to Reddit Home









     r/LocalLLaMA



    Get App      Get the Reddit app          [Log In](https://www.reddit.com/login/)Log in to Reddit

     Expand user menu Open settings menu



















 [Go to LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/)

[r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/)   •

MorroHsu

Reddit——互联网的心脏跳至主要内容

打开菜单打开导航前往 Reddit 首页

r/LocalLLaMA

获取 App 获取 Reddit 应用登录登录 Reddit

展开用户菜单打开设置菜单

前往 LocalLLaMA

r/LocalLLaMA •

MorroHsu

I was backend lead at Manus. After building agents for 2 years, I stopped using function calling entirely. Heres what I use instead.

English is not my first language. I wrote this in Chinese and translated it with AI help. The writing may have some AI flavor, but the design decisions, the production failures, and the thinking that distilled them into principles — those are mine.

I was a backend lead at Manus before the Meta acquisition. Ive spent the last 2 years building AI agents — first at Manus, then on my own open-source agent runtime (Pinix) and agent (agent-clip). Along the way I came to a conclusion that surprised me:

A single run(command=...) tool with Unix-style commands outperforms a catalog of typed function calls.

Heres what I learned.

我曾是 Manus 的后端负责人。做了两年智能体后，我彻底不再使用函数调用。以下是我改用的方案。

一个采用类 Unix 命令风格的 run(command=...) 工具，胜过一整套带类型的函数调用目录。

以下是我的心得。

Why *nix

Unix made a design decision 50 years ago: everything is a text stream. Programs dont exchange complex binary structures or share memory objects — they communicate through text pipes. Small tools each do one thing well, composed via | into powerful workflows. Programs describe themselves with --help, report success or failure with exit codes, and communicate errors through stderr.

LLMs made an almost identical decision 50 years later: everything is tokens. They only understand text, only produce text. Their thinking is text, their actions are text, and the feedback they receive from the world must be text.

These two decisions, made half a century apart from completely different starting points, converge on the same interface model. The text-based system Unix designed for human terminal operators — cat, grep, pipe, exit codes, man pages — isnt just usable by LLMs. Its a natural fit. When it comes to tool use, an LLM is essentially a terminal operator — one thats faster than any human and has already seen vast amounts of shell commands and CLI patterns in its training data.

This is the core philosophy of the *nix Agent: dont invent a new tool interface. Take what Unix has proven over 50 years and hand it directly to the LLM.

为什么选择 *nix

这就是 *nix Agent 的核心哲学：不要发明新的工具接口。把 Unix 50 年证明有效的东西，直接交给 LLM。

Why a single run

The single-tool hypothesis

Most agent frameworks give LLMs a catalog of independent tools:


tools: [search_web, read_file, write_file, run_code, send_email, ...]

Before each call, the LLM must make a tool selection — which one? What parameters? The more tools you add, the harder the selection, and accuracy drops. Cognitive load is spent on which tool? instead of what do I need to accomplish?

My approach: one run(command=...) tool, all capabilities exposed as CLI commands.


run(command=cat notes.md)
run(command=cat log.txt | grep ERROR | wc -l)
run(command=see screenshot.png)
run(command=memory search deployment issue)
run(command=clip sandbox bash python3 analyze.py)

The LLM still chooses which command to use, but this is fundamentally different from choosing among 15 tools with different schemas. Command selection is string composition within a unified namespace — function selection is context-switching between unrelated APIs.

LLMs already speak CLI

Why are CLI commands a better fit for LLMs than structured function calls?

Because CLI is the densest tool-use pattern in LLM training data. Billions of lines on GitHub are full of:


# README install instructions
pip install -r requirements.txt  python main.py

# CI/CD build scripts
make build  make test  make deploy

# Stack Overflow solutions
cat /var/log/syslog | grep Out of memory | tail -20

I dont need to teach the LLM how to use CLI — it already knows. This familiarity is probabilistic and model-dependent, but in practice its remarkably reliable across mainstream models.

Compare two approaches to the same task:


Task: Read a log file, count the error lines

Function-calling approach (3 tool calls):
  1. read_file(path=/var/log/app.log) → returns entire file
  2. search_text(text=entire file, pattern=ERROR) → returns matching lines
  3. count_lines(text=matched lines) → returns number

CLI approach (1 tool call):
  run(command=cat /var/log/app.log | grep ERROR | wc -l)
  → 42

One call replaces three. Not because of special optimization — but because Unix pipes natively support composition.

Making pipes and chains work

A single run isnt enough on its own. If run can only execute one command at a time, the LLM still needs multiple calls for composed tasks. So I make a chain parser (parseChain) in the command routing layer, supporting four Unix operators:


|   Pipe: stdout of previous command becomes stdin of next
  And:  execute next only if previous succeeded
||  Or:   execute next only if previous failed
;   Seq:  execute next regardless of previous result

With this mechanism, every tool call can be a complete workflow:


# One tool call: download → inspect
curl -sL $URL -o data.csv  cat data.csv | head 5

# One tool call: read → filter → sort → top 10
cat access.log | grep 500 | sort | head 10

# One tool call: try A, fall back to B
cat config.yaml || echo config not found, using defaults

N commands × 4 operators — the composition space grows dramatically. And to the LLM, its just a string it already knows how to write.

The command line is the LLMs native tool interface.

为什么只用一个 run

单工具假设

大多数智能体框架会给 LLM 一份彼此独立的工具目录：


tools: [search_web, read_file, write_file, run_code, send_email, ...]

我的做法：只提供一个 run(command=...) 工具，把所有能力都以 CLI 命令的方式暴露出来。


run(command=cat notes.md)
run(command=cat log.txt | grep ERROR | wc -l)
run(command=see screenshot.png)
run(command=memory search deployment issue)
run(command=clip sandbox bash python3 analyze.py)

LLM 本就会说 CLI

为什么 CLI 命令比结构化函数调用更适配 LLM？

因为 CLI 是 LLM 训练数据里最密集的工具使用模式。GitHub 上数十亿行文本里充满了：


# README install instructions
pip install -r requirements.txt  python main.py

# CI/CD build scripts
make build  make test  make deploy

# Stack Overflow solutions
cat /var/log/syslog | grep Out of memory | tail -20

我不需要教 LLM 怎么用 CLI——它早就会。这种熟悉程度取决于模型与概率分布，但在实践里，它在主流模型上的稳定性出奇地好。

对比同一个任务的两种做法：


Task: Read a log file, count the error lines

Function-calling approach (3 tool calls):
  1. read_file(path=/var/log/app.log) → returns entire file
  2. search_text(text=entire file, pattern=ERROR) → returns matching lines
  3. count_lines(text=matched lines) → returns number

CLI approach (1 tool call):
  run(command=cat /var/log/app.log | grep ERROR | wc -l)
  → 42

一次调用顶三次。不是因为什么特殊优化——而是因为 Unix 管道天然支持组合。

让管道与链式执行真正可用


|   Pipe: stdout of previous command becomes stdin of next
  And:  execute next only if previous succeeded
||  Or:   execute next only if previous failed
;   Seq:  execute next regardless of previous result

有了这个机制，每一次工具调用都可以是一整段完整工作流：


# One tool call: download → inspect
curl -sL $URL -o data.csv  cat data.csv | head 5

# One tool call: read → filter → sort → top 10
cat access.log | grep 500 | sort | head 10

# One tool call: try A, fall back to B
cat config.yaml || echo config not found, using defaults

N 个命令 × 4 个运算符——组合空间会急剧增大。而对 LLM 来说，这不过是一条它本来就会写的字符串。

命令行就是 LLM 的母语工具接口。

Heuristic design: making CLI guide the agent

Single-tool + CLI solves what to use. But the agent still needs to know how to use it. It cant Google. It cant ask a colleague. I use three progressive design techniques to make the CLI itself serve as the agents navigation system.

Technique 1: Progressive --help discovery

A well-designed CLI tool doesnt require reading documentation — because --help tells you everything. I apply the same principle to the agent, structured as progressive disclosure: the agent doesnt need to load all documentation at once, but discovers details on-demand as it goes deeper.

Level 0: Tool Description → command list injection

The run tools description is dynamically generated at the start of each conversation, listing all registered commands with one-line summaries:


Available commands:
  cat    — Read a text file. For images use see. For binary use cat -b.
  see    — View an image (auto-attaches to vision)
  ls     — List files in current topic
  write  — Write file. Usage: write path [content] or stdin
  grep   — Filter lines matching a pattern (supports -i, -v, -c)
  memory — Search or manage memory
  clip   — Operate external environments (sandboxes, services)
  ...

The agent knows whats available from turn one, but doesnt need every parameter of every command — that would waste context.

Note: Theres an open design question here: injecting the full command list vs. on-demand discovery. As commands grow, the list itself consumes context budget. Im still exploring the right balance. Ideas welcome.

Level 1: command (no args) → usage

When the agent is interested in a command, it just calls it. No arguments? The command returns its own usage:


→ run(command=memory)
[error] memory: usage: memory search|recent|store|facts|forget

→ run(command=clip)
  clip list                              — list available clips
  clip name                            — show clip details and commands
  clip name command [args...]         — invoke a command
  clip name pull remote-path [name]   — pull file from clip to local
  clip name push local-path remote  — push local file to clip

Now the agent knows memory has five subcommands and clip supports list/pull/push. One call, no noise.

Level 2: command subcommand (missing args) → specific parameters

The agent decides to use memory search but isnt sure about the format? It drills down:


→ run(command=memory search)
[error] memory: usage: memory search query [-t topic_id] [-k keyword]

→ run(command=clip sandbox)
  Clip: sandbox
  Commands:
    clip sandbox bash script
    clip sandbox read path
    clip sandbox write path
  File transfer:
    clip sandbox pull remote-path [local-name]
    clip sandbox push local-path remote-path

Progressive disclosure: overview (injected) → usage (explored) → parameters (drilled down). The agent discovers on-demand, each level providing just enough information for the next step.

This is fundamentally different from stuffing 3,000 words of tool documentation into the system prompt. Most of that information is irrelevant most of the time — pure context waste. Progressive help lets the agent decide when it needs more.

This also imposes a requirement on command design: every command and subcommand must have complete help output. Its not just for humans — its for the agent. A good help message means one-shot success. A missing one means a blind guess.

Technique 2: Error messages as navigation

Agents will make mistakes. The key isnt preventing errors — its making every error point to the right direction.

Traditional CLI errors are designed for humans who can Google. Agents cant Google. So I require every error to contain both what went wrong and what to do instead:


Traditional CLI:
  $ cat photo.png
  cat: binary file (standard output)
  → Human Googles how to view image in terminal

My design:
  [error] cat: binary image file (182KB). Use: see photo.png
  → Agent calls see directly, one-step correction

More examples:


[error] unknown command: foo
Available: cat, ls, see, write, grep, memory, clip, ...
→ Agent immediately knows what commands exist

[error] not an image file: data.csv (use cat to read text files)
→ Agent switches from see to cat

[error] clip sandbox not found. Use clip list to see available clips
→ Agent knows to list clips first

Technique 1 (help) solves what can I do? Technique 2 (errors) solves what should I do instead? Together, the agents recovery cost is minimal — usually 1-2 steps to the right path.

Real case: The cost of silent stderr

For a while, my code silently dropped stderr when calling external sandboxes — whenever stdout was non-empty, stderr was discarded. The agent ran pip install pymupdf, got exit code 127. stderr contained bash: pip: command not found, but the agent couldnt see it. It only knew it failed, not why — and proceeded to blindly guess 10 different package managers:


pip install         → 127  (doesnt exist)
python3 -m pip      → 1    (module not found)
uv pip install      → 1    (wrong usage)
pip3 install        → 127
sudo apt install    → 127
... 5 more attempts ...
uv run --with pymupdf python3 script.py → 0 ✓  (10th try)

10 calls, ~5 seconds of inference each. If stderr had been visible the first time, one call would have been enough.

stderr is the information agents need most, precisely when commands fail. Never drop it.

Technique 3: Consistent output format

The first two techniques handle discovery and correction. The third lets the agent get better at using the system over time.

I append consistent metadata to every tool result:


file1.txt
file2.txt
dir1/
[exit:0 | 12ms]

The LLM extracts two signals:

Exit codes (Unix convention, LLMs already know these):

exit:0 — success

exit:1 — general error

exit:127 — command not found

Duration (cost awareness):

12ms — cheap, call freely

3.2s — moderate

45s — expensive, use sparingly

After seeing [exit:N | Xs] dozens of times in a conversation, the agent internalizes the pattern. It starts anticipating — seeing exit:1 means check the error, seeing long duration means reduce calls.

Consistent output format makes the agent smarter over time. Inconsistency makes every call feel like the first.

The three techniques form a progression:


--help       →  What can I do?        →  Proactive discovery
Error Msg    →  What should I do?     →  Reactive correction
Output Fmt   →  How did it go?        →  Continuous learning

启发式设计：让 CLI 引导智能体

技巧 1：渐进式的 --help 发现

Level 0：工具描述 → 注入命令列表

run 工具的描述会在每次对话开始时动态生成，列出所有已注册命令及其一行摘要：


Available commands:
  cat    — Read a text file. For images use see. For binary use cat -b.
  see    — View an image (auto-attaches to vision)
  ls     — List files in current topic
  write  — Write file. Usage: write path [content] or stdin
  grep   — Filter lines matching a pattern (supports -i, -v, -c)
  memory — Search or manage memory
  clip   — Operate external environments (sandboxes, services)
  ...

智能体从第一轮就知道有什么可用，但它不需要每个命令的每个参数——那会浪费上下文。

Level 1：command（无参数）→ 用法

当智能体对某个命令感兴趣时，它就直接调用它。不带参数？命令返回自己的用法：


→ run(command=memory)
[error] memory: usage: memory search|recent|store|facts|forget

→ run(command=clip)
  clip list                              — list available clips
  clip name                            — show clip details and commands
  clip name command [args...]         — invoke a command
  clip name pull remote-path [name]   — pull file from clip to local
  clip name push local-path remote  — push local file to clip

现在智能体知道 memory 有五个子命令，clip 支持 list/pull/push。一条调用，零噪声。

Level 2：command subcommand（缺参）→ 具体参数

智能体决定用 memory search，但不确定格式？就继续向下钻：


→ run(command=memory search)
[error] memory: usage: memory search query [-t topic_id] [-k keyword]

→ run(command=clip sandbox)
  Clip: sandbox
  Commands:
    clip sandbox bash script
    clip sandbox read path
    clip sandbox write path
  File transfer:
    clip sandbox pull remote-path [local-name]
    clip sandbox push local-path remote-path

渐进式披露：概览（注入）→ 用法（探索）→ 参数（钻取）。智能体按需发现，每一层都只提供下一步所需的信息。

技巧 2：把错误信息当作导航

智能体一定会犯错。关键不在于杜绝错误——而在于让每个错误都指向正确方向。

传统 CLI 的错误是为能 Google 的人设计的。智能体不能 Google。所以我要求每个错误都必须同时包含“哪里错了”和“该怎么做”：


Traditional CLI:
  $ cat photo.png
  cat: binary file (standard output)
  → Human Googles how to view image in terminal

My design:
  [error] cat: binary image file (182KB). Use: see photo.png
  → Agent calls see directly, one-step correction

更多例子：


[error] unknown command: foo
Available: cat, ls, see, write, grep, memory, clip, ...
→ Agent immediately knows what commands exist

[error] not an image file: data.csv (use cat to read text files)
→ Agent switches from see to cat

[error] clip sandbox not found. Use clip list to see available clips
→ Agent knows to list clips first

技巧 1（help）解决“我能做什么？”技巧 2（错误）解决“那我应该改做什么？”二者结合，智能体的恢复成本很低——通常 1–2 步就能回到正确路径。

真实案例：静默 stderr 的代价


pip install         → 127  (doesnt exist)
python3 -m pip      → 1    (module not found)
uv pip install      → 1    (wrong usage)
pip3 install        → 127
sudo apt install    → 127
... 5 more attempts ...
uv run --with pymupdf python3 script.py → 0 ✓  (10th try)

10 次调用，每次推理大约 5 秒。如果第一次就能看到 stderr，一次调用就够了。

当命令失败时，stderr 恰恰是智能体最需要的信息。永远不要丢。

技巧 3：一致的输出格式

前两种技巧解决了发现与纠错。第三个技巧让智能体能随着时间推移越来越擅长使用系统。

我会在每次工具结果后追加一致的元信息：


file1.txt
file2.txt
dir1/
[exit:0 | 12ms]

LLM 会抽取两类信号：

退出码（Unix 约定，LLM 本来就懂）：

exit:0 — 成功
exit:1 — 一般错误
exit:127 — 未找到命令

耗时（成本感知）：

12ms — 便宜，可频繁调用
3.2s — 中等
45s — 昂贵，应谨慎使用

当在一次对话里看过几十次 [exit:N | Xs] 之后，智能体会内化这个模式。它开始提前预期——看到 exit:1 就知道要检查错误；看到耗时很长就会减少调用。

一致的输出格式会让智能体随着时间变聪明。不一致会让每次调用都像第一次。

三种技巧形成一条递进链：


--help       →  What can I do?        →  Proactive discovery
Error Msg    →  What should I do?     →  Reactive correction
Output Fmt   →  How did it go?        →  Continuous learning

Two-layer architecture: engineering the heuristic design

The section above described how CLI guides agents at the semantic level. But to make it work in practice, theres an engineering problem: the raw output of a command and what the LLM needs to see are often very different things.

Two hard constraints of LLMs

Constraint A: The context window is finite and expensive. Every token costs money, attention, and inference speed. Stuffing a 10MB file into context doesnt just waste budget — it pushes earlier conversation out of the window. The agent forgets.

Constraint B: LLMs can only process text. Binary data produces high-entropy meaningless tokens through the tokenizer. It doesnt just waste context — it disrupts attention on surrounding valid tokens, degrading reasoning quality.

These two constraints mean: raw command output cant go directly to the LLM — it needs a presentation layer for processing. But that processing cant affect command execution logic — or pipes break. Hence, two layers.

Execution layer vs. presentation layer


┌─────────────────────────────────────────────┐
│  Layer 2: LLM Presentation Layer            │  ← Designed for LLM constraints
│  Binary guard | Truncation+overflow | Meta   │
├─────────────────────────────────────────────┤
│  Layer 1: Unix Execution Layer              │  ← Pure Unix semantics
│  Command routing | pipe | chain | exit code │
└─────────────────────────────────────────────┘

When cat bigfile.txt | grep error | head 10 executes:


Inside Layer 1:
  cat output → [500KB raw text] → grep input
  grep output → [matching lines] → head input
  head output → [first 10 lines]

If you truncate cats output in Layer 1 → grep only searches the first 200 lines, producing incomplete results. If you add [exit:0] in Layer 1 → it flows into grep as data, becoming a search target.

So Layer 1 must remain raw, lossless, metadata-free. Processing only happens in Layer 2 — after the pipe chain completes and the final result is ready to return to the LLM.

Layer 1 serves Unix semantics. Layer 2 serves LLM cognition. The separation isnt a design preference — its a logical necessity.

Layer 2s four mechanisms

Mechanism A: Binary Guard (addressing Constraint B)

Before returning anything to the LLM, check if its text:


Null byte detected → binary
UTF-8 validation failed → binary
Control character ratio  10% → binary

If image: [error] binary image (182KB). Use: see photo.png
If other: [error] binary file (1.2MB). Use: cat -b file.bin

The LLM never receives data it cant process.

Mechanism B: Overflow Mode (addressing Constraint A)


Output  200 lines or  50KB?
  → Truncate to first 200 lines (rune-safe, wont split UTF-8)
  → Write full output to /tmp/cmd-output/cmd-{n}.txt
  → Return to LLM:

    [first 200 lines]

    --- output truncated (5000 lines, 245.3KB) ---
    Full output: /tmp/cmd-output/cmd-3.txt
    Explore: cat /tmp/cmd-output/cmd-3.txt | grep pattern
             cat /tmp/cmd-output/cmd-3.txt | tail 100
    [exit:0 | 1.2s]

Key insight: the LLM already knows how to use grep, head, tail to navigate files. Overflow mode transforms large data exploration into a skill the LLM already has.

Mechanism C: Metadata Footer


actual output here
[exit:0 | 1.2s]

Exit code + duration, appended as the last line of Layer 2. Gives the agent signals for success/failure and cost awareness, without polluting Layer 1s pipe data.

Mechanism D: stderr Attachment


When command fails with stderr:
  output + \n[stderr]  + stderr

Ensures the agent can see why something failed, preventing blind retries.

双层架构：把启发式设计工程化

上面的部分描述了 CLI 如何在语义层面引导智能体。但要在实践中跑通，还存在一个工程问题：命令的原始输出与 LLM 需要看到的内容，往往完全不同。

LLM 的两个硬约束

执行层 vs. 展示层


┌─────────────────────────────────────────────┐
│  Layer 2: LLM Presentation Layer            │  ← Designed for LLM constraints
│  Binary guard | Truncation+overflow | Meta   │
├─────────────────────────────────────────────┤
│  Layer 1: Unix Execution Layer              │  ← Pure Unix semantics
│  Command routing | pipe | chain | exit code │
└─────────────────────────────────────────────┘

当 cat bigfile.txt | grep error | head 10 执行时：


Inside Layer 1:
  cat output → [500KB raw text] → grep input
  grep output → [matching lines] → head input
  head output → [first 10 lines]

如果你在 Layer 1 里截断 cat 的输出 → grep 只会搜索前 200 行，结果不完整。
如果你在 Layer 1 里加上 [exit:0] → 它会作为数据流进 grep，成为搜索目标。

所以 Layer 1 必须保持原始、无损、无元数据。所有处理只发生在 Layer 2——在 pipe 链完成之后、最终结果即将返回给 LLM 之前。

Layer 1 服务于 Unix 语义。Layer 2 服务于 LLM 认知。二者分离不是偏好——而是逻辑必然。

Layer 2 的四个机制

机制 A：二进制防护（对应约束 B）

在把任何内容返回给 LLM 之前，先判断它是否为文本：


Null byte detected → binary
UTF-8 validation failed → binary
Control character ratio  10% → binary

If image: [error] binary image (182KB). Use: see photo.png
If other: [error] binary file (1.2MB). Use: cat -b file.bin

LLM 永远不会收到它无法处理的数据。

机制 B：溢出模式（对应约束 A）


Output  200 lines or  50KB?
  → Truncate to first 200 lines (rune-safe, wont split UTF-8)
  → Write full output to /tmp/cmd-output/cmd-{n}.txt
  → Return to LLM:

    [first 200 lines]

    --- output truncated (5000 lines, 245.3KB) ---
    Full output: /tmp/cmd-output/cmd-3.txt
    Explore: cat /tmp/cmd-output/cmd-3.txt | grep pattern
             cat /tmp/cmd-output/cmd-3.txt | tail 100
    [exit:0 | 1.2s]

关键洞察：LLM 本来就会用 grep、head、tail 在文件里导航。溢出模式把“大数据探索”变成 LLM 已经掌握的技能。

机制 C：元数据页脚


actual output here
[exit:0 | 1.2s]

退出码 + 耗时，作为 Layer 2 的最后一行追加。既能给智能体提供成功/失败与成本信号，又不会污染 Layer 1 的 pipe 数据。

机制 D：stderr 附带


When command fails with stderr:
  output + \n[stderr]  + stderr

Ensures the agent can see why something failed, preventing blind retries.

Lessons learned: stories from production

Story 1: A PNG that caused 20 iterations of thrashing

A user uploaded an architecture diagram. The agent read it with cat, receiving 182KB of raw PNG bytes. The LLMs tokenizer turned these bytes into thousands of meaningless tokens crammed into the context. The LLM couldnt make sense of it and started trying different read approaches — cat -f, cat --format, cat --type image — each time receiving the same garbage. After 20 iterations, the process was force-terminated.

Root cause: cat had no binary detection, Layer 2 had no guard. Fix: isBinary() guard + error guidance Use: see photo.png. Lesson: The tool result is the agents eyes. Return garbage = agent goes blind.

Story 2: Silent stderr and 10 blind retries

The agent needed to read a PDF. It tried pip install pymupdf, got exit code 127. stderr contained bash: pip: command not found, but the code dropped it — because there was some stdout output, and the logic was if stdout exists, ignore stderr.

The agent only knew it failed, not why. What followed was a long trial-and-error:


pip install         → 127  (doesnt exist)
python3 -m pip      → 1    (module not found)
uv pip install      → 1    (wrong usage)
pip3 install        → 127
sudo apt install    → 127
... 5 more attempts ...
uv run --with pymupdf python3 script.py → 0 ✓

10 calls, ~5 seconds of inference each. If stderr had been visible the first time, one call would have sufficed.

Root cause: InvokeClip silently dropped stderr when stdout was non-empty. Fix: Always attach stderr on failure. Lesson: stderr is the information agents need most, precisely when commands fail.

Story 3: The value of overflow mode

The agent analyzed a 5,000-line log file. Without truncation, the full text (~200KB) was stuffed into context. The LLMs attention was overwhelmed, response quality dropped sharply, and earlier conversation was pushed out of the context window.

With overflow mode:


[first 200 lines of log content]

--- output truncated (5000 lines, 198.5KB) ---
Full output: /tmp/cmd-output/cmd-3.txt
Explore: cat /tmp/cmd-output/cmd-3.txt | grep pattern
         cat /tmp/cmd-output/cmd-3.txt | tail 100
[exit:0 | 45ms]

The agent saw the first 200 lines, understood the file structure, then used grep to pinpoint the issue — 3 calls total, under 2KB of context.

Lesson: Giving the agent a map is far more effective than giving it the entire territory.

经验教训：来自生产环境的故事

故事 1：一张 PNG 引发的 20 轮疯狂挣扎

故事 2：静默 stderr 与 10 次盲目重试

智能体只知道失败，却不知道原因。接下来就是漫长的试错：


pip install         → 127  (doesnt exist)
python3 -m pip      → 1    (module not found)
uv pip install      → 1    (wrong usage)
pip3 install        → 127
sudo apt install    → 127
... 5 more attempts ...
uv run --with pymupdf python3 script.py → 0 ✓

10 次调用，每次推理约 5 秒。若第一次就能看到 stderr，一次调用足矣。

根因：InvokeClip 在 stdout 非空时静默丢弃 stderr。修复：失败时始终附带 stderr。教训：当命令失败时，stderr 正是智能体最需要的信息。

故事 3：溢出模式的价值

启用溢出模式后：


[first 200 lines of log content]

--- output truncated (5000 lines, 198.5KB) ---
Full output: /tmp/cmd-output/cmd-3.txt
Explore: cat /tmp/cmd-output/cmd-3.txt | grep pattern
         cat /tmp/cmd-output/cmd-3.txt | tail 100
[exit:0 | 45ms]

智能体先看到前 200 行，理解文件结构，然后用 grep 精准定位问题——总共 3 次调用，上下文不足 2KB。

教训：给智能体一张地图，远比把整片领土塞给它更有效。

Boundaries and limitations

CLI isnt a silver bullet. Typed APIs may be the better choice in these scenarios:

Strongly-typed interactions: Database queries, GraphQL APIs, and other cases requiring structured input/output. Schema validation is more reliable than string parsing.

High-security requirements: CLIs string concatenation carries inherent injection risks. In untrusted-input scenarios, typed parameters are safer. agent-clip mitigates this through sandbox isolation.

Native multimodal: Pure audio/video processing and other binary-stream scenarios where CLIs text pipe is a bottleneck.

Additionally, no iteration limit doesnt mean no safety boundaries. Safety is ensured by external mechanisms:

Sandbox isolation: Commands execute inside BoxLite containers, no escape possible

API budgets: LLM calls have account-level spending caps

User cancellation: Frontend provides cancel buttons, backend supports graceful shutdown

Hand Unix philosophy to the execution layer, hand LLMs cognitive constraints to the presentation layer, and use help, error messages, and output format as three progressive heuristic navigation techniques.

CLI is all agents need.

Source code (Go): github.com/epiral/agent-clip

Core files: internal/tools.go (command routing), internal/chain.go (pipes), internal/loop.go (two-layer agentic loop), internal/fs.go (binary guard), internal/clip.go (stderr handling), internal/browser.go (vision auto-attach), internal/memory.go (semantic memory).

Happy to discuss — especially if youve tried similar approaches or found cases where CLI breaks down. The command discovery problem (how much to inject vs. let the agent discover) is something Im still actively exploring.

     Share

New to Reddit?

Create your account and connect with a world of communities.

 Continue with Email











 Continue With Phone Number

By continuing, you agree to our User Agreement and acknowledge that you understand the Privacy Policy.

https://x.com/localllamasub

Public

Anyone can view, post, and comment to this community

0 0

   Expand Navigation         Collapse Navigation

边界与限制

CLI 并非银弹。在以下场景，带类型的 API 可能更合适：

强类型交互：数据库查询、GraphQL API 等需要结构化输入/输出的场景。schema 校验比字符串解析更可靠。
高安全需求：CLI 的字符串拼接天然存在注入风险。在不可信输入场景里，带类型参数更安全。agent-clip 通过 sandbox 隔离来缓解这一点。
原生多模态：纯音频/视频处理等二进制流场景，CLI 的文本管道会成为瓶颈。

此外，“不设迭代次数上限”不等于“没有安全边界”。安全由外部机制保证：

Sandbox 隔离：命令在 BoxLite 容器内执行，不可能逃逸
API 预算：LLM 调用有账户级的花费上限
用户取消：前端提供取消按钮，后端支持优雅终止

把 Unix 哲学交给执行层，把 LLM 的认知约束交给展示层，再用 help、错误信息、输出格式这三种渐进式启发式导航技术串起来。

CLI 就是智能体所需的一切。

源码（Go）：github.com/epiral/agent-clip

第一次使用 Reddit？

创建你的账号，与全世界的社区建立连接。

继续使用电子邮件

继续使用手机号

继续即表示你同意我们的用户协议，并确认你已理解隐私政策。

https://x.com/localllamasub

公开

任何人都可以查看、发帖并在此社区评论

0 0

展开导航收起导航

I was backend lead at Manus. After building agents for 2 years, I stopped using function calling entirely. Heres what I use instead.

A single run(command=...) tool with Unix-style commands outperforms a catalog of typed function calls.

Heres what I learned.

Why *nix

This is the core philosophy of the *nix Agent: dont invent a new tool interface. Take what Unix has proven over 50 years and hand it directly to the LLM.

Why a single run

The single-tool hypothesis

Most agent frameworks give LLMs a catalog of independent tools:


tools: [search_web, read_file, write_file, run_code, send_email, ...]

My approach: one run(command=...) tool, all capabilities exposed as CLI commands.


run(command=cat notes.md)
run(command=cat log.txt | grep ERROR | wc -l)
run(command=see screenshot.png)
run(command=memory search deployment issue)
run(command=clip sandbox bash python3 analyze.py)

LLMs already speak CLI

Why are CLI commands a better fit for LLMs than structured function calls?

Because CLI is the densest tool-use pattern in LLM training data. Billions of lines on GitHub are full of:


# README install instructions
pip install -r requirements.txt  python main.py

# CI/CD build scripts
make build  make test  make deploy

# Stack Overflow solutions
cat /var/log/syslog | grep Out of memory | tail -20

I dont need to teach the LLM how to use CLI — it already knows. This familiarity is probabilistic and model-dependent, but in practice its remarkably reliable across mainstream models.

Compare two approaches to the same task:


Task: Read a log file, count the error lines

Function-calling approach (3 tool calls):
  1. read_file(path=/var/log/app.log) → returns entire file
  2. search_text(text=entire file, pattern=ERROR) → returns matching lines
  3. count_lines(text=matched lines) → returns number

CLI approach (1 tool call):
  run(command=cat /var/log/app.log | grep ERROR | wc -l)
  → 42

One call replaces three. Not because of special optimization — but because Unix pipes natively support composition.

Making pipes and chains work


|   Pipe: stdout of previous command becomes stdin of next
  And:  execute next only if previous succeeded
||  Or:   execute next only if previous failed
;   Seq:  execute next regardless of previous result

With this mechanism, every tool call can be a complete workflow:


# One tool call: download → inspect
curl -sL $URL -o data.csv  cat data.csv | head 5

# One tool call: read → filter → sort → top 10
cat access.log | grep 500 | sort | head 10

# One tool call: try A, fall back to B
cat config.yaml || echo config not found, using defaults

N commands × 4 operators — the composition space grows dramatically. And to the LLM, its just a string it already knows how to write.

The command line is the LLMs native tool interface.

Heuristic design: making CLI guide the agent

Technique 1: Progressive --help discovery

Level 0: Tool Description → command list injection

The run tools description is dynamically generated at the start of each conversation, listing all registered commands with one-line summaries:


Available commands:
  cat    — Read a text file. For images use see. For binary use cat -b.
  see    — View an image (auto-attaches to vision)
  ls     — List files in current topic
  write  — Write file. Usage: write path [content] or stdin
  grep   — Filter lines matching a pattern (supports -i, -v, -c)
  memory — Search or manage memory
  clip   — Operate external environments (sandboxes, services)
  ...

The agent knows whats available from turn one, but doesnt need every parameter of every command — that would waste context.

Level 1: command (no args) → usage

When the agent is interested in a command, it just calls it. No arguments? The command returns its own usage:


→ run(command=memory)
[error] memory: usage: memory search|recent|store|facts|forget

→ run(command=clip)
  clip list                              — list available clips
  clip name                            — show clip details and commands
  clip name command [args...]         — invoke a command
  clip name pull remote-path [name]   — pull file from clip to local
  clip name push local-path remote  — push local file to clip

Now the agent knows memory has five subcommands and clip supports list/pull/push. One call, no noise.

Level 2: command subcommand (missing args) → specific parameters

The agent decides to use memory search but isnt sure about the format? It drills down:


→ run(command=memory search)
[error] memory: usage: memory search query [-t topic_id] [-k keyword]

→ run(command=clip sandbox)
  Clip: sandbox
  Commands:
    clip sandbox bash script
    clip sandbox read path
    clip sandbox write path
  File transfer:
    clip sandbox pull remote-path [local-name]
    clip sandbox push local-path remote-path

Progressive disclosure: overview (injected) → usage (explored) → parameters (drilled down). The agent discovers on-demand, each level providing just enough information for the next step.

Technique 2: Error messages as navigation

Agents will make mistakes. The key isnt preventing errors — its making every error point to the right direction.

Traditional CLI errors are designed for humans who can Google. Agents cant Google. So I require every error to contain both what went wrong and what to do instead:


Traditional CLI:
  $ cat photo.png
  cat: binary file (standard output)
  → Human Googles how to view image in terminal

My design:
  [error] cat: binary image file (182KB). Use: see photo.png
  → Agent calls see directly, one-step correction

More examples:


[error] unknown command: foo
Available: cat, ls, see, write, grep, memory, clip, ...
→ Agent immediately knows what commands exist

[error] not an image file: data.csv (use cat to read text files)
→ Agent switches from see to cat

[error] clip sandbox not found. Use clip list to see available clips
→ Agent knows to list clips first

Technique 1 (help) solves what can I do? Technique 2 (errors) solves what should I do instead? Together, the agents recovery cost is minimal — usually 1-2 steps to the right path.

Real case: The cost of silent stderr


pip install         → 127  (doesnt exist)
python3 -m pip      → 1    (module not found)
uv pip install      → 1    (wrong usage)
pip3 install        → 127
sudo apt install    → 127
... 5 more attempts ...
uv run --with pymupdf python3 script.py → 0 ✓  (10th try)

10 calls, ~5 seconds of inference each. If stderr had been visible the first time, one call would have been enough.

stderr is the information agents need most, precisely when commands fail. Never drop it.

Technique 3: Consistent output format

The first two techniques handle discovery and correction. The third lets the agent get better at using the system over time.

I append consistent metadata to every tool result:


file1.txt
file2.txt
dir1/
[exit:0 | 12ms]

The LLM extracts two signals:

Exit codes (Unix convention, LLMs already know these):

exit:0 — success

exit:1 — general error

exit:127 — command not found

Duration (cost awareness):

12ms — cheap, call freely

3.2s — moderate

45s — expensive, use sparingly

Consistent output format makes the agent smarter over time. Inconsistency makes every call feel like the first.

The three techniques form a progression:


--help       →  What can I do?        →  Proactive discovery
Error Msg    →  What should I do?     →  Reactive correction
Output Fmt   →  How did it go?        →  Continuous learning

Two-layer architecture: engineering the heuristic design

Two hard constraints of LLMs

Execution layer vs. presentation layer


┌─────────────────────────────────────────────┐
│  Layer 2: LLM Presentation Layer            │  ← Designed for LLM constraints
│  Binary guard | Truncation+overflow | Meta   │
├─────────────────────────────────────────────┤
│  Layer 1: Unix Execution Layer              │  ← Pure Unix semantics
│  Command routing | pipe | chain | exit code │
└─────────────────────────────────────────────┘

When cat bigfile.txt | grep error | head 10 executes:


Inside Layer 1:
  cat output → [500KB raw text] → grep input
  grep output → [matching lines] → head input
  head output → [first 10 lines]

So Layer 1 must remain raw, lossless, metadata-free. Processing only happens in Layer 2 — after the pipe chain completes and the final result is ready to return to the LLM.

Layer 1 serves Unix semantics. Layer 2 serves LLM cognition. The separation isnt a design preference — its a logical necessity.

Layer 2s four mechanisms

Mechanism A: Binary Guard (addressing Constraint B)

Before returning anything to the LLM, check if its text:


Null byte detected → binary
UTF-8 validation failed → binary
Control character ratio  10% → binary

If image: [error] binary image (182KB). Use: see photo.png
If other: [error] binary file (1.2MB). Use: cat -b file.bin

The LLM never receives data it cant process.

Mechanism B: Overflow Mode (addressing Constraint A)


Output  200 lines or  50KB?
  → Truncate to first 200 lines (rune-safe, wont split UTF-8)
  → Write full output to /tmp/cmd-output/cmd-{n}.txt
  → Return to LLM:

    [first 200 lines]

    --- output truncated (5000 lines, 245.3KB) ---
    Full output: /tmp/cmd-output/cmd-3.txt
    Explore: cat /tmp/cmd-output/cmd-3.txt | grep pattern
             cat /tmp/cmd-output/cmd-3.txt | tail 100
    [exit:0 | 1.2s]

Key insight: the LLM already knows how to use grep, head, tail to navigate files. Overflow mode transforms large data exploration into a skill the LLM already has.

Mechanism C: Metadata Footer


actual output here
[exit:0 | 1.2s]

Exit code + duration, appended as the last line of Layer 2. Gives the agent signals for success/failure and cost awareness, without polluting Layer 1s pipe data.

Mechanism D: stderr Attachment


When command fails with stderr:
  output + \n[stderr]  + stderr

Ensures the agent can see why something failed, preventing blind retries.

Lessons learned: stories from production

Story 1: A PNG that caused 20 iterations of thrashing

Story 2: Silent stderr and 10 blind retries

The agent only knew it failed, not why. What followed was a long trial-and-error:


pip install         → 127  (doesnt exist)
python3 -m pip      → 1    (module not found)
uv pip install      → 1    (wrong usage)
pip3 install        → 127
sudo apt install    → 127
... 5 more attempts ...
uv run --with pymupdf python3 script.py → 0 ✓

10 calls, ~5 seconds of inference each. If stderr had been visible the first time, one call would have sufficed.

Root cause: InvokeClip silently dropped stderr when stdout was non-empty. Fix: Always attach stderr on failure. Lesson: stderr is the information agents need most, precisely when commands fail.

Story 3: The value of overflow mode

With overflow mode:


[first 200 lines of log content]

--- output truncated (5000 lines, 198.5KB) ---
Full output: /tmp/cmd-output/cmd-3.txt
Explore: cat /tmp/cmd-output/cmd-3.txt | grep pattern
         cat /tmp/cmd-output/cmd-3.txt | tail 100
[exit:0 | 45ms]

The agent saw the first 200 lines, understood the file structure, then used grep to pinpoint the issue — 3 calls total, under 2KB of context.

Lesson: Giving the agent a map is far more effective than giving it the entire territory.

Boundaries and limitations

CLI isnt a silver bullet. Typed APIs may be the better choice in these scenarios:

Strongly-typed interactions: Database queries, GraphQL APIs, and other cases requiring structured input/output. Schema validation is more reliable than string parsing.

High-security requirements: CLIs string concatenation carries inherent injection risks. In untrusted-input scenarios, typed parameters are safer. agent-clip mitigates this through sandbox isolation.

Native multimodal: Pure audio/video processing and other binary-stream scenarios where CLIs text pipe is a bottleneck.

Additionally, no iteration limit doesnt mean no safety boundaries. Safety is ensured by external mechanisms:

Sandbox isolation: Commands execute inside BoxLite containers, no escape possible

API budgets: LLM calls have account-level spending caps

User cancellation: Frontend provides cancel buttons, backend supports graceful shutdown

CLI is all agents need.

Source code (Go): github.com/epiral/agent-clip

     Share

New to Reddit?

Create your account and connect with a world of communities.

📋 讨论归档

讨论进行中…

LLM 智能体应该用 CLI 而非函数调用

核心观点

跟我们的关联

讨论引子

我曾是 Manus 的后端负责人。做了两年智能体后，我彻底不再使用函数调用。以下是我改用的方案。

为什么选择 *nix

为什么只用一个 run

单工具假设

LLM 本就会说 CLI

让管道与链式执行真正可用

启发式设计：让 CLI 引导智能体

技巧 1：渐进式的 --help 发现

技巧 2：把错误信息当作导航

技巧 3：一致的输出格式

双层架构：把启发式设计工程化

LLM 的两个硬约束

执行层 vs. 展示层

Layer 2 的四个机制

经验教训：来自生产环境的故事

故事 1：一张 PNG 引发的 20 轮疯狂挣扎

故事 2：静默 stderr 与 10 次盲目重试

故事 3：溢出模式的价值

边界与限制

I was backend lead at Manus. After building agents for 2 years, I stopped using function calling entirely. Heres what I use instead.

我曾是 Manus 的后端负责人。做了两年智能体后，我彻底不再使用函数调用。以下是我改用的方案。

Why *nix

为什么选择 *nix

Why a single run

The single-tool hypothesis

LLMs already speak CLI

Making pipes and chains work

为什么只用一个 run

单工具假设

LLM 本就会说 CLI

让管道与链式执行真正可用

Heuristic design: making CLI guide the agent

Technique 1: Progressive --help discovery

Technique 2: Error messages as navigation

Technique 3: Consistent output format

启发式设计：让 CLI 引导智能体

技巧 1：渐进式的 --help 发现

技巧 2：把错误信息当作导航

技巧 3：一致的输出格式

Two-layer architecture: engineering the heuristic design

Two hard constraints of LLMs

Execution layer vs. presentation layer

Layer 2s four mechanisms

双层架构：把启发式设计工程化

LLM 的两个硬约束

执行层 vs. 展示层

Layer 2 的四个机制

Lessons learned: stories from production

Story 1: A PNG that caused 20 iterations of thrashing

Story 2: Silent stderr and 10 blind retries

Story 3: The value of overflow mode

经验教训：来自生产环境的故事

故事 1：一张 PNG 引发的 20 轮疯狂挣扎

故事 2：静默 stderr 与 10 次盲目重试

故事 3：溢出模式的价值

Boundaries and limitations

边界与限制

相关笔记

I was backend lead at Manus. After building agents for 2 years, I stopped using function calling entirely. Heres what I use instead.

Why *nix

Why a single run

The single-tool hypothesis

LLMs already speak CLI

Making pipes and chains work

Heuristic design: making CLI guide the agent

Technique 1: Progressive --help discovery

Technique 2: Error messages as navigation

Technique 3: Consistent output format

Two-layer architecture: engineering the heuristic design

Two hard constraints of LLMs

Execution layer vs. presentation layer

Layer 2s four mechanisms

Lessons learned: stories from production

Story 1: A PNG that caused 20 iterations of thrashing

Story 2: Silent stderr and 10 blind retries

Story 3: The value of overflow mode