返回列表
🧠 阿头学 · 💬 讨论题

LLM 个人知识库不是玩具,但“全自动自治”被说大了

把 LLM 当“知识编译器”来搭本地 Markdown wiki 是个真有价值的工作流,但作者把“能跑通”直接推成“可自治、可产品化”的结论,明显过度乐观。
打开原文 ↗

2026-04-03 原文链接 ↗
阅读简报
双语对照
完整翻译
原文
讨论归档

核心观点

  • 从问答转向编译 作者真正提出的不是“用 LLM 查资料”,而是“用 LLM 把 raw 材料持续编译成 Markdown wiki”;这个判断是成立的,因为可存档、可回溯、可增量增强的知识资产,明显比一次性对话更有复利。
  • 中小规模下,轻 RAG 可能胜过重架构 作者认为在约 100 篇文章、40 万词的规模下,不必先上复杂 RAG,这个判断在个人研究场景里大概率成立;但如果把它外推到更大规模或更高准确率要求的场景,这个结论就站不住了。
  • Obsidian 只是前端,LLM 才是主笔 文中最激进也最关键的主张是“wiki 几乎全部由 LLM 写和维护”;这在效率上很诱人,但在可靠性上风险极高,因为一旦早期摘要出错,后续链接、问答、归档和校验都会被连带污染。
  • 输出回流形成知识飞轮 让 LLM 直接产出 Markdown、幻灯片、图表,再归档回 wiki,这个闭环判断是强的;因为它把“回答”变成“资产”,提升了后续复用价值,也更接近真正的研究工作流。
  • 产品机会真实存在,但门槛被低估了 “有空间做出一个很棒的新产品”这个判断并非空想,因为 ingest-compile-query-archive 的闭环确实比聊天框更有产品厚度;但作者明显低估了成本控制、错误治理、权限管理和普通用户可用性,这些恰恰决定能不能产品化。

跟我们的关联

1. 对 ATou 意味着什么:这说明 ATou 不该把 AI 只当回答器,而该把它当知识资产整理器;下一步应该先选一个垂直主题,按 raw/→wiki→输出回流 的闭环做最小实验,验证“复利感”是否真实存在。 2. 对 Neta 意味着什么:这篇文章强化了“知识编译层比聊天层更关键”的产品判断;下一步可以把重点放在摘要、索引、链接、版本校验和输出归档,而不是先卷更复杂的 Agent 外壳。 3. 对 Uota 意味着什么:它提示 Uota 类系统如果只做检索问答会很快同质化,真正的差异化会出现在“自动维护知识结构”上;下一步应测试哪些内容必须人工审核、哪些可以全自动生成,避免错误闭环。 4. 对三者共同意味着什么:这套方法最值得学的不是 Obsidian 或 Markdown,而是“把探索过程沉淀为可复用工件”的思路;下一步应该建立强制溯源和抽样复核机制,否则知识库越用越强很可能变成越用越脏。

讨论引子

1. 如果知识库内容几乎全由 LLM 编写和维护,人类到底是在“省力”,还是在把错误外包给一个更难审计的系统? 2. 在 100 篇文章、40 万词这个量级下,不上复杂 RAG 是务实选择,还是只是因为规模还没大到暴露问题? 3. 真正可产品化的部分是“LLM 自动整理知识”,还是“把知识工作流沉淀成可视化、可回流的资产系统”?

最近发现一件特别好用的事:用 LLM 为各种自己感兴趣的研究主题搭建个人知识库。这样一来,最近消耗的大量 token 不再主要用在摆弄代码上,而更多用在整理知识上(以 Markdown 和图片的形式存着)。最新一代的 LLM 做这件事相当不错。所以:

数据摄取: 先把源文档(文章、论文、代码库、数据集、图片等)索引到 raw/ 目录里,然后用 LLM 逐步把它们编译成一个 wiki,本质上就是按目录结构组织的一堆 .md 文件。这个 wiki 会包含 raw/ 里所有数据的摘要、反向链接,然后把数据归类成概念,为这些概念写文章,并把它们全部串起来。把网页文章转成 .md 文件时,喜欢用 Obsidian Web Clipper 扩展;然后还会用一个快捷键把相关图片全部下载到本地,方便 LLM 直接引用。

IDE: 用 Obsidian 作为 IDE 的前端,在里面看原始数据、编译后的 wiki,以及由此派生出来的可视化。需要强调的是,wiki 的全部数据都由 LLM 来写和维护,几乎不会直接手改。也试过一些 Obsidian 插件,用不同方式渲染和浏览数据(比如用 Marp 做幻灯片)。

问答: 有意思的地方在于,当你的 wiki 足够大之后(比如我在一些近期研究上,大概有 ~100 篇文章和 ~400K 词),就可以让 LLM agent 围绕这套 wiki 回答各种复杂问题,它会自己跑去查资料、把答案研究出来等等。本来以为得上很花哨的 RAG,但 LLM 在自动维护索引文件、给所有文档生成简短摘要这件事上表现很好;在这种规模不大的情况下,它也能比较轻松地读到所有关键的相关资料。

输出: 比起在文本或终端里拿到答案,更喜欢让它直接给我产出 Markdown 文件、幻灯片(Marp 格式)或者 matplotlib 图片,然后再回到 Obsidian 里查看。根据问题不同,还可以想象出很多别的可视化输出格式。很多时候,最后会把这些输出再归档回 wiki,让它更强,便于后续继续问。所以自己的探索和提问,都会在知识库里不断累积起来。

校验: 也会让 LLM 对整个 wiki 做一些健康检查,比如找出不一致的数据、补全缺失数据(借助网页搜索器)、找出有趣的关联以便产出新的候选文章等,用这种方式逐步清理 wiki,提升整体的数据完整性。LLM 也很擅长建议接下来还值得问什么、该往哪里继续挖。

额外工具: 还会顺手开发一些处理数据的工具,比如凭感觉写了一个小而朴素的 wiki 搜索引擎,既会直接用它(在一个 web ui 里),但更多时候是通过 CLI 把它交给 LLM 当工具,用来支撑更大的查询。

进一步探索: 随着 repo 变大,很自然也会想要做合成数据生成加微调,让 LLM 不只是靠上下文窗口,而是把这些数据真正学进权重里。

TLDR:先从若干来源收集原始数据,再由 LLM 把它编译成一个 .md 的 wiki,然后 LLM 借助各种 CLI 在上面做问答,并持续增量增强 wiki;所有内容都可以在 Obsidian 里查看。几乎不需要手写或手改 wiki,它基本就是 LLM 的领域。这里感觉有空间做出一个很棒的新产品,而不是一堆拼凑的脚本。

Something I'm finding very useful recently: using LLMs to build personal knowledge bases for various topics of research interest. In this way, a large fraction of my recent token throughput is going less into manipulating code, and more into manipulating knowledge (stored as markdown and images). The latest LLMs are quite good at it. So:

最近发现一件特别好用的事:用 LLM 为各种自己感兴趣的研究主题搭建个人知识库。这样一来,最近消耗的大量 token 不再主要用在摆弄代码上,而更多用在整理知识上(以 Markdown 和图片的形式存着)。最新一代的 LLM 做这件事相当不错。所以:

Data ingest: I index source documents (articles, papers, repos, datasets, images, etc.) into a raw/ directory, then I use an LLM to incrementally "compile" a wiki, which is just a collection of .md files in a directory structure. The wiki includes summaries of all the data in raw/, backlinks, and then it categorizes data into concepts, writes articles for them, and links them all. To convert web articles into .md files I like to use the Obsidian Web Clipper extension, and then I also use a hotkey to download all the related images to local so that my LLM can easily reference them.

数据摄取: 先把源文档(文章、论文、代码库、数据集、图片等)索引到 raw/ 目录里,然后用 LLM 逐步把它们编译成一个 wiki,本质上就是按目录结构组织的一堆 .md 文件。这个 wiki 会包含 raw/ 里所有数据的摘要、反向链接,然后把数据归类成概念,为这些概念写文章,并把它们全部串起来。把网页文章转成 .md 文件时,喜欢用 Obsidian Web Clipper 扩展;然后还会用一个快捷键把相关图片全部下载到本地,方便 LLM 直接引用。

IDE: I use Obsidian as the IDE "frontend" where I can view the raw data, the the compiled wiki, and the derived visualizations. Important to note that the LLM writes and maintains all of the data of the wiki, I rarely touch it directly. I've played with a few Obsidian plugins to render and view data in other ways (e.g. Marp for slides).

IDE: 用 Obsidian 作为 IDE 的前端,在里面看原始数据、编译后的 wiki,以及由此派生出来的可视化。需要强调的是,wiki 的全部数据都由 LLM 来写和维护,几乎不会直接手改。也试过一些 Obsidian 插件,用不同方式渲染和浏览数据(比如用 Marp 做幻灯片)。

Q&A: Where things get interesting is that once your wiki is big enough (e.g. mine on some recent research is ~100 articles and ~400K words), you can ask your LLM agent all kinds of complex questions against the wiki, and it will go off, research the answers, etc. I thought I had to reach for fancy RAG, but the LLM has been pretty good about auto-maintaining index files and brief summaries of all the documents and it reads all the important related data fairly easily at this ~small scale.

问答: 有意思的地方在于,当你的 wiki 足够大之后(比如我在一些近期研究上,大概有 ~100 篇文章和 ~400K 词),就可以让 LLM agent 围绕这套 wiki 回答各种复杂问题,它会自己跑去查资料、把答案研究出来等等。本来以为得上很花哨的 RAG,但 LLM 在自动维护索引文件、给所有文档生成简短摘要这件事上表现很好;在这种规模不大的情况下,它也能比较轻松地读到所有关键的相关资料。

Output: Instead of getting answers in text/terminal, I like to have it render markdown files for me, or slide shows (Marp format), or matplotlib images, all of which I then view again in Obsidian. You can imagine many other visual output formats depending on the query. Often, I end up "filing" the outputs back into the wiki to enhance it for further queries. So my own explorations and queries always "add up" in the knowledge base.

输出: 比起在文本或终端里拿到答案,更喜欢让它直接给我产出 Markdown 文件、幻灯片(Marp 格式)或者 matplotlib 图片,然后再回到 Obsidian 里查看。根据问题不同,还可以想象出很多别的可视化输出格式。很多时候,最后会把这些输出再归档回 wiki,让它更强,便于后续继续问。所以自己的探索和提问,都会在知识库里不断累积起来。

Linting: I've run some LLM "health checks" over the wiki to e.g. find inconsistent data, impute missing data (with web searchers), find interesting connections for new article candidates, etc., to incrementally clean up the wiki and enhance its overall data integrity. The LLMs are quite good at suggesting further questions to ask and look into.

校验: 也会让 LLM 对整个 wiki 做一些健康检查,比如找出不一致的数据、补全缺失数据(借助网页搜索器)、找出有趣的关联以便产出新的候选文章等,用这种方式逐步清理 wiki,提升整体的数据完整性。LLM 也很擅长建议接下来还值得问什么、该往哪里继续挖。

Extra tools: I find myself developing additional tools to process the data, e.g. I vibe coded a small and naive search engine over the wiki, which I both use directly (in a web ui), but more often I want to hand it off to an LLM via CLI as a tool for larger queries.

额外工具: 还会顺手开发一些处理数据的工具,比如凭感觉写了一个小而朴素的 wiki 搜索引擎,既会直接用它(在一个 web ui 里),但更多时候是通过 CLI 把它交给 LLM 当工具,用来支撑更大的查询。

Further explorations: As the repo grows, the natural desire is to also think about synthetic data generation + finetuning to have your LLM "know" the data in its weights instead of just context windows.

进一步探索: 随着 repo 变大,很自然也会想要做合成数据生成加微调,让 LLM 不只是靠上下文窗口,而是把这些数据真正学进权重里。

TLDR: raw data from a given number of sources is collected, then compiled by an LLM into a .md wiki, then operated on by various CLIs by the LLM to do Q&A and to incrementally enhance the wiki, and all of it viewable in Obsidian. You rarely ever write or edit the wiki manually, it's the domain of the LLM. I think there is room here for an incredible new product instead of a hacky collection of scripts.

TLDR:先从若干来源收集原始数据,再由 LLM 把它编译成一个 .md 的 wiki,然后 LLM 借助各种 CLI 在上面做问答,并持续增量增强 wiki;所有内容都可以在 Obsidian 里查看。几乎不需要手写或手改 wiki,它基本就是 LLM 的领域。这里感觉有空间做出一个很棒的新产品,而不是一堆拼凑的脚本。

Something I'm finding very useful recently: using LLMs to build personal knowledge bases for various topics of research interest. In this way, a large fraction of my recent token throughput is going less into manipulating code, and more into manipulating knowledge (stored as markdown and images). The latest LLMs are quite good at it. So:

Data ingest: I index source documents (articles, papers, repos, datasets, images, etc.) into a raw/ directory, then I use an LLM to incrementally "compile" a wiki, which is just a collection of .md files in a directory structure. The wiki includes summaries of all the data in raw/, backlinks, and then it categorizes data into concepts, writes articles for them, and links them all. To convert web articles into .md files I like to use the Obsidian Web Clipper extension, and then I also use a hotkey to download all the related images to local so that my LLM can easily reference them.

IDE: I use Obsidian as the IDE "frontend" where I can view the raw data, the the compiled wiki, and the derived visualizations. Important to note that the LLM writes and maintains all of the data of the wiki, I rarely touch it directly. I've played with a few Obsidian plugins to render and view data in other ways (e.g. Marp for slides).

Q&A: Where things get interesting is that once your wiki is big enough (e.g. mine on some recent research is ~100 articles and ~400K words), you can ask your LLM agent all kinds of complex questions against the wiki, and it will go off, research the answers, etc. I thought I had to reach for fancy RAG, but the LLM has been pretty good about auto-maintaining index files and brief summaries of all the documents and it reads all the important related data fairly easily at this ~small scale.

Output: Instead of getting answers in text/terminal, I like to have it render markdown files for me, or slide shows (Marp format), or matplotlib images, all of which I then view again in Obsidian. You can imagine many other visual output formats depending on the query. Often, I end up "filing" the outputs back into the wiki to enhance it for further queries. So my own explorations and queries always "add up" in the knowledge base.

Linting: I've run some LLM "health checks" over the wiki to e.g. find inconsistent data, impute missing data (with web searchers), find interesting connections for new article candidates, etc., to incrementally clean up the wiki and enhance its overall data integrity. The LLMs are quite good at suggesting further questions to ask and look into.

Extra tools: I find myself developing additional tools to process the data, e.g. I vibe coded a small and naive search engine over the wiki, which I both use directly (in a web ui), but more often I want to hand it off to an LLM via CLI as a tool for larger queries.

Further explorations: As the repo grows, the natural desire is to also think about synthetic data generation + finetuning to have your LLM "know" the data in its weights instead of just context windows.

TLDR: raw data from a given number of sources is collected, then compiled by an LLM into a .md wiki, then operated on by various CLIs by the LLM to do Q&A and to incrementally enhance the wiki, and all of it viewable in Obsidian. You rarely ever write or edit the wiki manually, it's the domain of the LLM. I think there is room here for an incredible new product instead of a hacky collection of scripts.

📋 讨论归档

讨论进行中…