返回列表
🧠 阿头学 · 💰投资

2026年AI数据智能体的实战搭建指南

用动态上下文管理和用户纠错知识库替代静态语义层,让AI智能体直读代码库按需理解数据,可将数据分析师需求从5人压缩到1人,但这套方案的普适性和真实成本被严重高估了。
打开原文 ↗

2026-03-13 原文链接 ↗
阅读简报
双语对照
完整翻译
原文
讨论归档

核心观点

  • 从预先建模到动态理解的范式转移 传统数据栈需要分析师预先定义所有指标、维度、关系(语义层),维护成本高且容易过时。新方案让Context Agent在查询时动态读取dbt DAG和应用代码,按需组装上下文,系统能随代码库自然扩展,无需人工预判每个问题。
  • 隐性业务知识的资产化机制 通过"Quirks"知识库把用户纠错沉淀为可检索资产(如"orders表按order_id去重"),这些过去只存在老员工脑子里的脏逻辑变成机构记忆,每次纠正都在为系统增加长期价值,实现从人工维护到自我进化的跳跃。
  • 多维自评分+缺口修复的容错闭环 智能体从结构正确性、执行可靠性、上下文对齐度三个维度为自己打分,生成"上下文缺口简报"暴露证据不足的部分,然后分层修复(逻辑缺失改Prompt重试,上下文不匹配回到侦察Agent补全),这比单纯堆文档更高级。
  • "先侦察后行动"的Agent架构 在主Agent写SQL前派出低成本Context Agent踩点,防止大模型直接幻觉捏造表结构,这种子Agent+主Agent的解耦模式可迁移到任何复杂代码生成场景。
  • 团队角色从"取数机"到"知识系统维护者"的转变 分析师需求从4-5人压缩到1人,但这1人的工作从回答重复问题变成定义核心KPI、审核Quirks、处理高风险边界,本质是把团队价值从产出转向护栏和知识维护。

跟我们的关联

  • 对ATou意味着什么 这套系统直接威胁传统数据分析师的"翻译官"角色——重复性取数工作被AI接管,但也创造了新的高价值工作(知识系统维护、边界案例处理)。下一步应该是主动学习如何与AI协作、如何定义和维护核心指标、如何审核AI学到的业务规则,而不是被动等待被替代。
  • 对Neta意味着什么 这是一个关于"何时放弃完美预设、拥抱灰度设计"的产品哲学案例——只维护最核心的20% KPI定义,剩下80%长尾需求交由AI动态探索和用户纠错覆盖。这个"二八定律"的平衡术可直接应用到任何AI产品的可靠性vs维护成本权衡中。
  • 对Uota意味着什么 这套架构的护城河不在模型选型,而在知识库的质量和深度——谁能把企业的隐性业务规则、数据脏逻辑、口径标准更完整地沉淀成可检索资产,谁就能提供更可靠的数据智能体。下一步是思考如何让这个知识库自动化积累、如何防止知识污染、如何确保人类审核权重始终高于AI自学。
  • 对通用决策意味着什么 这个案例展示了AI不是简单替代人,而是改变工作流和组织结构的力量——同样的业务目标,用AI驱动的架构可以用更少的人力实现,但前提是愿意重新设计流程、接受新的维护成本、信任系统的自适应能力。

讨论引子

  • 作者声称"语义层需求彻底崩塌",但他提出的Quirks知识库和KPI定义本质上是重新发明了一个碎片化、基于向量检索的非结构化语义层——这真的是进步还是换了个名字的倒退?
  • 系统高度依赖用户发现错误并纠正AI("这张表有重复行"),但业务人员通常只关心结果、不了解底层表结构,如果AI生成的SQL有隐蔽错误但数据"看起来合理",用户怎么可能纠正它?这个假设在现实中成立吗?
  • 文中提到的"朋友公司"案例是否具有代表性?初创公司或轻量级数据场景能否直接外推到中大型企业,特别是那些有严格数据治理、合规性要求、跨部门复杂指标对齐需求的组织?

我在 2020 年于 Notion 搭建了最初的数据栈,当时“现代数据栈”这一概念刚开始流行。过去一年里,我花了一些时间做咨询,或帮助朋友的公司解决下一场革命——“AI 数据栈”。下面是一份实操指南,教你构建一个现代化的 AI 数据智能体,把你们数据团队的带宽提升 5 倍以上。这个版本从头到尾我大约做了 3 周,包括第三方集成(Slack/Notion)以及用于管理它的内部管理员仪表盘。

从现代数据栈到 AI 数据栈

“现代数据栈”关注的是让数据可接入(Fivetran/Airbyte 等)、可组织(DBT)、可查询(Snowflake/Bigquery),并能在任何地方使用(Census/Hightouch)。现代数据栈让分析师更容易获取数据。AI 数据栈则让所有人都能用上数据:它用能自动把问题变成答案的智能体,替代了人工翻译层(写 SQL、做仪表盘、排期报表、解读结果)。

一些先进公司从 2024 年末 / 2025 年初开始用 AI 做自动化报表:在 DBT MetricFlow 或 Cube.dev 之上叠加 GPT-4o 这类模型。但这需要大量的手把手调教,以及对语义层的精细管理。随着 Opus 4.6 的发布——而且就是 Opus 4.6——尽管它在编码任务上相较 Codex 5.3 xhigh 输得很惨,但在数据分析的对比测试里却全面碾压 Codex 5.3 xhigh,于是对高级语义层的需求已经彻底崩塌(这很好,因为语义层从一开始就烦人得要命),如今完全可以用上下文管理来替代。

用上下文管理干掉语义层

语义层是一套静态、靠人工维护的映射,把业务概念对应到 SQL(指标、维度、关系)。它搭建起来很折腾,维护起来也同样折腾。上下文管理则正好相反:你不再预先定义智能体可能需要的一切,而是让它直接访问最原始的“事实来源”(你的 DBT 模型、你的应用代码),并让它按需、按问题去调查。

在写任何 SQL 之前,你只需要派出一个子智能体(我把它叫做 context agent),去阅读 dbt 模型里真实的转换逻辑,并沿着 DAG 向上追踪 ref() 依赖。如果你在 DBT 文件里用元数据标注链接到应用代码(强烈推荐,而且很容易自动化),它还会顺带校验业务规则。它会给自己生成一份简报:相关表、字段定义、连接路径、过滤条件、去重逻辑、注意事项。原本这些工作应该由语义层提前计算完成,而现在它会在查询时动态发生。

与语义层相比,上下文管理另一个很棒的地方在于:它是自适应的。当用户纠正智能体(“不对,这个指标要加这个过滤条件”或“这张表有重复行,你得在 X 上去重”),你可以把这些纠正提炼成持久、可复用的知识,在以后类似问题出现时再检索出来。这相当于语义层从真实使用中自我重建,而不是靠分析师去持续维护。

语义层很脆弱:需要不断维护,覆盖不了边界情况,还会造成瓶颈——数据团队不得不预判每一个问题。上下文管理则能随代码库扩展(因为它读代码库),也能随使用规模扩展(因为它会从纠正和团队输入中学习)。

基础搭建

  1. 在你的主机上搭建智能体循环。我推荐 Claude Agent SDK,因为在数据相关场景里 Opus 4.6 > Codex 5.3 xhigh。

  2. 确保你的 dbt 仓库以及代码库(一个或多个)能被智能体循环访问

  3. 为你的 dbt 基础模型(通常是 stg_)添加注释:指向应用代码的链接,以及对其功能的高层描述。这件事大多可以自动完成。

  4. 构建一个在 SQL 生成之前运行的上下文子智能体。当问题进来时,在主智能体写任何 SQL 之前,先派出一个子智能体,它唯一的工作就是勘察数据版图。它会读取相关模型文件(完整的转换逻辑,而不仅是文件头)、追踪上游依赖;若存在注释,也会检查应用代码;最后返回一份结构化简报:表、字段、连接路径、过滤条件、去重规则、注意事项。这份简报会被注入主智能体的上下文。这个子智能体很便宜(基本就是读文件),也是防止主智能体臆造表结构的关键。

  5. 为“学到的纠正”建立一个知识存储,也就是所谓的“quirks”。当用户纠正智能体时,把纠正提炼成一个可持久化的“quirk”——一段短小、可复用的知识,比如“当一次订单有多次发货时,orders 表会出现同一订单的重复行;务必按 order_id 去重”。把这些以 embedding 的形式存起来,便于语义检索。每来一个新问题,就对 quirk 存储做混合检索(向量相似度 + 关键词搜索),并把最相关的命中结果与 context agent 的简报一起注入上下文。久而久之,它会变成你的机构知识库——那些过去只存在某个分析师脑子里的东西。我用 Postgres + pgvector 做向量相似度,用 OpenAI text-embedding-3-small 生成 embedding,用 pg_textsearch 做 BM25 关键词搜索。效果完全够用。

  6. 为核心 KPI 添加由人编写的指标定义,也就是那些被频繁询问且口径严格的指标。让数据团队以结构化的方式撰写定义,并提供推理指引(如何计算、需要哪些过滤条件等)。这些同样进入同一个知识存储,并以与 quirks 相同的方式被检索。可以把它理解为语义层里真正有用的那 20%:人愿意的时候再维护,而不是系统能否运行的硬性前提。

  7. 把这一整套接到 Slack 上(务必给智能体配好多线程),并做一个基础的管理员 UI 仪表盘,用来监控使用情况与结果,并支持编辑知识存储。

高级调优

当基础循环跑通之后(context agent 调研,主智能体写 SQL,返回答案),你会发现:有时 SQL 的错误很隐蔽,智能体自己察觉不到。它可能跑出一条能正常执行的查询,但回答的却是与问题略有偏差的另一个问题;或者漏了过滤条件;或者用错了 join key,悄无声息地把数翻倍。智能体并不知道自己错了,于是会很自信地给出一个坏答案。一个朴素的改进方式是硬塞更多指标定义和 quirks,或者在 DBT 文件里补更多文档,但其实有一个更优雅、也更简单的解法。

解决办法是让智能体给自己的工作打分,具体这样做:

  1. 它会从三个维度为每条 SQL 查询打分:结构正确性、执行可靠性、上下文对齐度。这一步结合确定性信号与 Haiku 做语义/上下文校验(包括对子问题的覆盖),如果 Haiku 不可用则回退到确定性策略。

  2. 然后,它根据这些打分后的查询及其证据,生成一份“上下文缺口”简报。简报会把问题拆成子问题,标记哪些部分已有高置信证据充分支持,并指出缺失的维度、实体、时间窗口与业务逻辑。

  3. 这份简报会驱动修复循环。如果仍有未解决的缺口且 SQL 信号足够,智能体会用针对缺口的提示词重试;如果问题更像是 schema/上下文不匹配,则先重新跑一遍上下文补全。

在检索调优上,你的混合搜索需要几个旋钮。Collection weights 让你在结果里平衡“指标定义”和“quirks”的偏好。Reviewed-item multiplier 确保人类审核过的知识排在智能体自学出来的内容之前(后者未必正确)。用 reciprocal-rank fusion 把向量检索与 BM25 的候选融合成一份统一的排序列表。再设定一个固定的上下文预算,避免每次提示词都塞进整个知识存储——这同样会把性能拖垮。

感受 AGI

这套方案把我朋友公司原本以为要内部招 4–5 个数据分析师的计划,压缩到只招 1 个。现在 Customer Success 和 Sales 即便遇到复杂问题,也能在 Slack 里自助取数。Product/Growth 也能在 Slack 里自助取数,并立刻把结果保存到 Notion。欢迎来到未来。

链接:http://x.com/i/article/2029702407823970306

I built the initial data stack at Notion in 2020 when the "Modern Data Stack" was first becoming a thing, and have spent some time over the last year consulting or helping friends' companies solve the next revolution, or the "AI Data Stack". Here is a how-to guide for building a modern AI data agent that can 5x+ your data team bandwidth. This particular build took me about 3 weeks end-to-end, including the third party integrations (Slack/Notion) and the internal admin dashboard to manage it.

我在 2020 年于 Notion 搭建了最初的数据栈,当时“现代数据栈”这一概念刚开始流行。过去一年里,我花了一些时间做咨询,或帮助朋友的公司解决下一场革命——“AI 数据栈”。下面是一份实操指南,教你构建一个现代化的 AI 数据智能体,把你们数据团队的带宽提升 5 倍以上。这个版本从头到尾我大约做了 3 周,包括第三方集成(Slack/Notion)以及用于管理它的内部管理员仪表盘。

From the Modern Data Stack to the AI Data Stack

The "modern data stack" was about making your data accessible (Fivetran/Airbyte/etc), organized (DBT), queryable (Snowflake/Bigquery), and available anywhere (Census/Hightouch). The modern data stack made data accessible to analysts. The AI data stack makes data accessible to everyone, by replacing the human translation layer (write SQL, build dashboard, schedule report, interpret results) with agents that go from question to answer automatically.

Advanced companies were doing automated reporting with AI starting in late-2024/early-2025 with models like GPT-4o layered on top of DBT MetricFlow or Cube.dev, but it took a lot of handholding and careful semantic-layer management. With the release of Opus 4.6, and specifically Opus 4.6, even though Opus 4.6 loses pitifully on coding tasks vs Codex 5.3 xhigh, it crushes Codex 5.3 xhigh in side-by-side testing for data analysis, the need for an advanced semantic layer has completely collapsed (which is great, because it was a pain in the ass to begin with), and can now be entirely replaced with context management.

从现代数据栈到 AI 数据栈

“现代数据栈”关注的是让数据可接入(Fivetran/Airbyte 等)、可组织(DBT)、可查询(Snowflake/Bigquery),并能在任何地方使用(Census/Hightouch)。现代数据栈让分析师更容易获取数据。AI 数据栈则让所有人都能用上数据:它用能自动把问题变成答案的智能体,替代了人工翻译层(写 SQL、做仪表盘、排期报表、解读结果)。

一些先进公司从 2024 年末 / 2025 年初开始用 AI 做自动化报表:在 DBT MetricFlow 或 Cube.dev 之上叠加 GPT-4o 这类模型。但这需要大量的手把手调教,以及对语义层的精细管理。随着 Opus 4.6 的发布——而且就是 Opus 4.6——尽管它在编码任务上相较 Codex 5.3 xhigh 输得很惨,但在数据分析的对比测试里却全面碾压 Codex 5.3 xhigh,于是对高级语义层的需求已经彻底崩塌(这很好,因为语义层从一开始就烦人得要命),如今完全可以用上下文管理来替代。

Killing the Semantic Layer with Context Management

The semantic layer is a static, hand-maintained mapping between business concepts and SQL (metrics, dimensions, relationships). It is a pain in the ass to build and a pain in the ass to maintain. Context management is the inverse: instead of pre-defining everything the agent might need, you give it access to the raw source of truth (your DBT models, your app code) and let it investigate on demand, per-question.

Before writing any SQL, you just dispatch a sub-agent (I call it a context agent) to read the actual transformation logic in the dbt models and trace ref() dependencies upstream through the DAG. If you annotate your DBT files with metadata links into application code (highly recommended, and easy to automate) it will also validate business rules. It builds a brief for itself: relevant tables, column definitions, join paths, filters, dedup logic, caveats. This is the work a semantic layer was supposed to pre-compute, but now it happens dynamically at query time.

The other great part about context management compared to the semantic layer is that it's adaptive. When a user corrects the agent ("no, that metric needs this filter" or "that table has duplicate rows, you need to dedup on X"), you can extract those corrections into durable, reusable knowledge that gets retrieved on future questions. This is the semantic layer rebuilding itself from usage, rather than requiring an analyst to maintain it.

Semantic layers were brittle, they required constant maintenance, they couldn't cover edge cases, and they created a bottleneck where the data team had to anticipate every question. Context management scales with the codebase (because it reads the codebase) and scales with usage (because it learns from corrections and input from the team).

用上下文管理干掉语义层

语义层是一套静态、靠人工维护的映射,把业务概念对应到 SQL(指标、维度、关系)。它搭建起来很折腾,维护起来也同样折腾。上下文管理则正好相反:你不再预先定义智能体可能需要的一切,而是让它直接访问最原始的“事实来源”(你的 DBT 模型、你的应用代码),并让它按需、按问题去调查。

在写任何 SQL 之前,你只需要派出一个子智能体(我把它叫做 context agent),去阅读 dbt 模型里真实的转换逻辑,并沿着 DAG 向上追踪 ref() 依赖。如果你在 DBT 文件里用元数据标注链接到应用代码(强烈推荐,而且很容易自动化),它还会顺带校验业务规则。它会给自己生成一份简报:相关表、字段定义、连接路径、过滤条件、去重逻辑、注意事项。原本这些工作应该由语义层提前计算完成,而现在它会在查询时动态发生。

与语义层相比,上下文管理另一个很棒的地方在于:它是自适应的。当用户纠正智能体(“不对,这个指标要加这个过滤条件”或“这张表有重复行,你得在 X 上去重”),你可以把这些纠正提炼成持久、可复用的知识,在以后类似问题出现时再检索出来。这相当于语义层从真实使用中自我重建,而不是靠分析师去持续维护。

语义层很脆弱:需要不断维护,覆盖不了边界情况,还会造成瓶颈——数据团队不得不预判每一个问题。上下文管理则能随代码库扩展(因为它读代码库),也能随使用规模扩展(因为它会从纠正和团队输入中学习)。

Basic Set Up

  1. Set up an agentic loop on your host machine. I recommend the Claude Agent SDK since Opus 4.6 > Codex 5.3 xhigh when it comes to data.

  2. Ensure your dbt repo and codebase or codebases are accessible from the agentic loop.

  3. Annotate your dbt base models (usually stg_) with links to application code and a high-level description of their functionality. This can mostly be done automatically.

  4. Build out a context sub-agent that runs before SQL generation. When a question comes in, before your main agent writes any SQL, dispatch a sub-agent whose only job is to investigate the data landscape. It reads the relevant model files (the full transformation logic, not just headers), traces upstream dependencies, checks application code if annotations exist, and returns a structured brief: tables, columns, join paths, filters, dedup rules, caveats. This brief gets injected into the main agent's context. The sub-agent is cheap (it's mostly reading files) and it's what keeps the main agent from hallucinating table structures.

  5. Build a knowledge store for learned corrections, or "quirks." When a user corrects the agent, extract the correction into a durable "quirk," a short, reusable piece of knowledge like "the orders table has duplicate rows per order when there are multiple shipments; always dedup on order_id." Store these with embeddings for semantic search. On each new question, run hybrid retrieval (vector similarity + keyword search) against your quirk store and inject the top matches alongside the context agent's brief. Over time this becomes your institutional knowledge base, the stuff that used to live in one analyst's head. I use Postgres with pgvector for vector similarity, OpenAI text-embedding-3-small for embeddings, and pg_textsearch for BM25 keyword search. It works fine.

  6. Add human-authored metric definitions for your core KPIs, the metrics that get asked about constantly and have precise definitions. Let your data team author structured definitions with inference guidance (how to calculate it, what filters apply, etc). These go into the same knowledge store and get retrieved the same way as quirks. Think of this as the 20% of the semantic layer that was actually useful, maintained by humans when they feel like it rather than required for the system to function.

  7. Connect this all to Slack (ensuring you set up multi-threading for the agents) and build a basic admin UI dashboard to monitor ongoing usage/results and enable editing of the knowledge store.

基础搭建

  1. 在你的主机上搭建智能体循环。我推荐 Claude Agent SDK,因为在数据相关场景里 Opus 4.6 > Codex 5.3 xhigh。

  2. 确保你的 dbt 仓库以及代码库(一个或多个)能被智能体循环访问

  3. 为你的 dbt 基础模型(通常是 stg_)添加注释:指向应用代码的链接,以及对其功能的高层描述。这件事大多可以自动完成。

  4. 构建一个在 SQL 生成之前运行的上下文子智能体。当问题进来时,在主智能体写任何 SQL 之前,先派出一个子智能体,它唯一的工作就是勘察数据版图。它会读取相关模型文件(完整的转换逻辑,而不仅是文件头)、追踪上游依赖;若存在注释,也会检查应用代码;最后返回一份结构化简报:表、字段、连接路径、过滤条件、去重规则、注意事项。这份简报会被注入主智能体的上下文。这个子智能体很便宜(基本就是读文件),也是防止主智能体臆造表结构的关键。

  5. 为“学到的纠正”建立一个知识存储,也就是所谓的“quirks”。当用户纠正智能体时,把纠正提炼成一个可持久化的“quirk”——一段短小、可复用的知识,比如“当一次订单有多次发货时,orders 表会出现同一订单的重复行;务必按 order_id 去重”。把这些以 embedding 的形式存起来,便于语义检索。每来一个新问题,就对 quirk 存储做混合检索(向量相似度 + 关键词搜索),并把最相关的命中结果与 context agent 的简报一起注入上下文。久而久之,它会变成你的机构知识库——那些过去只存在某个分析师脑子里的东西。我用 Postgres + pgvector 做向量相似度,用 OpenAI text-embedding-3-small 生成 embedding,用 pg_textsearch 做 BM25 关键词搜索。效果完全够用。

  6. 为核心 KPI 添加由人编写的指标定义,也就是那些被频繁询问且口径严格的指标。让数据团队以结构化的方式撰写定义,并提供推理指引(如何计算、需要哪些过滤条件等)。这些同样进入同一个知识存储,并以与 quirks 相同的方式被检索。可以把它理解为语义层里真正有用的那 20%:人愿意的时候再维护,而不是系统能否运行的硬性前提。

  7. 把这一整套接到 Slack 上(务必给智能体配好多线程),并做一个基础的管理员 UI 仪表盘,用来监控使用情况与结果,并支持编辑知识存储。

Advanced Tuning

Once the basic loop is working (context agent investigates, main agent writes SQL, answers come back) you're going to notice that sometimes the SQL is wrong in ways the agent doesn't catch. It'll run a query that executes fine but answers a slightly different question than what was asked, or it'll miss a filter, or it'll join on the wrong key and silently double-count. The agent doesn't know it got it wrong, so it confidently delivers a bad answer. A naive way to improve this would be to brute force adding more metrics definitions and quirks, or to add more documentation to your DBT files, but there's an elegant solution that's actually much simpler.

The fix is to make the agent score its own work, here is how this works:

  1. It scores every SQL query on three dimensions: structural correctness, execution reliability, and context alignment. This uses deterministic signals plus Haiku for semantic/context checks (including subquestion coverage), with deterministic fallback if Haiku is unavailable.

  2. It then builds a context-gap brief from those scored queries and their evidence. The brief breaks the question into subquestions, marks what is fully supported by high-confidence evidence, and flags missing dimensions, entities, time windows, and business logic.

  3. That brief drives the recovery loop. If unresolved gaps remain and there is enough SQL signal, the agent retries with a targeted prompt focused on specific gaps; if the issue looks like schema/context mismatch, it reruns context enrichment first.

For retrieval tuning, your hybrid search needs a few knobs. Collection weights let you balance how much to favor metrics vs. quirks in results. A reviewed-item multiplier ensures human-approved knowledge ranks higher than stuff the agent learned on its own (which may or may not be right). Use reciprocal-rank fusion to blend your vector and BM25 candidates into a single ranked list. Set a fixed context budget so you're not stuffing the entire knowledge store into every prompt, which can also be performance destroying.

高级调优

当基础循环跑通之后(context agent 调研,主智能体写 SQL,返回答案),你会发现:有时 SQL 的错误很隐蔽,智能体自己察觉不到。它可能跑出一条能正常执行的查询,但回答的却是与问题略有偏差的另一个问题;或者漏了过滤条件;或者用错了 join key,悄无声息地把数翻倍。智能体并不知道自己错了,于是会很自信地给出一个坏答案。一个朴素的改进方式是硬塞更多指标定义和 quirks,或者在 DBT 文件里补更多文档,但其实有一个更优雅、也更简单的解法。

解决办法是让智能体给自己的工作打分,具体这样做:

  1. 它会从三个维度为每条 SQL 查询打分:结构正确性、执行可靠性、上下文对齐度。这一步结合确定性信号与 Haiku 做语义/上下文校验(包括对子问题的覆盖),如果 Haiku 不可用则回退到确定性策略。

  2. 然后,它根据这些打分后的查询及其证据,生成一份“上下文缺口”简报。简报会把问题拆成子问题,标记哪些部分已有高置信证据充分支持,并指出缺失的维度、实体、时间窗口与业务逻辑。

  3. 这份简报会驱动修复循环。如果仍有未解决的缺口且 SQL 信号足够,智能体会用针对缺口的提示词重试;如果问题更像是 schema/上下文不匹配,则先重新跑一遍上下文补全。

在检索调优上,你的混合搜索需要几个旋钮。Collection weights 让你在结果里平衡“指标定义”和“quirks”的偏好。Reviewed-item multiplier 确保人类审核过的知识排在智能体自学出来的内容之前(后者未必正确)。用 reciprocal-rank fusion 把向量检索与 BM25 的候选融合成一份统一的排序列表。再设定一个固定的上下文预算,避免每次提示词都塞进整个知识存储——这同样会把性能拖垮。

Feel the AGI

This setup reduced what we thought was going to be an internal hiring plan for 4-5 data analysts at a friend's company to only hiring one. Customer Success and Sales can self-serves data in Slack now even for complex questions. Product/Growth can self-serve data in Slack now and instantly save it to Notion. Welcome to the future.

Link: http://x.com/i/article/2029702407823970306

感受 AGI

这套方案把我朋友公司原本以为要内部招 4–5 个数据分析师的计划,压缩到只招 1 个。现在 Customer Success 和 Sales 即便遇到复杂问题,也能在 Slack 里自助取数。Product/Growth 也能在 Slack 里自助取数,并立刻把结果保存到 Notion。欢迎来到未来。

链接:http://x.com/i/article/2029702407823970306

相关笔记

I built the initial data stack at Notion in 2020 when the "Modern Data Stack" was first becoming a thing, and have spent some time over the last year consulting or helping friends' companies solve the next revolution, or the "AI Data Stack". Here is a how-to guide for building a modern AI data agent that can 5x+ your data team bandwidth. This particular build took me about 3 weeks end-to-end, including the third party integrations (Slack/Notion) and the internal admin dashboard to manage it.

From the Modern Data Stack to the AI Data Stack

The "modern data stack" was about making your data accessible (Fivetran/Airbyte/etc), organized (DBT), queryable (Snowflake/Bigquery), and available anywhere (Census/Hightouch). The modern data stack made data accessible to analysts. The AI data stack makes data accessible to everyone, by replacing the human translation layer (write SQL, build dashboard, schedule report, interpret results) with agents that go from question to answer automatically.

Advanced companies were doing automated reporting with AI starting in late-2024/early-2025 with models like GPT-4o layered on top of DBT MetricFlow or Cube.dev, but it took a lot of handholding and careful semantic-layer management. With the release of Opus 4.6, and specifically Opus 4.6, even though Opus 4.6 loses pitifully on coding tasks vs Codex 5.3 xhigh, it crushes Codex 5.3 xhigh in side-by-side testing for data analysis, the need for an advanced semantic layer has completely collapsed (which is great, because it was a pain in the ass to begin with), and can now be entirely replaced with context management.

Killing the Semantic Layer with Context Management

The semantic layer is a static, hand-maintained mapping between business concepts and SQL (metrics, dimensions, relationships). It is a pain in the ass to build and a pain in the ass to maintain. Context management is the inverse: instead of pre-defining everything the agent might need, you give it access to the raw source of truth (your DBT models, your app code) and let it investigate on demand, per-question.

Before writing any SQL, you just dispatch a sub-agent (I call it a context agent) to read the actual transformation logic in the dbt models and trace ref() dependencies upstream through the DAG. If you annotate your DBT files with metadata links into application code (highly recommended, and easy to automate) it will also validate business rules. It builds a brief for itself: relevant tables, column definitions, join paths, filters, dedup logic, caveats. This is the work a semantic layer was supposed to pre-compute, but now it happens dynamically at query time.

The other great part about context management compared to the semantic layer is that it's adaptive. When a user corrects the agent ("no, that metric needs this filter" or "that table has duplicate rows, you need to dedup on X"), you can extract those corrections into durable, reusable knowledge that gets retrieved on future questions. This is the semantic layer rebuilding itself from usage, rather than requiring an analyst to maintain it.

Semantic layers were brittle, they required constant maintenance, they couldn't cover edge cases, and they created a bottleneck where the data team had to anticipate every question. Context management scales with the codebase (because it reads the codebase) and scales with usage (because it learns from corrections and input from the team).

Basic Set Up

  1. Set up an agentic loop on your host machine. I recommend the Claude Agent SDK since Opus 4.6 > Codex 5.3 xhigh when it comes to data.

  2. Ensure your dbt repo and codebase or codebases are accessible from the agentic loop.

  3. Annotate your dbt base models (usually stg_) with links to application code and a high-level description of their functionality. This can mostly be done automatically.

  4. Build out a context sub-agent that runs before SQL generation. When a question comes in, before your main agent writes any SQL, dispatch a sub-agent whose only job is to investigate the data landscape. It reads the relevant model files (the full transformation logic, not just headers), traces upstream dependencies, checks application code if annotations exist, and returns a structured brief: tables, columns, join paths, filters, dedup rules, caveats. This brief gets injected into the main agent's context. The sub-agent is cheap (it's mostly reading files) and it's what keeps the main agent from hallucinating table structures.

  5. Build a knowledge store for learned corrections, or "quirks." When a user corrects the agent, extract the correction into a durable "quirk," a short, reusable piece of knowledge like "the orders table has duplicate rows per order when there are multiple shipments; always dedup on order_id." Store these with embeddings for semantic search. On each new question, run hybrid retrieval (vector similarity + keyword search) against your quirk store and inject the top matches alongside the context agent's brief. Over time this becomes your institutional knowledge base, the stuff that used to live in one analyst's head. I use Postgres with pgvector for vector similarity, OpenAI text-embedding-3-small for embeddings, and pg_textsearch for BM25 keyword search. It works fine.

  6. Add human-authored metric definitions for your core KPIs, the metrics that get asked about constantly and have precise definitions. Let your data team author structured definitions with inference guidance (how to calculate it, what filters apply, etc). These go into the same knowledge store and get retrieved the same way as quirks. Think of this as the 20% of the semantic layer that was actually useful, maintained by humans when they feel like it rather than required for the system to function.

  7. Connect this all to Slack (ensuring you set up multi-threading for the agents) and build a basic admin UI dashboard to monitor ongoing usage/results and enable editing of the knowledge store.

Advanced Tuning

Once the basic loop is working (context agent investigates, main agent writes SQL, answers come back) you're going to notice that sometimes the SQL is wrong in ways the agent doesn't catch. It'll run a query that executes fine but answers a slightly different question than what was asked, or it'll miss a filter, or it'll join on the wrong key and silently double-count. The agent doesn't know it got it wrong, so it confidently delivers a bad answer. A naive way to improve this would be to brute force adding more metrics definitions and quirks, or to add more documentation to your DBT files, but there's an elegant solution that's actually much simpler.

The fix is to make the agent score its own work, here is how this works:

  1. It scores every SQL query on three dimensions: structural correctness, execution reliability, and context alignment. This uses deterministic signals plus Haiku for semantic/context checks (including subquestion coverage), with deterministic fallback if Haiku is unavailable.

  2. It then builds a context-gap brief from those scored queries and their evidence. The brief breaks the question into subquestions, marks what is fully supported by high-confidence evidence, and flags missing dimensions, entities, time windows, and business logic.

  3. That brief drives the recovery loop. If unresolved gaps remain and there is enough SQL signal, the agent retries with a targeted prompt focused on specific gaps; if the issue looks like schema/context mismatch, it reruns context enrichment first.

For retrieval tuning, your hybrid search needs a few knobs. Collection weights let you balance how much to favor metrics vs. quirks in results. A reviewed-item multiplier ensures human-approved knowledge ranks higher than stuff the agent learned on its own (which may or may not be right). Use reciprocal-rank fusion to blend your vector and BM25 candidates into a single ranked list. Set a fixed context budget so you're not stuffing the entire knowledge store into every prompt, which can also be performance destroying.

Feel the AGI

This setup reduced what we thought was going to be an internal hiring plan for 4-5 data analysts at a friend's company to only hiring one. Customer Success and Sales can self-serves data in Slack now even for complex questions. Product/Growth can self-serve data in Slack now and instantly save it to Notion. Welcome to the future.

Link: http://x.com/i/article/2029702407823970306

📋 讨论归档

讨论进行中…