返回列表
🧠 阿头学 · 💬 讨论题

大语言模型究竟是如何工作的

这篇文章虽精准总结了 Transformer 技术栈收敛趋势,但其 2026 年的虚构日期严重削弱了事实可信度,且过度夸大架构作用而忽视数据壁垒。
打开原文 ↗

2026-06-09 原文链接 ↗
阅读简报
双语对照
完整翻译
原文
讨论归档

核心观点

  • 架构红利消失 主流模型已收敛于 RoPE+SwiGLU+GQA 组合,单纯架构创新不再是竞争核心。
  • 知识物理分离 FFN 层存储事实知识而 Attention 负责信息路由,这为模型编辑提供了物理依据。
  • 分词决定认知边界 模型处理的是 Token ID 而非文本,导致计数等基础推理任务存在先天缺陷。
  • 时间戳信用崩塌 文章标注 2026 年发布日期,将预测性趋势伪装成既定事实,属于严重的内容造假。
  • 开源代表全行业存疑 作者用 LLaMA 系开源模型的标准代表整个行业,忽略了闭源模型可能存在的架构差异。

跟我们的关联

  • 对投资人意味着架构故事已失效,下一步应审查项目的数据闭环能力而非技术白皮书。
  • 对产品经理意味着 Prompt 设计需规避“中间丢失”效应,下一步应将关键指令强制置于上下文首尾。
  • 对技术人员意味着理解 FFN 存储机制,下一步可尝试利用 ROME 等技术进行低成本知识修正而非全量微调。
  • 对决策者意味着警惕“无数学懂论文”的误导,下一步应要求团队补充数学推导能力以避免虚假自信。

讨论引子

  • 如果架构已完全收敛,开源模型凭什么防御闭源模型的数据壁垒?
  • 声称“不沾数学就能读懂论文”是降低门槛还是制造认知幻觉?
  • 模型编辑技术(如 ROME)真的能安全修改事实而不导致模型崩溃吗?

2026 年 6 月 1 日,星期一 阅读约 26 分钟 这篇文章会带你走一遍 LLM 的工作方式。现代 LLM 基本都是把 transformer block 一层层反复堆起来,所以只要理解了 transformer 这套机制,基本就抓住大头了。

我会讲现代基于 transformer 的 LLM 里那些核心机制,尽量不沾那些黏糊糊的数学。别误会,数学当然还是该学,但这篇可以先当入门。

大多数现代 LLM 共享同一套 transformer 家族的骨架。差别主要来自各自用什么数据训练、规模和配置怎么选,以及后续又做了哪些训练。看到最后,你应该能读懂不少现代 LLM 论文或模型卡,也能知道每一节在讲架构里的哪一块。

路径如下:

  1. Tokens,文本字符串如何变成一串整数
  2. Embeddings,这些整数如何获得意义
  3. Positional encoding,模型如何知道 token 原本的顺序
  4. Attention,token 之间如何共享信息
  5. Multi-head attention,模型如何同时跟踪多种关系
  6. Feed-forward network,模型内部大量已存结构主要藏在哪里
  7. Residual stream 和 layer normalization,深层堆叠为何还能训练
  8. 预测下一个 token,模型实际输出什么,以及生成循环如何工作
  9. 架构与训练后权重,现代 LLM 之间哪些东西大体相同,哪些不同

Image 1: Transformer pipeline from tokenization to next-token prediction

文中会不断穿插一些极简说明,不管背景如何都能跟上。


Tokenization

模型并不直接读文本。它们读的是整数 ID。也就是把你的提示词转换成一串整数的那个步骤。

这个转换步骤叫 tokenization。tokenizer 接收一个字符串,输出一串整数,每个整数都对应固定词表中的一项。现代 LLM 的词表通常有几万到几十万项。

极简说明:token ID

token ID 是模型用来表示某个词表项的整数。模型处理的是这个数字,不是写出来的词本身。

token 通常不是完整单词,通常是更小的子词片段。比如 tokenization 可能会被切成 [“token”, “ization”]。running 可能会被切成 [“run”, “ning”]。原因是效率。整词词表太大,对新词的泛化也差。字符级词表又太小,连最简单的模式都得模型从头学。子词切分正好落在中间。最常见的片段会成为单个 token,罕见或新造的词则由更小的片段拼出来。

极简说明:vocabulary

vocabulary 就是 tokenizer 固定的一张片段清单。每个片段都有一个 ID,模型只能直接接收这张清单里的 ID。

这种权衡会出现在一些大家意想不到的地方。经典例子是让 LLM 回答 strawberry 里有几个 R。过去的 LLM 经常答错。这不等于模型不会计数,而是模型并不是直接对字母操作,它处理的是 token ID,而这些 token ID 只是碰巧拼出了一个人类会按字母拆开的单词。

Image 2: Tokenization turns text into token IDs

不同模型家族会用不同 tokenizer。GPT 模型使用 Byte Pair Encoding 的各种变体。SentencePiece 在 LLaMA 风格模型里很常见。这个选择会影响计算量,token 越少,工作量通常越小,也会影响多语言覆盖之类的问题,但基本形态都一样。文本进来,整数出去。

现在提示词已经变成一串整数了,下一步就是让这些整数有意义。


Embeddings

1024 这样的 token ID 只是一个行索引。它本身没有意义。真正赋予它意义的,是一张巨大的表,叫 embedding matrix。

每个模型都有它。词表里每一项对应一行,每一行都是一条很长的数字向量。每一行的长度就是模型的 hidden size。很多 7B 级模型里,每个 token 往往对应 4096 个数字。更大的模型通常会用更宽的向量。

极简说明:vector

vector 就是一串数字。在 transformer 里,每个 token 都会变成一个 vector,这样模型才能拿它做数学运算。

当 tokenizer 把一个整数交给模型时,模型会查这张表里对应的那一行,然后用那条向量替代这个整数。这条向量就是该 token 的 embedding。它是模型在训练中学出来的,对这个 token 意义的内部表示。

极简说明:embedding matrix

embedding matrix 就是一张查表。输入 token ID,输出一条学出来的向量。

这类 embedding 有个很有意思的性质。语义相近的 token,最后往往会落到彼此接近的向量位置上。king 的向量会接近 queen,Paris 的向量会接近 France。这些都不是硬编码进去的,而是在足够多文本上训练之后自然出现的,因为模型发现把它们放在这样的空间位置上更有利于预测文本。

你甚至可以对 embedding 做算术,有时还真能成立。最著名的例子就是 king − man + woman ≈ queen。embedding 空间的几何结构里,确实携带着真实的语义关系,尽管没人教模型要这样构建。

Image 3: Embedding space analogy with semantic relationships

这里要说清楚一点。到了这一步,每个 token 都已经被替换成自己的 embedding 了,但 embedding 本身并没有告诉你这个 token 在序列里的位置。dog 的向量不管出现在提示词第一个还是第五个,都是同一条向量。这就成问题了。

这正是 positional encoding 要补上的空缺。


Positional encoding

普通 self-attention 并没有内建的词序表示。如果没有某种位置信号,它就没有直接办法知道 dog 是出现在 bites 前面,还是后面。

词序会改变意思。所以模型还需要另一块东西。它需要一种办法,把每个 token 的位置注入到计算里。

极简说明:positional encoding

positional encoding 就是模型获取顺序信息的方式。它告诉模型每个 token 在序列中的位置。

最早的 transformer 论文,也就是 Vaswani 等人在 2017 年那篇,做法是给每个位置分配一套自己的数字模式,然后在任何其他处理之前,直接把这套数字加到每个 token 的 embedding 上。位置 1 有一种模式,位置 5 有另一种,位置 100 又有另一种。这个模式来自不同频率的正弦和余弦波。于是位置 1 的 dog embedding 和位置 5 的 dog embedding 就不一样了,因为加上去的位置模式不同。

这种方法能用,而且选择正弦编码的部分原因,是它能外推到训练时没见过的序列长度。但随着模型越做越大,这类加法式位置方案还是暴露出两个越来越重要的问题。

第一,embedding 必须在同一组数字里同时承载语义和位置信息。能塞进去的东西终究有限。

第二,尤其是学出来的绝对位置 embedding,泛化并不干净。如果训练时只见过最长 2048 token 的提示词,那模型从没见过位置 5000,这个位置的 embedding 也就没以同样方式学出来。

现代模型大多改用另一套方案,叫 Rotary Position Embeddings,也就是 RoPE。它由 Su 等人在 2021 年提出,如今 LLaMA、Mistral、Gemma、Qwen 以及大多数开源权重家族都在用。直觉上说,RoPE 不是把位置信息加到 token 向量里,而是按 token 所在位置对应的角度去旋转 Query 和 Key 向量。位置 1 的 token 旋一点,位置 100 的 token 旋得更多。之后两个 token 在 attention 里做比较时,真正起作用的是它们 Query 和 Key 旋转后的差异,这里面编码了它们相隔多远。

极简说明:RoPE

RoPE 是 Rotary Position Embeddings 的缩写。它不是加一个位置向量,而是旋转 Query 和 Key 向量,让相对距离在 attention 过程中显现出来。

Image 4: Rotary position embeddings rotate vectors by position

它在实践里的优势很实在。RoPE 天然编码相对位置,这更接近 attention 真正需要的东西。它对更长上下文的泛化也更好,而且不会给模型额外增加参数。

即便有了不错的位置编码,现代 LLM 仍然有一个被明确记录的问题,叫 lost in the middle,也就是 Liu 等人在 2023 年提到的现象。对于很长的提示词,模型通常更稳定地利用开头和结尾的信息,而对埋在中间的信息利用得没那么可靠。所以像把重要上下文放前面,或者在结尾重复关键信息,这类 prompt engineering 技巧确实有帮助。模型并不是对提示词每一部分都一视同仁。

现在 token 的意义和位置都编码进去了,接下来的问题就是,token 到底怎么彼此交换信息。


Attention

这就是给整个架构命名的那个机制。Attention。

在每一层 transformer 里,attention 只做一件事。它让每个 token 去看自己允许看到的其他 token,并决定哪些对接下来最重要。

它的做法是让每个 token 同时扮演三种角色。每个 token 都会被变换成三条新向量,分别叫 Query、Key 和 Value,也就是 Q、K、V。

极简说明:Q、K、V

Query 表示 我在找什么,Key 表示 我能和什么匹配,Value 表示 匹配强时会被拷过去的信息。

  • Query 会问,我想从其他 token 那里找什么
  • Key 会说,我能给正在看我的 token 提供什么
  • Value 负责携带,一旦匹配发生,真正会被传过去的是什么

同一个 token 会同时扮演这三个角色。Q、K、V 这些变换都来自学出来的矩阵,所以模型会在训练中自己摸索,每个 token 应该寻找什么,又应该提供什么。

匹配通过相似度分数完成。每个 token 的 Query 会和它允许看到的每个 token 的 Key 做比较,使用的是 scaled dot product。直觉上,这衡量的是两条向量有多对齐。缩放这一步是为了在 softmax 之前让数字更稳定。

极简说明:dot product

dot product 是给两条向量对齐程度打分的一种简单方式。越对齐,匹配越强。

这些匹配分数之后会通过 softmax 变成权重。softmax 会把任意一组数字变成类似概率的分布,并且加起来等于 1。匹配分数高的 token 会得到更高权重,再用这些权重去对 value 向量做加权平均。

极简说明:softmax

softmax 会把原始分数变成加起来等于 1 的权重。分数高,权重大。分数低,权重小。

举个例子。看这句话,The cat that I saw yesterday was sleeping。当模型处理 was 时,它得弄清楚是谁在 sleeping。was 的 Query 向量会和它允许看到的那些 token 的 Key 向量逐个比较。它和 cat 的 dot product 会比较高,因为模型已经学会了,像 was 这样的动词需要一个主语,而像 cat 这样的主语会产生和它对齐良好的 Key 向量。它和 yesterday 的 dot product 就会低。softmax 把这些分数变成权重,cat 的权重高,yesterday 的权重低。然后模型对相应的 value 向量做加权求和,于是 cat 的 value 会主导结果。这样一来,was 的新表示就主要会被 cat 的 value 形塑。这就是前面隔了好几个位置的 token,如何成为当前词所指对象的。

GPT 风格语言模型还有一个特有约束,那就是它们按从左到右的方式生成文本。位置 5 的 token 只能 attend 到位置 1 到 5,不能 attend 到位置 6、7、8,因为那些内容还没生成出来。这叫 causal masking。实现很简单,就是把未来 token 的匹配分数压得极低,让它们在 softmax 之后几乎得到 0 权重。

极简说明:causal masking

causal masking 会把未来 token 挡住。它防止纯 decoder 语言模型在预测下一个 token 时偷看后文。

Image 5: Attention heatmap showing causal masking and high attention to cat

可解释性研究里有个很有意思的发现,和一种叫 induction heads 的专门 attention head 有关,这是 Anthropic 在 2022 年发现的。这类 head 学会识别提示词中形如 A B … A 的模式,并预测接下来是 B。当模型第二次看到 A 时,这个 induction head 会回看 A 上一次出现的位置,看看它后面跟着什么,然后把那个东西拷过来。它们是目前已知最清晰的机制之一,用来解释 in-context learning,也就是 LLM 能从你的提示词里现学一个模式并延续下去的能力。

极简说明:induction head

induction head 是一种 attention head,它会注意提示词里的重复模式,并帮助模型把模式续下去。

Attention 有一个很大的代价。在完整 attention 里,每个 token 都要和它允许看到的所有 token 做比较,所以提示词长度翻倍,工作量大致会变成四倍。这就是长提示词为什么运行昂贵,也解释了最近很多研究为什么都在想办法让 attention 更高效,比如 FlashAttention、sparse attention、linear attention。

但一个 attention head 只能给模型提供一种学出来的关系视角。


Multi-head attention

一次单独的 attention 计算,只能让模型学会一种 token 与 token 之间的重要性判断方式。这远远不够。语言里同时发生着很多种关系。主谓一致。代词和它指代的名字。跨句子的远距离引用。词序和局部短语关系。

Multi-head attention 的解决办法,是并行跑很多次 attention,每一条并行路径都在自己较小的空间里工作。每一条并行路径就叫一个 head。

极简说明:attention head

attention head 就是一次独立的 attention 计算,它有自己学出来的投影矩阵。

这里有个地方经常被讲错,很多教程也会说错。每个 head 并不是拿到原始 token 向量里一个固定切片。每个 head 都有自己学出来的投影矩阵,把完整 token 向量映射成自己较小的 Q、K、V 向量。所以如果一个模型每个 token 有 4096 个数字,一共 32 个 head,那每个 head 通常工作在 128 维空间里,但这 128 个数字是从完整 4096 维里学出来的投影,不是死板切一块下来。它们是同一个 token 的不同视图,不是不同碎片。

每个 head 都独立跑自己的 attention。然后所有 head 的输出会被拼接起来,再经过最后一个线性层,把它们重新混成一条完整尺寸的向量。这一步如何混,模型也会自己学。

Image 6: Multi-head attention combines specialized attention heads

真正有意思的地方在于,不同 head 往往会出现某种部分专门化。模型从来没人告诉它每个 head 该干什么。这种专门化是在训练中自然冒出来的。研究者发现,有的 head 会跟踪语法关系,比如把动词连到它的宾语,把冠词连到它修饰的名词。有的 head 会判断某个代词指的是哪个名字。有的 head 跟踪位置模式,还有 induction head,以及很多别的类型。单层 transformer 可能有 32 个 head。现代前沿模型会有几十层。所以一个典型 LLM 里,attention head 总数往往上千,每个都在贡献自己学到的一种视角。

还有一个实际成本问题,推动了最近的一次架构变化。每个 head 都需要把已经生成过的所有 token 对应的 Key 和 Value 向量留在内存里,这样当新 token 生成时,模型就不用从头重新算一遍。这叫 KV cache,它是 LLM 在长上下文运行时最主要的内存开销。

极简说明:KV cache

KV cache 会在生成过程中存住旧的 Key 和 Value 向量。这样模型每增加一个 token 时,就不用把整段提示词全部重算。

现代纯 decoder LLM 大多使用一种变体,叫 Grouped-Query Attention,也就是 GQA。它不是每个 head 都有自己单独的 key 和 value,而是让一组 query head 共享同一组 key/value head。LLaMA-2 70B 有 64 个 query head,但只有 8 个 key/value head。Mistral 7B 有 32 个 query head 和 8 个 key/value head。结果是,准确率几乎和完整 multi-head attention 一样,但内存压力和推理成本小得多。

极简说明:GQA

Grouped-Query Attention 让多个 query head 共享更少的 key/value head。这样既保留多种 query 视角,又能削减 KV cache 的内存占用。


Feed-forward network

在 attention 完成 token 之间的信息混合之后,每一层还有第二步,只是平时没那么多人讲。那就是 feed-forward network。

如果说 attention 是 token 彼此交谈,那么 feed-forward network 更像是每个 token 自己单独再做一轮处理。它会独立作用在每个 token 的向量上,不发生 token 之间的混合。

feed-forward network 按顺序做三件事:

  1. 把 token 的向量扩展到更大的尺寸,原始 transformer 用的是 4 倍,现代 SwiGLU 模型常用的是别的扩展比例
  2. 施加一个非线性函数
  3. 再把向量压回原来的尺寸

Image 7: Feed-forward network expands, transforms, and compresses each token vector

中间那个非线性步骤干的事情很具体,值得弄明白。所谓非线性,就是一种会让输入发生弯折的函数。最简单的例子是 ReLU,负数输出零,正数原样通过。

极简说明:non-linearity

non-linearity 是防止整个网络塌成一个巨大线性变换的关键。

如果没有它,FFN 就只是两层线性层叠在一起,而纯线性运算叠起来是会塌缩的。数学上,两层线性层连着用,等价于一层线性层。一百层线性层连着用,本质上也还是等价于一层。真正阻止这种塌缩的,就是非线性。也正因如此,FFN 才能做出比一次矩阵乘法更丰富的事情。

最初的 transformer 用的是 ReLU。GPT 和 BERT 改成了 GELU。LLaMA、Mistral、PaLM 这样的现代模型又改用 SwiGLU。先扩张再压缩的整体结构没变,真正被不断迭代的是中间这个非线性本身。

在稠密 transformer 模型里,大多数参数并不在 attention,而是在 FFN。很大一部分权重都躺在 feed-forward 层里。

而且这些参数并不是泛泛的一堆数字。模型内部大量存储的事实和语义结构,就藏在这里。研究者发现,FFN 里的某些神经元会和特定概念或事实强相关。某个神经元可能会对和 Eiffel Tower 有关的文本强烈激活。另一个可能对编程语言激活。另一个可能对过去式动词激活。当模型知道 Paris 是 France 的首都时,这个事实会在特定层里的 FFN 权重和激活中被表示出来。

这种存储记忆的性质带来一个有意思的结果。研究者已经找到了办法,可以直接修改训练完成模型中的某些事实,而不必重新训练。像 ROME 这类方法,也就是 Rank-One Model Editing,可以通过对某个特定 FFN 权重矩阵做一次有针对性的低秩编辑,把 Eiffel Tower is in Paris 改成 Eiffel Tower is in Rome。之后模型往往会生成和这个新关联一致的文本。

一些现代前沿模型已经开始把稠密 FFN 替换成 Mixture of Experts,也就是 MoE。不是每层只有一个 feed-forward network,而是有很多个并行 FFN,叫 experts,再加一个很小的 router network,决定每个 token 该交给哪些 expert 处理。Mixtral 8x7B 每层有 8 个 expert,每个 token 实际只会激活其中 2 个。总参数量会明显上升,但每个 token 的计算量增长得慢得多,因为真正运行的只有少数 expert。这就是在不按比例增加推理成本的前提下,继续扩展参数规模的办法。

极简说明:MoE

Mixture of Experts 的意思是,模型里有多套 feed-forward network,而每个 token 只会被路由到其中少数几套。

Mixtral 8x7B 总参数量有 467 亿,但每个 token 实际只会用到大约 129 亿参数。它已经成了超大模型里很常见的选项,因为它让参数量继续增长,同时又不必让推理成本同步按比例暴涨。


Residual stream and layer normalization

Residual stream 的作用,是让模型以相加的方式工作,而不是替换。attention 跑完之后,或者 feed-forward network 跑完之后,结果通常不会直接替换 token 的向量,而是逐位置加回去。新向量等于旧向量加上这个子模块的输出。

极简说明:residual connection

residual connection 会把一个模块的输出加回它原本起步的向量上。它给信息和梯度提供了一条穿过网络的捷径。

经过三十层、五十层甚至上百层之后,每一层的贡献都是不断累积的,而不是简单覆盖前一层向量。这种一路累加下来的总和,就叫 residual stream。它有个挺奇特的性质。最初输入的 embedding 一直保留着一条直达后面层的加法路径,只不过一路上不断和各个子模块的贡献混在一起。

Image 8: Residual stream accumulates attention and feed-forward outputs

Residual connection 不是为 transformer 发明的。它来自 ResNet,也就是 He 等人在 2015 年为图像识别提出的结构。最初动机很直接,深网络根本难以训练。训练信号在穿过很多层回传时会变得太弱,有时也会太强。模型没法真正从自己的错误里学到东西。加上一条捷径之后,信号就能从输出直接回流到输入。于是几百层深的网络也能训练了。transformer 继承了同样这招。

在现代可解释性研究里,residual stream 已经成了最核心的对象。每个组件、每个 attention head、每个 feed-forward network,甚至最后的 unembedding 步骤,都是从 residual stream 里读,再把结果写回去。

第二块东西,layer normalization,理由就更实际了。没有它,residual stream 不可能保持稳定。数字经过几十次相加之后,要么越滚越大,要么一路塌到接近零。两种情况都会让训练失败。layer normalization 会在子模块之间把每个 token 的向量重新缩放回一个可控范围。

极简说明:layer normalization

layer normalization 会重新缩放 token 向量,让其中的数字在训练过程中保持稳定范围。

最早 2017 年的 transformer 是在每个子模块之后做 normalization,也就是 post-norm。浅层模型里这样能用,但层数一深,训练稳定性就越来越难保证。现代 transformer,也就是从 GPT-2 往后,包括 LLaMA、Mistral,通常会改成在每个子模块之前做 normalization,也就是 pre-norm。这正是让超深 transformer 更容易训练起来的改动之一。

具体函数本身也变了。很多现代开源模型,比如 LLaMA、Mistral、Gemma、Phi,用的是一种更简单的变体,叫 RMSNorm。原来的 layer normalization 同时做两件事,先把每条向量往零中心平移,再重新缩放数字大小。RMSNorm 去掉了平移这一步,只保留缩放。经验上看,真正的大部分收益其实来自缩放,而且这样算起来更便宜。

极简说明:RMSNorm

RMSNorm 是一种更便宜的归一化方法。它只缩放向量大小,不先减去均值。

所以这就是那套不太起眼的底层机械结构。没有 residual connection,超深模型会难训很多。没有 layer normalization,这条不断累加的流就可能爆掉,或者塌掉。有了这两样,才有可能做出上百层深的模型。


Next-token prediction

等 attention 和 feed-forward 这些层层处理全部结束后,模型会为序列中的每个 token 都得到一条向量。在生成阶段,为了预测下一个词,它只取最后一个 token 的最终向量。

这条最后向量会被转换成每个可能下一个 token 对应的一个数字。如果词表里有 100000 个 token,那就会得到 100000 个数字。这些数字叫 logits。它们还不是概率。它们可以是任意大小,正的负的都行。

极简说明:logits

logits 是针对每个可能下一个 token 的原始分数。只有经过 softmax 之后,它们才会变成概率。

然后通过 softmax,把这些 logits 变成模型对所有候选下一个 token 的概率分布。和前面是同一个操作,只是发生在模型中的不同位置。

模型通常不会每次都直接选概率最高的 token。解码设置会控制输出到底更确定,还是更多变化。temperature 会改变分布有多尖锐。top-k 和 top-p 会把选择范围限制在最可信的一批候选里。这就是为什么同一个模型,在某种设置下会显得很精准,在另一种设置下又会更有创造性。

极简说明:temperature

temperature 控制采样时的随机性。温度低,模型更保守。温度高,输出更多变化。

一旦选中了一个 token,它就会被加到输入里。接着模型会在更长的序列上跑下一步,通常会复用 KV cache,这样就不用把整个前缀从头重算。新 token 的新 attention。新的 feed-forward。新的最终向量。新的预测。这个循环会一直继续,直到模型输出一个序列结束 token,或者撞上长度上限。整整一段文字,本质上就是这个循环一下一下跑出来的,每次一个 token。

这个单一目标,也就是预测下一个 token,是基础 LLM 的核心训练信号。基础模型并不是直接按事实准确性、对话能力、推理能力或写代码能力来训练的。它训练的就是在海量文本里预测下一个 token。之后再通过后训练,把模型调整到更会遵循指令、更符合偏好、更安全,也更像对话助手。

还有一个值得知道的大效率创新,叫 speculative decoding。一个小而快的模型先往前提议多个 token,大模型并行验证它们。如果这些提议 token 在大模型的概率分布下也成立,就直接接受。否则就回退到大模型自己算。只要做法正确,最终输出分布会和只跑大模型时一致,但整个循环能快很多。

极简说明:speculative decoding

speculative decoding 会让一个较小的草稿模型先往前猜,然后让较大的模型一次性验证多个猜测 token。

预测下一个 token 的循环,是整套架构里最简单的一部分,但也正是它让整个系统真正运转起来。


架构与训练后权重

前面已经走过了这套核心机制,tokens、embeddings、positional encoding、attention、multi-head attention、feed-forward network、residual stream 和 normalization,还有输出端的 next-token 循环。这就是这套基础架构的一次完整通关。

所以 GPT、Claude、Gemini、LLaMA 到底真正差在哪。公开信息有多有少,闭源模型也不会公布每一个架构选择。但在这篇文章覆盖的这一层面上,它们大体都落在同一个 transformer 家族设计空间里。

大多数现代基于 transformer 的 LLM 使用的是同一种大结构,tokenization、embeddings、positional encoding、层层堆叠的 transformer layer,每层里面有 multi-head attention 和 feed-forward network,再加上 residual stream、layer normalization,以及 next-token prediction。

模型之间变化的地方在于:

  1. 训练后的权重本身,也就是在不同训练数据和不同规模下学出来的那些数字
  2. 配置,比如层数、词表大小、head 数量、参数规模,是 MoE 还是稠密模型
  3. 后训练,包括指令微调、从人类反馈中学习,以及叠加在基础模型之上的安全控制

极简说明:weights

weights 就是模型内部那些学出来的数字。训练会不断调整它们,直到模型能够把文本预测好。

2023 到 2025 年这套现代 transformer 堆栈,在很多严肃的前沿模型和开源权重模型之间,已经逐渐收敛出一组很常见的选择,尽管不同团队是各自独立走到这里的。Pre-norm。RMSNorm。RoPE。SwiGLU。Grouped-Query Attention。在一些最大的模型里还有 Mixture of Experts。这些东西不是一口气同时发明出来的,而是在最初 2017 年设计的基础上,经过大约五年不断打磨,一点点累积起来的。


这会走向哪里

围绕 transformer 家族架构的这种收敛,在机器学习历史里是很少见的。这个领域过去的大部分时间里,基本是每个问题都有自己专门的一套网络。图像识别用一种。语言用另一种。音频再用第三种。视觉团队和语言团队几乎不共享方法。

而现在,transformer 风格模型已经同时出现在语言、视觉、音频和多模态系统里。transformer 吞掉了这个领域很大一块地盘。

这也可能会变。Mamba 和其他 state-space model 是很有分量的替代路线,尤其适合超长序列。混合架构也在不断探索。Mixture-of-experts 已经改变了前沿领域里大家对架构这个词的理解,而这种变化放在五年前还会被视为相当异类。

不过,这篇文章里讲的那些核心机制,tokens、embeddings、positional encoding、attention、feed-forward network、residual stream 和 normalization,以及 next-token prediction,才是比较耐久的那部分。哪怕未来架构变了,这些问题任何序列模型都得以某种形式解决。

如果你读到了这里,那你已经可以去读很多现代 transformer 论文或模型卡,并知道每一节在讲哪一块。这就是目标。

非常欢迎反馈。如果这些内容里有任何一点让你感兴趣,欢迎在 X 上联系我。我很喜欢认识新朋友。

非常欢迎反馈。如果这些内容里有任何一点让你感兴趣,欢迎在 X 上联系我。我很喜欢认识新朋友。

Monday. June 01, 2026 - 26 mins This post is a walkthrough of how LLMs work. Modern LLMs are mostly built by stacking transformer blocks over and over, so understanding the transformer machinery gets you most of the way there.

2026 年 6 月 1 日,星期一 阅读约 26 分钟 这篇文章会带你走一遍 LLM 的工作方式。现代 LLM 基本都是把 transformer block 一层层反复堆起来,所以只要理解了 transformer 这套机制,基本就抓住大头了。

I’ll cover the core mechanisms inside modern transformer-based LLMs, without all that sticky math stuff. Don’t get me wrong, you should learn the math, but this can serve as an introduction.

我会讲现代基于 transformer 的 LLM 里那些核心机制,尽量不沾那些黏糊糊的数学。别误会,数学当然还是该学,但这篇可以先当入门。

Most modern LLMs share the same transformer-family skeleton. The differences come from what each one was trained on, the scale and configuration choices, and the post-training done on top. By the end, you should be able to read many modern LLM papers or model cards and know which piece of the architecture each section is talking about.

大多数现代 LLM 共享同一套 transformer 家族的骨架。差别主要来自各自用什么数据训练、规模和配置怎么选,以及后续又做了哪些训练。看到最后,你应该能读懂不少现代 LLM 论文或模型卡,也能知道每一节在讲架构里的哪一块。

Here’s the path:

路径如下:

  1. Tokens, how a string of text becomes a sequence of integers
  2. Embeddings, how those integers get meaning
  3. Positional encoding, how the model knows what order the tokens came in
  4. Attention, how tokens share information with each other
  5. Multi-head attention, how the model tracks many kinds of relationships at once
  6. The feed-forward network, where a large share of the model’s stored structure lives
  7. The residual stream and layer normalization, what makes deep stacks trainable
  8. Predicting the next token, what the model actually outputs and how the generation loop works
  9. Architecture vs trained weights, what’s broadly shared across modern LLMs, and what’s different
  1. Tokens,文本字符串如何变成一串整数
  2. Embeddings,这些整数如何获得意义
  3. Positional encoding,模型如何知道 token 原本的顺序
  4. Attention,token 之间如何共享信息
  5. Multi-head attention,模型如何同时跟踪多种关系
  6. Feed-forward network,模型内部大量已存结构主要藏在哪里
  7. Residual stream 和 layer normalization,深层堆叠为何还能训练
  8. 预测下一个 token,模型实际输出什么,以及生成循环如何工作
  9. 架构与训练后权重,现代 LLM 之间哪些东西大体相同,哪些不同

Image 1: Transformer pipeline from tokenization to next-token prediction

Tiny explainers appear throughout so anyone can follow along, regardless of background.

文中会不断穿插一些极简说明,不管背景如何都能跟上。



Tokenization

Tokenization

Models don’t read text directly. They read integer IDs. The step that converts your prompt into a sequence of those integers.

模型并不直接读文本。它们读的是整数 ID。也就是把你的提示词转换成一串整数的那个步骤。

That conversion step is called tokenization. A tokenizer takes a string and produces a sequence of integers, where each integer points to an entry in a fixed vocabulary. Modern LLM vocabularies usually contain tens of thousands to a few hundred thousand entries.

这个转换步骤叫 tokenization。tokenizer 接收一个字符串,输出一串整数,每个整数都对应固定词表中的一项。现代 LLM 的词表通常有几万到几十万项。

Tiny explainer: token ID

A token ID is the integer the model uses for one vocabulary entry. The model works with the number, not the written word itself.

极简说明:token ID

token ID 是模型用来表示某个词表项的整数。模型处理的是这个数字,不是写出来的词本身。

Tokens aren’t usually whole words. They’re usually subword pieces. The word “tokenization” might split into [“token”, “ization”]. The word “running” might split into [“run”, “ning”]. The reason is efficiency. Whole-word vocabularies are too big and don’t generalize to new words. Character-level vocabularies are too small and force the model to learn even the simplest patterns from scratch. Subword tokenization sits in the middle. The most common pieces become single tokens, and rare or novel words get composed from smaller pieces.

token 通常不是完整单词,通常是更小的子词片段。比如 tokenization 可能会被切成 [“token”, “ization”]。running 可能会被切成 [“run”, “ning”]。原因是效率。整词词表太大,对新词的泛化也差。字符级词表又太小,连最简单的模式都得模型从头学。子词切分正好落在中间。最常见的片段会成为单个 token,罕见或新造的词则由更小的片段拼出来。

Tiny explainer: vocabulary

The vocabulary is the tokenizer’s fixed list of pieces. Each piece has an ID, and the model can only directly receive IDs from that list.

极简说明:vocabulary

vocabulary 就是 tokenizer 固定的一张片段清单。每个片段都有一个 ID,模型只能直接接收这张清单里的 ID。

The trade-off shows up in places people don’t expect. The classic example: ask an LLM how many R’s are in “strawberry.” LLMs used to get it wrong. That’s not the model failing at counting. It’s the model not operating on letters directly, only token IDs that happen to spell out a word a human would split letter by letter.

这种权衡会出现在一些大家意想不到的地方。经典例子是让 LLM 回答 strawberry 里有几个 R。过去的 LLM 经常答错。这不等于模型不会计数,而是模型并不是直接对字母操作,它处理的是 token ID,而这些 token ID 只是碰巧拼出了一个人类会按字母拆开的单词。

Image 2: Tokenization turns text into token IDs

Different model families use different tokenizers. GPT models use Byte Pair Encoding variants. SentencePiece is common in LLaMA-style models. The choice matters for compute (fewer tokens means less work) and for things like multilingual coverage, but the basic shape is the same. Text in, integers out.

不同模型家族会用不同 tokenizer。GPT 模型使用 Byte Pair Encoding 的各种变体。SentencePiece 在 LLaMA 风格模型里很常见。这个选择会影响计算量,token 越少,工作量通常越小,也会影响多语言覆盖之类的问题,但基本形态都一样。文本进来,整数出去。

Now that the prompt is a sequence of integers, the next step is to give those integers meaning.

现在提示词已经变成一串整数了,下一步就是让这些整数有意义。



Embeddings

Embeddings

A token ID like 1024 is just a row index. It doesn’t mean anything by itself. The thing that gives it meaning is a giant table called the embedding matrix.

1024 这样的 token ID 只是一个行索引。它本身没有意义。真正赋予它意义的,是一张巨大的表,叫 embedding matrix。

Every model has one. It has one row per entry in the vocabulary, and each row is a long vector of numbers. The length of each row is the model’s hidden size. In many 7B-class models, that means 4,096 numbers per token. Larger models usually use wider vectors.

每个模型都有它。词表里每一项对应一行,每一行都是一条很长的数字向量。每一行的长度就是模型的 hidden size。很多 7B 级模型里,每个 token 往往对应 4096 个数字。更大的模型通常会用更宽的向量。

Tiny explainer: vector

A vector is a list of numbers. In a transformer, each token becomes a vector so the model can do math with it.

极简说明:vector

vector 就是一串数字。在 transformer 里,每个 token 都会变成一个 vector,这样模型才能拿它做数学运算。

When the tokenizer hands the model an integer, the model looks up that row and uses the vector instead. That vector is the token’s embedding. It’s the model’s representation of what that token “means,” learned during training.

当 tokenizer 把一个整数交给模型时,模型会查这张表里对应的那一行,然后用那条向量替代这个整数。这条向量就是该 token 的 embedding。它是模型在训练中学出来的,对这个 token 意义的内部表示。

Tiny explainer: embedding matrix

The embedding matrix is a lookup table. Token ID in, learned vector out.

极简说明:embedding matrix

embedding matrix 就是一张查表。输入 token ID,输出一条学出来的向量。

The interesting property of these embeddings is that semantically similar tokens end up with similar vectors. The vector for “king” is close in space to the vector for “queen,” and the vector for “Paris” is close to “France.” None of this is hard-coded. It emerges from training on enough text, and the model learns these positions because they let it predict text well.

这类 embedding 有个很有意思的性质。语义相近的 token,最后往往会落到彼此接近的向量位置上。king 的向量会接近 queen,Paris 的向量会接近 France。这些都不是硬编码进去的,而是在足够多文本上训练之后自然出现的,因为模型发现把它们放在这样的空间位置上更有利于预测文本。

You can do arithmetic on embeddings and it sometimes works. The famous example is king − man + woman ≈ queen. The geometry of embedding space carries real semantic structure, even though nobody told the model to build it that way.

你甚至可以对 embedding 做算术,有时还真能成立。最著名的例子就是 king − man + woman ≈ queen。embedding 空间的几何结构里,确实携带着真实的语义关系,尽管没人教模型要这样构建。

Image 3: Embedding space analogy with semantic relationships

Worth being clear on: at this stage every token has been replaced by its embedding, but the embedding alone says nothing about where the token sits in the sequence. The vector for “dog” is the same vector whether “dog” is the first word in your prompt or the fifth. That’s a problem.

这里要说清楚一点。到了这一步,每个 token 都已经被替换成自己的 embedding 了,但 embedding 本身并没有告诉你这个 token 在序列里的位置。dog 的向量不管出现在提示词第一个还是第五个,都是同一条向量。这就成问题了。

That’s the gap positional encoding fills.

这正是 positional encoding 要补上的空缺。



Positional encoding

Positional encoding

Plain self-attention doesn’t have a built-in representation of word order. Without some positional signal, it has no direct way to know that “dog” came before “bites” instead of after it.

普通 self-attention 并没有内建的词序表示。如果没有某种位置信号,它就没有直接办法知道 dog 是出现在 bites 前面,还是后面。

Word order changes meaning. So the model needs another piece. It needs a way to inject the position of each token into the math.

词序会改变意思。所以模型还需要另一块东西。它需要一种办法,把每个 token 的位置注入到计算里。

Tiny explainer: positional encoding

Positional encoding is how the model gets order information. It tells the model where each token sits in the sequence.

极简说明:positional encoding

positional encoding 就是模型获取顺序信息的方式。它告诉模型每个 token 在序列中的位置。

The original transformer paper (Vaswani et al. 2017) did this by giving each position its own pattern of numbers and adding it directly to each token’s embedding before any other processing. Position 1 had one pattern, position 5 had a different pattern, position 100 had another. The patterns came from sine and cosine waves at different frequencies. Now the embedding for “dog” at position 1 was different from the embedding for “dog” at position 5, just because the position pattern added to it was different.

最早的 transformer 论文,也就是 Vaswani 等人在 2017 年那篇,做法是给每个位置分配一套自己的数字模式,然后在任何其他处理之前,直接把这套数字加到每个 token 的 embedding 上。位置 1 有一种模式,位置 5 有另一种,位置 100 又有另一种。这个模式来自不同频率的正弦和余弦波。于是位置 1 的 dog embedding 和位置 5 的 dog embedding 就不一样了,因为加上去的位置模式不同。

That worked, and sinusoidal encodings were chosen partly because they can extrapolate beyond the exact sequence lengths seen during training. But additive position schemes still had two problems that became important as models scaled up.

这种方法能用,而且选择正弦编码的部分原因,是它能外推到训练时没见过的序列长度。但随着模型越做越大,这类加法式位置方案还是暴露出两个越来越重要的问题。

First, the embedding had to carry both meaning and position in the same set of numbers. There’s only so much you can pack in.

第一,embedding 必须在同一组数字里同时承载语义和位置信息。能塞进去的东西终究有限。

Second, learned absolute position embeddings in particular don’t generalize cleanly. If you trained on prompts up to 2,048 tokens long, the model never saw position 5,000 during training, and the embedding for that position was not learned in the same way.

第二,尤其是学出来的绝对位置 embedding,泛化并不干净。如果训练时只见过最长 2048 token 的提示词,那模型从没见过位置 5000,这个位置的 embedding 也就没以同样方式学出来。

Modern models mostly use a different scheme called Rotary Position Embeddings (RoPE), introduced by Su et al. in 2021 and now used in LLaMA, Mistral, Gemma, Qwen, and most other open-weight families. The intuition: instead of adding position info to each token’s vector, RoPE rotates the Query and Key vectors by an angle that depends on the token’s position. A token at position 1 gets a small turn, a token at position 100 gets a bigger turn. When two tokens are later compared during attention, what matters is the difference between their Query and Key rotations, which encodes how far apart they are.

现代模型大多改用另一套方案,叫 Rotary Position Embeddings,也就是 RoPE。它由 Su 等人在 2021 年提出,如今 LLaMA、Mistral、Gemma、Qwen 以及大多数开源权重家族都在用。直觉上说,RoPE 不是把位置信息加到 token 向量里,而是按 token 所在位置对应的角度去旋转 Query 和 Key 向量。位置 1 的 token 旋一点,位置 100 的 token 旋得更多。之后两个 token 在 attention 里做比较时,真正起作用的是它们 Query 和 Key 旋转后的差异,这里面编码了它们相隔多远。

Tiny explainer: RoPE

RoPE stands for Rotary Position Embeddings. Instead of adding a position vector, it rotates Query and Key vectors so relative distance shows up during attention.

极简说明:RoPE

RoPE 是 Rotary Position Embeddings 的缩写。它不是加一个位置向量,而是旋转 Query 和 Key 向量,让相对距离在 attention 过程中显现出来。

Image 4: Rotary position embeddings rotate vectors by position

The practical advantages are real. RoPE encodes relative position naturally (which is closer to what attention actually wants). It generalizes better to longer contexts. And it doesn’t add new parameters to the model.

它在实践里的优势很实在。RoPE 天然编码相对位置,这更接近 attention 真正需要的东西。它对更长上下文的泛化也更好,而且不会给模型额外增加参数。

Even with good positional encoding, modern LLMs have a documented “lost in the middle” problem (Liu et al. 2023). They use information at the start and end of long prompts more reliably than information buried in the middle. That’s why prompt engineering tips like “put important context first” or “repeat key info at the end” actually help. The model isn’t using every part of your prompt equally well.

即便有了不错的位置编码,现代 LLM 仍然有一个被明确记录的问题,叫 lost in the middle,也就是 Liu 等人在 2023 年提到的现象。对于很长的提示词,模型通常更稳定地利用开头和结尾的信息,而对埋在中间的信息利用得没那么可靠。所以像把重要上下文放前面,或者在结尾重复关键信息,这类 prompt engineering 技巧确实有帮助。模型并不是对提示词每一部分都一视同仁。

With token meaning and position both encoded, the next question is how do tokens actually exchange information?

现在 token 的意义和位置都编码进去了,接下来的问题就是,token 到底怎么彼此交换信息。



Attention

Attention

This is the mechanism that gave the architecture its name. Attention.

这就是给整个架构命名的那个机制。Attention。

Inside every transformer layer, attention does one thing. It lets each token look at the other tokens it is allowed to see and decide which ones matter for what comes next.

在每一层 transformer 里,attention 只做一件事。它让每个 token 去看自己允许看到的其他 token,并决定哪些对接下来最重要。

It does this by giving each token three roles at once. Each token gets transformed into three new vectors, called Query, Key, and Value (Q, K, V).

它的做法是让每个 token 同时扮演三种角色。每个 token 都会被变换成三条新向量,分别叫 Query、Key 和 Value,也就是 Q、K、V。

Tiny explainer: Q, K, V

Query means “what am I looking for,” Key means “what do I match with,” and Value is the information that gets copied when the match is strong.

极简说明:Q、K、V

Query 表示 我在找什么,Key 表示 我能和什么匹配,Value 表示 匹配强时会被拷过去的信息。

  • The Query asks, “what am I looking for from other tokens?”
  • The Key says, “this is what I offer to tokens looking at me.”
  • The Value carries, “this is what gets passed along when a match happens.”
  • Query 会问,我想从其他 token 那里找什么
  • Key 会说,我能给正在看我的 token 提供什么
  • Value 负责携带,一旦匹配发生,真正会被传过去的是什么

The same token plays all three roles at the same time. The Q, K, V transformations are learned matrices, so the model figures out during training what each token should look for and what it should offer.

同一个 token 会同时扮演这三个角色。Q、K、V 这些变换都来自学出来的矩阵,所以模型会在训练中自己摸索,每个 token 应该寻找什么,又应该提供什么。

Matching happens through a similarity score. Each token’s Query is compared against the Key of each token it is allowed to see, using a scaled dot product. Intuitively, this measures how much the two vectors line up. The scaling keeps the numbers stable before softmax.

匹配通过相似度分数完成。每个 token 的 Query 会和它允许看到的每个 token 的 Key 做比较,使用的是 scaled dot product。直觉上,这衡量的是两条向量有多对齐。缩放这一步是为了在 softmax 之前让数字更稳定。

Tiny explainer: dot product

A dot product is a simple way to score how aligned two vectors are. Higher alignment means a stronger match.

极简说明:dot product

dot product 是给两条向量对齐程度打分的一种简单方式。越对齐,匹配越强。

The match scores then get turned into weights using softmax. Softmax takes any set of numbers and turns them into a probability-like distribution that sums to 1. Tokens with higher match scores get higher weights, and the weights are then used to take a weighted average of the value vectors.

这些匹配分数之后会通过 softmax 变成权重。softmax 会把任意一组数字变成类似概率的分布,并且加起来等于 1。匹配分数高的 token 会得到更高权重,再用这些权重去对 value 向量做加权平均。

Tiny explainer: softmax

Softmax turns raw scores into weights that add up to 1. Big scores get big weights, small scores get small weights.

极简说明:softmax

softmax 会把原始分数变成加起来等于 1 的权重。分数高,权重大。分数低,权重小。

An example. Consider the sentence “The cat that I saw yesterday was sleeping.” When the model processes “was,” it needs to figure out what’s doing the sleeping. The Query vector for “was” gets compared against the Key vectors of the tokens it is allowed to see. The dot product with “cat” is high, because the model has learned that verbs like “was” need a subject and that subjects like “cat” produce Key vectors that line up well. The dot product with “yesterday” is low. Softmax turns those scores into weights, “cat” gets a high weight, “yesterday” gets a low one. The model then takes a weighted sum of the corresponding value vectors, so the value for “cat” dominates the result. The new representation of “was” is now mostly shaped by the value of “cat.” That’s how a token several positions back becomes the referent.

举个例子。看这句话,The cat that I saw yesterday was sleeping。当模型处理 was 时,它得弄清楚是谁在 sleeping。was 的 Query 向量会和它允许看到的那些 token 的 Key 向量逐个比较。它和 cat 的 dot product 会比较高,因为模型已经学会了,像 was 这样的动词需要一个主语,而像 cat 这样的主语会产生和它对齐良好的 Key 向量。它和 yesterday 的 dot product 就会低。softmax 把这些分数变成权重,cat 的权重高,yesterday 的权重低。然后模型对相应的 value 向量做加权求和,于是 cat 的 value 会主导结果。这样一来,was 的新表示就主要会被 cat 的 value 形塑。这就是前面隔了好几个位置的 token,如何成为当前词所指对象的。

There’s a constraint specific to GPT-style language models, which is that they generate text left to right. A token at position 5 is only allowed to attend to positions 1 through 5. It cannot attend to tokens at positions 6, 7, 8, because those haven’t been generated yet. This is called causal masking. The implementation is simple: future tokens get match scores so low they end up with effectively zero weight after softmax.

GPT 风格语言模型还有一个特有约束,那就是它们按从左到右的方式生成文本。位置 5 的 token 只能 attend 到位置 1 到 5,不能 attend 到位置 6、7、8,因为那些内容还没生成出来。这叫 causal masking。实现很简单,就是把未来 token 的匹配分数压得极低,让它们在 softmax 之后几乎得到 0 权重。

Tiny explainer: causal masking

Causal masking hides future tokens. It keeps a decoder-only language model from looking ahead while predicting the next token.

极简说明:causal masking

causal masking 会把未来 token 挡住。它防止纯 decoder 语言模型在预测下一个 token 时偷看后文。

Image 5: Attention heatmap showing causal masking and high attention to cat

One of the most interesting findings in interpretability research is about specialized attention heads called induction heads, found by Anthropic in 2022. These heads learn to spot patterns of the form “A B … A” in the prompt and predict that B comes next. When the model sees “A” the second time, the induction head looks back to where “A” appeared before, sees what came after, and copies that. They’re one of the clearest known mechanisms behind in-context learning, the ability of an LLM to pick up a pattern from your prompt and continue it.

可解释性研究里有个很有意思的发现,和一种叫 induction heads 的专门 attention head 有关,这是 Anthropic 在 2022 年发现的。这类 head 学会识别提示词中形如 A B … A 的模式,并预测接下来是 B。当模型第二次看到 A 时,这个 induction head 会回看 A 上一次出现的位置,看看它后面跟着什么,然后把那个东西拷过来。它们是目前已知最清晰的机制之一,用来解释 in-context learning,也就是 LLM 能从你的提示词里现学一个模式并延续下去的能力。

Tiny explainer: induction head

An induction head is an attention head that notices repeated patterns in the prompt and helps continue them.

极简说明:induction head

induction head 是一种 attention head,它会注意提示词里的重复模式,并帮助模型把模式续下去。

Attention has one big cost. In full attention, each token compares against all the tokens it is allowed to see, so doubling the prompt length roughly quadruples the work. This is why long prompts are expensive to run, and why a lot of recent research is about making attention more efficient (FlashAttention, sparse attention, linear attention).

Attention 有一个很大的代价。在完整 attention 里,每个 token 都要和它允许看到的所有 token 做比较,所以提示词长度翻倍,工作量大致会变成四倍。这就是长提示词为什么运行昂贵,也解释了最近很多研究为什么都在想办法让 attention 更高效,比如 FlashAttention、sparse attention、linear attention。

But one attention head only gives the model one learned view of those relationships.

但一个 attention head 只能给模型提供一种学出来的关系视角。



Multi-head attention

Multi-head attention

A single attention pass gives the model one way of deciding which tokens matter to which other tokens. That’s not enough. Language has many relationships happening at the same time. Subject and verb agreement. Pronouns and the names they refer to. Long-range references between sentences. Word order and local phrases.

一次单独的 attention 计算,只能让模型学会一种 token 与 token 之间的重要性判断方式。这远远不够。语言里同时发生着很多种关系。主谓一致。代词和它指代的名字。跨句子的远距离引用。词序和局部短语关系。

Multi-head attention solves this by running attention many times in parallel, with each parallel pass operating in its own smaller space. Each parallel pass is called a head.

Multi-head attention 的解决办法,是并行跑很多次 attention,每一条并行路径都在自己较小的空间里工作。每一条并行路径就叫一个 head。

Tiny explainer: attention head

An attention head is one independent attention pass with its own learned projections.

极简说明:attention head

attention head 就是一次独立的 attention 计算,它有自己学出来的投影矩阵。

The part that’s often described wrong, including in plenty of tutorials. Each head doesn’t get a literal slice of the original token vector. Each head has its own learned projection matrices that map the full token vector down to its own smaller Q, K, and V vectors. So if a model has 4,096 numbers per token and 32 heads, each head usually works in a 128-dimensional space, but those 128 numbers are a learned projection of the full 4,096, not a fixed slice. Different “views” of the same token, not different chunks of it.

这里有个地方经常被讲错,很多教程也会说错。每个 head 并不是拿到原始 token 向量里一个固定切片。每个 head 都有自己学出来的投影矩阵,把完整 token 向量映射成自己较小的 Q、K、V 向量。所以如果一个模型每个 token 有 4096 个数字,一共 32 个 head,那每个 head 通常工作在 128 维空间里,但这 128 个数字是从完整 4096 维里学出来的投影,不是死板切一块下来。它们是同一个 token 的不同视图,不是不同碎片。

Each head runs its attention pass independently. Then the outputs of all the heads get concatenated and passed through a final linear layer that mixes them back into one full-size vector. The model learns that final mixing too.

每个 head 都独立跑自己的 attention。然后所有 head 的输出会被拼接起来,再经过最后一个线性层,把它们重新混成一条完整尺寸的向量。这一步如何混,模型也会自己学。

Image 6: Multi-head attention combines specialized attention heads

What makes this interesting is that different heads often end up partially specialized. The model is never told what each head should do. Specialization emerges naturally during training. Researchers have found heads that track grammar (linking verbs to their objects, articles to their nouns), heads that figure out which pronoun refers to which name, heads that track positional patterns, induction heads, and many more. A single transformer layer might have 32 heads. A modern frontier model has dozens of layers. So a typical LLM has thousands of attention heads in total, each adding its own learned view.

真正有意思的地方在于,不同 head 往往会出现某种部分专门化。模型从来没人告诉它每个 head 该干什么。这种专门化是在训练中自然冒出来的。研究者发现,有的 head 会跟踪语法关系,比如把动词连到它的宾语,把冠词连到它修饰的名词。有的 head 会判断某个代词指的是哪个名字。有的 head 跟踪位置模式,还有 induction head,以及很多别的类型。单层 transformer 可能有 32 个 head。现代前沿模型会有几十层。所以一个典型 LLM 里,attention head 总数往往上千,每个都在贡献自己学到的一种视角。

There’s a practical cost concern that drove a recent architectural change. Each head needs to keep its Key and Value vectors in memory for all the tokens already generated, so that when a new token gets generated the model doesn’t have to recompute everything from scratch. This is called the KV cache, and it’s the main memory cost of running an LLM at long context lengths.

还有一个实际成本问题,推动了最近的一次架构变化。每个 head 都需要把已经生成过的所有 token 对应的 Key 和 Value 向量留在内存里,这样当新 token 生成时,模型就不用从头重新算一遍。这叫 KV cache,它是 LLM 在长上下文运行时最主要的内存开销。

Tiny explainer: KV cache

The KV cache stores old Key and Value vectors during generation. It saves the model from recomputing the whole prompt every time it adds a token.

极简说明:KV cache

KV cache 会在生成过程中存住旧的 Key 和 Value 向量。这样模型每增加一个 token 时,就不用把整段提示词全部重算。

Modern decoder-only LLMs mostly use a variant called Grouped-Query Attention (GQA). Instead of every head having its own keys and values, groups of heads share the same key and value heads. LLaMA-2 70B has 64 query heads but only 8 key/value heads. Mistral 7B has 32 query heads and 8 key/value heads. The result is nearly the same accuracy as full multi-head attention but with much less memory pressure and inference cost.

现代纯 decoder LLM 大多使用一种变体,叫 Grouped-Query Attention,也就是 GQA。它不是每个 head 都有自己单独的 key 和 value,而是让一组 query head 共享同一组 key/value head。LLaMA-2 70B 有 64 个 query head,但只有 8 个 key/value head。Mistral 7B 有 32 个 query head 和 8 个 key/value head。结果是,准确率几乎和完整 multi-head attention 一样,但内存压力和推理成本小得多。

Tiny explainer: GQA

Grouped-Query Attention lets multiple query heads share fewer key/value heads. That cuts KV-cache memory while keeping many query views.

极简说明:GQA

Grouped-Query Attention 让多个 query head 共享更少的 key/value head。这样既保留多种 query 视角,又能削减 KV cache 的内存占用。



Feed-forward network

Feed-forward network

After attention finishes mixing information between tokens, every layer has a second step that nobody talks about as much. The feed-forward network.

在 attention 完成 token 之间的信息混合之后,每一层还有第二步,只是平时没那么多人讲。那就是 feed-forward network。

Where attention is about tokens talking to each other, the feed-forward network is about each token, on its own, doing more processing. It runs on every token’s vector independently, with no cross-token mixing.

如果说 attention 是 token 彼此交谈,那么 feed-forward network 更像是每个 token 自己单独再做一轮处理。它会独立作用在每个 token 的向量上,不发生 token 之间的混合。

The feed-forward network does three things in order:

feed-forward network 按顺序做三件事:

  1. Expand the token’s vector to a larger size (the original transformer used 4x, while modern SwiGLU models often use different expansion sizes).
  2. Apply a non-linear function.
  3. Compress the vector back down to its original size.
  1. 把 token 的向量扩展到更大的尺寸,原始 transformer 用的是 4 倍,现代 SwiGLU 模型常用的是别的扩展比例
  2. 施加一个非线性函数
  3. 再把向量压回原来的尺寸

Image 7: Feed-forward network expands, transforms, and compresses each token vector

That non-linear step in the middle is doing something specific that’s worth understanding. A non-linearity is a function that bends its input. The simplest one, ReLU, outputs zero for any negative number and passes positive numbers through unchanged.

中间那个非线性步骤干的事情很具体,值得弄明白。所谓非线性,就是一种会让输入发生弯折的函数。最简单的例子是 ReLU,负数输出零,正数原样通过。

Tiny explainer: non-linearity

A non-linearity is a function that prevents the network from collapsing into one big linear transformation.

极简说明:non-linearity

non-linearity 是防止整个网络塌成一个巨大线性变换的关键。

Without it, the FFN would just be two linear layers stacked together, and stacking pure linear math collapses. Two linear layers in a row are mathematically equivalent to a single linear layer, and a hundred linear layers in a row are still equivalent to one. The non-linearity is what stops that collapse, and it’s the reason the FFN can do something richer than a single matrix multiplication.

如果没有它,FFN 就只是两层线性层叠在一起,而纯线性运算叠起来是会塌缩的。数学上,两层线性层连着用,等价于一层线性层。一百层线性层连着用,本质上也还是等价于一层。真正阻止这种塌缩的,就是非线性。也正因如此,FFN 才能做出比一次矩阵乘法更丰富的事情。

The original transformer used ReLU. GPT and BERT moved to GELU. Modern models like LLaMA, Mistral, and PaLM use SwiGLU. The expand-then-compress structure stayed the same. The non-linearity itself is what’s been iterated on.

最初的 transformer 用的是 ReLU。GPT 和 BERT 改成了 GELU。LLaMA、Mistral、PaLM 这样的现代模型又改用 SwiGLU。先扩张再压缩的整体结构没变,真正被不断迭代的是中间这个非线性本身。

Most of the parameters in a dense transformer model live in the FFN, not in attention. A large share of the weights sit in feed-forward layers.

在稠密 transformer 模型里,大多数参数并不在 attention,而是在 FFN。很大一部分权重都躺在 feed-forward 层里。

And those parameters aren’t generic. They’re where much of the model’s stored factual and semantic structure lives. Researchers have found that some neurons inside the FFN are strongly associated with specific concepts or facts. One neuron might activate strongly on Eiffel-Tower-related text. Another on programming languages. Another on past-tense verbs. When a model “knows” that Paris is the capital of France, that fact is represented across FFN weights and activations in specific layers.

而且这些参数并不是泛泛的一堆数字。模型内部大量存储的事实和语义结构,就藏在这里。研究者发现,FFN 里的某些神经元会和特定概念或事实强相关。某个神经元可能会对和 Eiffel Tower 有关的文本强烈激活。另一个可能对编程语言激活。另一个可能对过去式动词激活。当模型知道 Paris 是 France 的首都时,这个事实会在特定层里的 FFN 权重和激活中被表示出来。

This stored-memory property has an interesting consequence. Researchers have figured out how to directly edit some facts in a trained model without retraining it. Methods like ROME (Rank-One Model Editing) can change “the Eiffel Tower is in Paris” to “the Eiffel Tower is in Rome” by making a targeted low-rank edit to a specific FFN weight matrix. The model then tends to generate text consistent with the edited association.

这种存储记忆的性质带来一个有意思的结果。研究者已经找到了办法,可以直接修改训练完成模型中的某些事实,而不必重新训练。像 ROME 这类方法,也就是 Rank-One Model Editing,可以通过对某个特定 FFN 权重矩阵做一次有针对性的低秩编辑,把 Eiffel Tower is in Paris 改成 Eiffel Tower is in Rome。之后模型往往会生成和这个新关联一致的文本。

Some modern frontier models have started replacing the dense FFN with something called Mixture of Experts (MoE). Instead of one feed-forward network per layer, the model has many parallel FFNs (called experts) and a tiny router network that picks which experts process each token. Mixtral 8x7B has 8 experts per layer; only 2 are activated for any given token. The total parameter count goes up substantially, but the compute per token grows much more slowly because only a few experts run. That’s how you scale parameter count without scaling inference cost in proportion.

一些现代前沿模型已经开始把稠密 FFN 替换成 Mixture of Experts,也就是 MoE。不是每层只有一个 feed-forward network,而是有很多个并行 FFN,叫 experts,再加一个很小的 router network,决定每个 token 该交给哪些 expert 处理。Mixtral 8x7B 每层有 8 个 expert,每个 token 实际只会激活其中 2 个。总参数量会明显上升,但每个 token 的计算量增长得慢得多,因为真正运行的只有少数 expert。这就是在不按比例增加推理成本的前提下,继续扩展参数规模的办法。

Tiny explainer: MoE

Mixture of Experts means the model has several feed-forward networks and routes each token through only a few of them.

极简说明:MoE

Mixture of Experts 的意思是,模型里有多套 feed-forward network,而每个 token 只会被路由到其中少数几套。

Mixtral 8x7B has 46.7 billion total parameters but uses about 12.9 billion per token. This has become a common option for very large models because it lets you keep growing the parameter count without making inference cost grow in proportion.

Mixtral 8x7B 总参数量有 467 亿,但每个 token 实际只会用到大约 129 亿参数。它已经成了超大模型里很常见的选项,因为它让参数量继续增长,同时又不必让推理成本同步按比例暴涨。



Residual stream and layer normalization

Residual stream and layer normalization

The residual stream is what makes the model “additive” instead of “replacing.” After attention runs, or after the feed-forward network runs, the result usually doesn’t replace the token’s vector. It gets added to it. Position by position. The new vector equals the old vector plus the sub-block’s output.

Residual stream 的作用,是让模型以相加的方式工作,而不是替换。attention 跑完之后,或者 feed-forward network 跑完之后,结果通常不会直接替换 token 的向量,而是逐位置加回去。新向量等于旧向量加上这个子模块的输出。

Tiny explainer: residual connection

A residual connection adds a block’s output back to the vector it started from. It gives information and gradients a shortcut through the network.

极简说明:residual connection

residual connection 会把一个模块的输出加回它原本起步的向量上。它给信息和梯度提供了一条穿过网络的捷径。

Across thirty or fifty or a hundred layers, each layer’s contribution accumulates instead of simply overwriting the previous vector. That running sum is called the residual stream, and it has a strange property. The original input embeddings still have a direct additive path into late layers, mixed together with every sub-block’s contribution along the way.

经过三十层、五十层甚至上百层之后,每一层的贡献都是不断累积的,而不是简单覆盖前一层向量。这种一路累加下来的总和,就叫 residual stream。它有个挺奇特的性质。最初输入的 embedding 一直保留着一条直达后面层的加法路径,只不过一路上不断和各个子模块的贡献混在一起。

Image 8: Residual stream accumulates attention and feed-forward outputs

Residual connections weren’t invented for transformers. They came from ResNet (He et al. 2015), originally for image recognition. The motivation was that deep networks were impossible to train. The training signal got too weak (or sometimes too strong) by the time it traveled back through many layers. The model couldn’t actually learn from its own mistakes. Adding a shortcut path let the signal flow directly back from the output to the input. Suddenly you could train networks with hundreds of layers. Transformers inherited the same trick.

Residual connection 不是为 transformer 发明的。它来自 ResNet,也就是 He 等人在 2015 年为图像识别提出的结构。最初动机很直接,深网络根本难以训练。训练信号在穿过很多层回传时会变得太弱,有时也会太强。模型没法真正从自己的错误里学到东西。加上一条捷径之后,信号就能从输出直接回流到输入。于是几百层深的网络也能训练了。transformer 继承了同样这招。

In modern interpretability research, the residual stream has become the central object. Every component, every attention head, every feed-forward network, even the unembedding step at the end, reads from the residual stream and writes back to it.

在现代可解释性研究里,residual stream 已经成了最核心的对象。每个组件、每个 attention head、每个 feed-forward network,甚至最后的 unembedding 步骤,都是从 residual stream 里读,再把结果写回去。

The second piece, layer normalization, exists for a much more practical reason. Without it, the residual stream would not stay stable. Numbers flowing through dozens of additions tend to either explode upward or collapse toward zero. Either way, training fails. Layer normalization rescales each token’s vector back into a controlled range between sub-blocks.

第二块东西,layer normalization,理由就更实际了。没有它,residual stream 不可能保持稳定。数字经过几十次相加之后,要么越滚越大,要么一路塌到接近零。两种情况都会让训练失败。layer normalization 会在子模块之间把每个 token 的向量重新缩放回一个可控范围。

Tiny explainer: layer normalization

Layer normalization rescales a token vector so its numbers stay in a stable range while the model trains.

极简说明:layer normalization

layer normalization 会重新缩放 token 向量,让其中的数字在训练过程中保持稳定范围。

The original 2017 transformer applied normalization AFTER each sub-block (post-norm). This worked for shallow models but became harder to train reliably as depth increased. Modern transformers (GPT-2 onward, LLaMA, Mistral) commonly apply normalization BEFORE each sub-block (pre-norm). That’s one of the changes that made very deep transformers easier to train.

最早 2017 年的 transformer 是在每个子模块之后做 normalization,也就是 post-norm。浅层模型里这样能用,但层数一深,训练稳定性就越来越难保证。现代 transformer,也就是从 GPT-2 往后,包括 LLaMA、Mistral,通常会改成在每个子模块之前做 normalization,也就是 pre-norm。这正是让超深 transformer 更容易训练起来的改动之一。

The function itself has also changed. Many modern open models (LLaMA, Mistral, Gemma, Phi) use a simpler variant called RMSNorm. The original layer normalization did two things at once: shift each vector toward zero, then rescale the size of the numbers. RMSNorm drops the shift step and keeps only the rescaling. Empirically, the rescaling carries most of the benefit while being cheaper to compute.

具体函数本身也变了。很多现代开源模型,比如 LLaMA、Mistral、Gemma、Phi,用的是一种更简单的变体,叫 RMSNorm。原来的 layer normalization 同时做两件事,先把每条向量往零中心平移,再重新缩放数字大小。RMSNorm 去掉了平移这一步,只保留缩放。经验上看,真正的大部分收益其实来自缩放,而且这样算起来更便宜。

Tiny explainer: RMSNorm

RMSNorm is a cheaper normalization method that rescales vector size without subtracting the mean first.

极简说明:RMSNorm

RMSNorm 是一种更便宜的归一化方法。它只缩放向量大小,不先减去均值。

So that’s the unglamorous machinery. Without residual connections, very deep models become much harder to train. Without layer normalization, the running sum can blow up or collapse. With both, you get models hundreds of layers deep.

所以这就是那套不太起眼的底层机械结构。没有 residual connection,超深模型会难训很多。没有 layer normalization,这条不断累加的流就可能爆掉,或者塌掉。有了这两样,才有可能做出上百层深的模型。



Next-token prediction

Next-token prediction

After all the layers of attention and feed-forward processing finish, the model has a vector for each token in the sequence. During generation, to predict the next word, it takes the final vector of the last token only.

等 attention 和 feed-forward 这些层层处理全部结束后,模型会为序列中的每个 token 都得到一条向量。在生成阶段,为了预测下一个词,它只取最后一个 token 的最终向量。

That last vector gets converted into one number per possible next token. If the vocabulary has 100,000 tokens, that’s 100,000 numbers. These numbers are called logits. They aren’t probabilities yet. They can be any size, positive or negative.

这条最后向量会被转换成每个可能下一个 token 对应的一个数字。如果词表里有 100000 个 token,那就会得到 100000 个数字。这些数字叫 logits。它们还不是概率。它们可以是任意大小,正的负的都行。

Tiny explainer: logits

Logits are raw scores for each possible next token. They become probabilities only after softmax.

极简说明:logits

logits 是针对每个可能下一个 token 的原始分数。只有经过 softmax 之后,它们才会变成概率。

A softmax turns those logits into the model’s probability distribution over possible next tokens. Same operation as before, different place in the model.

然后通过 softmax,把这些 logits 变成模型对所有候选下一个 token 的概率分布。和前面是同一个操作,只是发生在模型中的不同位置。

The model usually does not just pick the highest-probability token every time. Decoding settings control how deterministic or varied the output is. Temperature changes how sharp the distribution is. Top-k and top-p limit the choices to the most plausible next tokens. That is why the same model can feel precise in one setting and more creative in another.

模型通常不会每次都直接选概率最高的 token。解码设置会控制输出到底更确定,还是更多变化。temperature 会改变分布有多尖锐。top-k 和 top-p 会把选择范围限制在最可信的一批候选里。这就是为什么同一个模型,在某种设置下会显得很精准,在另一种设置下又会更有创造性。

Tiny explainer: temperature

Temperature controls randomness during sampling. Low temperature makes the model more conservative; high temperature makes it more varied.

极简说明:temperature

temperature 控制采样时的随机性。温度低,模型更保守。温度高,输出更多变化。

Once a token is picked, it gets added to the input. The model runs the next step on the longer sequence, usually reusing the KV cache so it doesn’t recompute the whole prefix from scratch. New attention for the new token. New feed-forward. New final vector. New prediction. The loop continues until the model emits an end-of-sequence token or hits a length limit. A whole paragraph is just this loop, one token at a time.

一旦选中了一个 token,它就会被加到输入里。接着模型会在更长的序列上跑下一步,通常会复用 KV cache,这样就不用把整个前缀从头重算。新 token 的新 attention。新的 feed-forward。新的最终向量。新的预测。这个循环会一直继续,直到模型输出一个序列结束 token,或者撞上长度上限。整整一段文字,本质上就是这个循环一下一下跑出来的,每次一个 token。

This single objective, predicting the next token, is the core training signal for a base LLM. The base model isn’t trained on factual accuracy, conversational ability, reasoning, or coding directly. It’s trained to predict the next token in massive amounts of text. Later post-training can then tune the model for instruction following, preference, safety, and conversational behavior.

这个单一目标,也就是预测下一个 token,是基础 LLM 的核心训练信号。基础模型并不是直接按事实准确性、对话能力、推理能力或写代码能力来训练的。它训练的就是在海量文本里预测下一个 token。之后再通过后训练,把模型调整到更会遵循指令、更符合偏好、更安全,也更像对话助手。

There’s been a major efficiency innovation worth knowing about. It’s called speculative decoding. A small fast model proposes several tokens ahead. The big model verifies them in parallel. If the proposed tokens are accepted under the big model’s probabilities, accept them. If not, fall back to the big model. Done correctly, the output distribution matches running the big model alone, but the loop can run much faster.

还有一个值得知道的大效率创新,叫 speculative decoding。一个小而快的模型先往前提议多个 token,大模型并行验证它们。如果这些提议 token 在大模型的概率分布下也成立,就直接接受。否则就回退到大模型自己算。只要做法正确,最终输出分布会和只跑大模型时一致,但整个循环能快很多。

Tiny explainer: speculative decoding

Speculative decoding uses a small draft model to guess ahead, then asks the larger model to verify several guessed tokens at once.

极简说明:speculative decoding

speculative decoding 会让一个较小的草稿模型先往前猜,然后让较大的模型一次性验证多个猜测 token。

The next-token prediction loop is the simplest part of the architecture, but it’s what makes the whole thing work.

预测下一个 token 的循环,是整套架构里最简单的一部分,但也正是它让整个系统真正运转起来。



Architecture vs trained weights

架构与训练后权重

We’ve gone through the core mechanisms: tokens, embeddings, positional encoding, attention, multi-head attention, the feed-forward network, the residual stream and normalization, and the next-token loop on the output side. That’s the basic architecture in one pass.

前面已经走过了这套核心机制,tokens、embeddings、positional encoding、attention、multi-head attention、feed-forward network、residual stream 和 normalization,还有输出端的 next-token 循环。这就是这套基础架构的一次完整通关。

So what’s actually different between GPT and Claude and Gemini and LLaMA? Public details vary, and the proprietary models do not publish every architectural choice. But at the level this post is covering, they broadly sit in the same transformer-family design space.

所以 GPT、Claude、Gemini、LLaMA 到底真正差在哪。公开信息有多有少,闭源模型也不会公布每一个架构选择。但在这篇文章覆盖的这一层面上,它们大体都落在同一个 transformer 家族设计空间里。

Most modern transformer-based LLMs use the same broad structure: tokenization, embeddings, positional encoding, stacked transformer layers (each with multi-head attention and a feed-forward network), residual streams, layer normalization, and next-token prediction.

大多数现代基于 transformer 的 LLM 使用的是同一种大结构,tokenization、embeddings、positional encoding、层层堆叠的 transformer layer,每层里面有 multi-head attention 和 feed-forward network,再加上 residual stream、layer normalization,以及 next-token prediction。

What changes between models is:

模型之间变化的地方在于:

  1. The trained weights themselves, learned from different training data at different scales.
  2. The configuration: number of layers, vocabulary size, head count, parameter count, MoE or dense.
  3. The post-training: instruction tuning, learning from human feedback, safety controls applied on top of the base model.
  1. 训练后的权重本身,也就是在不同训练数据和不同规模下学出来的那些数字
  2. 配置,比如层数、词表大小、head 数量、参数规模,是 MoE 还是稠密模型
  3. 后训练,包括指令微调、从人类反馈中学习,以及叠加在基础模型之上的安全控制

Tiny explainer: weights

Weights are the learned numbers inside the model. Training changes those numbers until the model predicts text well.

极简说明:weights

weights 就是模型内部那些学出来的数字。训练会不断调整它们,直到模型能够把文本预测好。

The 2023-2025 “modern transformer” stack converged on a common set of choices across many serious frontier and open-weight models, even though different teams arrived at them independently. Pre-norm placement. RMSNorm. RoPE. SwiGLU. Grouped-Query Attention. Mixture of Experts in some of the largest models. None of these were invented at once. They accumulated over about five years of refinement on top of the original 2017 design.

2023 到 2025 年这套现代 transformer 堆栈,在很多严肃的前沿模型和开源权重模型之间,已经逐渐收敛出一组很常见的选择,尽管不同团队是各自独立走到这里的。Pre-norm。RMSNorm。RoPE。SwiGLU。Grouped-Query Attention。在一些最大的模型里还有 Mixture of Experts。这些东西不是一口气同时发明出来的,而是在最初 2017 年设计的基础上,经过大约五年不断打磨,一点点累积起来的。



Where this is going

这会走向哪里

The convergence around transformer-family architectures is unusual in machine learning history. For most of the field’s life, every problem had its own specialized network. Image recognition used one kind. Language used another. Audio used a third. Vision and language teams barely shared methods.

围绕 transformer 家族架构的这种收敛,在机器学习历史里是很少见的。这个领域过去的大部分时间里,基本是每个问题都有自己专门的一套网络。图像识别用一种。语言用另一种。音频再用第三种。视觉团队和语言团队几乎不共享方法。

Now transformer-style models show up across language, vision, audio, and multimodal systems. The transformer absorbed a huge part of the field.

而现在,transformer 风格模型已经同时出现在语言、视觉、音频和多模态系统里。transformer 吞掉了这个领域很大一块地盘。

That could change. Mamba and other state-space models are credible alternatives, especially for very long sequences. Hybrid architectures are being explored. Mixture-of-experts has already shifted what “the architecture” means at the frontier in ways that would have been considered exotic five years ago.

这也可能会变。Mamba 和其他 state-space model 是很有分量的替代路线,尤其适合超长序列。混合架构也在不断探索。Mixture-of-experts 已经改变了前沿领域里大家对架构这个词的理解,而这种变化放在五年前还会被视为相当异类。

But the core mechanisms in this post (tokens, embeddings, positional encoding, attention, the feed-forward network, the residual stream and normalization, and next-token prediction) are the durable parts. Even when the architecture changes, these are the problems any sequence model has to solve in some form.

不过,这篇文章里讲的那些核心机制,tokens、embeddings、positional encoding、attention、feed-forward network、residual stream 和 normalization,以及 next-token prediction,才是比较耐久的那部分。哪怕未来架构变了,这些问题任何序列模型都得以某种形式解决。

If you’ve made it this far, you can read many modern transformer papers or model cards and know which piece each section is talking about. That’s the goal.

如果你读到了这里,那你已经可以去读很多现代 transformer 论文或模型卡,并知道每一节在讲哪一块。这就是目标。

Feedback is extremely welcome. If any of this interests you, please reach out on X. I love making new friends.

非常欢迎反馈。如果这些内容里有任何一点让你感兴趣,欢迎在 X 上联系我。我很喜欢认识新朋友。

Feedback is extremely welcome. If any of this interests you, please reach out on X. I love making new friends.

非常欢迎反馈。如果这些内容里有任何一点让你感兴趣,欢迎在 X 上联系我。我很喜欢认识新朋友。

Monday. June 01, 2026 - 26 mins This post is a walkthrough of how LLMs work. Modern LLMs are mostly built by stacking transformer blocks over and over, so understanding the transformer machinery gets you most of the way there.

I’ll cover the core mechanisms inside modern transformer-based LLMs, without all that sticky math stuff. Don’t get me wrong, you should learn the math, but this can serve as an introduction.

Most modern LLMs share the same transformer-family skeleton. The differences come from what each one was trained on, the scale and configuration choices, and the post-training done on top. By the end, you should be able to read many modern LLM papers or model cards and know which piece of the architecture each section is talking about.

Here’s the path:

  1. Tokens, how a string of text becomes a sequence of integers
  2. Embeddings, how those integers get meaning
  3. Positional encoding, how the model knows what order the tokens came in
  4. Attention, how tokens share information with each other
  5. Multi-head attention, how the model tracks many kinds of relationships at once
  6. The feed-forward network, where a large share of the model’s stored structure lives
  7. The residual stream and layer normalization, what makes deep stacks trainable
  8. Predicting the next token, what the model actually outputs and how the generation loop works
  9. Architecture vs trained weights, what’s broadly shared across modern LLMs, and what’s different

Image 1: Transformer pipeline from tokenization to next-token prediction

Tiny explainers appear throughout so anyone can follow along, regardless of background.


Tokenization

Models don’t read text directly. They read integer IDs. The step that converts your prompt into a sequence of those integers.

That conversion step is called tokenization. A tokenizer takes a string and produces a sequence of integers, where each integer points to an entry in a fixed vocabulary. Modern LLM vocabularies usually contain tens of thousands to a few hundred thousand entries.

Tiny explainer: token ID

A token ID is the integer the model uses for one vocabulary entry. The model works with the number, not the written word itself.

Tokens aren’t usually whole words. They’re usually subword pieces. The word “tokenization” might split into [“token”, “ization”]. The word “running” might split into [“run”, “ning”]. The reason is efficiency. Whole-word vocabularies are too big and don’t generalize to new words. Character-level vocabularies are too small and force the model to learn even the simplest patterns from scratch. Subword tokenization sits in the middle. The most common pieces become single tokens, and rare or novel words get composed from smaller pieces.

Tiny explainer: vocabulary

The vocabulary is the tokenizer’s fixed list of pieces. Each piece has an ID, and the model can only directly receive IDs from that list.

The trade-off shows up in places people don’t expect. The classic example: ask an LLM how many R’s are in “strawberry.” LLMs used to get it wrong. That’s not the model failing at counting. It’s the model not operating on letters directly, only token IDs that happen to spell out a word a human would split letter by letter.

Image 2: Tokenization turns text into token IDs

Different model families use different tokenizers. GPT models use Byte Pair Encoding variants. SentencePiece is common in LLaMA-style models. The choice matters for compute (fewer tokens means less work) and for things like multilingual coverage, but the basic shape is the same. Text in, integers out.

Now that the prompt is a sequence of integers, the next step is to give those integers meaning.


Embeddings

A token ID like 1024 is just a row index. It doesn’t mean anything by itself. The thing that gives it meaning is a giant table called the embedding matrix.

Every model has one. It has one row per entry in the vocabulary, and each row is a long vector of numbers. The length of each row is the model’s hidden size. In many 7B-class models, that means 4,096 numbers per token. Larger models usually use wider vectors.

Tiny explainer: vector

A vector is a list of numbers. In a transformer, each token becomes a vector so the model can do math with it.

When the tokenizer hands the model an integer, the model looks up that row and uses the vector instead. That vector is the token’s embedding. It’s the model’s representation of what that token “means,” learned during training.

Tiny explainer: embedding matrix

The embedding matrix is a lookup table. Token ID in, learned vector out.

The interesting property of these embeddings is that semantically similar tokens end up with similar vectors. The vector for “king” is close in space to the vector for “queen,” and the vector for “Paris” is close to “France.” None of this is hard-coded. It emerges from training on enough text, and the model learns these positions because they let it predict text well.

You can do arithmetic on embeddings and it sometimes works. The famous example is king − man + woman ≈ queen. The geometry of embedding space carries real semantic structure, even though nobody told the model to build it that way.

Image 3: Embedding space analogy with semantic relationships

Worth being clear on: at this stage every token has been replaced by its embedding, but the embedding alone says nothing about where the token sits in the sequence. The vector for “dog” is the same vector whether “dog” is the first word in your prompt or the fifth. That’s a problem.

That’s the gap positional encoding fills.


Positional encoding

Plain self-attention doesn’t have a built-in representation of word order. Without some positional signal, it has no direct way to know that “dog” came before “bites” instead of after it.

Word order changes meaning. So the model needs another piece. It needs a way to inject the position of each token into the math.

Tiny explainer: positional encoding

Positional encoding is how the model gets order information. It tells the model where each token sits in the sequence.

The original transformer paper (Vaswani et al. 2017) did this by giving each position its own pattern of numbers and adding it directly to each token’s embedding before any other processing. Position 1 had one pattern, position 5 had a different pattern, position 100 had another. The patterns came from sine and cosine waves at different frequencies. Now the embedding for “dog” at position 1 was different from the embedding for “dog” at position 5, just because the position pattern added to it was different.

That worked, and sinusoidal encodings were chosen partly because they can extrapolate beyond the exact sequence lengths seen during training. But additive position schemes still had two problems that became important as models scaled up.

First, the embedding had to carry both meaning and position in the same set of numbers. There’s only so much you can pack in.

Second, learned absolute position embeddings in particular don’t generalize cleanly. If you trained on prompts up to 2,048 tokens long, the model never saw position 5,000 during training, and the embedding for that position was not learned in the same way.

Modern models mostly use a different scheme called Rotary Position Embeddings (RoPE), introduced by Su et al. in 2021 and now used in LLaMA, Mistral, Gemma, Qwen, and most other open-weight families. The intuition: instead of adding position info to each token’s vector, RoPE rotates the Query and Key vectors by an angle that depends on the token’s position. A token at position 1 gets a small turn, a token at position 100 gets a bigger turn. When two tokens are later compared during attention, what matters is the difference between their Query and Key rotations, which encodes how far apart they are.

Tiny explainer: RoPE

RoPE stands for Rotary Position Embeddings. Instead of adding a position vector, it rotates Query and Key vectors so relative distance shows up during attention.

Image 4: Rotary position embeddings rotate vectors by position

The practical advantages are real. RoPE encodes relative position naturally (which is closer to what attention actually wants). It generalizes better to longer contexts. And it doesn’t add new parameters to the model.

Even with good positional encoding, modern LLMs have a documented “lost in the middle” problem (Liu et al. 2023). They use information at the start and end of long prompts more reliably than information buried in the middle. That’s why prompt engineering tips like “put important context first” or “repeat key info at the end” actually help. The model isn’t using every part of your prompt equally well.

With token meaning and position both encoded, the next question is how do tokens actually exchange information?


Attention

This is the mechanism that gave the architecture its name. Attention.

Inside every transformer layer, attention does one thing. It lets each token look at the other tokens it is allowed to see and decide which ones matter for what comes next.

It does this by giving each token three roles at once. Each token gets transformed into three new vectors, called Query, Key, and Value (Q, K, V).

Tiny explainer: Q, K, V

Query means “what am I looking for,” Key means “what do I match with,” and Value is the information that gets copied when the match is strong.

  • The Query asks, “what am I looking for from other tokens?”
  • The Key says, “this is what I offer to tokens looking at me.”
  • The Value carries, “this is what gets passed along when a match happens.”

The same token plays all three roles at the same time. The Q, K, V transformations are learned matrices, so the model figures out during training what each token should look for and what it should offer.

Matching happens through a similarity score. Each token’s Query is compared against the Key of each token it is allowed to see, using a scaled dot product. Intuitively, this measures how much the two vectors line up. The scaling keeps the numbers stable before softmax.

Tiny explainer: dot product

A dot product is a simple way to score how aligned two vectors are. Higher alignment means a stronger match.

The match scores then get turned into weights using softmax. Softmax takes any set of numbers and turns them into a probability-like distribution that sums to 1. Tokens with higher match scores get higher weights, and the weights are then used to take a weighted average of the value vectors.

Tiny explainer: softmax

Softmax turns raw scores into weights that add up to 1. Big scores get big weights, small scores get small weights.

An example. Consider the sentence “The cat that I saw yesterday was sleeping.” When the model processes “was,” it needs to figure out what’s doing the sleeping. The Query vector for “was” gets compared against the Key vectors of the tokens it is allowed to see. The dot product with “cat” is high, because the model has learned that verbs like “was” need a subject and that subjects like “cat” produce Key vectors that line up well. The dot product with “yesterday” is low. Softmax turns those scores into weights, “cat” gets a high weight, “yesterday” gets a low one. The model then takes a weighted sum of the corresponding value vectors, so the value for “cat” dominates the result. The new representation of “was” is now mostly shaped by the value of “cat.” That’s how a token several positions back becomes the referent.

There’s a constraint specific to GPT-style language models, which is that they generate text left to right. A token at position 5 is only allowed to attend to positions 1 through 5. It cannot attend to tokens at positions 6, 7, 8, because those haven’t been generated yet. This is called causal masking. The implementation is simple: future tokens get match scores so low they end up with effectively zero weight after softmax.

Tiny explainer: causal masking

Causal masking hides future tokens. It keeps a decoder-only language model from looking ahead while predicting the next token.

Image 5: Attention heatmap showing causal masking and high attention to cat

One of the most interesting findings in interpretability research is about specialized attention heads called induction heads, found by Anthropic in 2022. These heads learn to spot patterns of the form “A B … A” in the prompt and predict that B comes next. When the model sees “A” the second time, the induction head looks back to where “A” appeared before, sees what came after, and copies that. They’re one of the clearest known mechanisms behind in-context learning, the ability of an LLM to pick up a pattern from your prompt and continue it.

Tiny explainer: induction head

An induction head is an attention head that notices repeated patterns in the prompt and helps continue them.

Attention has one big cost. In full attention, each token compares against all the tokens it is allowed to see, so doubling the prompt length roughly quadruples the work. This is why long prompts are expensive to run, and why a lot of recent research is about making attention more efficient (FlashAttention, sparse attention, linear attention).

But one attention head only gives the model one learned view of those relationships.


Multi-head attention

A single attention pass gives the model one way of deciding which tokens matter to which other tokens. That’s not enough. Language has many relationships happening at the same time. Subject and verb agreement. Pronouns and the names they refer to. Long-range references between sentences. Word order and local phrases.

Multi-head attention solves this by running attention many times in parallel, with each parallel pass operating in its own smaller space. Each parallel pass is called a head.

Tiny explainer: attention head

An attention head is one independent attention pass with its own learned projections.

The part that’s often described wrong, including in plenty of tutorials. Each head doesn’t get a literal slice of the original token vector. Each head has its own learned projection matrices that map the full token vector down to its own smaller Q, K, and V vectors. So if a model has 4,096 numbers per token and 32 heads, each head usually works in a 128-dimensional space, but those 128 numbers are a learned projection of the full 4,096, not a fixed slice. Different “views” of the same token, not different chunks of it.

Each head runs its attention pass independently. Then the outputs of all the heads get concatenated and passed through a final linear layer that mixes them back into one full-size vector. The model learns that final mixing too.

Image 6: Multi-head attention combines specialized attention heads

What makes this interesting is that different heads often end up partially specialized. The model is never told what each head should do. Specialization emerges naturally during training. Researchers have found heads that track grammar (linking verbs to their objects, articles to their nouns), heads that figure out which pronoun refers to which name, heads that track positional patterns, induction heads, and many more. A single transformer layer might have 32 heads. A modern frontier model has dozens of layers. So a typical LLM has thousands of attention heads in total, each adding its own learned view.

There’s a practical cost concern that drove a recent architectural change. Each head needs to keep its Key and Value vectors in memory for all the tokens already generated, so that when a new token gets generated the model doesn’t have to recompute everything from scratch. This is called the KV cache, and it’s the main memory cost of running an LLM at long context lengths.

Tiny explainer: KV cache

The KV cache stores old Key and Value vectors during generation. It saves the model from recomputing the whole prompt every time it adds a token.

Modern decoder-only LLMs mostly use a variant called Grouped-Query Attention (GQA). Instead of every head having its own keys and values, groups of heads share the same key and value heads. LLaMA-2 70B has 64 query heads but only 8 key/value heads. Mistral 7B has 32 query heads and 8 key/value heads. The result is nearly the same accuracy as full multi-head attention but with much less memory pressure and inference cost.

Tiny explainer: GQA

Grouped-Query Attention lets multiple query heads share fewer key/value heads. That cuts KV-cache memory while keeping many query views.


Feed-forward network

After attention finishes mixing information between tokens, every layer has a second step that nobody talks about as much. The feed-forward network.

Where attention is about tokens talking to each other, the feed-forward network is about each token, on its own, doing more processing. It runs on every token’s vector independently, with no cross-token mixing.

The feed-forward network does three things in order:

  1. Expand the token’s vector to a larger size (the original transformer used 4x, while modern SwiGLU models often use different expansion sizes).
  2. Apply a non-linear function.
  3. Compress the vector back down to its original size.

Image 7: Feed-forward network expands, transforms, and compresses each token vector

That non-linear step in the middle is doing something specific that’s worth understanding. A non-linearity is a function that bends its input. The simplest one, ReLU, outputs zero for any negative number and passes positive numbers through unchanged.

Tiny explainer: non-linearity

A non-linearity is a function that prevents the network from collapsing into one big linear transformation.

Without it, the FFN would just be two linear layers stacked together, and stacking pure linear math collapses. Two linear layers in a row are mathematically equivalent to a single linear layer, and a hundred linear layers in a row are still equivalent to one. The non-linearity is what stops that collapse, and it’s the reason the FFN can do something richer than a single matrix multiplication.

The original transformer used ReLU. GPT and BERT moved to GELU. Modern models like LLaMA, Mistral, and PaLM use SwiGLU. The expand-then-compress structure stayed the same. The non-linearity itself is what’s been iterated on.

Most of the parameters in a dense transformer model live in the FFN, not in attention. A large share of the weights sit in feed-forward layers.

And those parameters aren’t generic. They’re where much of the model’s stored factual and semantic structure lives. Researchers have found that some neurons inside the FFN are strongly associated with specific concepts or facts. One neuron might activate strongly on Eiffel-Tower-related text. Another on programming languages. Another on past-tense verbs. When a model “knows” that Paris is the capital of France, that fact is represented across FFN weights and activations in specific layers.

This stored-memory property has an interesting consequence. Researchers have figured out how to directly edit some facts in a trained model without retraining it. Methods like ROME (Rank-One Model Editing) can change “the Eiffel Tower is in Paris” to “the Eiffel Tower is in Rome” by making a targeted low-rank edit to a specific FFN weight matrix. The model then tends to generate text consistent with the edited association.

Some modern frontier models have started replacing the dense FFN with something called Mixture of Experts (MoE). Instead of one feed-forward network per layer, the model has many parallel FFNs (called experts) and a tiny router network that picks which experts process each token. Mixtral 8x7B has 8 experts per layer; only 2 are activated for any given token. The total parameter count goes up substantially, but the compute per token grows much more slowly because only a few experts run. That’s how you scale parameter count without scaling inference cost in proportion.

Tiny explainer: MoE

Mixture of Experts means the model has several feed-forward networks and routes each token through only a few of them.

Mixtral 8x7B has 46.7 billion total parameters but uses about 12.9 billion per token. This has become a common option for very large models because it lets you keep growing the parameter count without making inference cost grow in proportion.


Residual stream and layer normalization

The residual stream is what makes the model “additive” instead of “replacing.” After attention runs, or after the feed-forward network runs, the result usually doesn’t replace the token’s vector. It gets added to it. Position by position. The new vector equals the old vector plus the sub-block’s output.

Tiny explainer: residual connection

A residual connection adds a block’s output back to the vector it started from. It gives information and gradients a shortcut through the network.

Across thirty or fifty or a hundred layers, each layer’s contribution accumulates instead of simply overwriting the previous vector. That running sum is called the residual stream, and it has a strange property. The original input embeddings still have a direct additive path into late layers, mixed together with every sub-block’s contribution along the way.

Image 8: Residual stream accumulates attention and feed-forward outputs

Residual connections weren’t invented for transformers. They came from ResNet (He et al. 2015), originally for image recognition. The motivation was that deep networks were impossible to train. The training signal got too weak (or sometimes too strong) by the time it traveled back through many layers. The model couldn’t actually learn from its own mistakes. Adding a shortcut path let the signal flow directly back from the output to the input. Suddenly you could train networks with hundreds of layers. Transformers inherited the same trick.

In modern interpretability research, the residual stream has become the central object. Every component, every attention head, every feed-forward network, even the unembedding step at the end, reads from the residual stream and writes back to it.

The second piece, layer normalization, exists for a much more practical reason. Without it, the residual stream would not stay stable. Numbers flowing through dozens of additions tend to either explode upward or collapse toward zero. Either way, training fails. Layer normalization rescales each token’s vector back into a controlled range between sub-blocks.

Tiny explainer: layer normalization

Layer normalization rescales a token vector so its numbers stay in a stable range while the model trains.

The original 2017 transformer applied normalization AFTER each sub-block (post-norm). This worked for shallow models but became harder to train reliably as depth increased. Modern transformers (GPT-2 onward, LLaMA, Mistral) commonly apply normalization BEFORE each sub-block (pre-norm). That’s one of the changes that made very deep transformers easier to train.

The function itself has also changed. Many modern open models (LLaMA, Mistral, Gemma, Phi) use a simpler variant called RMSNorm. The original layer normalization did two things at once: shift each vector toward zero, then rescale the size of the numbers. RMSNorm drops the shift step and keeps only the rescaling. Empirically, the rescaling carries most of the benefit while being cheaper to compute.

Tiny explainer: RMSNorm

RMSNorm is a cheaper normalization method that rescales vector size without subtracting the mean first.

So that’s the unglamorous machinery. Without residual connections, very deep models become much harder to train. Without layer normalization, the running sum can blow up or collapse. With both, you get models hundreds of layers deep.


Next-token prediction

After all the layers of attention and feed-forward processing finish, the model has a vector for each token in the sequence. During generation, to predict the next word, it takes the final vector of the last token only.

That last vector gets converted into one number per possible next token. If the vocabulary has 100,000 tokens, that’s 100,000 numbers. These numbers are called logits. They aren’t probabilities yet. They can be any size, positive or negative.

Tiny explainer: logits

Logits are raw scores for each possible next token. They become probabilities only after softmax.

A softmax turns those logits into the model’s probability distribution over possible next tokens. Same operation as before, different place in the model.

The model usually does not just pick the highest-probability token every time. Decoding settings control how deterministic or varied the output is. Temperature changes how sharp the distribution is. Top-k and top-p limit the choices to the most plausible next tokens. That is why the same model can feel precise in one setting and more creative in another.

Tiny explainer: temperature

Temperature controls randomness during sampling. Low temperature makes the model more conservative; high temperature makes it more varied.

Once a token is picked, it gets added to the input. The model runs the next step on the longer sequence, usually reusing the KV cache so it doesn’t recompute the whole prefix from scratch. New attention for the new token. New feed-forward. New final vector. New prediction. The loop continues until the model emits an end-of-sequence token or hits a length limit. A whole paragraph is just this loop, one token at a time.

This single objective, predicting the next token, is the core training signal for a base LLM. The base model isn’t trained on factual accuracy, conversational ability, reasoning, or coding directly. It’s trained to predict the next token in massive amounts of text. Later post-training can then tune the model for instruction following, preference, safety, and conversational behavior.

There’s been a major efficiency innovation worth knowing about. It’s called speculative decoding. A small fast model proposes several tokens ahead. The big model verifies them in parallel. If the proposed tokens are accepted under the big model’s probabilities, accept them. If not, fall back to the big model. Done correctly, the output distribution matches running the big model alone, but the loop can run much faster.

Tiny explainer: speculative decoding

Speculative decoding uses a small draft model to guess ahead, then asks the larger model to verify several guessed tokens at once.

The next-token prediction loop is the simplest part of the architecture, but it’s what makes the whole thing work.


Architecture vs trained weights

We’ve gone through the core mechanisms: tokens, embeddings, positional encoding, attention, multi-head attention, the feed-forward network, the residual stream and normalization, and the next-token loop on the output side. That’s the basic architecture in one pass.

So what’s actually different between GPT and Claude and Gemini and LLaMA? Public details vary, and the proprietary models do not publish every architectural choice. But at the level this post is covering, they broadly sit in the same transformer-family design space.

Most modern transformer-based LLMs use the same broad structure: tokenization, embeddings, positional encoding, stacked transformer layers (each with multi-head attention and a feed-forward network), residual streams, layer normalization, and next-token prediction.

What changes between models is:

  1. The trained weights themselves, learned from different training data at different scales.
  2. The configuration: number of layers, vocabulary size, head count, parameter count, MoE or dense.
  3. The post-training: instruction tuning, learning from human feedback, safety controls applied on top of the base model.

Tiny explainer: weights

Weights are the learned numbers inside the model. Training changes those numbers until the model predicts text well.

The 2023-2025 “modern transformer” stack converged on a common set of choices across many serious frontier and open-weight models, even though different teams arrived at them independently. Pre-norm placement. RMSNorm. RoPE. SwiGLU. Grouped-Query Attention. Mixture of Experts in some of the largest models. None of these were invented at once. They accumulated over about five years of refinement on top of the original 2017 design.


Where this is going

The convergence around transformer-family architectures is unusual in machine learning history. For most of the field’s life, every problem had its own specialized network. Image recognition used one kind. Language used another. Audio used a third. Vision and language teams barely shared methods.

Now transformer-style models show up across language, vision, audio, and multimodal systems. The transformer absorbed a huge part of the field.

That could change. Mamba and other state-space models are credible alternatives, especially for very long sequences. Hybrid architectures are being explored. Mixture-of-experts has already shifted what “the architecture” means at the frontier in ways that would have been considered exotic five years ago.

But the core mechanisms in this post (tokens, embeddings, positional encoding, attention, the feed-forward network, the residual stream and normalization, and next-token prediction) are the durable parts. Even when the architecture changes, these are the problems any sequence model has to solve in some form.

If you’ve made it this far, you can read many modern transformer papers or model cards and know which piece each section is talking about. That’s the goal.

Feedback is extremely welcome. If any of this interests you, please reach out on X. I love making new friends.

Feedback is extremely welcome. If any of this interests you, please reach out on X. I love making new friends.

📋 讨论归档

讨论进行中…