Chapter 01Part I · Foundations

Transformer Architecture Internals

8 practice sets · 7 coding problems

A modern large language model (LLM) looks intimidating from the outside, but underneath the marketing it is a remarkably uniform machine: a stack of $L$ nearly identical transformer blocks threaded onto a single shared communication channel, all built around one operation called attention. This mini-chapter builds that machine from the ground up, assuming only that you have seen a neural network and a bit of linear algebra before. By the end you should be able to follow a single token all the way from an integer id, through every sublayer of a decoder-only transformer, out to a probability distribution over the next token — and you should know the name and the job of every moving part (embeddings, the residual stream, queries/keys/values, heads, the feed-forward network, normalization, positional encodings, the KV cache) that the rest of this topic takes apart in detail.

The one job: predict the next token

Everything an LLM does at training time collapses to a single, almost embarrassingly simple objective: given a prefix of text, predict the next token. That is the whole game. The fluency, the coding, the apparent reasoning — all of it is squeezed out of a model that was only ever asked to guess what comes next.

To make this precise we need one preprocessing step. Raw text is first chopped into tokens — sub-word chunks — by a tokenizer (Topic 2), so a sentence becomes a sequence of integer ids $x_1,\dots,x_T$ drawn from a fixed vocabulary of size $V$ (commonly $30\text{k}$ – $200\text{k}$ entries). For example the string "unhappiness" might become three tokens ["un", "happ", "iness"], i.e. three integers. A decoder-only transformer then models the probability of the whole sequence autoregressively — it factorizes the joint probability strictly left to right, one token at a time:

p(x_1,\dots,x_T)=\prod_{t=1}^{T} p\!\left(x_t \mid x_{1},\dots,x_{t-1}\right).

Read this as: the probability of the text is the probability of the 1st token, times the probability of the 2nd given the 1st, times the 3rd given the first two, and so on. Each factor conditions only on tokens already seen — never on the future. The network is therefore one big function that maps a prefix $x_{<t}$ to a probability distribution over which token comes next. Generation is then just a loop: evaluate the function, sample a token from the predicted distribution, append it to the prefix, and repeat.

The whole architecture in one picture

Before we zoom into any part, here is the entire pipeline, end to end. Hold this picture in your head; everything below is just one of these boxes opened up.

Loading diagram…

Text $\to$ ids $\to$ each id becomes a vector (embedding) $\to$ those vectors flow through $L$ identical blocks that gradually refine them $\to$ a final norm $\to$ an unembedding that turns each vector into $V$ raw scores (logits) $\to$ a softmax that turns scores into probabilities. We now walk this pipeline left to right.

From token ids to vectors: the embedding

A neural network multiplies matrices; it cannot multiply by the integer “ $4173$ .” So each id is first turned into a continuous vector. The embedding matrix $E\in\mathbb{R}^{V\times d}$ stores one learned row of length $d$ per vocabulary entry, and “embedding” a token is simply a table lookup: grab the row whose index is the token's id. The width $d$ — the model dimension, often written $d_{\text{model}}$ (e.g. $2048$ , $4096$ , $8192$ ) — is the dimension every internal vector lives in, from here to the very end. The same row is fetched for a given token wherever it appears, so the embedding by itself carries no notion of where the token sits in the sentence; we repair that later with positional encodings.

Hands-on · a

4

-word vocabulary

Say $V=4$ (vocabulary $=\{$ the ${=}0$ , cat $=1$ , sat $=2$ , mat $=3\}$ ) and $d=3$ , with

E=\begin{pmatrix}0.1 & 0.0 & {-0.2}\\ 0.9 & 0.3 & 0.0\\ {-0.4} & 0.7 & 0.5\\ 0.2 & {-0.1}& 0.8\end{pmatrix}.

The sentence “the cat sat” is ids $(0,1,2)$ . Embedding it just stacks rows $0,1,2$ :

X=\begin{pmatrix}0.1 & 0.0 & {-0.2}\\ 0.9 & 0.3 & 0.0\\ {-0.4}& 0.7 & 0.5\end{pmatrix}\in\mathbb{R}^{3\times 3}.

No arithmetic happened — a lookup is the cheapest layer in the network. From now on the model only ever sees $X$ , never the integers.

Stacking the $T$ token vectors gives an activation matrix $X\in\mathbb{R}^{T\times d}$ (one row per position). It pays to separate two kinds of numbers from the very start, because they behave completely differently:

Parameters (a.k.a. weights) are the learned numbers — the embedding $E$ , the projection matrices inside attention, the FFN matrices. They are fixed after training and shared across all positions and all inputs.
Activations are the per-token vectors that flow through the network for one specific input. They depend on the actual tokens, are recomputed for every input, and — at long context — dominate memory.

The residual stream: a shared bus the whole network reads and writes

Here is the single most important structural idea in the transformer. A transformer never overwrites its representation of a token. Each block reads the current vector, computes an update, and adds it back:

x \;\leftarrow\; x + \operatorname{Sublayer}(x).

This additive skip is the residual connection, and the running sum that every block reads from and writes to is the residual stream.

Two consequences matter enormously. First, gradients flow. Because the path through “ $+x$ ” is the identity, the gradient has a clean highway straight back to early layers; this is what lets stacks of dozens to well over a hundred layers train without the gradient vanishing on the way down. Second, it gives a clean mental model used throughout interpretability: each sublayer reads features that earlier layers wrote and writes a small increment for later layers to use. The stream is a shared bus; the blocks are devices that read and write it.

Each block contains exactly two sublayers, run here in the now-standard pre-norm order: normalize, then multi-head self-attention, added back; normalize, then a feed-forward network, added back. The division of labor is sharp and worth memorizing: attention is the only place where tokens exchange information across positions; the FFN and the norms act on each token in isolation.

Loading diagram…

Attention: a soft, learned dictionary lookup

Attention is how a token gathers information from other tokens. From the block input $X$ , three learned linear maps produce three different “views” of every token:

Q = XW_Q,\qquad K = XW_K,\qquad V = XW_V,

the query, key, and value. The cleanest way to read these:

the query of a token is its question — “what kind of context am I looking for?”;
the key of a token advertises what it offers — “here is what I am about, match against me”;
the value is the content a token hands over if it is judged relevant.

A token compares its query against every key by a dot product (high dot product $=$ good match), turns those match-scores into weights that sum to one, and uses them to take a weighted average of the values. Written out, scaled dot-product attention is

\operatorname{Attention}(Q,K,V)=\operatorname{softmax}\!\Big(\tfrac{QK^{\top}}{\sqrt{d_k}}\Big)\,V,

where $d_k$ is the dimension of each key/query vector. Three small details earn their keep.

The softmax, applied to each row, turns a row of raw scores $z$ into non-negative weights that sum to one, $\operatorname{softmax}(z)_j = e^{z_j}\big/\sum_k e^{z_k}$ , so every output is a convex combination (a weighted average) of value vectors — the model can blend information but cannot invent values out of nothing. The scaling by $1/\sqrt{d_k}$ keeps the dot products from exploding: if query and key entries are independent with unit variance, then $\operatorname{Var}(q\!\cdot\!k)=d_k$ , so raw scores grow like $\sqrt{d_k}$ ; dividing by $\sqrt{d_k}$ cancels that growth. Without it, large $d_k$ would push the softmax into a saturated, near one-hot regime where one weight is $\approx 1$ and the gradient with respect to the others is $\approx 0$ — learning stalls. The causal mask enforces autoregression: before the softmax, every score at a future position ( $j>i$ , i.e. a key that comes after the query) is set to $-\infty$ , so after the softmax its weight is exactly $0$ and token $i$ can attend only to positions $\le i$ . This single trick is what lets us train on all positions in parallel while still respecting the strict left-to-right factorization from the top of the chapter.

Loading diagram…

The upper-triangular cells (future keys) are masked to $-\infty$ ; every other cell holds a raw match-score $s_{rc}=q_r\!\cdot\!k_c/\sqrt{d_k}$ . Softmax along each row turns that row into the weights used to average the values — on the right, the last query's four weights, summing to one.

Hands-on · attention on three tokens, by hand

Let the (already scaled, already masked) score rows for a $3$ -token sequence be

\text{row 1: } (0,\,{-}\infty,\,{-}\infty),\quad \text{row 2: } (0,\,\ln 4,\,{-}\infty),\quad \text{row 3: } (\ln 2,\,0,\,\ln 2).

Apply softmax to each row (recall $e^{0}=1,\ e^{\ln 4}=4,\ e^{\ln 2}=2$ ):

Token 1 can only see itself $\Rightarrow$ weights $(1,0,0)$ ; output $=v_1$ .
Token 2: weights $\big(\tfrac{1}{1+4},\tfrac{4}{1+4},0\big)=(0.2,\,0.8,\,0)$ ; output $=0.2\,v_1+0.8\,v_2$ .
Token 3: weights $\big(\tfrac{2}{2+1+2},\tfrac{1}{5},\tfrac{2}{5}\big)=(0.4,\,0.2,\,0.4)$ ; output $=0.4\,v_1+0.2\,v_2+0.4\,v_3$ .

That is all attention ever does: for each position, produce a normalized blend of the value vectors at positions it is allowed to see. Token 2 mostly copies itself with a $20\%$ glance back at token 1; token 3 splits its attention between the first and third tokens.

The $T\times T$ score matrix is the source of attention's cost: it has $T^2$ entries per head, so a self-attention layer is $O(T^2 d)$ in the sequence length $T$ . That quadratic term is what makes long context expensive and motivates much of Topic 3.

Multiple heads, and the MHA / MQA / GQA / MLA family

One attention pattern per layer is limiting — a token might need to track grammatical agreement and copy a name and watch for a closing bracket all at once. So multi-head attention (MHA) runs $h$ attentions in parallel, each on its own $d_k=d/h$ -dimensional slice of the vectors. Different heads specialize, and their outputs are concatenated back to width $d$ and mixed by an output projection $W_O$ . Crucially, heads split the width rather than duplicate it, so $h$ heads cost essentially the same as one wide attention.

Loading diagram…

Attention is the only place tokens communicate; the norms and the FFN act on each token alone. A transformer block is therefore “mix across positions (attention), then think per position (FFN),” repeated $L$ times, with each step added into the residual stream.

There is a catch that drives a whole family of variants. During generation the keys and values of every past token must be kept around (the KV cache, below), and with full MHA that store is bulky. The fix is to share keys and values across query heads, trading a little quality for a much smaller cache. Multi-query attention (MQA) uses a single shared K/V for all heads. Grouped-query attention (GQA) interpolates: each group of query heads shares one K/V, so $G=h$ groups recovers MHA and $G=1$ recovers MQA — modern models pick something in between (say $8$ groups). Multi-head latent attention (MLA), from DeepSeek-V2, instead compresses K and V into a small shared latent vector that is what gets cached, then expands it back per head at use time; it shrinks the cache to a fraction of MHA while, in their ablations, matching or beating MHA quality. These three are the standard levers for the cache-vs-quality trade, and Topic 3 returns to them.

The feed-forward network: where each token “thinks”

After attention has mixed information across tokens, the feed-forward network (FFN, also called the MLP) transforms each token's vector independently — the same small network is applied to every position, with no interaction between positions:

\operatorname{FFN}(x)=W_2\,\phi(W_1 x + b_1)+b_2 .

It projects up to a hidden width that is classically $4\times d_{\text{model}}$ , applies a pointwise nonlinearity $\phi$ (originally $\operatorname{ReLU}$ , today usually $\operatorname{GELU}$ or a gated variant), then projects back down to $d$ . Without $\phi$ the two linear maps would collapse into a single matrix $W_2 W_1$ and the layer would be useless; the nonlinearity is what gives it power. The $4\times$ expansion gives the network room to compute richer per-token features in a higher-dimensional space before compressing back.

Modern models often use gated FFNs such as SwiGLU, where one up-projection is multiplied elementwise by a gated version of a second; gating tends to improve quality, and to hold the parameter count fixed the hidden width is shrunk (e.g. to $\tfrac{8}{3}d$ instead of $4d$ ). Either way the FFN usually holds the majority of a model's non-embedding parameters.

Normalization: keeping the activations sane

Deep residual stacks need their activations held at a stable scale, or the numbers drift and training destabilizes. That is the job of normalization. LayerNorm standardizes each token vector to zero mean and unit variance across its $d$ features, then rescales with a learned gain and shift. (Note it normalizes within a single token's vector, not across the batch — that is why it suits variable-length text where batch statistics would be meaningless.) RMSNorm simplifies this by dropping the mean-subtraction entirely and normalizing only by the root-mean-square:

\operatorname{RMSNorm}(x)=\frac{x}{\sqrt{\tfrac{1}{d}\sum_{i} x_i^2+\varepsilon}}\;\odot\; g ,

where $g\in\mathbb{R}^d$ is a learned per-feature gain and $\varepsilon$ is a tiny constant for numerical safety. Re-centering (subtracting the mean) turns out to be largely dispensable; keeping only the re-scaling gives comparable quality while being cheaper, which is why most recent LLMs use RMSNorm.

Where the norm sits matters as much as which norm it is. The diagrams here use pre-norm: normalize before each sublayer and leave the residual highway itself untouched. The original 2017 transformer used post-norm (normalize after the add), which is harder to train at depth because the clean additive highway gets renormalized at every block, disrupting the gradient flow we worked so hard to preserve. Pre-norm is now standard precisely because it keeps an unbroken identity path from input to output — though it brings its own quirk, a residual norm that tends to grow with depth, which later questions in this topic revisit.

Position: telling tokens where they are

Here is a subtle but crucial fact. Pure self-attention is permutation-equivariant: shuffle the input tokens and the outputs shuffle identically, because the scores depend only on content ( $QK^\top$ ), never on order. But “dog bites man” must mean something different from “man bites dog,” so order has to be injected somehow. Early models simply added a fixed sinusoidal or learned “position vector” to each embedding. Modern LLMs overwhelmingly use rotary position embeddings (RoPE), which instead rotate each query and key vector by an angle proportional to its position.

Output: from the final vector to a next-token distribution

After the last block, the residual stream is normalized once more and mapped by the output projection (or “unembedding”) $W_U\in\mathbb{R}^{d\times V}$ to a vector of $V$ logits — one raw, unnormalized score per vocabulary token. “Logit” just means the pre-softmax score, so a larger logit means a larger probability. A softmax over the logits gives the next-token distribution. Very often $W_U$ is set to the transpose of the embedding matrix $E$ (weight tying), which saves a large block of parameters and ties together how a token is read in and written out; some large-vocabulary models untie them for a little extra quality.

From the distribution we pick the next token by decoding: greedy takes the argmax; temperature sampling divides the logits by a temperature $\tau$ before softmax to make the distribution sharper ( $\tau<1$ , more deterministic) or flatter ( $\tau>1$ , more random); top- $k$ /top- $p$ restrict sampling to the most probable tokens (Topic 13 goes deeper). Whatever the rule, we append the chosen token and loop.

Generation, and the KV cache that makes it fast

Generation is the loop promised at the start: run the network, read the logits at the last position, pick a token, append it, repeat. Done naively, each new step would recompute attention over the entire prefix from scratch — wasteful, because the keys and values of all the past tokens do not change when you append a new one. The fix is the KV cache: compute each token's K and V once, store them, and at every step compute K/V only for the single new token and append it to the cache. This turns the per-step work from quadratic into linear in the context length.

Loading diagram…

But the cache is not free: it grows linearly with sequence length, and with $\text{layers}\times\text{KV-heads}\times\text{head-dim}$ . At long context it, not the weights, dominates memory — which is exactly the pressure that MQA, GQA, and MLA were invented to relieve.

Hands-on · how big is the KV cache?

A rough but standard estimate. The cache stores, per token, the K and V vectors of every layer: that is $2$ (K and V) $\times\,L$ (layers) $\times\,n_{kv}$ (KV heads) $\times\,d_h$ (head dim) numbers. Take a $7$ B-class model with $L=32$ , $n_{kv}=32$ heads, $d_h=128$ , in fp16 ( $2$ bytes), at a context of $T=8192$ tokens for one sequence:

2\times 32\times 32\times 128 \times 2\,\text{B} = 1{,}048{,}576\ \text{B} \approx 1\,\text{MB \emph{per token}},

so $8192$ tokens cost $\approx 8\,$ GB — comparable to the model weights themselves, and it scales with batch size. Now repeat the calculation with GQA at $n_{kv}=8$ : the cache shrinks $4\times$ to $\approx 2\,$ GB. That one-line saving is why grouped-query attention is nearly universal.

Putting it together / what to watch for

The full mental model in one breath: ids $\to$ embedding $\to$ a residual stream that passes through $L$ identical blocks, each doing pre-norm $\to$ attention $\to$ add then pre-norm $\to$ FFN $\to$ add, with positions injected by RoPE inside attention and the causal mask enforcing left-to-right flow, $\to$ final norm $\to$ unembedding $\to$ logits $\to$ softmax $\to$ next token. One layer of nuance worth carrying forward: because every block only adds to the stream, you can read the final logits as a sum of contributions from each layer (the “logit lens”), and small combinations of heads across two layers can implement crisp algorithms — the classic example being an induction circuit that copies and continues a repeated pattern (“… Dursley … Durs” $\to$ “ley”).

A handful of recurring tensions drive almost every design choice in this topic, and naming them now will make the detailed questions feel familiar:

Attention's $O(T^2)$ cost and the linearly growing KV cache push toward GQA, MLA, and smarter caching (Topic 3).
Softmax saturation and low-precision (fp16) overflow push toward the $1/\sqrt{d_k}$ scaling, QK-normalization, and careful norm placement.
The need to encode position without breaking long-context generalization motivates RoPE.
Depth stability motivates the pre-norm residual stream (and fixes for its growing-norm side effect).

Keep this skeleton in mind, and the detailed questions that follow — RoPE derivations, KV-cache arithmetic, MLA's decoupled-RoPE trick, attention sinks, the softmax bottleneck, and the rest — will read as variations on parts you have already met.