Chapter 03Part I · Foundations

Attention Efficiency & Long Context

8 practice sets · 5 coding problems

Topic 1 built the transformer and noted, almost in passing, two facts that this chapter is entirely about. First, self-attention costs $O(T^2)$ in the sequence length $T$ : every token looks at every other token, so doubling the context quadruples the attention work. Second, generating text keeps a steadily growing KV cache of every past token's keys and values. For a short chat those two facts are footnotes. For a model asked to read a $200$ -page contract, a million-line codebase, or an hour of transcribed audio, they become the entire bill — in compute, in memory, and in the time you wait for the first word. “Attention efficiency and long context” is the engineering discipline of making that bill payable while keeping the answers good. This chapter assumes only Topic 1 (you know what $Q$ , $K$ , $V$ , heads, softmax, RoPE, and the KV cache are). It builds a cost model first — where exactly the money goes — and then walks each family of fixes, so that every detailed question in the topic reads as a move in one budget.

Why long context is genuinely hard

It helps to separate the two pressures, because they are fought with completely different tools.

The first is compute. Attention compares each of $T$ queries against each of $T$ keys, producing a $T\times T$ matrix of scores. That matrix has $T^2$ entries; at $T=1{,}000$ it is a million, at $T=100{,}000$ it is ten billion, per head, per layer. The arithmetic grows with the square of the context, so the part of the network that felt free at chat length quietly comes to dominate at document length.

The second is memory. To generate token $t$ , the model needs the keys and values of all $t-1$ earlier tokens (that is what attention attends to). Recomputing them every step would be madness, so we store them — and that store, the KV cache, grows by one entry per token, per layer, forever. At long context the cache, not the model's weights, is what overflows the GPU. This is the KV-cache memory wall.

Topic 1 already introduced the standard notebook-shrinkers: MQA (all query heads share one key/value head), GQA (heads share key/value heads in groups — the modern default, e.g. $8$ groups), and MLA (compress key/value into a small latent vector that is cached and re-expanded per head). We will not re-derive them; this chapter shows where in the budget each one bites, alongside the other levers.

Loading diagram…

Two phases: prefill and decode

Generation runs in two phases with opposite cost profiles, and almost every optimization targets one or the other.

Prefill processes the whole prompt at once. All $T$ prompt tokens go through the network in parallel; attention sees a full $T\times T$ interaction; the work is one big matrix-multiply that keeps the hardware's arithmetic units busy. Prefill is a parallel sprint, and it is what sets your time-to-first-token.

Decode then emits new tokens one at a time. Each step runs the network for a single new token, produces one logit vector, samples, and appends. Decode is a long line of tiny, almost-serial steps — and it is where the KV cache is read over and over.

The bridge between the phases is that cache. During prefill we fill it once; during decode each step computes $Q,K,V$ for only the one new token, appends its $K,V$ , and lets the new query attend over the whole cache. That turns per-step work from $O(t^2)$ (recompute everything) into $O(t)$ (attend over $t$ cached entries) — the cache buys back the quadratic, at the price of growing without bound.

Loading diagram…

The KV-cache budget — a worked number

It pays to size the cache exactly, because it drives nearly every long-context decision. The cache stores, per token, one $K$ vector and one $V$ vector in every layer. Multiply out the factors and the total in bytes is

\text{KV bytes} \;=\; \underbrace{2}_{K\text{ and }V} \;\times\; \underbrace{L}_{\text{layers}} \;\times\; \underbrace{n_{\text{kv}}}_{\text{KV heads}} \;\times\; \underbrace{d_h}_{\text{head dim}} \;\times\; \underbrace{T}_{\text{tokens}} \;\times\; \underbrace{B}_{\text{batch}} \;\times\; \underbrace{b}_{\text{bytes/elt}} .

Every symbol is a knob, and every cache-shrinking trick in this chapter is an attack on one of them: GQA/MQA/MLA shrink $n_{\text{kv}}$ , quantization shrinks $b$ , sliding windows and eviction cap $T$ .

Hands-on · how big does the cache get?

Take a $7$ B-class model: $L=32$ layers, $n_{\text{kv}}=32$ heads (full MHA), $d_h=128$ , fp16 so $b=2$ bytes, batch $B=1$ . Per token, the cache holds

2\times 32\times 32\times 128 \times 2\,\text{B} \;=\; 524{,}288\ \text{B} \;\approx\; 0.5\ \text{MB \emph{per token}}.

At a chat-length $T=2{,}000$ that is $\approx 1$ GB — a rounding error next to the $\sim 14$ GB of weights. Now stretch to $T=32{,}000$ :

0.5\ \text{MB} \times 32{,}000 \;\approx\; 16\ \text{GB} \;>\; 14\ \text{GB of weights}.

The cache has overtaken the model. Swap in GQA with $n_{\text{kv}}=8$ instead of $32$ and the same context costs $16/4 = 4$ GB. That one substitution — and the fact that it barely dents quality — is why grouped-query attention is nearly universal. Notice too that the cache is strictly linear in $T$ : double the context, double the cache.

Loading diagram…

At short context the weights dominate memory; at long context the KV cache does, and it grows linearly with sequence length, layers, and KV-heads. The single inequality “cache $>$ weights” is why this whole topic exists — and every cache trick (MQA/GQA/MLA, quantization, eviction, sliding windows) is an attack on one factor of the KV-bytes formula.

Why memory, not FLOPs, is the bottleneck: the memory hierarchy

Here is a fact that surprises newcomers: at decode the GPU's powerful matrix units mostly sit idle. Why? Because moving data is far slower than computing on it, and decode moves a lot of data per unit of arithmetic.

A GPU has a steep memory hierarchy. At the bottom is HBM (high-bandwidth memory) — the tens of gigabytes where weights and the KV cache live. It is large but, relative to the compute units, slow: a few terabytes per second. At the top is SRAM — a tiny on-chip scratchpad (kilobytes per processor) that is roughly an order of magnitude faster but far too small to hold a whole tensor. Every number a compute unit touches must be pulled up from HBM into SRAM, used, and the result pushed back down.

Loading diagram…

The relevant ratio is arithmetic intensity: how many floating-point operations you do per byte you move from HBM, $\text{intensity}=\text{FLOPs}/\text{bytes read}$ . The roofline model says attainable throughput is $\min(\text{peak FLOP/s},\ \text{intensity}\times\text{bandwidth})$ . Below a hardware “ridge point” you are memory-bound (the expensive matrix units wait on data); above it you are compute-bound (you finally saturate them). Decode attention sits far on the memory-bound side: each cached entry is read once and used for just a couple of multiply-adds, so intensity is roughly $O(1)$ , and wall-clock time is set by how fast you can stream the cache, not by FLOPs. This is why a smaller cache is the prize: fewer bytes to stream means a faster token.

Loading diagram…

This also explains a lever that looks like free money: batching raises intensity. The weights are read once from HBM but reused across all $B$ sequences in the batch, so more FLOPs ride on the same bytes — batching pushes decode rightward toward the compute roof and lifts throughput, right up until the per-sequence caches stop fitting. The tell-tale that you have hit the KV wall is that decode throughput stops scaling with batch size and flatlines: you are now bandwidth-bound on the cache, and the cure is to make the cache smaller, not to add compute.

FlashAttention: exact attention without the score matrix

The other half of the bill is the $T\times T$ score matrix $S=QK^\top$ in prefill. Materializing it is fatal at scale: at $T=32{,}000$ , one head's score matrix is $32{,}000^2$ entries $\times\,2$ bytes $\approx 2$ GB — per head, per layer. The textbook algorithm writes that whole matrix to HBM, reads it back to apply softmax, then reads it a third time to multiply by $V$ : three round-trips of an $O(T^2)$ object through slow memory.

FlashAttention computes the exact same answer without ever writing $S$ to HBM. Two things to be crystal clear about, because they are common exam traps. It is not an approximation — the output equals standard attention up to floating-point rounding. And it does not cut FLOPs — it still does $O(T^2)$ multiply-adds. What it slashes is memory I/O; it is “IO-aware,” and on memory-bound hardware that is exactly the bottleneck that matters.

The mechanism is tiling plus online softmax. Tiling: load a block of queries and a block of keys/values into fast SRAM, compute that block's partial scores and partial output there, accumulate into a running result, then move to the next block. The big matrix is born and consumed on-chip and never lands in HBM. The obstacle is that softmax needs a global normalizer — the sum $\sum_j e^{s_j}$ over the whole row — which you cannot know until you have seen every key. Online softmax fixes this by carrying, per query, just three running numbers and rescaling them as each new block arrives.

Concretely, recall the numerically stable softmax subtracts the row max before exponentiating (so $e^{(\cdot)}$ never overflows). FlashAttention keeps, per query: a running max $m$ , a running denominator $\ell=\sum e^{s_j-m}$ , and a running output accumulator $o=\sum e^{s_j-m}\,v_j$ . When a new block reveals a larger max, every previously accumulated quantity was scaled by the old max and must be corrected. With new block max $\tilde m$ , set $m_{\text{new}}=\max(m,\tilde m)$ , multiply the old $\ell$ and old $o$ by the correction factor $e^{\,m-m_{\text{new}}}$ , then add the new block's freshly-scaled terms. After the last block, the output is $o/\ell$ — algebraically identical to softmaxing the full row, but computed in $O(T)$ memory instead of $O(T^2)$ .

Hands-on · online softmax over two blocks

Softmax-weighted-average the values $v=(10,20,30)$ with raw scores $s=(0,2,1)$ , but in two blocks, never holding all three scores at once. (The true answer: weights $\propto (e^0,e^2,e^1)=(1,7.389,2.718)$ , sum $11.107$ , so output $=\frac{1\cdot10+7.389\cdot20+2.718\cdot30}{11.107}=\frac{239.3}{11.107}\approx 21.5$ .)

Block 1 $=\{s_1{=}0,\,v_1{=}10\}$ . Init $m=0$ , $\ell=e^{0-0}=1$ , $o=1\cdot 10=10$ .

Block 2 $=\{s_2{=}2,v_2{=}20;\ s_3{=}1,v_3{=}30\}$ . Local max $\tilde m=2$ , so $m_{\text{new}}=\max(0,2)=2$ . Rescale the old state by $e^{\,m-m_{\text{new}}}=e^{-2}\approx 0.1353$ :

\ell \leftarrow 1\cdot 0.1353=0.1353,\qquad o \leftarrow 10\cdot 0.1353=1.353.

Add the new terms ( $e^{2-2}=1$ for $v_2$ , $e^{1-2}=0.3679$ for $v_3$ ):

\ell = 0.1353 + 1 + 0.3679 = 1.5032,\qquad o = 1.353 + 1\cdot 20 + 0.3679\cdot 30 = 32.39.

Final output $=o/\ell = 32.39/1.5032 \approx 21.5$ — exactly the one-shot answer. We never stored the full score row, and the running max kept every exponential safely $\le 1$ .

Loading diagram…

Successive versions refined only the engineering, never the result. FlashAttention-2 re-partitioned the GPU work for far higher utilization. FlashAttention-3 exploits the asynchrony of Hopper-class GPUs — overlapping data movement with matmul and softmax via warp specialization — and adds FP8 support, reaching roughly $75\%$ of the H100's peak ( $\sim\!740$ TFLOP/s in fp16), up from the $\sim\!35\%$ of FlashAttention-2. All three are exact.

Bounding the quadratic: windows, sinks, sparsity

FlashAttention makes the $O(T^2)$ cheaper to run; it does not make it go away. To break quadratic scaling outright you must compute attention over fewer pairs.

The simplest cut is sliding-window (local) attention, popularized at scale by Mistral: let each token attend only to the previous $w$ tokens. Cost drops to $O(Tw)$ — linear in $T$ — and the cache caps at $w$ entries instead of growing forever. The bet is that language is mostly local: most of what you need to predict the next word sits nearby. Stack many layers, though, and information still propagates far, because a token's window overlaps its neighbor's, so influence travels a window per layer (much like a convolution's receptive field).

Pure windows have a sharp failure mode: drop the oldest tokens and you lose access to the start of the sequence (the system prompt, the question being answered). Worse, naively evicting the first tokens tends to crash quality outright. The reason is a real and slightly weird phenomenon: the first few tokens act as an attention sink. Because softmax weights must sum to one, a query that finds nothing especially relevant still has to put its probability mass somewhere; heads learn to dump that excess onto the always-visible, low-content opening tokens. The sink is a pressure-release valve. StreamingLLM turns this into a method: keep a few initial sink tokens plus a sliding window, evict the middle, and a model can stream indefinitely without the collapse that removing the sinks causes.

Loading diagram…

More general sparse attention keeps the softmax but lets each token attend to a structured subset of positions — typically dense for nearby tokens and sparse for distant ones (some strided or block-strided pattern), often plus a handful of global tokens that every position can see and that can see everything (Longformer-style). The intuition is that genuinely dense long-range dependencies are rare, so “dense locally, sparse globally” loses little while cutting cost below quadratic. The catch is that the pattern is hand-designed and can miss the one long-range link a task happens to need; choosing it well is the whole art.

Loading diagram…

Linear / kernelized attention: trading exactness for $O(T)$

A bolder move removes the softmax entirely. Standard attention is forced to compute $S=QK^\top$ first — an $N\times N$ object — because softmax acts on it nonlinearly. Linear attention replaces softmax with a similarity that factorizes, $\text{sim}(q,k)=\phi(q)^\top\phi(k)$ for some feature map $\phi$ . Then the numerator $\sum_j \phi(q_i)^\top\phi(k_j)\,v_j$ rearranges by simple associativity:

\big(\phi(Q)\,\phi(K)^\top\big)\,V \;=\; \phi(Q)\,\big(\phi(K)^\top V\big).

The left side builds the $N\times N$ matrix first ( $O(N^2 d)$ ); the right side computes the small $d\times d$ matrix $\phi(K)^\top V$ first, then multiplies by $\phi(Q)$ ( $O(N d^2)$ ). For long sequences ( $N\gg d$ ) that is the difference between quadratic and linear in $N$ .

The catch is quality. A fixed-size state cannot losslessly store arbitrarily long history — it is a lossy summary — so linear-attention and state-space models (Mamba-style) can stumble on exact long-range retrieval and copying, the tasks where you must reproduce a specific distant token verbatim. That is precisely why hybrid designs are popular: make most layers cheap (linear or windowed) and sprinkle in a few full-attention layers where verbatim recall matters — cheap state for the bulk, exact attention for the spots that need it. As a rule of thumb: for retrieval-heavy work, exact methods win (FlashAttention, plus ring attention, which shards one long sequence across devices and passes K/V blocks around a ring so no single device holds it all); for workloads that tolerate lossy memory, approximate methods win on cost.

Teaching a short-context model to use long context: RoPE extension

Suppose your model was pretrained with a $4$ k window and you now feed it $100$ k tokens. It breaks — but, perhaps surprisingly, not because of the cache (a bigger GPU fixes that). It breaks because of positional encoding. Recall RoPE rotates each query and key by an angle proportional to its position, with per-dimension frequencies

\theta_i \;=\; \text{pos}\cdot \text{base}^{-2i/d},

so the attention score depends only on the relative offset between two tokens. Feed positions far past anything seen in training and the rotation angles wind into a regime the model has never encountered — the positions are out-of-distribution, and output quality collapses. The fixes all reshape those angles so they stay in a familiar range.

Position Interpolation (PI) is the bluntest: to go from trained length $L$ to target $L'$ , define the scale $s=L'/L$ and treat position $p$ as $p/s$ . This squeezes all $L'$ new positions back into the trained $[0,L]$ range, so every angle is in-distribution. It works, but it compresses everything uniformly, blurring fine distinctions between nearby tokens (the model can less easily tell “one apart” from “two apart”).

NTK-aware scaling is subtler. Instead of squeezing all frequencies equally, it scales the RoPE base,

\text{base}' \;=\; \text{base}\cdot s^{\,d/(d-2)},

which interpolates the slow, low-frequency dimensions (the ones that track coarse, long-range position) while leaving the fast, high-frequency dimensions (which distinguish adjacent tokens) nearly untouched. You keep local resolution and only stretch the global scale — exactly where there is slack.

Hands-on · NTK base scaling, by the numbers

Extend by $s=8$ with head dimension $d=128$ . The exponent is $d/(d-2)=128/126\approx 1.0159$ , so $s^{1.0159}=8^{1.0159}$ . Since $8^{1}=8$ and $8^{0.0159}=e^{0.0159\ln 8}=e^{0.0331}\approx 1.034$ , we get $s^{1.0159}\approx 8.27$ . With a standard $\text{base}=10{,}000$ ,

\text{base}' \;=\; 10{,}000\times 8.27 \;\approx\; 82{,}700.

A larger base lengthens every wavelength, so the same absolute position lands at a gentler angle — the slow dimensions get stretched to cover the longer context while the fast ones barely move.

YaRN combines the ideas and adds one more. It partitions RoPE dimensions by how many full cycles each completes over the training length: high-frequency dims (many cycles, local detail) are left to extrapolate, low-frequency dims (few cycles, global position) are interpolated, and a middle band is smoothly ramped between the two — this is the “NTK-by-parts” interpolation. On top, YaRN rescales the attention logits by a small temperature factor to counteract the entropy drift that longer contexts otherwise induce in the softmax. The payoff is striking: YaRN can take a model from $4$ k to $128$ k context by fine-tuning on well under $0.1\%$ of the original pretraining tokens.

Lost in the middle: attention is not retrieval

One last reality check before the questions: a long context window is not the same as using it well. Two pitfalls recur.

First, perplexity is not the test. A context-extended model can post fine perplexity yet fail needle-in-a-haystack retrieval past some length. Perplexity averages over mostly-easy nearby tokens; retrieval probes the single hard long-range dependency that perplexity barely weights. So always validate long context with a retrieval probe — and re-check that short-context scores did not quietly regress, since extension can degrade them.

Second, lost-in-the-middle. Models reliably use facts placed at the start and end of a long prompt but neglect the middle — accuracy as a function of where the answer sits is U-shaped. This is a position bias (from training-data structure and attention dynamics, reinforced by the attention sinks at the front), not a hard capacity limit, which is why it shows up even when the model technically “saw” the middle. It is also why retrieval-augmented pipelines bother to rank and place the most relevant chunks at the edges of the context rather than dumping everything in arbitrary order.

Loading diagram…

What to watch for

The practical loop for serving long context is now fairly standard, and it is just the budget applied in order. Pick an exact kernel (FlashAttention) so attention is never the memory bottleneck. Shrink the cache in order of cheapness: GQA first, then KV quantization (8- or 4-bit; cheap, but watch long-range copying, since a slightly corrupted key flips which token is retrieved — degraded needle-retrieval with unchanged MMLU is its signature), then sliding-window or heavy-hitter eviction, then MLA if you control the architecture. Use chunked prefill (process a giant prompt in slices) to cap time-to-first-token. Extend positions with YaRN-style scaling, and validate with retrieval, not perplexity, re-checking short context for regressions. Hold the two anchors — the KV-bytes formula and the roofline — in your head, and every question in this topic, from tile-size derivations to MLA's decoupled-RoPE trick to attention-sink eviction, becomes a single, recognizable move in the same budget.

Why long context is genuinely hard

Two phases: prefill and decode

The KV-cache budget — a worked number

Why memory, not FLOPs, is the bottleneck: the memory hierarchy

FlashAttention: exact attention without the score matrix

Bounding the quadratic: windows, sinks, sparsity

Linear / kernelized attention: trading exactness for O(T)O(T)O(T)

Teaching a short-context model to use long context: RoPE extension

Lost in the middle: attention is not retrieval

What to watch for

Linear / kernelized attention: trading exactness for $O(T)$