Chapter 13Part IV · Systems & Serving

Inference & Serving

8 practice sets · 5 coding problems

Training a model is a one-time cost; running it happens billions of times, and that is where the money, the electricity bill, and the user experience actually live. Once the weights are frozen the job changes completely. There is no backward pass, no optimizer, no gradients — only a forward pass that must turn a prompt into tokens as fast and as cheaply as possible, for many users at once. This mini-chapter is about that forward pass at scale. It assumes you have met the transformer (Topic 1): you know that the model maps a prefix of tokens to a distribution over the next token, and that during generation it keeps a KV cache of the keys and values of every past token so it need not recompute them. We build everything else from there.

The single fact that organizes the whole topic is slightly counter-intuitive, so let us state it up front and then earn it: for a modern decoder-only LLM, generating text is usually limited not by how fast the hardware can multiply numbers, but by how fast it can read the weights and the cache out of memory. Almost every serving technique you will meet — the KV cache, batching, PagedAttention, quantization, speculative decoding — is a different way to dodge that “memory wall.” Once you see the wall, the rest of the topic stops being a grab-bag of tricks and becomes a single coherent story.

From logits to a token: decoding strategies

Recall the model's output at each step: a vector of $V$ logits (one raw, unnormalized score per vocabulary token). A decoding strategy is the rule that turns those scores into the actual next token. The simplest is greedy decoding: take the $\arg\max$ , the single highest-scoring token. Greedy is deterministic and fine for tasks with one right answer, but for open-ended writing it is repetitive and dull, because it can never take the second-best road even when that road is more interesting.

Sampling instead draws a token at random from the softmax distribution, and three knobs shape that distribution. Temperature $T$ rescales the logits before the softmax, $p_i \propto \exp(z_i / T)$ . Lowering $T$ below $1$ sharpens the distribution toward the top token (more deterministic, more “safe”); raising it above $1$ flattens it (more random, more “creative”); and $T\to 0$ recovers greedy. Top- $k$ keeps only the $k$ highest-probability tokens and renormalizes, throwing away the long tail of nonsense. Top- $p$ (nucleus) sampling instead keeps the smallest set of tokens whose cumulative probability first reaches $p$ (say $0.9$ ); the pool automatically grows when the model is uncertain (many plausible continuations) and shrinks when it is confident (one obvious word). A close cousin, min- $p$ , keeps tokens whose probability is at least a fraction of the top token's probability, which adapts to confidence in a slightly different way and is sometimes preferred for creative text. Finally, beam search keeps several partial hypotheses alive at once and extends the best; it shines on tasks with a single correct output like translation, but tends to produce bland, high-probability mush on open-ended prompts, so chat models rarely use it.

Hands-on · temperature on three logits

Take logits $z=[\,2.0,\ 1.0,\ 0.1\,]$ . At $T=1$ the softmax is $\exp(z)/\sum\exp(z)$ . Using $e^{2.0}\approx 7.39,\ e^{1.0}\approx 2.72,\ e^{0.1}\approx 1.11$ (sum $\approx 11.2$ ), the probabilities are about $[\,0.66,\ 0.24,\ 0.10\,]$ . Now cool to $T=0.5$ : divide the logits by $0.5$ , i.e. double them to $[\,4.0,\ 2.0,\ 0.2\,]$ . Re-softmaxing gives roughly $[\,0.84,\ 0.11,\ 0.05\,]$ . Same model, same logits — but the top token's share jumped from $66\%$ to $84\%$ . Lowering temperature does not change which token is most likely; it changes how decisively the model commits to it.

Loading diagram…

A quick top- $p$ pass on sorted probabilities $[\,0.5,\,0.25,\,0.15,\,0.07,\,0.03\,]$ with $p=0.9$ : cumulative sums are $0.5,\,0.75,\,0.90$ , so the first three tokens reach $0.90$ and the last two are dropped; we renormalize over $\{0.5,0.25,0.15\}$ and sample. Decoding is the cheap, glamorous part of inference. Everything hard is in how fast we can get to the logits in the first place, which is what the rest of the chapter is about.

Two phases of generation: prefill vs. decode

Serving one request splits into two phases with opposite hardware personalities, and almost nothing about serving makes sense until you separate them.

In prefill, the model ingests the entire prompt at once. All $P$ prompt tokens go through the transformer in a single forward pass, fully in parallel (the causal mask lets every position be computed simultaneously), and at the end the model emits the logits for the first generated token. Prefill also fills the KV cache for the whole prompt in one shot.

In decode, the model generates the rest of the answer one token at a time. Each step is a forward pass over a single new token: compute its query, key, and value; attend against the cached keys and values of all previous tokens; produce one set of logits; sample one token; append its K and V to the cache; repeat. The sequence is strictly serial — step $t+1$ cannot start until step $t$ has chosen its token, because that token is the input to step $t+1$ .

This split drives the two latency numbers a user actually feels. Time-to-first-token (TTFT) — how long the user stares at a blank screen before the answer starts streaming — is dominated by prefill (plus any time spent waiting in the queue). Inter-token latency (ITL), also called time-per-output-token (TPOT) — the gap between successive streamed tokens, i.e. how fast the words appear — is set by decode. A timeline makes the shape vivid: one fat prefill block, then a long tail of thin decode steps.

Loading diagram…

Why does prefill “do a lot of math” while decode does little? Prefill multiplies the weight matrices against an $P\times d$ block of activations — big matrix–matrix products. Decode multiplies the same weight matrices against a single $1\times d$ vector — a matrix–vector product. The weights read from memory are identical in both cases, but prefill reuses each loaded weight across all $P$ tokens, whereas decode uses it for exactly one. That difference is the whole ballgame, and the next section names it precisely.

The memory wall: arithmetic intensity and the roofline

To say why decode is the bottleneck we need one definition. The arithmetic intensity of an operation is the number of arithmetic operations (FLOPs) it performs per byte it moves from memory:

\text{intensity} \;=\; \frac{\text{FLOPs}}{\text{bytes moved}} .

A GPU (or any accelerator) has two hard ceilings: a peak compute rate (FLOP/s, how fast the math units run) and a peak memory bandwidth (bytes/s, how fast data streams from high-bandwidth memory, HBM, into the chip). The roofline model puts these on one picture. If an operation's intensity is high, the math units are the limit and the operation is compute-bound; if intensity is low, you spend all your time waiting for bytes to arrive and the operation is memory-bound. The crossover point — the intensity at which the two ceilings meet — is the accelerator's $\text{peak FLOP/s} \big/ \text{peak bytes/s}$ , often a few hundred FLOPs per byte.

Loading diagram…

Single-token decode sits far down on the rising part of the roofline. To produce one token it must read every weight in the model (and stream the whole KV cache) out of HBM, then do only a single matrix–vector's worth of arithmetic with each weight. Enormous bytes, tiny FLOPs, low intensity — memory-bound. Prefill, multiplying big matrices over hundreds of prompt tokens at once, reuses each loaded weight many times, so its intensity lands above the crossover — compute-bound. Same model, same hardware; the phase decides which wall you hit.

Hands-on · bytes vs. FLOPs per decoded token

Take a $13$ B-parameter model in FP16 ( $2$ bytes per weight). Decoding one token must read essentially all the weights once:

\text{bytes/token} \;\approx\; 13\times10^{9} \times 2\,\text{B} \;=\; 26\ \text{GB}.

The arithmetic is about $2$ FLOPs per parameter for a matrix–vector multiply, so

\text{FLOPs/token} \;\approx\; 2 \times 13\times10^{9} \;=\; 26\ \text{GFLOP}.

That is roughly $1$ FLOP per byte moved. An accelerator with $\sim 1000$ FLOP/s for every $1$ byte/s of bandwidth would have to leave $99.9\%$ of its math units idle, starved for data — the textbook signature of a memory-bound kernel. Now turn the same numbers into a speed limit:

\text{tokens/sec} \;\approx\; \frac{\text{HBM bandwidth (bytes/s)}}{\text{bytes read per token}} \;=\; \frac{2\ \text{TB/s}}{26\ \text{GB}} \;\approx\; 77\ \text{tok/s},

no matter how fast the cores are. Halve the bytes (e.g. INT4 weights, $\approx 6.5$ GB/token) and you roughly double the speed to $\sim 150$ tok/s. This one inequality explains why “make the weights smaller” is the highest-leverage decode optimization there is.

Decode is memory-bound: per token you re-read all the weights and the entire KV cache. So the two highest-leverage moves are (1) move fewer bytes — quantization, GQA/MLA, KV compression — and (2) amortize the bytes you do move over more useful work — batching and speculative decoding. Prefill, being compute-bound, responds to a different set of knobs (chunking, more parallelism). Always ask first: which wall am I hitting?

The KV cache, and why it dominates memory

Decode is cheap in FLOPs precisely because of the KV cache. Without it, every step would re-attend over the whole prefix from scratch, and generating $n$ tokens would cost $O(n^2)$ work. With it, each token's K and V are computed once, stored, and reused, so each step is linear in the context length. But that store is not free, and at long context it is the memory hog. The cache size in bytes is

\text{KV bytes} \;=\; 2 \times b \times L \times h_{kv} \times d_{\text{head}} \times s \times (\text{bytes/elt}),

where the leading $2$ counts keys and values, $b$ is batch size, $L$ the number of layers, $h_{kv}$ the number of key/value heads, $d_{\text{head}}$ the head dimension, $s$ the sequence length, and bytes/elt is the precision (2 for FP16). It grows linearly with both context length and batch size, which is exactly the combination that explodes when you try to serve many long conversations at once.

Hands-on · KV cache for a cached system prompt

A Llama-3-70B-style config: $L=80$ layers, $h_{kv}=8$ KV heads (it uses GQA, so far fewer than its $64$ query heads), $d_{\text{head}}=128$ , FP16. Per token the cache stores

2 \times 80 \times 8 \times 128 \times 2\,\text{B} \;=\; 327{,}680\ \text{B} \;\approx\; 0.31\ \text{MB}.

A $2{,}000$ -token system prompt therefore needs $2000\times 0.31\,\text{MB} \approx 0.6$ GB of cache — per request. Cache $50$ distinct such prompts and you are at $\approx 30$ GB, a real chunk of a GPU. Had this model used full multi-head attention ( $h_{kv}=64$ ), every number above would be $8\times$ larger: $\approx 4.9$ GB for one prompt. That $8\times$ is why grouped-query attention is nearly universal in modern serving — it directly shrinks the term that blows up.

GQA (sharing one KV head across a group of query heads) and multi-head latent attention (MLA, which caches a small compressed latent instead of full K/V) attack this term at the architecture level. KV-cache quantization attacks it at serving time by storing K and V in INT8 or INT4. And the next idea, PagedAttention, attacks the waste in how the cache is laid out.

PagedAttention: the KV cache as paged virtual memory

Here is a subtle systems problem. We do not know in advance how long a generation will be. The naive fix is to reserve, per request, one big contiguous slab of memory sized for the maximum possible length. The result is brutal fragmentation: a request that stops after $50$ tokens still holds a slab sized for $2{,}000$ , and the gaps between slabs are too small to reuse. In classic serving systems this wasted $60$ – $80\%$ of KV memory, which directly caps how many requests fit on a GPU — and a smaller batch means lower throughput on a memory-bound workload.

PagedAttention, the idea at the heart of vLLM, borrows virtual memory paging from operating systems. The KV cache is chopped into small fixed-size blocks (“pages”), each holding the K/V for a handful of tokens. A request's logical sequence of tokens is mapped to a scattered set of physical blocks through a per-request block table, exactly as an OS maps a process's virtual pages to scattered physical frames. Blocks are allocated on demand, one at a time as the sequence grows, so almost nothing is reserved that is not used — the waste drops from $60$ – $80\%$ to under $\sim 4\%$ . The reclaimed memory becomes a larger effective batch, and reported throughput rises $2$ – $4\times$ versus prior systems. As a bonus, because blocks are addressed indirectly, two requests that share a prefix (the same long system prompt, a shared document) can point their block tables at the same physical blocks — this is prefix caching, computing the shared prefix's KV once and skipping redundant prefill for everyone after, cutting TTFT. A cache hit is detected by matching (typically hashing) the leading tokens; it must be invalidated when those tokens change, and in a multi-tenant API prefixes may only be shared when it is safe to do so.

Loading diagram…

Batching: amortizing the byte read across many users

Return to the memory wall. At batch size $1$ , decode reads all the weights to make one token — and then throws away the read. The cure is to read each weight once and immediately apply it to many requests' tokens. Batching stacks $b$ requests so the single expensive weight read serves $b$ tokens at once; throughput (total tokens/s across all users) climbs almost linearly until you start to approach the compute roof. The cost is latency: an individual request may wait for its batch to assemble and to step. This is the fundamental latency–throughput trade-off of serving — bigger batches mean more tokens per second but a slower experience for any single user.

Loading diagram…

Naive static batching squanders the win. Requests in a batch finish at different times (some answers are short, some long), but a static batch only releases when its slowest member is done, so finished slots sit idle. Continuous (in-flight) batching fixes this by scheduling at the granularity of a single decode step: after every step, any finished request leaves and a waiting request is admitted into the freed slot immediately. The batch stays full, GPU utilization stays high, and this is the single biggest throughput win in modern serving (vLLM, TensorRT-LLM).

Loading diagram…

Continuous batching interacts with two realities. Output lengths vary, so a flood of very long generations can hog slots and starve newcomers — which is why schedulers add priorities, max-token limits, and admission control. And serving quality is judged on a tail, not an average: the latency SLO (service-level objective) is usually stated as a percentile such as P99 (“ $99\%$ of requests finish at least this fast”). Tails matter because the mean can look healthy while the unlucky $1\%$ of users have a terrible time — and that $1\%$ is exactly where overload and bad scheduling first show up. A deployment can hit its throughput target yet fail its P99, which is almost always a scheduling problem, not a model problem.

Speculative decoding: more tokens per memory read

The cleverest trick exploits the memory wall head-on. Decode is slow because each big-model pass reads all the weights to produce just one token. Speculative decoding uses a small, cheap draft model to guess the next $k$ tokens, then runs the big target model once over all $k$ guesses in a single parallel pass — a mini-prefill — to check them all at the price of one memory read. If most guesses are right, you got several tokens for the cost of one big-model pass.

What keeps this honest is a modified rejection-sampling rule, so the output is not an approximation. Let $p$ be the target model's probability for a drafted token and $q$ the draft model's. Accept the token with probability

\min\!\Big(1,\ \tfrac{p}{q}\Big).

If the target likes the token at least as much as the draft did ( $p\ge q$ ) it is always accepted; if the target likes it less, it is accepted proportionally. On the first rejection you stop and resample that position from the residual distribution $\propto \max(0,\,p-q)$ (the leftover target probability the draft under-weighted), renormalized. One can prove the token you finally emit is distributed exactly as if you had sampled from the target directly — so speculative decoding is a pure speedup, never a quality change. Accepted tokens after the first rejection are discarded (the draft's later guesses were conditioned on a token that did not survive).

Loading diagram…

Hands-on · expected tokens and speedup

Suppose each drafted token is accepted independently with probability $\alpha$ , and the draft proposes $k$ tokens per round. Counting the bonus corrected token emitted on the first rejection, the expected number of tokens produced per target pass is the geometric-style sum

\mathbb{E}[\text{tokens}] \;=\; \frac{1-\alpha^{\,k+1}}{1-\alpha}.

For $\alpha=0.7,\ k=4$ : $\tfrac{1-0.7^{5}}{1-0.7} = \tfrac{1-0.168}{0.3} \approx 2.8$ tokens for the price of one big-model pass — nearly a $3\times$ decode speedup before accounting for the small draft cost. As $k\to\infty$ the ceiling is $\tfrac{1}{1-\alpha}$ tokens per pass (here $\tfrac{1}{0.3}\approx 3.3$ ), so there are diminishing returns to drafting deeper. This also explains a field observation: speculation flies on easy, predictable prompts (boilerplate, code with obvious continuations — high $\alpha$ ) and barely helps on hard ones (genuinely uncertain answers — low $\alpha$ ), because a draft cannot guess what the target itself is unsure about. The optimal $k$ trades the gain $\tfrac{1-\alpha^{k+1}}{1-\alpha}$ against the draft's per-token cost ratio $c$ ; if drafting is cheap ( $c$ small) you push $k$ higher, and if the draft is nearly as expensive as the target the whole scheme stops paying off.

A draft need not be a separate model: Medusa-style multi-token-prediction heads bolt extra output heads onto the target itself to predict the next few tokens, and tree-structured drafts verify several candidate continuations at once. The principle is unchanged — propose cheaply, verify in one memory-bound pass, keep the longest correct prefix.

Quantization: fewer bits per number

Because decode is memory-bound, fewer bytes per weight translates almost directly into faster decode, and that is the appeal of quantization: store numbers in fewer bits. FP16 uses $2$ bytes per weight; INT8 halves that, INT4 quarters it. The basic operation is to map a block of floating-point weights onto a small grid of integers via a per-channel (or per-group) scale: $w_{\text{int}} = \mathrm{round}(w/\text{scale})$ , and dequantize with $w \approx \text{scale}\cdot w_{\text{int}}$ . The scale is chosen so the largest weight in the block lands at the edge of the integer range.

Loading diagram…

Weight-only quantization compresses just the weights (activations stay in higher precision) and is the common default, because it directly buys the decode speedup with the least accuracy risk. GPTQ and AWQ are the two standard INT4 weight-only methods. GPTQ rounds the weights to the INT4 grid greedily while using a Hessian-based correction to compensate each rounding error in the not-yet-quantized weights. AWQ (activation-aware weight quantization) starts from a sharp observation: a weight multiplied by a large activation contributes more to the layer's output, so quantization error on those “salient” channels hurts far more than error elsewhere. Protecting just the $\sim 1\%$ most salient weight channels — by scaling them up before rounding (and scaling the matching activations down to compensate) — recovers most of the lost accuracy. Weight-and-activation quantization (e.g. INT8 on both) can additionally speed up the matrix multiplies, but it is harder, because activations contain large outliers that wreck a naive INT8 scale; SmoothQuant-style methods migrate that difficulty from the activations into the weights by a per-channel rescale, making both sides quantization-friendly. Finally, KV-cache quantization stores the cache itself in INT8/INT4, shrinking the term that dominates long-context memory.

Intuition

Quantization is lossy compression for weights, and the loss is invisible on easy tasks but can surface on hard ones. A model has slack: on a question like “capital of France?” a slightly-off weight still lands the answer, so MMLU-style benchmarks can be untouched by INT4. But long chains of reasoning or long-context retrieval compound small per-step errors, and INT4 (or aggressive KV quantization) can quietly degrade exactly those, while leaving the easy benchmark green. The practical rule: evaluate on the workload you actually serve — multi-step reasoning, long documents — not on a convenient proxy, and watch the calibration of output probabilities, not just top-1 accuracy. The same caution applies before further RL fine-tuning, where small numerical shifts can move the policy in unexpected ways.

Putting it together: a serving stack

A real deployment composes all of the above, and the discipline that ties it together is always the same: find the binding constraint and spend your engineering exactly there. Big models are split across GPUs with tensor parallelism; GQA or MLA shrink the KV cache at the architecture level; quantization shrinks the weights (decode) and optionally the cache; continuous batching keeps the batch full; PagedAttention kills fragmentation so the batch can be large; prefix caching skips repeated prefill; and speculative decoding squeezes several tokens out of each memory-bound big-model pass. Because prefill and decode have opposite personalities, a frontier idea is disaggregated serving: run prefill on one pool of hardware (tuned for compute throughput) and decode on another (tuned for memory bandwidth and low ITL), so a burst of long prompts entering prefill does not stall everyone else's token streaming. Chunked prefill similarly slices a long prompt into pieces interleaved with decode steps, so one giant prompt cannot freeze the stream.

This gives a clean diagnostic checklist, which also maps directly onto the questions in this topic. If TTFT is bad under load, suspect prefill — try chunked prefill or disaggregation. If ITL is bad, suspect decode — reach for quantization, GQA/MLA, kernel fusion, or speculative decoding. If throughput is high but P99 is terrible, suspect the scheduler — long generations are starving newcomers. If a quantization change leaves easy benchmarks intact but degrades reasoning or long-context, suspect that you compressed away precision the hard tasks needed. Every one of these is the same move in a different costume: name the wall, then move bytes, amortize bytes, or rebalance the schedule. Keep that lens, and the detailed questions ahead — roofline-driven designs, the speculative-decoding correctness proof, KV-cache arithmetic, MoE serving, and the rest — will read as variations on a story you already know.