Inference & Serving
8 practice sets · 5 coding problems
Training a model is a one-time cost; running it happens billions of times, and that is where the money, the electricity bill, and the user experience actually live. Once the weights are frozen the job changes completely. There is no backward pass, no optimizer, no gradients — only a forward pass that must turn a prompt into tokens as fast and as cheaply as possible, for many users at once. This mini-chapter is about that forward pass at scale. It assumes you have met the transformer (Topic 1): you know that the model maps a prefix of tokens to a distribution over the next token, and that during generation it keeps a KV cache of the keys and values of every past token so it need not recompute them. We build everything else from there.
The single fact that organizes the whole topic is slightly counter-intuitive, so let us state it up front and then earn it: for a modern decoder-only LLM, generating text is usually limited not by how fast the hardware can multiply numbers, but by how fast it can read the weights and the cache out of memory. Almost every serving technique you will meet — the KV cache, batching, PagedAttention, quantization, speculative decoding — is a different way to dodge that “memory wall.” Once you see the wall, the rest of the topic stops being a grab-bag of tricks and becomes a single coherent story.
From logits to a token: decoding strategies
Recall the model's output at each step: a vector of logits (one raw, unnormalized score per vocabulary token). A decoding strategy is the rule that turns those scores into the actual next token. The simplest is greedy decoding: take the , the single highest-scoring token. Greedy is deterministic and fine for tasks with one right answer, but for open-ended writing it is repetitive and dull, because it can never take the second-best road even when that road is more interesting.
Sampling instead draws a token at random from the softmax distribution, and three knobs shape that distribution. Temperature rescales the logits before the softmax, . Lowering below sharpens the distribution toward the top token (more deterministic, more “safe”); raising it above flattens it (more random, more “creative”); and recovers greedy. Top- keeps only the highest-probability tokens and renormalizes, throwing away the long tail of nonsense. Top- (nucleus) sampling instead keeps the smallest set of tokens whose cumulative probability first reaches (say ); the pool automatically grows when the model is uncertain (many plausible continuations) and shrinks when it is confident (one obvious word). A close cousin, min-, keeps tokens whose probability is at least a fraction of the top token's probability, which adapts to confidence in a slightly different way and is sometimes preferred for creative text. Finally, beam search keeps several partial hypotheses alive at once and extends the best; it shines on tasks with a single correct output like translation, but tends to produce bland, high-probability mush on open-ended prompts, so chat models rarely use it.
A quick top- pass on sorted probabilities with : cumulative sums are , so the first three tokens reach and the last two are dropped; we renormalize over and sample. Decoding is the cheap, glamorous part of inference. Everything hard is in how fast we can get to the logits in the first place, which is what the rest of the chapter is about.
Two phases of generation: prefill vs. decode
Serving one request splits into two phases with opposite hardware personalities, and almost nothing about serving makes sense until you separate them.
In prefill, the model ingests the entire prompt at once. All prompt tokens go through the transformer in a single forward pass, fully in parallel (the causal mask lets every position be computed simultaneously), and at the end the model emits the logits for the first generated token. Prefill also fills the KV cache for the whole prompt in one shot.
In decode, the model generates the rest of the answer one token at a time. Each step is a forward pass over a single new token: compute its query, key, and value; attend against the cached keys and values of all previous tokens; produce one set of logits; sample one token; append its K and V to the cache; repeat. The sequence is strictly serial — step cannot start until step has chosen its token, because that token is the input to step .
This split drives the two latency numbers a user actually feels. Time-to-first-token (TTFT) — how long the user stares at a blank screen before the answer starts streaming — is dominated by prefill (plus any time spent waiting in the queue). Inter-token latency (ITL), also called time-per-output-token (TPOT) — the gap between successive streamed tokens, i.e. how fast the words appear — is set by decode. A timeline makes the shape vivid: one fat prefill block, then a long tail of thin decode steps.
Why does prefill “do a lot of math” while decode does little? Prefill multiplies the weight matrices against an block of activations — big matrix–matrix products. Decode multiplies the same weight matrices against a single vector — a matrix–vector product. The weights read from memory are identical in both cases, but prefill reuses each loaded weight across all tokens, whereas decode uses it for exactly one. That difference is the whole ballgame, and the next section names it precisely.
The memory wall: arithmetic intensity and the roofline
To say why decode is the bottleneck we need one definition. The arithmetic intensity of an operation is the number of arithmetic operations (FLOPs) it performs per byte it moves from memory:
A GPU (or any accelerator) has two hard ceilings: a peak compute rate (FLOP/s, how fast the math units run) and a peak memory bandwidth (bytes/s, how fast data streams from high-bandwidth memory, HBM, into the chip). The roofline model puts these on one picture. If an operation's intensity is high, the math units are the limit and the operation is compute-bound; if intensity is low, you spend all your time waiting for bytes to arrive and the operation is memory-bound. The crossover point — the intensity at which the two ceilings meet — is the accelerator's , often a few hundred FLOPs per byte.
Single-token decode sits far down on the rising part of the roofline. To produce one token it must read every weight in the model (and stream the whole KV cache) out of HBM, then do only a single matrix–vector's worth of arithmetic with each weight. Enormous bytes, tiny FLOPs, low intensity — memory-bound. Prefill, multiplying big matrices over hundreds of prompt tokens at once, reuses each loaded weight many times, so its intensity lands above the crossover — compute-bound. Same model, same hardware; the phase decides which wall you hit.
Decode is memory-bound: per token you re-read all the weights and the entire KV cache. So the two highest-leverage moves are (1) move fewer bytes — quantization, GQA/MLA, KV compression — and (2) amortize the bytes you do move over more useful work — batching and speculative decoding. Prefill, being compute-bound, responds to a different set of knobs (chunking, more parallelism). Always ask first: which wall am I hitting?
The KV cache, and why it dominates memory
Decode is cheap in FLOPs precisely because of the KV cache. Without it, every step would re-attend over the whole prefix from scratch, and generating tokens would cost work. With it, each token's K and V are computed once, stored, and reused, so each step is linear in the context length. But that store is not free, and at long context it is the memory hog. The cache size in bytes is
where the leading counts keys and values, is batch size, the number of layers, the number of key/value heads, the head dimension, the sequence length, and bytes/elt is the precision (2 for FP16). It grows linearly with both context length and batch size, which is exactly the combination that explodes when you try to serve many long conversations at once.
GQA (sharing one KV head across a group of query heads) and multi-head latent attention (MLA, which caches a small compressed latent instead of full K/V) attack this term at the architecture level. KV-cache quantization attacks it at serving time by storing K and V in INT8 or INT4. And the next idea, PagedAttention, attacks the waste in how the cache is laid out.
PagedAttention: the KV cache as paged virtual memory
Here is a subtle systems problem. We do not know in advance how long a generation will be. The naive fix is to reserve, per request, one big contiguous slab of memory sized for the maximum possible length. The result is brutal fragmentation: a request that stops after tokens still holds a slab sized for , and the gaps between slabs are too small to reuse. In classic serving systems this wasted – of KV memory, which directly caps how many requests fit on a GPU — and a smaller batch means lower throughput on a memory-bound workload.
PagedAttention, the idea at the heart of vLLM, borrows virtual memory paging from operating systems. The KV cache is chopped into small fixed-size blocks (“pages”), each holding the K/V for a handful of tokens. A request's logical sequence of tokens is mapped to a scattered set of physical blocks through a per-request block table, exactly as an OS maps a process's virtual pages to scattered physical frames. Blocks are allocated on demand, one at a time as the sequence grows, so almost nothing is reserved that is not used — the waste drops from – to under . The reclaimed memory becomes a larger effective batch, and reported throughput rises – versus prior systems. As a bonus, because blocks are addressed indirectly, two requests that share a prefix (the same long system prompt, a shared document) can point their block tables at the same physical blocks — this is prefix caching, computing the shared prefix's KV once and skipping redundant prefill for everyone after, cutting TTFT. A cache hit is detected by matching (typically hashing) the leading tokens; it must be invalidated when those tokens change, and in a multi-tenant API prefixes may only be shared when it is safe to do so.
Batching: amortizing the byte read across many users
Return to the memory wall. At batch size , decode reads all the weights to make one token — and then throws away the read. The cure is to read each weight once and immediately apply it to many requests' tokens. Batching stacks requests so the single expensive weight read serves tokens at once; throughput (total tokens/s across all users) climbs almost linearly until you start to approach the compute roof. The cost is latency: an individual request may wait for its batch to assemble and to step. This is the fundamental latency–throughput trade-off of serving — bigger batches mean more tokens per second but a slower experience for any single user.
Naive static batching squanders the win. Requests in a batch finish at different times (some answers are short, some long), but a static batch only releases when its slowest member is done, so finished slots sit idle. Continuous (in-flight) batching fixes this by scheduling at the granularity of a single decode step: after every step, any finished request leaves and a waiting request is admitted into the freed slot immediately. The batch stays full, GPU utilization stays high, and this is the single biggest throughput win in modern serving (vLLM, TensorRT-LLM).
Continuous batching interacts with two realities. Output lengths vary, so a flood of very long generations can hog slots and starve newcomers — which is why schedulers add priorities, max-token limits, and admission control. And serving quality is judged on a tail, not an average: the latency SLO (service-level objective) is usually stated as a percentile such as P99 (“ of requests finish at least this fast”). Tails matter because the mean can look healthy while the unlucky of users have a terrible time — and that is exactly where overload and bad scheduling first show up. A deployment can hit its throughput target yet fail its P99, which is almost always a scheduling problem, not a model problem.
Speculative decoding: more tokens per memory read
The cleverest trick exploits the memory wall head-on. Decode is slow because each big-model pass reads all the weights to produce just one token. Speculative decoding uses a small, cheap draft model to guess the next tokens, then runs the big target model once over all guesses in a single parallel pass — a mini-prefill — to check them all at the price of one memory read. If most guesses are right, you got several tokens for the cost of one big-model pass.
What keeps this honest is a modified rejection-sampling rule, so the output is not an approximation. Let be the target model's probability for a drafted token and the draft model's. Accept the token with probability
If the target likes the token at least as much as the draft did () it is always accepted; if the target likes it less, it is accepted proportionally. On the first rejection you stop and resample that position from the residual distribution (the leftover target probability the draft under-weighted), renormalized. One can prove the token you finally emit is distributed exactly as if you had sampled from the target directly — so speculative decoding is a pure speedup, never a quality change. Accepted tokens after the first rejection are discarded (the draft's later guesses were conditioned on a token that did not survive).
A draft need not be a separate model: Medusa-style multi-token-prediction heads bolt extra output heads onto the target itself to predict the next few tokens, and tree-structured drafts verify several candidate continuations at once. The principle is unchanged — propose cheaply, verify in one memory-bound pass, keep the longest correct prefix.
Quantization: fewer bits per number
Because decode is memory-bound, fewer bytes per weight translates almost directly into faster decode, and that is the appeal of quantization: store numbers in fewer bits. FP16 uses bytes per weight; INT8 halves that, INT4 quarters it. The basic operation is to map a block of floating-point weights onto a small grid of integers via a per-channel (or per-group) scale: , and dequantize with . The scale is chosen so the largest weight in the block lands at the edge of the integer range.
Weight-only quantization compresses just the weights (activations stay in higher precision) and is the common default, because it directly buys the decode speedup with the least accuracy risk. GPTQ and AWQ are the two standard INT4 weight-only methods. GPTQ rounds the weights to the INT4 grid greedily while using a Hessian-based correction to compensate each rounding error in the not-yet-quantized weights. AWQ (activation-aware weight quantization) starts from a sharp observation: a weight multiplied by a large activation contributes more to the layer's output, so quantization error on those “salient” channels hurts far more than error elsewhere. Protecting just the most salient weight channels — by scaling them up before rounding (and scaling the matching activations down to compensate) — recovers most of the lost accuracy. Weight-and-activation quantization (e.g. INT8 on both) can additionally speed up the matrix multiplies, but it is harder, because activations contain large outliers that wreck a naive INT8 scale; SmoothQuant-style methods migrate that difficulty from the activations into the weights by a per-channel rescale, making both sides quantization-friendly. Finally, KV-cache quantization stores the cache itself in INT8/INT4, shrinking the term that dominates long-context memory.
Putting it together: a serving stack
A real deployment composes all of the above, and the discipline that ties it together is always the same: find the binding constraint and spend your engineering exactly there. Big models are split across GPUs with tensor parallelism; GQA or MLA shrink the KV cache at the architecture level; quantization shrinks the weights (decode) and optionally the cache; continuous batching keeps the batch full; PagedAttention kills fragmentation so the batch can be large; prefix caching skips repeated prefill; and speculative decoding squeezes several tokens out of each memory-bound big-model pass. Because prefill and decode have opposite personalities, a frontier idea is disaggregated serving: run prefill on one pool of hardware (tuned for compute throughput) and decode on another (tuned for memory bandwidth and low ITL), so a burst of long prompts entering prefill does not stall everyone else's token streaming. Chunked prefill similarly slices a long prompt into pieces interleaved with decode steps, so one giant prompt cannot freeze the stream.
This gives a clean diagnostic checklist, which also maps directly onto the questions in this topic. If TTFT is bad under load, suspect prefill — try chunked prefill or disaggregation. If ITL is bad, suspect decode — reach for quantization, GQA/MLA, kernel fusion, or speculative decoding. If throughput is high but P99 is terrible, suspect the scheduler — long generations are starving newcomers. If a quantization change leaves easy benchmarks intact but degrades reasoning or long-context, suspect that you compressed away precision the hard tasks needed. Every one of these is the same move in a different costume: name the wall, then move bytes, amortize bytes, or rebalance the schedule. Keep that lens, and the detailed questions ahead — roofline-driven designs, the speculative-decoding correctness proof, KV-cache arithmetic, MoE serving, and the rest — will read as variations on a story you already know.
