Chapter 08Part III · Post-Training & Alignment

SFT, Instruction Tuning, Data & PEFT

7 practice sets · 6 coding problems

A freshly pretrained language model is a strange creature: it has read a sizable fraction of the public internet and absorbed an astonishing amount of grammar, facts, and reasoning patterns, yet it has no idea that it is supposed to be helpful. Show it “What is the capital of France?” and a base model might happily continue with “What is the capital of Germany? What is the capital of Italy?” — a perfectly plausible continuation of a list of quiz questions, but not an answer. The model is doing exactly what pretraining trained it to do, which is to predict likely next tokens; nobody ever told it that when a human writes an instruction, the desired continuation is a useful response that then stops. This mini-chapter is about post-training: the comparatively cheap stages that take that raw next-token predictor and turn it into an assistant you can talk to. We focus on the first and most fundamental of those stages, supervised fine-tuning (SFT), and on the practical machinery around it — chat templates, loss masking, what makes training data good, and the parameter-efficient tricks (above all LoRA and QLoRA) that let you do all of this by training a tiny sliver of the weights. By the end you should be able to read every question in this topic — LoRA parameter counts, loss-mask vectors, QLoRA memory budgets, the alignment tax — as a variation on ideas built here from scratch.

The post-training pipeline: from base model to assistant

Modern LLM training happens in stages, and it helps to see the whole assembly line before zooming in. Pretraining (Topic 7) does the heavy lifting: trillions of tokens of raw text, one giant next-token-prediction run, costing millions of dollars and producing a base model that knows a great deal but follows no conventions. Everything after that is post-training, and it is cheap by comparison — thousands to millions of carefully chosen examples rather than trillions of scraped ones. Post-training has two main phases. First, SFT (this chapter) teaches the format of being an assistant: read a prompt, produce a helpful answer, stop. Second, preference optimization — RLHF, DPO, and the rest (Topics 9–11) — teaches the model which of several acceptable answers humans actually prefer, a signal SFT cannot express.

Loading diagram…

The boundary between these phases is conceptual, not architectural. The network never changes — same transformer, same weights being updated. What changes is the data and a couple of implementation details. That is the first thing to internalize: SFT is not a new kind of training, it is pretraining's loss pointed at different, curated data.

Why a base model needs SFT at all

It is worth dwelling on why the base model misbehaves, because the fix follows directly from the diagnosis. During pretraining the model saw web pages, books, and forums. In that data, a line that looks like a question is very often followed by more questions (a quiz, an FAQ index, a worksheet), or by a tangent, or by an advertisement — not reliably by a crisp answer. The model learned the true statistics of the internet, and on the internet “a helpful answer immediately follows every instruction” is simply false. So the base model has the knowledge to answer but not the habit.

Loading diagram…

What SFT actually optimizes

SFT uses the same loss as pretraining — next-token cross-entropy — on different data. Pretraining maximizes the likelihood of raw web text; SFT maximizes the likelihood of curated (prompt, response) pairs. For a target token sequence $y_1,\dots,y_T$ the per-example loss is

\mathcal{L}=-\sum_{t=1}^{T} m_t\,\log \pi_\theta\!\big(y_t \mid y_{<t}\big),

where $\pi_\theta$ is the model (its parameters are $\theta$ ), $\pi_\theta(y_t\mid y_{<t})$ is the probability it assigns to the correct next token $y_t$ given everything before it, and $m_t\in\{0,1\}$ is a loss mask. Cross-entropy here is just “negative log-probability of the right token”: it is large when the model was surprised by the true token and near zero when the model was confident and correct, so minimizing it pushes the model to put probability on the tokens that actually appear. The mask $m_t$ is the one genuinely new ingredient, and we devote a whole section to it below.

Training uses teacher forcing: at every position the model is fed the ground-truth previous tokens $y_{<t}$ , not its own guesses, so the entire sequence is scored in a single parallel forward pass (this is exactly the parallelism the causal mask buys us, from Topic 1). This is the key mismatch with inference, where the model is autoregressive and must consume its own previous outputs token by token. A small early mistake at decode time is a situation the teacher-forced model was never trained on, so the error can compound — this gap between “always fed the truth” (training) and “fed your own outputs” (inference) is called exposure bias, and it is one of several reasons SFT alone is not the end of post-training.

Instruction tuning and generalizing to unseen instructions

Instruction tuning is SFT done on a broad, diverse mix of tasks phrased as instructions — summarize this, translate that, write code for this, explain that — each paired with a good response. The two terms are often used interchangeably; “instruction tuning” just emphasizes the breadth of task framing. The payoff, discovered across the FLAN/T0 line of work and now standard, is generalization: train on enough varieties of instruction and the model learns the abstract skill “follow the instruction in front of me,” not just the specific tasks in the training set. A model instruction-tuned on summarization, translation, and Q&A will then make a credible attempt at, say, “rewrite this email to sound more polite” — an instruction type it never explicitly saw. The format-following behavior transfers; only the breadth and quality of the demonstrations determine how far.

Chat templates: giving the conversation a structure

Real assistants handle multi-turn conversations with distinct roles: a system message (hidden setup instructions, e.g. “You are a helpful assistant; today is June 20, 2026”), user messages (what the human types), and assistant messages (what the model replies). But the transformer only ever sees a flat sequence of token ids. A chat template is the fixed recipe that serializes a list of role-tagged messages into one token stream, marking the boundaries with dedicated special tokens that did not exist during pretraining. A widely used format (ChatML-style) wraps each turn in <|im_start|>role ... <|im_end|>. Concretely, the messages


system:    You are a helpful assistant.
user:      What is 2+2?
assistant: 4

become the single token stream


<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is 2+2?<|im_end|>
<|im_start|>assistant
4<|im_end|>

Two things are worth noticing. First, the model learns to start producing content exactly when it sees the <|im_start|>assistant cue, and to emit an end-of-turn token (here <|im_end|>) when it is done — which is how a chat model knows to stop rather than ramble on, a behavior the base model conspicuously lacks. Second, an end-of-sequence / end-of-turn token is among the most important things SFT installs: a base model has never been rewarded for halting, so without this it would generate until it hits the context limit.

The single most important practical rule in this whole topic: the chat template used at inference must match the one used in training, byte for byte. A stray space, a renamed tag, or a missing newline puts the model slightly off the distribution it was trained on and silently degrades quality — the classic symptom being a model that “ignores the system prompt” or answers oddly, when in fact the template is subtly wrong.

Loss masking: train only on the response

Now the mask. We feed the whole templated sequence through the model as context, but we compute the loss only on the assistant's response tokens: we set $m_t=1$ on response tokens and $m_t=0$ on everything else — the system message, the user message, and the structural special tokens that introduce them. The reason is simple once stated: we do not want to teach the model to generate users' questions; we want to teach it to answer them. Prompt tokens are still read as context (the model must condition on the question to answer it), they just contribute no gradient. Skipping the mask would waste capacity learning to predict prompts — which are often repetitive boilerplate — and can actively pull the model toward parroting instructions back at the user.

Hands-on · which tokens carry the loss?

Take the templated example above and label each span of tokens with its mask $m_t$ (the special tokens are spelled out, the long bits abbreviated):

$m_t=0$	`<\|im_start\|>system`, “You… assistant.” , `<\|im_end\|>`
$m_t=0$	`<\|im_start\|>user`, “What is 2+2?” , `<\|im_end\|>`
$m_t=0$	`<\|im_start\|>assistant`
$m_t=\mathbf{1}$	`4`, `<\|im_end\|>`

Only the final row — the assistant's content "4" and its closing <|im_end|> — gets $m_t=1$ . If the response were $20$ tokens long inside a $200$ -token prompt, just those $\sim\!20$ positions would drive learning; the other $\sim\!180$ are pure context. So even though the forward pass processes every token, the gradient is computed from a small, deliberately chosen slice. (For multi-turn data you choose a policy: either train on every assistant turn, or unroll the conversation and train on only the final turn per example, masking all earlier turns as context.)

Loading diagram…

A related efficiency trick is packing: because examples vary in length, padding each one to the maximum wastes compute, so we concatenate several short examples into one full-length sequence. The pitfall is that attention could then flow across the boundary between two unrelated examples, letting example B peek at example A — cross-contamination. The fix is a block-diagonal attention mask: each packed segment may attend only within itself, as if the others were not in the sequence at all.

Loading diagram…

Where the data comes from — and why quality beats quantity

Instruction data can be human-written (expensive, gold-standard) or synthetic (generated by a strong model). Three synthetic recipes recur:

Knowledge distillation. A large, capable teacher model generates the responses; a smaller student is fine-tuned to imitate them. The student inherits much of the teacher's behavior at a fraction of the cost — but also the teacher's quirks and its quality ceiling. Most strong open “small” chat models are distilled this way.
Rejection sampling (best-of- $N$ , also “RFT”). The model itself generates $N$ candidate answers per prompt; a verifier or reward model keeps only the best, and you SFT on the survivors. Unlike distillation from an external teacher, this bootstraps from the model's own good samples, so it can only amplify behaviors the model already produces sometimes.
Self-instruct / synthetic prompts. A model invents new instructions and answers, expanding coverage cheaply — useful for filling gaps where you have no natural prompts.

How much data do you actually need? Far less than intuition suggests. The influential LIMA result (“Less Is More for Alignment”) fine-tuned a strong base model on just $1{,}000$ carefully curated, diverse, high-quality examples and produced a capable instruction-follower; it even held coherent multi-turn conversations after adding only $\sim\!30$ hand-written dialogue chains. LIMA's explanation is the Superficial Alignment Hypothesis: a model's knowledge and capabilities are learned almost entirely during pretraining, and alignment mostly teaches it which format and style to use when responding. If that is even roughly right, then a small set of excellent demonstrations can suffice, because you are teaching style and convention, not facts.

SFT changes behavior and format, not knowledge. The model already knows the facts from pretraining; instruction tuning teaches it to surface them as a helpful, well-formatted answer. That is why a few thousand excellent examples can beat a million mediocre ones — and why a narrow or subtly-wrong dataset is dangerous: every bad example teaches a habit, and there are few enough of them that each one counts.

The practical takeaway is that quality, diversity, and correctness dominate sheer quantity. A subtly wrong label, a distribution narrow enough to over-fit one writing style, or a hidden formatting artifact can poison training even when each example “looks fine” in isolation. This is why the unit of curation matters: you deduplicate and filter not just tokens but whole prompts, conversations, sources, and generators, and you keep evaluation prompts strictly out of the training set to avoid contamination (a model that “memorized” the test scores well but has learned nothing transferable).

The cost problem, and PEFT's answer

Full fine-tuning updates every weight, and the memory math is brutal. For each parameter you must store, in mixed-precision Adam, roughly: the weight itself, a gradient, and two optimizer-state values (the running mean and variance) — about $16$ bytes per parameter. A $13$ B model therefore needs $\sim\!16\times13\text{B}\approx 208$ GB just for weights, gradients, and optimizer state, before activations — which already exceeds a single high-end GPU. Parameter-efficient fine-tuning (PEFT) sidesteps this by freezing the pretrained weights and training only a small number of new parameters, so gradients and optimizer state exist only for that small set.

LoRA (Low-Rank Adaptation) is the dominant PEFT method. Its premise: the update a task needs, $\Delta W$ , is approximately low rank — it lives in a far smaller subspace than the full $d\times k$ weight matrix it adjusts. (The LoRA paper found that even rank $1$ or $2$ can capture a useful update of a matrix whose full dimension is over $12{,}000$ .) So instead of learning the full $\Delta W$ , LoRA factors it into two skinny matrices. For a frozen weight $W_0\in\mathbb{R}^{d\times k}$ , the adapted layer computes

h = W_0\,x + \Delta W\,x,\qquad \Delta W = \frac{\alpha}{r}\,B A,

with $B\in\mathbb{R}^{d\times r}$ , $A\in\mathbb{R}^{r\times k}$ , and the rank $r\ll \min(d,k)$ (often $8$ or $16$ ). Only $A$ and $B$ are trained; $W_0$ never receives a gradient. The scaling factor $\alpha/r$ controls how strongly the adapter is applied: $\alpha$ is a tunable constant, and dividing by $r$ keeps the effective scale roughly stable as you change rank, so you can retune $r$ without re-tuning the learning rate from scratch.

Loading diagram…

Two details make LoRA work cleanly. Initialization: $A$ is small random and $B=0$ , so $\Delta W = BA = 0$ at step $0$ — training starts exactly at the pretrained model and grows a controlled increment from there, never a random jolt that would damage the careful pretrained weights. Merging: because the adapter is just an added linear term, after training you can fold it back, $W' = W_0 + \frac{\alpha}{r}BA$ , recovering a single weight matrix with zero extra parameters or latency at inference — a real advantage over older adapter methods that add permanent extra layers.

Hands-on · LoRA's parameter savings

Take one $d\times k = 4096\times 4096$ projection and rank $r=8$ .

Full fine-tuning trains every entry: $4096\times 4096 = 16{,}777{,}216 \approx 16.8$ M parameters for this one matrix.
LoRA trains only $A$ ( $r\times k = 8\times 4096$ ) and $B$ ( $d\times r = 4096\times 8$ ):
$\underbrace{8\times 4096}_{A}+\underbrace{4096\times 8}_{B} = 2\times 4096\times 8 = 65{,}536 \approx 65\text{k}.$

The reduction factor is $16{,}777{,}216 / 65{,}536 = 256\times$ on this layer. Now plug in the scaling: with $\alpha=16,\ r=8$ , the adapter is applied at strength $\alpha/r = 16/8 = 2$ . If you later doubled the rank to $r=16$ at the same $\alpha=16$ , the scale would fall to $16/16 = 1$ — which is precisely why the $\alpha/r$ form exists: it absorbs rank changes so the effective update magnitude stays comparable. Summed across all adapted layers of a real model, LoRA typically trains well under $1\%$ of the parameters — the original paper reports up to a $10{,}000\times$ reduction in trainable parameters for GPT-3 175B with no loss in quality.

QLoRA, and the rest of the PEFT family

QLoRA pushes memory lower still by attacking the one thing LoRA leaves expensive: the frozen base, which you must still store. QLoRA quantizes the frozen base to $4$ -bit, using a custom NF4 (4-bit NormalFloat) format tuned for the bell-curve distribution of neural-net weights, while the LoRA adapters stay in higher precision. Quantization means representing each weight with fewer bits: fp16 uses $16$ bits ( $2$ bytes) per weight, whereas $4$ -bit uses just $4$ bits ( $16$ possible values), a $4\times$ shrink in storage. During the forward and backward pass the $4$ -bit weights are dequantized on the fly back to a higher precision to do the matrix multiplies; crucially, gradients flow only into the adapters, never into the quantized base. With one further trick — double quantization, which quantizes even the per-block scaling constants — QLoRA fine-tuned a $65$ B model on a single $48$ GB GPU while nearly matching full $16$ -bit quality. That accessibility is its whole point.

LoRA and QLoRA are not the only PEFT methods, and it helps to place them in a small family:

Adapters insert tiny new bottleneck layers inside each transformer block and train only those. They work, but unlike LoRA they cannot be merged away, so they add permanent inference latency.
Prefix tuning prepends a handful of trainable “virtual” key/value vectors to the attention of every layer, steering the model without touching its weights.
Prompt tuning is the lightest of all: it learns a few soft prompt embeddings prepended to the input and trains nothing else — cheap, but generally the weakest at hard adaptation.

DoRA and rank-stabilized LoRA are refinements of LoRA itself that decouple a weight's magnitude from its direction, or rescale by $\alpha/\sqrt{r}$ , closing part of the small gap to full fine-tuning on hard domain shifts where vanilla LoRA's fixed low rank is too restrictive. The general rule: prompt/prefix tuning $<$ LoRA $<$ full FT in adaptation power, and roughly the reverse in cost.

Loading diagram…

Catastrophic forgetting and the alignment tax

Fine-tuning is not free of side effects. The two to watch for are closely related. Catastrophic forgetting is when tuning hard on a narrow new task overwrites pretrained skills — teach the model intensely about, say, legal contracts and it may get noticeably worse at basic arithmetic or casual chat, because the same weights that encoded those abilities were dragged toward the new distribution. Mitigations are all variations on “don't move too far”: a smaller learning rate (SFT typically runs one to two orders of magnitude below pretraining), mixing replay of general data back into the SFT set, a KL penalty anchoring the model to the base, or simply using PEFT — LoRA's frozen base preserves the old knowledge by construction, which is one of its underrated advantages.

The alignment tax is the subtler cousin: tuning a model to be more helpful, harmless, and instruction-following can slightly lower its raw capability on some benchmarks. You are reshaping the output distribution toward “what a good assistant says,” and that reshaping occasionally trims away some performance on narrow tasks. The trade is usually worth it — a slightly less peak-capable model that reliably follows instructions and refuses harmful requests is far more useful — but the cost is real and must be measured, not assumed away.

When SFT runs out of road — and what to watch for

SFT is powerful but bounded. It can only imitate demonstrations: it cannot learn that, of two acceptable answers, humans systematically prefer one, and it cannot discover good behaviors that are absent from its data. Failures that demonstrations cannot fix — subtle harmlessness trade-offs, ranking quality among already-good answers, calibrated refusals — are exactly what preference optimization and RL (the next topics) address. SFT is the right first move, not the last.

A handful of pitfalls recur, and naming them now will make the topic's questions feel familiar:

Format over substance. A model can ace an eval's format while getting worse at open-ended chat — it learned the template, not the skill. Confirm by testing on held-out formats and free-form prompts, and by separating “learned the skill” from “learned the format.”
Template mismatch. The #1 silent quality killer: a train/inference template that differs by a token. The usual symptom is a model that seems to ignore the system prompt.
Too many SFT epochs. Over-training sharpens (lowers the entropy of) the policy, which can make a later RL stage collapse sooner — more SFT is not always better. One epoch is a common default; up to $\sim\!3$ only for small niche datasets.
Synthetic-data collapse. A flywheel that repeatedly trains on its own generations drifts toward its teacher's quirks and loses diversity; it needs external checks and a stopping rule.

The throughline of this chapter: SFT is cheap, fast, and the correct first step from base model to assistant, but it is the same next-token loss you already know, pointed at curated data and masked to the response. Its quality is capped by your data; PEFT makes it affordable; and its job is format and behavior, never the final word on what “good” means — that word belongs to the preference and RL stages that follow.