Chapter 10Part III · Post-Training & Alignment

Alignment Algorithms Zoo

8 practice sets · 5 coding problems

Topic 9 gave us the canonical alignment recipe: collect human preferences (“response A is better than response B”), fit a reward model to those preferences, then use reinforcement learning (PPO) to push a language model toward high reward while a KL penalty stops it from drifting into nonsense. That pipeline works — it is how the first generation of chat models were tuned — but it is a heavy machine. It juggles four models at once (the policy being trained, a frozen reference, a reward model, and a value/critic network), it involves an online RL loop that is famously finicky to stabilize, and it is expensive to run. This topic is the zoo of methods that grew up to do the same job more cheaply, more simply, or more robustly. The headline idea — the one that reorganized the whole field — is that you can skip the reward model and the RL loop entirely and still solve the same problem with a plain supervised-learning loss. That method is DPO, and once we have derived it, almost every other animal in the zoo turns out to be DPO with one knob turned. We build DPO carefully from scratch, then tour the variants, the inference-time tricks, and the trade-off against PPO.

The problem, in plain words: learn from comparisons, not scores

Here is the situation after supervised fine-tuning (SFT, Topic 9). The model can follow instructions, but for most prompts there is no single “correct” answer — there are many acceptable responses, and humans simply prefer some over others (more helpful, more honest, less rude). We cannot easily write down a number that says “this response is worth $7.3$ ”; what people can reliably do is compare: shown two responses to the same prompt, say which one they like better. So our raw material is a pile of preference pairs.

Let us pin down the data. A preference dataset is a set of triples $(x, y_w, y_\ell)$ where $x$ is a prompt, $y_w$ is the chosen (winning) response, and $y_\ell$ is the rejected (losing) response. Our model is a policy $\pi_\theta(y\mid x)$ — the probability it assigns to producing response $y$ given prompt $x$ — with trainable parameters $\theta$ . We also keep a frozen copy of the SFT model called the reference $\pi_{\mathrm{ref}}$ , which is where we started and which we do not want to wander too far from. Our goal: adjust $\theta$ so the policy reliably puts more probability on the $y_w$ 's than on the $y_\ell$ 's, without forgetting how to write fluent text.

The one equation everything hangs on

We start from the RLHF objective of Topic 9 and never really leave it. RLHF maximizes expected reward while a Kullback–Leibler (KL) penalty keeps the policy close to the frozen reference, so it cannot collapse into gibberish that happens to score well on the reward model:

\max_{\pi_\theta}\ \mathbb{E}_{x,\,y\sim\pi_\theta}\big[r(x,y)\big]\;-\;\beta\,\mathrm{KL}\!\big(\pi_\theta(\cdot\mid x)\,\|\,\pi_{\mathrm{ref}}(\cdot\mid x)\big).

Read it as a tug-of-war: the first term pulls the policy toward responses the reward $r(x,y)$ likes; the second term, the KL divergence (a measure of how far apart two probability distributions are), pulls it back toward $\pi_{\mathrm{ref}}$ . The knob $\beta>0$ sets who wins. Large $\beta$ glues you to the reference (safe, but you barely learn the preferences); small $\beta$ lets you chase reward hard (you learn fast, but risk degenerating). Keep $\beta$ in mind — it is the master dial of this entire topic.

This objective looks like it needs RL to solve, but it has a known closed-form optimum. The best possible policy under this objective is

\pi^{\star}(y\mid x)\;=\;\frac{1}{Z(x)}\,\pi_{\mathrm{ref}}(y\mid x)\,\exp\!\Big(\tfrac{1}{\beta}\,r(x,y)\Big), \qquad Z(x)=\!\sum_{y}\pi_{\mathrm{ref}}(y\mid x)\,e^{r(x,y)/\beta}.

In words: take the reference distribution and tilt it by the reward — multiply each response's reference probability by $e^{r/\beta}$ , so high-reward responses get boosted and low-reward ones get suppressed, then renormalize. The factor $Z(x)$ , the partition function, is just that renormalizer: it makes the probabilities sum to one. The catch is fatal in practice: $Z(x)$ sums over every possible response $y$ — an astronomical number of token sequences — so you cannot compute it directly. That single intractable sum is precisely why naive RLHF has to resort to sampling and a critic network. DPO's whole trick is to make $Z(x)$ disappear before we ever have to compute it.

Reading the equation backwards: the implicit reward

The optimal-policy formula relates four things: the policy, the reference, the reward, and $Z(x)$ . We usually read it forwards — given a reward, here is the policy. DPO reads it backwards — given a policy, what reward must it correspond to? Solve for $r$ by taking logs and rearranging (one step per line):

\begin{align*} \pi^{\star}(y\mid x) &= \tfrac{1}{Z(x)}\,\pi_{\mathrm{ref}}(y\mid x)\,e^{r(x,y)/\beta}, \\ \log \pi^{\star}(y\mid x) &= -\log Z(x) + \log \pi_{\mathrm{ref}}(y\mid x) + \tfrac{1}{\beta}\,r(x,y), \\ \tfrac{1}{\beta}\,r(x,y) &= \log \pi^{\star}(y\mid x) - \log \pi_{\mathrm{ref}}(y\mid x) + \log Z(x), \\ r(x,y) &= \beta\,\log\frac{\pi^{\star}(y\mid x)}{\pi_{\mathrm{ref}}(y\mid x)} \;+\; \beta\log Z(x). \end{align*}

This is the crux. Any reward consistent with the RLHF objective is, up to the prompt-only term $\beta\log Z(x)$ , just a scaled log-ratio between a policy and the reference. So we never needed a separate reward model: the policy already is one. We give the policy-side quantity a name, the implicit reward:

\hat r(x,y)\;=\;\beta\,\log\frac{\pi_\theta(y\mid x)}{\pi_{\mathrm{ref}}(y\mid x)}.

It is the (per- $\beta$ ) log-ratio of how much more likely the current policy makes $y$ than the reference did. Bump the policy's probability on $y$ above the reference's, and its implicit reward goes up. That is the entire reward model, hiding inside the language model — hence the DPO paper's subtitle, “Your Language Model is Secretly a Reward Model.”

Loading diagram…

DPO: preferences become a classification loss

Now we connect the implicit reward to the preference data. The standard model of how a reward turns into a preference is Bradley–Terry: the probability that $y_w$ beats $y_\ell$ is a logistic function of their reward gap,

p(y_w \succ y_\ell\mid x) \;=\; \sigma\!\big(r(x,y_w)-r(x,y_\ell)\big), \qquad \sigma(z)=\frac{1}{1+e^{-z}}.

The sigmoid $\sigma$ squashes any real number into $(0,1)$ : a big positive reward gap means “ $y_w$ almost surely wins,” a gap near zero means “coin flip.” We want to choose $\theta$ to make the model assign high probability to the comparisons we actually observed (the winners winning) — i.e. maximize the likelihood of the data, equivalently minimize its negative log-likelihood.

Here is the magic step. Substitute the inverted reward $r = \beta\log\frac{\pi^\star}{\pi_{\mathrm{ref}}} + \beta\log Z(x)$ into the gap $r(x,y_w)-r(x,y_\ell)$ . Because $\beta\log Z(x)$ depends only on the prompt $x$ , it is identical for the winner and the loser, so it appears in both terms of the subtraction and cancels exactly:

\begin{align*} r(x,y_w)-r(x,y_\ell) &= \Big(\beta\log\tfrac{\pi^\star(y_w\mid x)}{\pi_{\mathrm{ref}}(y_w\mid x)} + \beta\log Z(x)\Big) - \Big(\beta\log\tfrac{\pi^\star(y_\ell\mid x)}{\pi_{\mathrm{ref}}(y_\ell\mid x)} + \beta\log Z(x)\Big) \\ &= \beta\log\tfrac{\pi^\star(y_w\mid x)}{\pi_{\mathrm{ref}}(y_w\mid x)} - \beta\log\tfrac{\pi^\star(y_\ell\mid x)}{\pi_{\mathrm{ref}}(y_\ell\mid x)} \;=\; \hat r_w - \hat r_\ell. \end{align*}

The intractable $Z(x)$ — the whole reason naive RLHF was hard — is gone. We replace $\pi^\star$ (which we are trying to learn) with our trainable $\pi_\theta$ , take the negative log of the Bradley–Terry likelihood, and out falls Direct Preference Optimization, a plain binary-classification loss on log-ratios:

\mathcal{L}_{\mathrm{DPO}} =-\,\mathbb{E}_{(x,y_w,y_\ell)}\;\log\sigma\!\Big( \underbrace{\beta\log\tfrac{\pi_\theta(y_w\mid x)}{\pi_{\mathrm{ref}}(y_w\mid x)}}_{\hat r_w\ :\ \text{chosen}} \;-\; \underbrace{\beta\log\tfrac{\pi_\theta(y_\ell\mid x)}{\pi_{\mathrm{ref}}(y_\ell\mid x)}}_{\hat r_\ell\ :\ \text{rejected}}\Big).

Let us name every symbol. $\pi_\theta$ is the model being trained; $\pi_{\mathrm{ref}}$ the frozen SFT reference; $\beta$ the same KL knob as before; $\sigma$ the sigmoid; the expectation $\mathbb{E}$ averages over preference triples in the dataset. The quantity inside the sigmoid is just $\hat r_w - \hat r_\ell$ , the implicit-reward margin between chosen and rejected. The loss is small when that margin is large and positive — i.e. when the policy has, relative to the reference, raised the chosen response's probability more than the rejected's. That is the whole method: raise the winner's log-ratio, lower the loser's, no reward model, no sampling, no RL loop, optimized exactly like ordinary supervised learning on a fixed dataset.

DPO turns alignment into classification. The KL-regularized RLHF problem has a closed-form optimal policy; inverting it shows the reward is a log-ratio $\beta\log(\pi_\theta/\pi_{\mathrm{ref}})$ ; substituting into Bradley–Terry makes the intractable normalizer $Z(x)$ cancel between chosen and rejected. What remains is a one-line, fully differentiable loss — the language model is its own reward model.

Why does this train sensibly? Look at the gradient. Differentiating the loss gives a clean form,

\nabla_\theta \mathcal{L}_{\mathrm{DPO}} = -\beta\,\mathbb{E}\Big[\;\underbrace{\sigma(\hat r_\ell - \hat r_w)}_{\text{weight } w}\;\big(\nabla_\theta\log\pi_\theta(y_w\mid x) - \nabla_\theta\log\pi_\theta(y_\ell\mid x)\big)\Big].

The bracket pushes up the log-probability of the chosen response and down that of the rejected one — exactly the dog-training move. The scalar weight $w=\sigma(\hat r_\ell - \hat r_w)$ is the clever part: it is large (near $1$ ) when the model currently has the pair ordered wrong (rejected scoring higher than chosen), and shrinks toward $0$ once the pair is comfortably correct. So the optimizer automatically spends its effort on the pairs it is still getting wrong and stops nagging about the ones it has already mastered.

Hands-on · one DPO step, by hand

Take $\beta=0.1$ and one preference pair. Suppose the per-token log-probabilities (summed over each response) are:

\log\pi_\theta(y_w)=-5.0,\quad \log\pi_{\mathrm{ref}}(y_w)=-5.5,\qquad \log\pi_\theta(y_\ell)=-4.0,\quad \log\pi_{\mathrm{ref}}(y_\ell)=-6.0.

The implicit rewards are the scaled log-ratios:

\begin{align*} \hat r_w &= \beta\,(\log\pi_\theta(y_w)-\log\pi_{\mathrm{ref}}(y_w)) = 0.1\,(-5.0-(-5.5)) = 0.1(0.5)=0.05,\\ \hat r_\ell &= \beta\,(\log\pi_\theta(y_\ell)-\log\pi_{\mathrm{ref}}(y_\ell)) = 0.1\,(-4.0-(-6.0)) = 0.1(2.0)=0.20. \end{align*}

The margin is $\hat r_w-\hat r_\ell = 0.05-0.20 = -0.15$ — negative, so right now the model implicitly rewards the rejected response more. The loss is $-\log\sigma(-0.15) = -\log(0.4626) \approx 0.771$ (vs. $0.693$ at a tie), confirming the pair is mis-ordered. The gradient weight is $w=\sigma(\hat r_\ell-\hat r_w)=\sigma(0.15)\approx 0.537$ : well above $\tfrac12$ , so this hard pair gets a strong update that pushes $y_w$ up and $y_\ell$ down. Had the margin been $+2.0$ instead, $w=\sigma(-2.0)\approx0.12$ — the same pair, already correct, would barely move. The sign of the margin tells you which way the pair is ordered; its size tells you how hard the gradient pushes.

Loading diagram…

The offline zoo: same skeleton, different knobs

DPO has a few well-known soft spots, and each variant in the zoo is a targeted fix. The unifying view: every offline preference method is DPO with one of three knobs changed.

The three knobs. (a) Reward parameterization: reference-anchored $\beta\log\frac{\pi_\theta}{\pi_{\mathrm{ref}}}$ (DPO, IPO, KTO) versus reference-free (ORPO, SimPO). (b) Link/loss on the margin: log-sigmoid (DPO) versus squared (IPO) versus a prospect-theory value function (KTO). (c) Data shape: paired winner+loser (DPO, IPO, ORPO, SimPO) versus unpaired single good/bad labels (KTO).

IPO (Identity Preference Optimization) targets DPO's overfitting on clean data. The log-sigmoid loss is happiest as the margin $\hat r_w-\hat r_\ell\to+\infty$ : with near-deterministic preferences it keeps pushing the chosen log-ratio up and the rejected down without bound, effectively trampling the KL penalty. IPO drops Bradley–Terry and instead regresses the margin to a fixed finite target with a squared loss, $\big(\hat r_w-\hat r_\ell-\tfrac{1}{2\beta}\big)^2$ . “Get the margin to $\tfrac{1}{2\beta}$ , then stop” — a target that cannot be trampled, so the regularization survives.

Loading diagram…

KTO (Kahneman–Tversky Optimization) changes the data requirement. Pairs are costly to collect; often you only have a single thumbs-up or thumbs-down on one response. KTO borrows prospect theory from behavioral economics — humans feel losses more sharply than equal gains — and scores each example's implicit reward against a baseline (roughly the batch-average implied reward), passing it through a value function that is concave on gains and convex on losses with built-in loss aversion. The result is an unpaired, per-example loss: feed it “good” and “bad” examples separately, no pairing needed — ideal when labels are abundant, imbalanced, or simply not paired.

ORPO (Odds Ratio Preference Optimization) removes a whole stage. DPO/IPO/KTO all keep a reference model $\pi_{\mathrm{ref}}$ in memory (a second forward pass) and usually need an SFT warm-up first. ORPO folds preference learning into SFT in one pass with no reference at all. Writing $p$ for the model's sequence likelihood, define the odds $\mathrm{odds}(y\mid x)=p/(1-p)$ ; the loss is the ordinary SFT negative-log-likelihood on the chosen response plus a small log-sigmoid penalty on the log-odds-ratio between chosen and rejected, weighted by $\lambda$ . One SFT-style run does both jobs.

SimPO (Simple Preference Optimization) also goes reference-free, but anchors to nothing at all: it replaces the log-ratio with the response's length-normalized average log-probability, $\tfrac{1}{|y|}\sum_t\log\pi_\theta(y_t\mid\cdot)$ , and adds a target margin $\gamma$ inside the sigmoid. Dropping $\pi_{\mathrm{ref}}$ halves memory and forward cost, and the $1/|y|$ term matches the per-token metric used at decode time. On benchmarks like AlpacaEval 2 it can beat DPO — but length normalization re-introduces a length bias: the model can inflate its reward simply by writing longer, so response length must be watched on evaluations.

Loading diagram…

Spending compute at inference: best-of- $n$ and rejection sampling

Everything above changes the model's weights. A complementary family changes nothing and just spends extra compute at generation time. Best-of- $n$ (BoN) samples $n$ responses to a prompt, scores each with a reward model or verifier, and returns the best one. Quality rises with $n$ : if each independent sample is “good” with probability $p$ , the chance that at least one of $n$ is good is $1-(1-p)^n$ , which climbs fast. The cost is paid twice over: you do $n\times$ the generation compute, and you drift away from $\pi_{\mathrm{ref}}$ . That drift is bounded, in KL terms, by a famous estimate:

\mathrm{KL}\big(\pi_{\mathrm{BoN}}\,\|\,\pi_{\mathrm{ref}}\big)\;\le\;\log n-\frac{n-1}{n}.

The remarkable thing is that this grows only logarithmically in $n$ — doubling $n$ adds barely anything to the KL — which is why BoN is such a cheap way to buy alignment measured in KL. (Recent work shows this is an upper bound and the true KL is often smaller.) But more is not always better: with an imperfect scorer, a large $n$ starts finding responses that fool the scorer rather than genuinely satisfy the goal. This is inference-time reward hacking — the held-out human judge first improves with $n$ , then declines. Best-of- $16$ may help where best-of- $256$ hurts; that decline is a signal your verifier is the weak link.

Loading diagram…

Rejection-sampling fine-tuning (RAFT / RFT) makes the BoN gain stick: generate many samples per prompt, keep only the highest-scoring ones, and run ordinary SFT on that filtered set. It is simple, stable, and offline. Its ceiling, though, is the model's own sampling distribution — it can only amplify good behaviors the model already produces sometimes; it cannot conjure skills the model never exhibits. It is the natural first thing to try when you have a decent verifier and want a robust, low-drama improvement before reaching for full RL.

Loading diagram…

Online vs. offline, and how GRPO relates

A theme cuts across the whole zoo: where does the training data come from? DPO and friends are offline — they learn from a fixed dataset of pairs that some other model (or an earlier version of this one) generated. PPO and GRPO (Topics 9 and 11) are online — at each step they sample fresh responses from the current policy and learn from those. The distinction matters because a contrastive loss like DPO only shapes the model where the training data lives; if the model has since drifted into a different region of response space, offline pairs no longer cover where it actually generates, and the signal goes stale.

Iterative / online DPO splits the difference: periodically generate fresh pairs from the current model, label them (with a reward model or an AI judge), and run another DPO round. This keeps the data “on-policy” and typically closes much of DPO's gap to full RL, at the cost of needing a labeler in the loop. GRPO (Group Relative Policy Optimization) is the online cousin most relevant here: for each prompt it samples a group of responses, uses their average reward as the baseline (so it needs no separate value network), and does a policy-gradient update — a lighter-weight PPO that is now standard for reasoning models. A useful way to file these: DPO learns from a frozen comparison; GRPO/PPO learn from the model's own live attempts.

Loading diagram…

DPO vs. PPO: the central trade-off

Since this whole topic exists to relate alternatives to the Topic 9 baseline, it is worth laying the two head-to-head. They provably share the same optimum (both solve the KL-regularized RLHF objective), but they get there very differently and behave differently with finite data and compute.

Models in play. PPO needs four (policy, reference, reward model, value/critic); DPO needs two (policy and reference) and no reward model. Fewer moving parts means less to break.
Data. DPO is offline — a fixed preference set, reusable, no rollouts. PPO is online — it samples from the live policy every step and must score those samples, which is slow but keeps the signal on-policy.
The KL constraint. DPO's KL is static: $\beta$ fixes it up front, and the loss steps straight to the exact optimum for that $\beta$ . PPO's KL is dynamic: it is measured and controlled per batch as training proceeds, which is more adaptive but adds tuning burden.
Stability & cost. DPO is stable and cheap, trains like supervised learning, and is the right place to start. PPO is harder to stabilize and more expensive, but, by exploring with fresh on-policy samples, tends to reach a slightly higher ceiling and resist some of DPO's failure modes.
Bottom line. Choose DPO for simplicity, speed, and fast data iteration (and remember: data quality usually matters more than which preference loss you pick). Reach for PPO/GRPO when you can afford the rollouts and need the last few points of quality, especially on verifiable or reasoning tasks.

Where the labels come from, and what to watch for

DPO and the rest all assume preference labels exist. Because human labels are slow and expensive, a major thread replaces them with AI feedback. In RLAIF, an LLM judges which of two responses is better, manufacturing the same $(x,y_w,y_\ell)$ triples DPO consumes. Constitutional AI (Anthropic) is the influential instance: a written list of principles drives a critique-and-revise loop — the model critiques its own draft against the principles, rewrites it, and the (original, revised) pairs plus AI-labeled comparisons become the training data. Self-rewarding models go further, using the same model as both generator and judge. The recurring danger across all three: the judge's blind spots silently become the policy's blind spots, so these loops typically improve for a round or two, then plateau or drift — they need an external, ideally human, check and a stopping rule.

A few pitfalls recur often enough to memorize. DPO can go bland: by chasing the chosen-minus-rejected margin, it sometimes lowers the probability of both responses (pushing the rejected down harder), and the freed-up probability mass can leak to off-distribution tokens, making the model duller — the standard fix is to add a small SFT/NLL term on the chosen response (the spirit of DPO+SFT, and built into ORPO). $\beta$ is the master dial: too high and the model barely leaves $\pi_{\mathrm{ref}}$ (preferences never learned); too low and it over-optimizes and degenerates — read it off the training curves (the implicit-reward margin and the KL growth) plus sample quality, with $\beta\approx 0.1$ a common default. Reference-free speed has a cost: SimPO's length normalization invites length exploitation. And any AI-in-the-loop method inherits its judge's flaws — always validate against a held-out, independent evaluation. With this map in hand, the detailed questions that follow — the full DPO and IPO derivations, the KTO and SimPO objectives, the BoN reward-hacking analysis, BOND and distillation, and online-vs-offline trade-offs — read as variations on the one equation we inverted at the start.

The problem, in plain words: learn from comparisons, not scores

The one equation everything hangs on

Reading the equation backwards: the implicit reward

DPO: preferences become a classification loss

The offline zoo: same skeleton, different knobs

Spending compute at inference: best-of-nnn and rejection sampling

Online vs. offline, and how GRPO relates

DPO vs. PPO: the central trade-off

Where the labels come from, and what to watch for

Spending compute at inference: best-of- $n$ and rejection sampling