Chapter 09Part III · Post-Training & Alignment

RLHF, RL & Preference Optimization (Core)

9 practice sets · 7 coding problems

By the end of Topic 8 we had a model that could follow instructions: supervised fine-tuning (SFT) taught it to imitate human-written demonstrations of good answers. But imitation has a ceiling. SFT can only copy the demonstrations it was shown; it has no way to learn that, of two perfectly fluent answers, one is more helpful, more honest, or less harmful than the other. Those qualities are exactly the ones we care about most, and they are exactly the ones we cannot write down as a clean loss function. This mini-chapter is about the standard fix — reinforcement learning from human feedback (RLHF) — which sidesteps the missing loss function by learning the objective itself from human comparisons, and then optimizing the model against it with reinforcement learning. We build the whole machine from scratch, assuming only that you have met SFT and a neural network's training loop. By the end you should be able to follow a model through all three RLHF stages and know the job of every part — the reward model, the KL leash, policy gradients, PPO, and GRPO — that the rest of this topic takes apart in detail.

Why we need RL after SFT

Start with the core obstacle. To train a network by gradient descent you need a differentiable loss: a number that says “how wrong was this output,” which you can push downhill. For next-token prediction that loss is easy — cross-entropy against the one true next token. For “was this answer helpful and honest and harmless?” there is no such formula. Helpfulness is not a token you can compare against; it is a fuzzy human judgement that depends on the whole answer at once.

Here is the key observation that unlocks everything. People are terrible at assigning absolute scores (ask ten annotators to rate an answer “out of 10” and you get ten different numbers), but they are quite good and quite consistent at comparisons: shown two answers $A$ and $B$ to the same prompt, most people will agree on which is better. RLHF is built entirely on this asymmetry. We collect a pile of pairwise comparisons, distil them into a learned scoring function, and then use that function as the objective we optimize.

The big picture: three stages

Classic RLHF, the recipe popularized by InstructGPT (the model behind the first ChatGPT), runs in three stages, in order:

SFT (supervised fine-tuning). Fine-tune the base model on human-written demonstrations of good answers. This produces a competent starting model. A frozen copy of it is kept as the reference model $\pi_{\text{ref}}$ — the “known-good” anchor we will not let the model stray too far from.
Reward modeling. Collect human preferences — for each prompt, a pair of answers labelled “chosen” $y_w$ (the winner) and “rejected” $y_l$ (the loser) — and train a reward model (RM) $r_\phi(x,y)$ that maps any prompt–answer pair to a single scalar: “how good is this answer?”
RL. Treat the reward model as the objective. Use reinforcement learning to update the model (now called the policy $\pi_\theta$ ) so it generates higher-reward answers — while a KL penalty stops it from drifting too far from $\pi_{\text{ref}}$ .

Why not skip the RL and just fine-tune on the highest-reward answers we can find? Because the answers being scored are the model's own samples, and those change as the model trains. You are optimizing a moving target: improve the model and it produces new answers, which get new rewards, which demand new updates. That feedback loop — act, get scored, improve, repeat — is precisely what reinforcement learning is built to handle, and why it, rather than plain supervised learning, sits at stage 3. Here is the whole pipeline in one picture; everything below opens one of these boxes.

Loading diagram…

Reward modeling: turning comparisons into a scalar

We need a function $r_\phi(x,y)$ that scores any answer, but all we have are comparisons. How do we get a number out of “ $A$ beats $B$ ”? The standard bridge is the Bradley–Terry model, a century-old model of pairwise contests (originally for ranking sports teams and tasted wines). It assumes each answer has a hidden “strength” $r(x,y)$ , and that the probability a judge prefers $y_w$ over $y_l$ is a logistic function of the gap between their strengths:

P(y_w \succ y_l \mid x) \;=\; \sigma\!\big(r(x,y_w)-r(x,y_l)\big), \qquad \sigma(z)=\frac{1}{1+e^{-z}}.

The sigmoid $\sigma$ squashes any real number into $(0,1)$ , so this is a valid probability. Read it as: the bigger $y_w$ 's lead in score, the more sure we are the human prefers it; if the scores are equal the gap is $0$ and $\sigma(0)=\tfrac12$ — a coin flip, exactly right for two answers the model cannot tell apart. Notice that only the difference of scores appears, so the overall scale is arbitrary up to an additive constant; the RM learns relative quality, not an absolute grade.

Loading diagram…

To train the RM we maximize the likelihood it assigns to the comparisons humans actually made — equivalently, we minimize the negative log-likelihood. For one preference pair $(x,y_w,y_l)$ the loss is just minus the log of the probability above:

\begin{align*} \mathcal{L}_{\text{RM}}(\phi) &= -\,\mathbb{E}_{(x,y_w,y_l)}\Big[\log P(y_w\succ y_l\mid x)\Big] \\ &= -\,\mathbb{E}_{(x,y_w,y_l)}\Big[\log \sigma\big(r_\phi(x,y_w)-r_\phi(x,y_l)\big)\Big]. \end{align*}

Driving this down means making $\sigma(\text{gap})$ close to $1$ , i.e. pushing the chosen answer's score above the rejected one's on every pair. Architecturally the RM is cheap to build: take the SFT model, throw away its unembedding (the head that predicts the next token), and bolt on a tiny linear scalar head that reads the final hidden state of the last token and outputs one number. The body already understands language from pretraining; the head just learns to read off “quality.”

The objective: maximize reward, but stay near the reference

Now we have a scorer. The obvious move is to train the policy to maximize the expected reward of its own answers, $\mathbb{E}_{y\sim\pi_\theta}[r(x,y)]$ . This is a trap. The reward model is a proxy: it was trained on a narrow slice of answers and is only approximately right. A policy turned loose to maximize it will hunt down weird, off-distribution answers that the RM rates highly but a human would hate — gibberish that happens to trip the RM's circuits, or a tic the RM accidentally rewarded. This failure has a name: reward hacking, or over-optimization.

The leash is a penalty on the KL divergence between the policy and the frozen reference. KL divergence $\mathrm{KL}(\pi_\theta\|\pi_{\text{ref}})$ measures how far the policy's output distribution has drifted from the reference's — zero when identical, growing as they diverge. Subtracting a multiple of it gives the KL-regularized RLHF objective, the single equation the whole topic optimizes:

\max_{\pi_\theta}\;\; \mathbb{E}_{x}\,\mathbb{E}_{y\sim\pi_\theta(\cdot\mid x)}\big[\,r(x,y)\,\big] \;-\;\beta\,\mathrm{KL}\!\big(\pi_\theta(\cdot\mid x)\,\|\,\pi_{\text{ref}}(\cdot\mid x)\big).

The coefficient $\beta>0$ sets the leash length. Small $\beta$ lets the policy chase reward aggressively (and risk hacking); large $\beta$ keeps it timid and close to the trusted SFT model. Remarkably, this objective has a closed-form optimum — the reference distribution tilted by the exponentiated reward:

\pi^\star(y\mid x) \;=\; \frac{1}{Z(x)}\,\pi_{\text{ref}}(y\mid x)\, \exp\!\Big(\tfrac{1}{\beta}\,r(x,y)\Big), \qquad Z(x)=\sum_{y}\pi_{\text{ref}}(y\mid x)\,e^{r(x,y)/\beta}.

Read it as: start from the reference, then up-weight each answer by $e^{r/\beta}$ — good answers get boosted, bad ones suppressed, and the normalizer $Z(x)$ makes it sum to one again. As $\beta\!\to\!\infty$ the exponent vanishes and $\pi^\star\!\to\!\pi_{\text{ref}}$ (the leash is rigid); as $\beta\!\to\!0$ it collapses onto the single highest-reward answer (no leash, greedy). The sting is $Z(x)$ : it sums over every possible answer, an astronomically large set, so we cannot write $\pi^\star$ down or sample from it directly. That intractable normalizer is the entire reason we fall back on iterative RL to crawl toward $\pi^\star$ one batch at a time. (Topic 10's DPO is the slick trick that makes $Z(x)$ cancel for the pairwise case — but that is the next chapter.)

All of RLHF reduces to one objective: push reward up, keep KL to the reference down. Its exact optimum is $\pi^\star\!\propto\!\pi_{\text{ref}}\,e^{r/\beta}$ . Every algorithm in this topic — PPO, GRPO, and the rest — is just a practical recipe for climbing toward that target without ever computing the intractable normalizer $Z(x)$ .

Loading diagram…

A short RL primer: policy, reward, advantage

Before the algorithms, fix the vocabulary. In RL an agent in a state takes an action and receives a reward; the policy is the (stochastic) rule mapping states to actions. For a language model the mapping is almost embarrassingly direct: the state is the prompt-so-far, the action is the next token (or, viewed coarsely, the whole answer), the policy $\pi_\theta$ is the language model itself, and the reward is the RM score on the finished answer. One full generated answer is called a rollout. Generating a fresh batch of rollouts from the current policy and training on them is on-policy RL — which is what all the algorithms here do.

The central problem of policy optimization is credit assignment: an answer got reward $0.8$ , but which choices made it good? The tool for this is the advantage. Instead of asking “how much reward did this answer get?” (a raw, noisy number), we ask “how much better than average was it?” If $V(s)$ is the value — the average reward you'd expect from state $s$ — then the advantage is

A \;=\; R - V,

the reward minus the baseline you expected. Positive advantage means “better than expected, do more of this”; negative means “worse than expected, do less.” Subtracting a baseline is the single most important variance-reduction trick in RL: it doesn't change which direction we move on average, but it makes the signal far less noisy by centering it. Hold onto this — the whole difference between PPO and GRPO is how they compute the baseline.

Policy gradients: optimizing a sampling distribution

How do you do gradient descent on “expected reward of the model's own samples”? The answer is the policy-gradient (REINFORCE) estimator, and it rests on one algebraic trick. We want $\nabla_\theta\,\mathbb{E}_{y\sim\pi_\theta}[R(y)]$ , but $\theta$ appears inside the sampling distribution, which seems to block the gradient. The log-derivative trick — $\nabla_\theta \pi = \pi\,\nabla_\theta\log\pi$ — rewrites it as a plain expectation we can estimate by sampling:

\begin{align*} \nabla_\theta\,\mathbb{E}_{y\sim\pi_\theta}[R(y)] &= \mathbb{E}_{y\sim\pi_\theta}\big[\,R(y)\,\nabla_\theta\log\pi_\theta(y)\,\big] \\ &\approx \mathbb{E}_{y\sim\pi_\theta}\big[\,A(y)\,\nabla_\theta\log\pi_\theta(y)\,\big]. \end{align*}

The second line swaps the raw return $R$ for the advantage $A$ (same expected gradient, far less noise).

In RLHF the per-token reward stream is sparse: $0$ at every token except the final one, where the RM score lands — minus a small per-token KL penalty $\beta\log\frac{\pi_\theta}{\pi_{\text{ref}}}$ that bakes the leash directly into the reward (a token the policy now favors much more than the reference gets its effective reward docked). Spreading that single end-of-sequence reward back over the tokens that earned it is the per-token credit-assignment job that the next two algorithms tackle differently.

PPO: clipped updates that don't blow up

Plain policy gradient is wasteful: it draws a batch of expensive rollouts and takes one tiny gradient step before throwing them away. We'd like several steps per batch. But after the first step the policy has changed, so the batch is now stale — it was sampled from the old policy $\pi_{\text{old}}$ , not the current $\pi_\theta$ . Importance sampling corrects for this with the probability ratio

\rho_t(\theta)=\frac{\pi_\theta(a_t\mid s_t)}{\pi_{\text{old}}(a_t\mid s_t)},

which is $1$ when nothing has changed and drifts away as the policy moves; the surrogate objective to maximize becomes $\rho_t\,\hat A_t$ . The danger: if $\rho_t$ wanders far from $1$ , a single update can lurch the policy somewhere catastrophic. PPO (Proximal Policy Optimization) tames this with a clipped surrogate:

\mathcal{L}^{\text{PPO}}_t = \min\!\Big(\rho_t\,\hat A_t,\;\; \operatorname{clip}(\rho_t,\,1-\epsilon,\,1+\epsilon)\,\hat A_t\Big), \qquad \epsilon\approx 0.2 .

Two pieces. The $\operatorname{clip}$ confines the ratio to $[0.8,\,1.2]$ , capping how far one update can move things. The outer $\min$ makes it pessimistic: when an action looks good ( $\hat A_t>0$ ), the clip refuses to keep rewarding an ever-larger ratio (no runaway); when it looks bad ( $\hat A_t<0$ ), you still pay full price to push it down. The net effect is a trust region — updates stay in a safe neighborhood of $\pi_{\text{old}}$ where the stale data is still trustworthy. PPO computes its per-token advantages $\hat A_t$ with a learned value head (a “critic”: a second network, usually the size of the policy, that predicts $V(s_t)$ ). The full PPO-for-LLMs loop therefore juggles four models: the policy being trained, the critic value head, the frozen reward model, and the frozen reference for the KL term — a lot of GPU memory.

The clip is easiest to see as a picture. The plot below shows the surrogate $\mathcal{L}^{\text{PPO}}_t$ for a good action ( $\hat A_t>0$ ) as the ratio $\rho_t$ moves. Below $1+\epsilon$ the objective rises with $\rho_t$ (reinforce the action), but once $\rho_t$ passes $1+\epsilon$ the line goes flat: extra increase earns nothing, so the gradient is zero and the update stops. That flat ceiling is the trust region's edge — PPO simply refuses to be tempted into a huge, destabilizing step.

Loading diagram…

GRPO: drop the critic, use a group baseline

The critic exists only to supply a baseline $V(s)$ . GRPO (Group Relative Policy Optimization, introduced with DeepSeekMath and made famous by DeepSeek-R1) makes a sharp observation: we can get a baseline for free by sampling a whole group of $G$ answers to the same prompt and using the group's own mean reward as the baseline. No second network needed — the group is its own yardstick. The group-relative advantage of answer $i$ is its reward standardized within the group:

A_i \;=\; \frac{r_i - \operatorname{mean}(r_1,\dots,r_G)} {\operatorname{std}(r_1,\dots,r_G)},

and this single scalar is then broadcast to every token of answer $i$ . The reading is plain: “how many standard deviations above the group average was this answer?” Answers that beat their peers get pushed up; answers that lose get pushed down. GRPO keeps PPO's clipped ratio and a KL term but deletes the value network, roughly halving the memory. That is why it became the workhorse for RLVR (RL with verifiable rewards) — math and code, where the reward is simply “is the final answer correct?” and no learned RM is needed at all.

Hands-on · GRPO advantages for one group

Prompt: “ $10\times10=?$ ”. The policy samples $G=5$ answers; a verifier gives reward $1$ for correct, $0$ for wrong, yielding $r=\{1,0,0,1,0\}$ . Compute the baseline and the advantages.

\begin{align*} \operatorname{mean}(r) &= \tfrac{1+0+0+1+0}{5} = 0.4, \\ \operatorname{std}(r) &= \sqrt{\tfrac{2(1-0.4)^2 + 3(0-0.4)^2}{5}} = \sqrt{\tfrac{0.72+0.48}{5}} = \sqrt{0.24} \approx 0.49. \end{align*}

Now standardize each reward:

\begin{align*} \text{correct answers } (r_i=1):\quad A_i &= \tfrac{1-0.4}{0.49} \approx +1.22,\\ \text{wrong answers } (r_i=0):\quad A_i &= \tfrac{0-0.4}{0.49} \approx -0.82. \end{align*}

Every token of the two correct answers is nudged up with weight $+1.22$ ; every token of the three wrong ones is nudged down with weight $-0.82$ . The policy learns “do more of what the winners did” — with no critic, no value head, and no GAE, just the group acting as its own baseline. (Note: if all five answers had been correct, mean $=1$ and every advantage would be $0$ — nothing to learn from that prompt, which is why GRPO variants discard such “all-same” groups.)

Loading diagram…

The group's mean reward is the baseline (dashed line); each answer's advantage is how far its reward sits above or below it. The two correct answers earn positive advantage and are reinforced; the three wrong ones are suppressed.

What to watch for

Three habits separate a working RLHF run from a broken one, and they preview the questions ahead:

The reward is a proxy, not the goal. The signature failure is over-optimization: the RM score climbs steadily while true (human / gold-RM) quality plateaus, then drops. The standard detector is gold-vs-proxy divergence on a holdout set; the standard knob is the KL budget — if the policy has drifted far from $\pi_{\text{ref}}$ (high KL) and humans like it less, $\beta$ is too small.
Watch reward, not loss. RL “loss” curves are nearly meaningless — they oscillate around zero by construction. The numbers that matter are mean reward / accuracy, KL to the reference, and generation entropy. Entropy collapsing toward zero warns that the policy has gone deterministic and will stop improving.
KL is the master dial. It is what separates “aligning a model” from “finding an adversarial input to your own reward model.” For open-ended helpfulness the KL leash is essential; for verifiable math/code (RLVR) it is often loosened or dropped, since a correctness check cannot be hacked the way a learned RM can.

Loading diagram…

With these pieces in hand — comparisons $\to$ Bradley–Terry reward model $\to$ KL-regularized objective $\to$ policy gradients $\to$ PPO's clipped trust region $\to$ GRPO's group baseline — the detailed questions that follow (GAE, the k1-vs-k3 KL estimators, Dr. GRPO and DAPO, clip-higher, over-optimization curves, and RLVR) will read as variations on parts you have already met.