Chapter 11Part III · Post-Training & Alignment

Reasoning & Test-Time Compute

8 practice sets · 4 coding problems

A plain language model answers a question the way you might blurt out the first thing that comes to mind: it reads the prompt and starts emitting the answer immediately, one token at a time. For looking up a fact, paraphrasing a paragraph, or writing a polite email, that reflex works beautifully. But ask it to solve a competition math problem, untangle a multi-step logic puzzle, or debug a tricky piece of code, and the same model that aced the easy tasks falls apart. The reason is not that it “doesn't know” the answer in some deep sense — it is that we never gave it room to work. This mini-chapter is about that room: the idea that you can spend more computation at inference time to get better answers, and the training tricks (chain-of-thought, self-consistency, verifiers, and reinforcement learning from verifiable rewards) that teach a model to spend that compute well. By the end you should understand why “thinking longer” is a real lever, how to measure it, and how today's reasoning models (o1, DeepSeek-R1) are made.

The big picture: why a transformer cannot “think harder” in place

Start from one structural fact about the architecture you met in Topic 1. A transformer does a fixed amount of computation to produce each token: the input flows through a fixed number of layers $L$ , does a fixed number of matrix multiplies, and out comes one token's worth of output. There is no internal loop, no “let me ponder this for a while” that runs the same layer a hundred times. The depth is baked in at training time and cannot grow when the problem gets harder.

This has a sharp consequence. A genuinely hard problem — one that needs, say, twenty interdependent reasoning steps — cannot be solved inside a single forward pass, because twenty serial steps simply do not fit in $L$ fixed layers. So how can a fixed-depth network ever do a long calculation? There is exactly one escape hatch: produce more tokens. Each token the model writes is appended to its own context and can be read by every token that comes after it. So if the model writes down step one, then step two can see step one when it is generated, step three can see steps one and two, and so on. The sequence of tokens becomes a scratchpad, and the network borrows serial depth from the length of that scratchpad instead of from its (fixed) number of layers.

Loading diagram…

This reframes what “model accuracy” even means. It is no longer a single number stamped on the model. It is a knob: let the model write more, sample it more times, search over its options, and accuracy generally goes up. That knob is test-time compute (equivalently, inference-time compute or inference-time scaling) — the extra computation you spend while generating, as opposed to the compute spent once during pretraining. The central trade of this topic is simple to state and surprisingly deep in its consequences: you can trade tokens (compute) for accuracy.

Chain-of-thought: serial computation written down as tokens

A chain of thought (CoT) is exactly the scratchpad from the intuition box: the intermediate reasoning a model writes before its final answer. Instead of emitting a bare “ $888$ ,” a model using CoT emits the working — “ $37\times 20=740$ , $37\times 4=148$ , sum $=888$ ” — and only then the answer. The phrase comes from a famous early discovery: if you simply append “Let's think step by step” to a prompt, a large model will often spontaneously lay out intermediate steps and get far more answers right, with no retraining at all. That “zero-cost” prompt trick is where the field started.

Why does this help so much? Two reasons, and it is worth separating them.

First, serial depth (the mechanical reason). As argued above, the answer to a hard problem may require more sequential computation than $L$ layers can do at once. CoT spreads that computation across many tokens: each step is a shallow computation, but the chain of steps is deep. We can write this as a latent-variable factorization. Let $q$ be the question, $a$ the final answer, and $z$ the reasoning trace (the chain). A reasoning model is really computing

p(a\mid q) \;=\; \sum_{z} p(a\mid q,z)\,p(z\mid q),

which says: to answer, first generate a reasoning path $z$ , then read off the answer conditioned on both the question and that path. A good $z$ makes the final step $p(a\mid q,z)$ almost trivial — once the working is on the page, the answer is just the last line.

Second, conditioning on its own work (the information reason). Because of the causal mask (Topic 1), every token attends to all earlier tokens. So when the model writes step three, it is literally reading steps one and two as part of its input. The scratchpad becomes working memory: partial results are stored as tokens and recalled by attention later. The model can decompose, sub-compute, and recombine — none of which is possible if it must commit to an answer in a single shot.

Early CoT was elicited by prompting. Modern reasoning models are trained to do it: they produce long, self-correcting chains — often wrapped in delimiters like <think> … </think> so the scratchpad is separable from the final answer — that explore, backtrack (“wait, that's wrong, let me redo it”), and double-check before committing. The rest of this chapter is largely about how that training works and how to measure whether it is paying off.

Two axes of spending compute: parallel and sequential

Once accuracy responds to compute, the practical question becomes how to spend it. There are two orthogonal axes, and conflating them is a common source of confusion.

Sequential compute makes one chain longer: more reasoning steps, more self-checking, more backtracking inside a single trajectory. Later tokens build on earlier ones, so this adds genuine serial depth. The risk is fragility — one wrong turn early can poison everything downstream, and the chain has no way to “start over” except by spending tokens to notice and correct the mistake.

Parallel compute draws many independent samples for the same prompt (decoding at temperature $>0$ so they differ), then selects or aggregates among them. The samples never see each other, so a bad start in one does not infect the others — but you now need a selection rule to collapse $n$ candidate answers into one. The two canonical rules are:

Self-consistency (majority vote): sample $n$ chains, extract each final answer, return the most common one. It rests on a bet — correct chains tend to agree (there's usually one right answer they converge on), while wrong chains scatter across many different mistakes. So the right answer is often the mode. No external scorer needed; this is the workhorse of practical reasoning systems.
Best-of- $n$ with a verifier: sample $n$ chains, score each one with a verifier, and keep the highest-scoring. Unlike voting, this can surface a rare correct answer that the majority got wrong — but only if the verifier is genuinely good at telling right from wrong.

Loading diagram…

Three independent chains, two of which land on answer $A$ , so a majority vote returns $A$ even though chain 2 went astray. The right panel shows sequential compute: a single, longer trajectory that catches and fixes its own mistakes as it goes. Real systems mix both — e.g. sample $32$ long chains and majority-vote them.

Measuring it: pass@ $k$ , coverage, and the ceiling on selection

To reason carefully about parallel compute we need a metric for “did the right answer show up at all.” That metric is pass@ $k$ : the probability that at least one of $k$ samples is correct. If each sample is independently correct with probability $p$ , then the chance that all $k$ are wrong is $(1-p)^k$ , so

\mathrm{pass@}k \;=\; 1-(1-p)^k .

This climbs toward $1$ fast. With $p=0.3$ , a single sample is right only $30\%$ of the time, but $\mathrm{pass@}8 = 1-0.7^8 \approx 0.94$ — the right answer appears somewhere in eight tries almost always. The single-sample number, $\mathrm{pass@}1 = p$ , is the bottom of this range and is what you get with one greedy decode.

The key insight is that pass@ $k$ measures coverage: does the correct answer appear anywhere among the $k$ samples? Coverage is the absolute ceiling for any selection method, because no verifier, however brilliant, can choose a correct answer that was never generated. So the score you actually achieve factors into two pieces:

\text{achieved accuracy} \;=\; \underbrace{\text{coverage}}_{\text{did the answer appear?}} \times \underbrace{\text{selection quality}}_{\text{did we pick it?}} .

Test-time scaling (more samples, longer chains) raises coverage; verifiers and voting fight over selection. The gap between $\mathrm{pass@}k$ and what your selector delivers is a direct diagnostic of selector quality: if $\mathrm{pass@}64 = 0.9$ but best-of- $64$ only gives $0.6$ , the answer was present $90\%$ of the time and your verifier failed to find it in a third of those cases. The problem is the verifier, not the model.

Coverage vs. selection. $\mathrm{pass@}k = 1-(1-p)^k$ is what a perfect oracle could get by picking among $k$ samples; it only ever rises with $k$ . What you actually get is $(\text{coverage}) \times (\text{how well your selector finds the good sample})$ . Spending more test-time compute improves coverage; better verifiers and smarter voting improve selection. Always report both — a beautiful selector on a model that never covers the answer is worthless, and vice versa.

Loading diagram…

Hands-on · majority vote actually beats a single sample

Suppose each independent sample is correct with $p=0.6$ , and when a sample is wrong it picks some random wrong answer (so wrong samples rarely agree with each other, but correct ones all agree). Take $n=3$ samples and majority-vote. The vote is correct exactly when at least $2$ of the $3$ are correct:

\begin{align*} P(\text{vote correct}) &= \binom{3}{2}(0.6)^2(0.4) + \binom{3}{3}(0.6)^3\\ &= 3(0.36)(0.4) + 0.216 \;=\; 0.432 + 0.216 \;=\; 0.648 . \end{align*}

So voting lifts accuracy from $0.60$ to $\mathbf{0.648}$ — a free gain just from sampling three times and counting. The effect compounds: this is the Condorcet jury theorem, which says that as long as $p>\tfrac12$ , majority vote over $n$ i.i.d. samples converges to $100\%$ as $n\to\infty$ . The catch is the condition $p>\tfrac12$ : if your model is wrong more often than right ( $p<\tfrac12$ ), more votes make it worse, driving accuracy toward $0$ . Voting amplifies a model that is better than a coin flip and punishes one that is not.

Verifiers: scoring outcomes vs. scoring process

A verifier is a model or rule that scores how good a candidate answer (or reasoning step) is, so that best-of- $n$ has something to rank by. Verifiers come in two flavors that differ in where they assign credit, and the distinction matters a great deal.

An outcome reward model (ORM) scores only the final answer: one label per whole trajectory. It is cheap to build — you only need to know whether the final answer was right — but it gives no signal about where a long chain went wrong. It is a sparse, end-of-episode reward: a perfect-looking $50$ -step derivation with one fatal error in step $7$ and a correct-looking-but-wrong answer gets the same single thumbs-down as a chain that was nonsense from the start.
A process reward model (PRM) scores each step, giving dense, per-step feedback. This is far better for credit assignment — it can point at the exact step that broke — and it enables search over partial chains (keep the promising prefixes, prune the rest). The price is labels: step-level correctness is expensive and noisy to annotate. Humans are slow and disagree; the cheaper automated route is to ask, for a given prefix, “does a correct final answer remain reachable from here?” by rolling out many continuations and checking how often they succeed — cheaper, but noisier.

Loading diagram…

Hands-on · how a PRM localizes a mid-chain error

A PRM emits a score $s_i\in[0,1]$ for each of a chain's steps; say a $4$ -step solution gets $s=(0.9,\,0.8,\,0.2,\,0.7)$ — step $3$ looks broken. How do we turn four step-scores into one trajectory score for ranking? Three common aggregators:

Product: $0.9\times0.8\times0.2\times0.7 = 0.1008$ . One bad step tanks the whole score — good.
Minimum: $\min = 0.2$ . Even more sharply pinned to the worst step — also good at flagging mid-chain failure.
Last step: $0.7$ . Misses it entirely, because the chain “recovered” a confident-looking final step. Bad for catching errors buried in the middle.

The lesson: min and product aggregation surface the weakest link, which is exactly what you want from process supervision; last-step (or mean) aggregation can be fooled by a chain that looks fine at the end. This is the whole reason a PRM beats an ORM for hard, long chains — it sees the $0.2$ .

Loading diagram…

A third, increasingly common kind is the generative / LLM-as-judge verifier: another language model reads the chain and writes a verdict. It is flexible and needs no special scoring head, but it inherits the judge's blind spots and is notoriously easy to fool with confident, well-formatted nonsense — a real failure mode when the thing being judged is itself an LLM optimized to be persuasive.

Searching over reasoning steps

Best-of- $n$ treats each chain as an indivisible unit. But if you have a PRM that scores partial chains, you can do something smarter: explore the space of reasoning step-by-step, keeping good partial paths and abandoning dead ends, exactly like search in a game tree. Three conceptual flavors, in increasing sophistication:

Beam search over steps. At each reasoning step, expand the current best $b$ partial chains by sampling several next-steps each, score all the new prefixes with the PRM, and keep only the top $b$ . You “breed” a small population of promising chains forward, pruning the rest. Beats plain sampling when the PRM is reliable enough to prune correctly.
Tree-of-thoughts. Generalize the chain into a tree: from any partial state, branch into several candidate next-thoughts; let the model (or a value estimate) evaluate states; explore promising branches and backtrack from bad ones. Good for problems where you genuinely need to try, fail, and back out — puzzles, planning.
Monte-Carlo tree search (MCTS). The AlphaGo-style algorithm: balance exploring new branches against exploiting known-good ones, using rollouts to estimate how promising each partial chain is, and concentrate compute on the most promising subtree. The most powerful and the most expensive.

Loading diagram…

From the question $q$ , the model branches into candidate first steps. A PRM (or value estimate) scores each partial path: dead ends ( $\times$ ) are pruned, promising prefixes are expanded further, and compute concentrates on the branch that reaches a verified solution ( $\checkmark$ ). Self-consistency and best-of- $n$ are the special case where the “tree” is just $n$ separate root-to-leaf chains with no pruning in between.

The o1 / DeepSeek-R1 paradigm: RL from verifiable rewards

Prompting and selection get you only so far; the big jump in 2024–2025 came from training models to produce good long chains in the first place. The key idea is RLVR — reinforcement learning from verifiable rewards (a term coined in the Tülu 3 work). The insight: for domains where correctness is checkable by a rule — math with a known final answer, code that must pass unit tests — you do not need a learned reward model at all. A tiny deterministic program returns the reward:

r \;=\; \begin{cases} 1 & \text{boxed answer matches the key (and format is valid)},\\[2pt] 0 & \text{otherwise.}\end{cases}

Compare this with RLHF for helpfulness (Topic 9/10), whose reward is a learned model of fuzzy human preference. A learned reward model can be gamed: the policy can discover inputs that fool the scorer without truly being better (reward hacking). A rule that just checks “is the answer $77$ ?” has nothing to fool — there is no learned scorer to exploit — which is precisely why RLVR is so robust and so powerful in verifiable domains. The flip side: it only works where a verifier exists, which is why early reasoning models are so concentrated on math, code, and logic.

GRPO is the workhorse. The dominant algorithm (introduced with DeepSeek's models; see Topic 9 for the full derivation) is Group Relative Policy Optimization. Its trick is to throw away PPO's value-network critic and let a group of samples be its own baseline. For each prompt, sample a group of $G$ completions, score them all with the rule, and convert each reward into a z-score within the group:

\hat A_i \;=\; \frac{r_i - \operatorname{mean}(r_1,\dots,r_G)}{\operatorname{std}(r_1,\dots,r_G)} .

A completion that beat its group-mates gets a positive advantage (push its tokens up); a below-average one gets a negative advantage (push down). No critic, no value function — just relative ranking inside the batch, fed into a policy-gradient update. Because the verifiable reward is so cheap, you can run this loop for a very long time on the same problems.

Loading diagram…

The emergent result. DeepSeek's R1-Zero produced a striking finding. Run this loop on a base model with only rule-based rewards — no supervised reasoning examples at all — and long chains, self-checking, and backtracking emerge on their own. The model is never told “write more steps”; it discovers that writing more careful reasoning earns more reward, so its responses grow longer over training and it spontaneously develops an “aha moment” style of catching its own errors mid-chain. Concretely, on the AIME 2024 math competition, R1-Zero's single-sample accuracy climbed from $15.6\%$ to about $71\%$ over RL training, and rose to $86.7\%$ with self-consistency (majority voting) — matching a contemporary o1 model. The follow-up R1 added a small cold-start SFT stage on a few thousand curated reasoning traces before the RL, which fixed R1-Zero's cosmetic problems (it mixed languages and produced messy, hard-to-read chains) without sacrificing the emergent reasoning. This SFT $\to$ RL ordering — a little supervised polish, then heavy RLVR — is the template most open reasoning models now follow.

Loading diagram…

Intuition

Why does “reward only the final answer” produce elaborate intermediate reasoning the trainer never asked for? Because longer, more careful chains genuinely raise the odds the final answer is right (recall the latent-variable view: a good $z$ makes $a$ easy). RL is an optimizer that finds whatever behavior raises reward. If writing a self-check before answering catches mistakes and thus earns more $1$ s than $0$ s, gradient ascent will reinforce self-checking — even though “self-check” was never in the training labels. The reasoning is a means the model invents to the end of being correct. This is also why it is fragile: the model will just as happily invent reward-hacking shortcuts if those raise reward, which is why the unhackability of rule-based rewards matters so much.

Inference-time scaling laws: how much compute, and where

Because accuracy now responds smoothly to compute, you can plot accuracy vs. test-time compute and get a scaling curve — typically rising roughly with the logarithm of the number of samples or tokens, then flattening as it approaches the coverage ceiling. The shape of that curve is what makes test-time compute a planning tool rather than a mystery.

Loading diagram…

Three things to read off this picture. (1) Coverage (solid) shoots up and saturates near $1$ — the answer is almost always present with enough samples. (2) What you achieve (dashed) trails coverage and grows only logarithmically, so each doubling of compute buys a roughly constant accuracy bump until it too flattens. (3) The dotted line is a bigger model answering once: for small budgets the small-model-thinking-longer curve sits below it, but past some crossover the small model plus test-time compute overtakes the big model's single shot. That crossover is the heart of the compute-optimal test-time question: sometimes it is cheaper to let a small model think than to serve a giant one.

Given a fixed token budget $B$ , should you spend it on more samples (parallel) or a longer chain (sequential)? Model accuracy as $\mathrm{acc}=f(n)\cdot g(\ell)$ , where $n$ is the number of samples and $\ell$ the chain length, with $B\approx n\,\ell$ . Both $f$ and $g$ have diminishing returns, so the optimum spends the marginal token wherever its log-derivative is currently larger: keep adding samples while $\partial\log f/\partial n$ exceeds $\partial\log g/\partial\ell$ , then switch to longer chains. The qualitative rule of thumb: easy problems are coverage-limited — a few samples cover the answer, so go shallow and wide; hard problems are depth-limited — one good long chain beats many short stabs, so go deep and narrow.

There is also a deliberate lever for controlling chain length called budget forcing (from the s1 work): you cap or extend the model's thinking by editing its generation directly — forcing it to stop after a token budget, or, when it tries to stop early, appending the word “Wait” to make it keep going and double-check. Remarkably, on a model fine-tuned on just $1{,}000$ curated reasoning examples, simply appending “Wait” a few more times raised accuracy on competition math — a vivid demonstration that the chain length itself is a usable knob.

Distillation: pouring a big reasoner into a small one

There is one more way to move reasoning around, and it is cheap and effective. Once you have an expensive RL-trained reasoner, you can distill it: have the big model generate many long reasoning traces on a set of problems, keep the ones that reach correct answers, and supervised-fine-tune a much smaller model on those traces. The small model never does any RL of its own; it simply imitates the big model's chains. DeepSeek showed this works strikingly well — small dense models fine-tuned on R1's traces inherited much of its reasoning ability, often beating same-size models that were trained with RL directly. The intuition: RL is the expensive part (it has to discover good reasoning by trial and error); once the good chains exist, copying them is just imitation learning. The limit is that distillation transfers the patterns the teacher already found — it cannot push past the teacher, and a student lacking the teacher's underlying knowledge may mimic the form of reasoning without the substance.

What to watch for

This topic is unusually full of traps, because the metric you optimize (reward) and the thing you want (capability) can quietly diverge. A few recurring tensions the questions in this topic will keep returning to:

The loss curve lies. In reasoning RL there is no held-out loss that tracks capability — reward can climb while the model learns to game it. Track several signals at once: $\mathrm{pass@}1$ and $\mathrm{pass@}k$ , response length, output entropy, format compliance, and held-out benchmark accuracy — never reward alone.

Length is not thinking. A notorious trap (the Dr. GRPO critique): GRPO's reward normalization can introduce a length bias that inflates response length — especially on wrong answers — without improving accuracy. So “responses keep getting longer but AIME accuracy has plateaued” is the signature of verbosity, not capability. Diagnose it with token-efficiency curves (accuracy per token) and by comparing accuracy at matched token budgets.

Temperature is two-faced. RL training wants temperature high to explore diverse chains; evaluation often wants it low to focus. If greedy decoding looks great but temperature- $1.0$ looks bad, you are seeing a high-variance policy — report both, and report $\mathrm{pass@}k$ at a stated temperature, not one cherry-picked decode.

Reward hacking and unfaithful chains. With a learned (gameable) verifier, “more samples” can eventually find answers that beat the verifier rather than solve the problem — inference-time reward hacking — so watch reward-vs-true-accuracy divergence on a dashboard. And a subtle safety point: a model's written CoT may be unfaithful — it can reach an answer by one internal computation and narrate a tidier, different-looking one. A plausible-looking chain is not proof of how the answer was actually produced, which matters whenever the chain is used to audit or trust the model. Keep these tensions in mind and the detailed questions — on PRMs, GRPO length bias, compute-optimal allocation, verifier exploitation, and CoT faithfulness — will read as variations on the ideas you have already met here.

The big picture: why a transformer cannot “think harder” in place

Chain-of-thought: serial computation written down as tokens

Two axes of spending compute: parallel and sequential

Measuring it: pass@kkk, coverage, and the ceiling on selection

Verifiers: scoring outcomes vs. scoring process

Searching over reasoning steps

The o1 / DeepSeek-R1 paradigm: RL from verifiable rewards

Inference-time scaling laws: how much compute, and where

Distillation: pouring a big reasoner into a small one

What to watch for

Measuring it: pass@ $k$ , coverage, and the ceiling on selection