Chapter 04Part II · Pretraining & Scale

Pretraining Objectives & Scaling Laws

8 practice sets · 4 coding problems

Topic 1 took a transformer apart and showed that all it ever computes, at each position, is a probability distribution over the next token. This chapter is about the other half of the story: how that machine is trained, and how we predict in advance how good it will get. Both halves turn out to be governed by a single objective so simple it is almost suspicious — guess the next token — and by a family of startlingly regular empirical curves, the scaling laws, that let an engineer forecast the loss of a multi-million-dollar run from a handful of cheap small ones. By the end you should be able to (i) write down and explain the pretraining loss and the three numbers everyone quotes about it (cross-entropy, perplexity, bits-per-token); (ii) sketch where a base model's training data comes from and why its quality matters as much as its quantity; (iii) explain the $C\approx 6ND$ compute rule and derive the factor $6$ ; and (iv) take a fixed compute budget and turn it into a concrete “train an $N$ -parameter model on $D$ tokens” recipe using the Chinchilla compute-optimal frontier.

The one objective: predict the next token

Before an LLM is ever turned into a chatbot, aligned, or taught to use tools, it goes through pretraining: it reads an enormous pile of raw text and learns to do one deceptively simple thing — guess the next token. Grammar, world facts, a little translation and arithmetic, a rough model of how the world hangs together — all of it falls out as a by-product of getting very good at that one game on enough data, with no human labels required.

Recall the setup from Topic 1. Raw text is chopped by a tokenizer into sub-word tokens drawn from a fixed vocabulary of size $V$ (a token is the atomic unit the model reads and predicts; one English word is on average $\approx 1.3$ tokens). A document becomes an integer sequence $x_1,x_2,\dots,x_T$ , and a decoder-only transformer models the probability of the whole sequence, factorized strictly left-to-right by the chain rule of probability:

p_\theta(x_1,\dots,x_T)=\prod_{t=1}^{T} p_\theta\!\left(x_t \mid x_{<t}\right),

where $\theta$ denotes all the model's learned parameters and $x_{<t}$ is shorthand for “every token before position $t$ .” At each position the model outputs a full probability distribution over the vocabulary for the next token, conditioned only on the tokens before it.

Cross-entropy: the loss that grades the guess

How do we grade a probability distribution against the token that actually came next? We use cross-entropy. At each position the model is scored on the (negative log) probability it assigned to the true next token; the training loss is the average of that over all positions:

\mathcal{L}(\theta)=-\frac{1}{T}\sum_{t=1}^{T}\log p_\theta\!\left(x_t \mid x_{<t}\right).

Read it in plain words: at every step, look up the probability the model gave to the word that really came next, take its logarithm, flip the sign, and average. If the model put high probability on the truth, $\log p$ is near $\log 1=0$ and the loss is small; if it was confidently wrong, $p$ is tiny, $-\log p$ is huge, and it is punished hard. Minimizing this loss is identical to maximizing the likelihood of the real text — they are the same objective written two ways.

Why is this such a good deal? Because the causal (look-only-backward) mask from Topic 1 lets the transformer compute all $T$ of these conditional predictions in a single parallel forward pass. We therefore harvest $T$ training signals from one sequence essentially for free. And the “label” for position $t$ is just the input token at position $t{+}1$ — the data is its own answer key. This is what makes next-token prediction a self-supervised objective: any raw text, with no human annotation, is training data. That is the whole reason LLMs can be trained on trillions of tokens; nobody could ever label that much by hand.

The choice of logarithm base only rescales the loss. With the natural log $\ln$ , the loss is in nats; with $\log_2$ , it is in bits; they differ by the constant factor $\ln 2\approx 0.693$ (one nat $=1/\ln 2\approx 1.4427$ bits). The objective and its minimizer are unchanged; only the unit on the $y$ -axis moves.

Perplexity and bits-per-token: the same number, made readable

Raw cross-entropy is hard to feel. Is a loss of $2.3$ good? Two derived quantities make it intuitive, and both are just the loss in disguise.

Perplexity exponentiates the loss, $\mathrm{PPL}=e^{\mathcal{L}}$ (with $\mathcal{L}$ in nats). It answers: at each step, how many equally-likely options does the model effectively feel torn between? A perplexity of $1$ means perfect prediction — all probability mass on the right token, so $\mathcal{L}=0$ and $\mathrm{PPL}=e^0=1$ . A model that has learned nothing and guesses uniformly over $V$ tokens has $\mathrm{PPL}=V$ : it is as confused as a fair $V$ -sided die. A good language model on English text lands somewhere in between — effectively choosing among a handful to a few dozen plausible continuations. Lower is better, and “the model is improving” always means perplexity (and loss) go down.

Bits-per-token is simply the loss measured in bits instead of nats: $\mathcal{L}/\ln 2$ . It has a beautiful interpretation through information theory: pretraining is compression, and bits-per-token is the model's compression rate. A model that predicts the next token well could be used to encode the text in few bits (the better you predict, the less surprise there is to transmit); a confused model needs more. This is the precise sense in which “learning” and “compressing the data” are the same activity.

Hands-on · loss, perplexity, and bits on a 3-token toy

A model reads three positions and assigns the true next token probabilities $0.5$ , $0.25$ , and $0.5$ . The per-token negative log-likelihoods (in nats) are $-\ln 0.5=0.693$ , $-\ln 0.25=1.386$ , $-\ln 0.5=0.693$ . The average cross-entropy is

\mathcal{L}=\tfrac{1}{3}(0.693+1.386+0.693)=\tfrac{1}{3}(2.772)=0.924\ \text{nats}.

Now convert. Perplexity: $\mathrm{PPL}=e^{0.924}\approx 2.52$ — the model is about as unsure as if it faced $2.5$ equally good choices at each step. Bits-per-token: $0.924/\ln 2\approx 1.33$ bits — on average it would take $1.33$ bits to encode each token under this model. A sanity check: if the model had instead been certain and correct every step ( $p=1$ ), every term would be $-\ln 1 = 0$ , giving $\mathcal{L}=0$ , $\mathrm{PPL}=1$ , $0$ bits — perfect prediction, nothing left to compress.

Bits-per-byte (or bits-per-character) goes one step further: it re-expresses the loss per raw byte of the original text rather than per token. Why bother? Because two models with different tokenizers cut the same sentence into different numbers of tokens, so their per-token losses are not comparable — a model with a coarser vocabulary can “win” per token simply by predicting fewer, bigger chunks. The byte count is fixed by the text itself, not by anyone's tokenizer, so normalizing to it gives an apples-to-apples metric:

\text{bits-per-byte}=\frac{\mathcal{L}_{\text{nats/token}}}{\ln 2}\cdot\frac{\#\text{tokens}}{\#\text{bytes}}.

Pretraining $=$ next-token prediction $=$ minimizing cross-entropy. Loss, perplexity ( $\mathrm{PPL}=e^{\mathcal{L}}$ ), and bits-per-token are three views of the same number ( $\mathcal{L}$ ); bits-per-byte is the one that survives a change of tokenizer and lets you compare models fairly.

Loading diagram…

A quick map of pretraining objectives

The topic's name is plural for a reason: next-token prediction is one point in a small design space, and it helps to see the alternatives so you know why the field settled where it did.

Causal / autoregressive LM (GPT, Llama, …): predict $x_t$ from $x_{<t}$ only. Generates text natively, one token at a time. This is what essentially all modern generative LLMs use.
Masked LM (BERT): randomly blank out $\sim 15\%$ of tokens and predict them from both sides of context. Great for producing representations (the encoder sees the whole sentence at once), but it cannot generate left-to-right and only trains on the masked fraction of tokens per pass, so it is far less sample-efficient as a generator.
Span corruption / denoising (T5): mask contiguous spans and have an encoder–decoder reconstruct them — a flexible middle ground for seq-to-seq tasks.
Prefix-LM: bidirectional attention over a given prefix, then causal generation after it — a hybrid that keeps full-context conditioning on the prompt.

The field converged on causal LM for general-purpose generative models because it (a) puts a learning signal on every token, (b) matches how the model is actually used (autoregressive generation), and (c) scales cleanly. The scaling laws below are stated for this causal objective, though the same methodology applies to any of them.

Where the tokens come from: the data pipeline

Trillions of training tokens do not exist in a clean folder somewhere; they are manufactured. At a high level the pipeline is a funnel that takes a vast, filthy pile of raw web text and squeezes out a much smaller, much cleaner stream worth training on.

Loading diagram…

Walk the funnel left to right. A web crawl (e.g. Common Crawl) yields petabytes of raw HTML — most of it boilerplate, menus, spam, and machine-generated junk. Text extraction strips the markup down to readable prose. Quality filtering then throws away the bulk of it: heuristic rules (is this English? too many symbols? a list of links?) and learned classifiers (does this “look like” a useful document?) keep only a small, high-value fraction. Deduplication removes near-identical copies — the web is enormously repetitive, and training on the same paragraph a thousand times wastes compute and encourages memorization. Finally, mixture weighting decides how much of each source to include: web text, code, math, books, multilingual data, each up- or down-weighted to hit target capabilities.

Two non-obvious lessons drive practice here. First, quality beats quantity: a smaller, well-filtered dataset routinely trains a better model than a larger, dirtier one, because every junk token spends capacity and compute teaching the model nothing (or worse). Second, the mixture is a balancing act, not a max: upweighting code to boost coding implicitly downweights everything else and can quietly hurt other skills, so the proportions are tuned by small ablation runs rather than guessed. This is why “what data, in what proportions” is treated as seriously as any architectural choice — if architecture defines how a model learns, data defines what it learns.

Counting the cost: FLOPs and the $C\approx 6ND$ rule

To reason about “compute” we need a unit. A FLOP is one floating-point operation — a single multiply or a single add. The cost of a training run is the total number of FLOPs it consumes, a useful, roughly hardware-independent currency (a faster chip does the same FLOPs in less wall-clock time). The single most useful estimate in all of pretraining is

\boxed{\,C \approx 6\,N\,D\,}

where $C$ is total training FLOPs, $N$ is the number of model parameters (weights), and $D$ is the number of training tokens. These two knobs, $N$ and $D$ , are the entire game: making the model bigger ( $N$ ) or feeding it more data ( $D$ ).

Where does the factor $6$ come from? It is clean accounting, and worth seeing once. The dominant cost of a transformer is its matrix multiplications, and in a matmul every weight is used in exactly one multiply-and-add per token. A multiply-and-add is $2$ FLOPs. So:

Forward pass: each of the $N$ weights does $1$ multiply-add per token $\Rightarrow \approx 2N$ FLOPs per token.
Backward pass: computing gradients requires two passes' worth of work — one to propagate the gradient back to the activations, one to compute the gradient of each weight — so it costs about twice the forward pass, $\approx 4N$ FLOPs per token.

Adding them up: $2N+4N=6N$ FLOPs per token. Over $D$ training tokens that is $C\approx 6ND$ . (This counts only the weight matmuls and ignores attention's $O(T^2)$ term, which is negligible unless the context is very long.) As a single memorable anchor, a $7$ B model on $1$ T tokens costs $C\approx 6\times(7{\times}10^{9})\times 10^{12}\approx 4{\times}10^{22}$ FLOPs.

Loading diagram…

Scaling laws: loss is a predictable function of $N$ and $D$

Here is the empirical discovery that made all of modern LLM development possible (Kaplan et al. 2020; Hoffmann et al. 2022, the “Chinchilla” paper). The held-out loss of a well-trained transformer falls as a smooth power law in model size $N$ and data $D$ — and it does so cleanly across many orders of magnitude. The Chinchilla parametric form captures the whole relationship with just five fitted constants:

L(N,D)=\underbrace{E}_{\text{irreducible}} +\underbrace{\frac{A}{N^{\alpha}}}_{\text{finite model}} +\underbrace{\frac{B}{D^{\beta}}}_{\text{finite data}} .

Read it term by term. $E$ is the irreducible loss — the entropy of language itself, the floor you could not beat even with an infinite model trained on infinite data, because real text is genuinely, partly unpredictable. $A/N^{\alpha}$ is the penalty for having only $N$ parameters; it shrinks as you add capacity. $B/D^{\beta}$ is the penalty for having seen only $D$ tokens; it shrinks as you train on more. The exponents $\alpha$ and $\beta$ say how fast each penalty melts away.

Hoffmann et al.'s fitted values, which everyone quotes, were

E\approx 1.69,\quad A\approx 406.4,\quad B\approx 410.7,\quad \alpha\approx 0.34,\quad \beta\approx 0.28 .

Because $\alpha,\beta<1$ , returns diminish: each doubling of $N$ or $D$ buys a smaller drop in loss than the last. And because the curve is so smooth, the whole enterprise becomes predictable — you can fit the law on a ladder of small, cheap runs and extrapolate to the single huge run you actually want, often to within a few percent. That extrapolation is the reason anyone dares spend tens of millions of dollars on one training run.

Loading diagram…

On log-log–style axes (compute on a log scale) the loss falls along a gentle, near-straight descent toward the dashed floor $E$ — the visual signature of a power law. It approaches but never reaches $E$ , no matter how much compute you pour in.

Compute-optimal training and the Chinchilla $\approx 20{:}1$ rule

Now the central practical question. Given a fixed budget $C=6ND$ , how should you split it between $N$ and $D$ ? You can buy a big model trained briefly, or a small model trained long — both can cost the same FLOPs. Which gives lower loss?

The way to find that balance experimentally is the IsoFLOP method (“iso” $=$ equal). Fix a compute budget $C$ . Now train several models of different sizes $N$ at that same budget — since $C$ is fixed and $C=6ND$ , a bigger $N$ automatically means fewer tokens $D=C/6N$ , and vice versa. Plot each model's final loss against its size. The curve is U-shaped: tiny models underfit (not enough brains), giant models are undertrained (they ran out of tokens before learning much), and the minimum in the middle marks the compute-optimal size for that budget. Repeat the whole sweep at several budgets, collect the minima, and you trace out the optimal frontier $N^\star(C),D^\star(C)$ .

Loading diagram…

This is exactly the experiment that produced Chinchilla. The headline result: Chinchilla (70B params, $1.4$ T tokens) beat Gopher (280B params, $\sim$ 300B tokens) using the same compute, because Gopher was far too big for its data ( $\sim$ 1 token/param — badly undertrained), while Chinchilla sat near the sweet spot with $4\times$ more data. Across all three of the paper's analysis methods (varying tokens at fixed model size; IsoFLOP curves; and a direct parametric fit), the conclusion agreed: as the budget grows, grow $N$ and $D$ in roughly equal proportion. The famous summary is the

\textbf{$\approx 20$ tokens-per-parameter} \text{ heuristic: } \frac{D^\star}{N^\star}\approx 20 .

So the ratio is about $20$ — not $2$ , not $200$ .

Why “equal proportion” gives a constant ratio. The frontier exponents come straight from the algebra. Minimizing $L(N,D)$ subject to $C=6ND$ (a one-line Lagrange exercise; Topic 4's harder questions do it in full) gives optimal allocations that are themselves power laws of the budget,

N^{\star}\propto C^{a},\qquad D^{\star}\propto C^{b},\qquad a=\frac{\beta}{\alpha+\beta},\quad b=\frac{\alpha}{\alpha+\beta},\quad a+b=1 .

With Chinchilla's near-equal exponents $\alpha\approx\beta$ , both $a$ and $b$ are $\approx 0.5$ : doubling the budget should roughly $\sqrt{2}\times$ the model and $\sqrt{2}\times$ the data. Because $N^\star$ and $D^\star$ then grow at the same rate, their ratio $D^\star/N^\star$ stays put — and the fitted constants make that ratio land near $20$ .

Hands-on · turn a FLOP budget into a model:

N^\star

and

D^\star

You have secured a cluster that will deliver $C=10^{24}$ FLOPs. What size model, on how many tokens, is compute-optimal?

Step 1 — impose the recipe. Compute-optimal means $D=20N$ . Substitute into the cost rule:

C=6ND = 6\,N\,(20N) = 120\,N^{2}.

Step 2 — solve for $N$ .

N=\sqrt{\frac{C}{120}}=\sqrt{\frac{10^{24}}{120}}=\sqrt{8.33{\times}10^{21}}\approx 2.9{\times}10^{10}.

So $N^\star\approx 29$ billion parameters.

Step 3 — get $D$ from the ratio.

D=20N\approx 20\times 2.9{\times}10^{10}\approx 5.8{\times}10^{11}\ \text{tokens} \;(\sim 580\text{B}).

Step 4 — sanity-check the budget. $6ND = 6\times 2.9{\times}10^{10}\times 5.8{\times}10^{11}\approx 1.0{\times}10^{24}$ FLOPs.

One line of arithmetic has turned a dollar budget into a concrete “ $29$ B model on $580$ B tokens” recipe — which is the entire reason scaling laws exist. As a final cross-check on the loss law itself, plugging $N=7{\times}10^{9}$ , $D=1.4{\times}10^{12}$ into the parametric form gives $L\approx 1.69 + 406.4/(7{\times}10^{9})^{0.34} + 410.7/(1.4{\times}10^{12})^{0.28}\approx 1.96$ nats.

Kaplan vs. Chinchilla: what changed

If scaling laws were the breakthrough of 2020, why is the Chinchilla paper of 2022 the one everyone cites? Because they reached opposite practical advice. The earlier Kaplan et al. (2020) law concluded you should pour most of a growing budget into model size, scaling $N$ much faster than $D$ ( $N^\star\propto C^{0.73}$ ). That recommendation is exactly why GPT-3 was built gigantic ( $175$ B parameters) but trained on a relatively thin $300$ B tokens. Chinchilla re-ran the experiments more carefully and found the balanced $\approx 0.5/0.5$ split instead — under which GPT-3 was drastically undertrained and should have seen something like $3.7$ T tokens for its size.

The discrepancy traces to two methodological fixes. First, Kaplan counted only non-embedding parameters and otherwise mis-accounted for the embedding/unembedding weights, which skews the $N$ -vs- $D$ trade especially at small scale; counting parameters consistently shifts the exponents toward balance. Second, Kaplan used a single learning-rate schedule rather than re-tuning the decay to each run's token count — and a schedule tuned for a long run handicaps a short one, biasing the measured curves. Repair both and the two laws reconcile on Chinchilla's balanced exponents. The lesson that outlived the specific numbers: your measured scaling exponents are only as trustworthy as your experimental hygiene.

Loading diagram…

Overtraining: why deployed models ignore Chinchilla on purpose

Here is the twist that governs most models you actually use. Chinchilla-optimal minimizes the cost of training. But a model that will be served to users also pays an inference cost every time it answers, forever — and inference cost scales with $N$ (a bigger model is more expensive on every query). If you expect to serve billions of tokens, it is rational to deliberately pick a smaller model and overtrain it — pour in far more than $20$ tokens per parameter — so that it is permanently cheaper to run, even though that is “wasteful” by the training-only metric.

Formally, you stop minimizing $6ND_{\text{tr}}$ alone and instead minimize total lifetime FLOPs, training plus serving,

\underbrace{6\,N\,D_{\text{tr}}}_{\text{training}} \;+\; \underbrace{2\,N\,D_{\text{inf}}}_{\text{inference}},

where $D_{\text{inf}}$ is the number of tokens you expect to generate over the model's life ( $2N$ per token is the forward-pass cost from the $6ND$ derivation). The bigger $D_{\text{inf}}$ is, the more the optimum shifts toward smaller $N$ trained on more data. This is why the Llama-3 models ( $8$ B and $70$ B) were trained on a colossal $15$ T tokens — roughly $1{,}900$ tokens per parameter for the $8$ B, nearly $100\times$ past Chinchilla — and why Qwen3 reportedly used $\sim 36$ T. They are intentionally over the Chinchilla line, buying a small, cheap-to-serve model at the price of extra training. Llama-2's $2$ T tokens and Chinchilla's own $1.4$ T look modest only in hindsight.

Emergent abilities — and the measurement caveat

Scaling laws predict loss beautifully and smoothly. Downstream capabilities are another matter. Some skills — multi-step arithmetic, certain reasoning tasks — appear to switch on suddenly as scale grows: flat, near-random performance for a long stretch, then a sharp jump to competence. These have been called emergent abilities, and they are genuinely striking.

But there is a deep caveat, and it is one the topic's questions push on hard. Much of that apparent suddenness is an artifact of the metric, not a real phase change in the model. Tasks like “solve this $5$ -digit multiplication” are often scored all-or-nothing (exact-match accuracy): get every digit right or score zero. Under such a brittle metric, a model whose per-digit probability is improving smoothly will look stuck at zero until it crosses the threshold where the whole answer finally clicks — producing a fake “emergence.” Swap to a smoother metric (per-token probability, partial credit) and the same underlying improvement looks gradual all along. The practical takeaway: loss is predictable; benchmark scores are not nearly as predictable, and a sudden jump on a harsh metric is not by itself evidence of a real discontinuity.

Loading diagram…

What to watch for

A handful of recurring tensions drive almost every question in this topic; naming them now will make the detailed answers feel familiar.

Optimal for what? Chinchilla-optimal minimizes training cost. For a model with heavy inference traffic, minimize lifetime cost instead and you will overtrain a smaller model — which is what real deployed models do.
The coefficients are not universal. $E,A,B,\alpha,\beta$ depend on architecture, tokenizer, and data mixture. They shift for Mixture-of-Experts (where “active” vs. “total” parameters change the FLOP accounting), for new tokenizers (a token stops meaning the same thing), and for multimodal data. Re-fit; never paste DeepMind's numbers onto a different stack.
Loss is not capability. Scaling laws extrapolate loss cleanly; downstream skills are far harder to forecast, and harsh metrics can manufacture illusory “emergence.”
The data wall. The $B/D^{\beta}$ term assumes fresh tokens. High-quality text is finite, and repeating data gives diminishing then negative returns (memorization without generalization); empirically a few epochs are roughly as good as fresh data, but not many more. So $D$ cannot grow forever — which is exactly why data quality and mixture now matter as much as raw quantity.

Keep this skeleton in mind — objective, metrics, the $6ND$ rule, the compute-optimal frontier, and the practical reasons people deviate from it — and the detailed questions that follow (the Lagrangian derivation, inference-adjusted optima, MoE FLOP accounting, IsoFLOP fitting, the emergence debate) will read as variations on parts you have already met.

The one objective: predict the next token

Cross-entropy: the loss that grades the guess

Perplexity and bits-per-token: the same number, made readable

A quick map of pretraining objectives

Where the tokens come from: the data pipeline

Counting the cost: FLOPs and the C≈6NDC\approx 6NDC≈6ND rule

Scaling laws: loss is a predictable function of NNN and DDD

Compute-optimal training and the Chinchilla ≈20:1\approx 20{:}1≈20:1 rule

Kaplan vs. Chinchilla: what changed

Overtraining: why deployed models ignore Chinchilla on purpose

Emergent abilities — and the measurement caveat

What to watch for

Counting the cost: FLOPs and the $C\approx 6ND$ rule

Scaling laws: loss is a predictable function of $N$ and $D$

Compute-optimal training and the Chinchilla $\approx 20{:}1$ rule