LLMs Deep Dive
Chapter 04Part II · Pretraining & Scale

Pretraining Objectives & Scaling Laws

8 practice sets · 4 coding problems

Topic 1 took a transformer apart and showed that all it ever computes, at each position, is a probability distribution over the next token. This chapter is about the other half of the story: how that machine is trained, and how we predict in advance how good it will get. Both halves turn out to be governed by a single objective so simple it is almost suspicious — guess the next token — and by a family of startlingly regular empirical curves, the scaling laws, that let an engineer forecast the loss of a multi-million-dollar run from a handful of cheap small ones. By the end you should be able to (i) write down and explain the pretraining loss and the three numbers everyone quotes about it (cross-entropy, perplexity, bits-per-token); (ii) sketch where a base model's training data comes from and why its quality matters as much as its quantity; (iii) explain the C6NDC\approx 6ND compute rule and derive the factor 66; and (iv) take a fixed compute budget and turn it into a concrete “train an NN-parameter model on DD tokens” recipe using the Chinchilla compute-optimal frontier.

The one objective: predict the next token

Before an LLM is ever turned into a chatbot, aligned, or taught to use tools, it goes through pretraining: it reads an enormous pile of raw text and learns to do one deceptively simple thing — guess the next token. Grammar, world facts, a little translation and arithmetic, a rough model of how the world hangs together — all of it falls out as a by-product of getting very good at that one game on enough data, with no human labels required.

Recall the setup from Topic 1. Raw text is chopped by a tokenizer into sub-word tokens drawn from a fixed vocabulary of size VV (a token is the atomic unit the model reads and predicts; one English word is on average 1.3\approx 1.3 tokens). A document becomes an integer sequence x1,x2,,xTx_1,x_2,\dots,x_T, and a decoder-only transformer models the probability of the whole sequence, factorized strictly left-to-right by the chain rule of probability:

pθ(x1,,xT)=t=1Tpθ ⁣(xtx<t), p_\theta(x_1,\dots,x_T)=\prod_{t=1}^{T} p_\theta\!\left(x_t \mid x_{<t}\right),

where θ\theta denotes all the model's learned parameters and x<tx_{<t} is shorthand for “every token before position tt.” At each position the model outputs a full probability distribution over the vocabulary for the next token, conditioned only on the tokens before it.

Cross-entropy: the loss that grades the guess

How do we grade a probability distribution against the token that actually came next? We use cross-entropy. At each position the model is scored on the (negative log) probability it assigned to the true next token; the training loss is the average of that over all positions:

L(θ)=1Tt=1Tlogpθ ⁣(xtx<t). \mathcal{L}(\theta)=-\frac{1}{T}\sum_{t=1}^{T}\log p_\theta\!\left(x_t \mid x_{<t}\right).

Read it in plain words: at every step, look up the probability the model gave to the word that really came next, take its logarithm, flip the sign, and average. If the model put high probability on the truth, logp\log p is near log1=0\log 1=0 and the loss is small; if it was confidently wrong, pp is tiny, logp-\log p is huge, and it is punished hard. Minimizing this loss is identical to maximizing the likelihood of the real text — they are the same objective written two ways.

Why is this such a good deal? Because the causal (look-only-backward) mask from Topic 1 lets the transformer compute all TT of these conditional predictions in a single parallel forward pass. We therefore harvest TT training signals from one sequence essentially for free. And the “label” for position tt is just the input token at position t+1t{+}1 — the data is its own answer key. This is what makes next-token prediction a self-supervised objective: any raw text, with no human annotation, is training data. That is the whole reason LLMs can be trained on trillions of tokens; nobody could ever label that much by hand.

The choice of logarithm base only rescales the loss. With the natural log ln\ln, the loss is in nats; with log2\log_2, it is in bits; they differ by the constant factor ln20.693\ln 2\approx 0.693 (one nat =1/ln21.4427=1/\ln 2\approx 1.4427 bits). The objective and its minimizer are unchanged; only the unit on the yy-axis moves.

Perplexity and bits-per-token: the same number, made readable

Raw cross-entropy is hard to feel. Is a loss of 2.32.3 good? Two derived quantities make it intuitive, and both are just the loss in disguise.

Perplexity exponentiates the loss, PPL=eL\mathrm{PPL}=e^{\mathcal{L}} (with L\mathcal{L} in nats). It answers: at each step, how many equally-likely options does the model effectively feel torn between? A perplexity of 11 means perfect prediction — all probability mass on the right token, so L=0\mathcal{L}=0 and PPL=e0=1\mathrm{PPL}=e^0=1. A model that has learned nothing and guesses uniformly over VV tokens has PPL=V\mathrm{PPL}=V: it is as confused as a fair VV-sided die. A good language model on English text lands somewhere in between — effectively choosing among a handful to a few dozen plausible continuations. Lower is better, and “the model is improving” always means perplexity (and loss) go down.

Bits-per-token is simply the loss measured in bits instead of nats: L/ln2\mathcal{L}/\ln 2. It has a beautiful interpretation through information theory: pretraining is compression, and bits-per-token is the model's compression rate. A model that predicts the next token well could be used to encode the text in few bits (the better you predict, the less surprise there is to transmit); a confused model needs more. This is the precise sense in which “learning” and “compressing the data” are the same activity.

Bits-per-byte (or bits-per-character) goes one step further: it re-expresses the loss per raw byte of the original text rather than per token. Why bother? Because two models with different tokenizers cut the same sentence into different numbers of tokens, so their per-token losses are not comparable — a model with a coarser vocabulary can “win” per token simply by predicting fewer, bigger chunks. The byte count is fixed by the text itself, not by anyone's tokenizer, so normalizing to it gives an apples-to-apples metric:

bits-per-byte=Lnats/tokenln2#tokens#bytes. \text{bits-per-byte}=\frac{\mathcal{L}_{\text{nats/token}}}{\ln 2}\cdot\frac{\#\text{tokens}}{\#\text{bytes}}.

Pretraining == next-token prediction == minimizing cross-entropy. Loss, perplexity (PPL=eL\mathrm{PPL}=e^{\mathcal{L}}), and bits-per-token are three views of the same number (L\mathcal{L}); bits-per-byte is the one that survives a change of tokenizer and lets you compare models fairly.

Loading diagram…

A quick map of pretraining objectives

The topic's name is plural for a reason: next-token prediction is one point in a small design space, and it helps to see the alternatives so you know why the field settled where it did.

  • Causal / autoregressive LM (GPT, Llama, ): predict xtx_t from x<tx_{<t} only. Generates text natively, one token at a time. This is what essentially all modern generative LLMs use.
  • Masked LM (BERT): randomly blank out 15%\sim 15\% of tokens and predict them from both sides of context. Great for producing representations (the encoder sees the whole sentence at once), but it cannot generate left-to-right and only trains on the masked fraction of tokens per pass, so it is far less sample-efficient as a generator.
  • Span corruption / denoising (T5): mask contiguous spans and have an encoder–decoder reconstruct them — a flexible middle ground for seq-to-seq tasks.
  • Prefix-LM: bidirectional attention over a given prefix, then causal generation after it — a hybrid that keeps full-context conditioning on the prompt.

The field converged on causal LM for general-purpose generative models because it (a) puts a learning signal on every token, (b) matches how the model is actually used (autoregressive generation), and (c) scales cleanly. The scaling laws below are stated for this causal objective, though the same methodology applies to any of them.

Where the tokens come from: the data pipeline

Trillions of training tokens do not exist in a clean folder somewhere; they are manufactured. At a high level the pipeline is a funnel that takes a vast, filthy pile of raw web text and squeezes out a much smaller, much cleaner stream worth training on.

Loading diagram…

Walk the funnel left to right. A web crawl (e.g. Common Crawl) yields petabytes of raw HTML — most of it boilerplate, menus, spam, and machine-generated junk. Text extraction strips the markup down to readable prose. Quality filtering then throws away the bulk of it: heuristic rules (is this English? too many symbols? a list of links?) and learned classifiers (does this “look like” a useful document?) keep only a small, high-value fraction. Deduplication removes near-identical copies — the web is enormously repetitive, and training on the same paragraph a thousand times wastes compute and encourages memorization. Finally, mixture weighting decides how much of each source to include: web text, code, math, books, multilingual data, each up- or down-weighted to hit target capabilities.

Two non-obvious lessons drive practice here. First, quality beats quantity: a smaller, well-filtered dataset routinely trains a better model than a larger, dirtier one, because every junk token spends capacity and compute teaching the model nothing (or worse). Second, the mixture is a balancing act, not a max: upweighting code to boost coding implicitly downweights everything else and can quietly hurt other skills, so the proportions are tuned by small ablation runs rather than guessed. This is why “what data, in what proportions” is treated as seriously as any architectural choice — if architecture defines how a model learns, data defines what it learns.

Counting the cost: FLOPs and the C6NDC\approx 6ND rule

To reason about “compute” we need a unit. A FLOP is one floating-point operation — a single multiply or a single add. The cost of a training run is the total number of FLOPs it consumes, a useful, roughly hardware-independent currency (a faster chip does the same FLOPs in less wall-clock time). The single most useful estimate in all of pretraining is

C6ND \boxed{\,C \approx 6\,N\,D\,}

where CC is total training FLOPs, NN is the number of model parameters (weights), and DD is the number of training tokens. These two knobs, NN and DD, are the entire game: making the model bigger (NN) or feeding it more data (DD).

Where does the factor 66 come from? It is clean accounting, and worth seeing once. The dominant cost of a transformer is its matrix multiplications, and in a matmul every weight is used in exactly one multiply-and-add per token. A multiply-and-add is 22 FLOPs. So:

  • Forward pass: each of the NN weights does 11 multiply-add per token 2N\Rightarrow \approx 2N FLOPs per token.
  • Backward pass: computing gradients requires two passes' worth of work — one to propagate the gradient back to the activations, one to compute the gradient of each weight — so it costs about twice the forward pass, 4N\approx 4N FLOPs per token.

Adding them up: 2N+4N=6N2N+4N=6N FLOPs per token. Over DD training tokens that is C6NDC\approx 6ND. (This counts only the weight matmuls and ignores attention's O(T2)O(T^2) term, which is negligible unless the context is very long.) As a single memorable anchor, a 77B model on 11T tokens costs C6×(7×109)×10124×1022C\approx 6\times(7{\times}10^{9})\times 10^{12}\approx 4{\times}10^{22} FLOPs.

Loading diagram…

Scaling laws: loss is a predictable function of NN and DD

Here is the empirical discovery that made all of modern LLM development possible (Kaplan et al. 2020; Hoffmann et al. 2022, the “Chinchilla” paper). The held-out loss of a well-trained transformer falls as a smooth power law in model size NN and data DD — and it does so cleanly across many orders of magnitude. The Chinchilla parametric form captures the whole relationship with just five fitted constants:

L(N,D)=Eirreducible+ANαfinite model+BDβfinite data. L(N,D)=\underbrace{E}_{\text{irreducible}} +\underbrace{\frac{A}{N^{\alpha}}}_{\text{finite model}} +\underbrace{\frac{B}{D^{\beta}}}_{\text{finite data}} .

Read it term by term. EE is the irreducible loss — the entropy of language itself, the floor you could not beat even with an infinite model trained on infinite data, because real text is genuinely, partly unpredictable. A/NαA/N^{\alpha} is the penalty for having only NN parameters; it shrinks as you add capacity. B/DβB/D^{\beta} is the penalty for having seen only DD tokens; it shrinks as you train on more. The exponents α\alpha and β\beta say how fast each penalty melts away.

Hoffmann et al.'s fitted values, which everyone quotes, were

E1.69,A406.4,B410.7,α0.34,β0.28. E\approx 1.69,\quad A\approx 406.4,\quad B\approx 410.7,\quad \alpha\approx 0.34,\quad \beta\approx 0.28 .

Because α,β<1\alpha,\beta<1, returns diminish: each doubling of NN or DD buys a smaller drop in loss than the last. And because the curve is so smooth, the whole enterprise becomes predictable — you can fit the law on a ladder of small, cheap runs and extrapolate to the single huge run you actually want, often to within a few percent. That extrapolation is the reason anyone dares spend tens of millions of dollars on one training run.

Loading diagram…

On log-log–style axes (compute on a log scale) the loss falls along a gentle, near-straight descent toward the dashed floor EE — the visual signature of a power law. It approaches but never reaches EE, no matter how much compute you pour in.

Compute-optimal training and the Chinchilla 20:1\approx 20{:}1 rule

Now the central practical question. Given a fixed budget C=6NDC=6ND, how should you split it between NN and DD? You can buy a big model trained briefly, or a small model trained long — both can cost the same FLOPs. Which gives lower loss?

The way to find that balance experimentally is the IsoFLOP method (“iso” == equal). Fix a compute budget CC. Now train several models of different sizes NN at that same budget — since CC is fixed and C=6NDC=6ND, a bigger NN automatically means fewer tokens D=C/6ND=C/6N, and vice versa. Plot each model's final loss against its size. The curve is U-shaped: tiny models underfit (not enough brains), giant models are undertrained (they ran out of tokens before learning much), and the minimum in the middle marks the compute-optimal size for that budget. Repeat the whole sweep at several budgets, collect the minima, and you trace out the optimal frontier N(C),D(C)N^\star(C),D^\star(C).

Loading diagram…

This is exactly the experiment that produced Chinchilla. The headline result: Chinchilla (70B params, 1.41.4T tokens) beat Gopher (280B params, \sim300B tokens) using the same compute, because Gopher was far too big for its data (\sim1 token/param — badly undertrained), while Chinchilla sat near the sweet spot with 4×4\times more data. Across all three of the paper's analysis methods (varying tokens at fixed model size; IsoFLOP curves; and a direct parametric fit), the conclusion agreed: as the budget grows, grow NN and DD in roughly equal proportion. The famous summary is the

20 tokens-per-parameter heuristic: DN20. \textbf{$\approx 20$ tokens-per-parameter} \text{ heuristic: } \frac{D^\star}{N^\star}\approx 20 .

So the ratio is about 2020 — not 22, not 200200.

Why “equal proportion” gives a constant ratio. The frontier exponents come straight from the algebra. Minimizing L(N,D)L(N,D) subject to C=6NDC=6ND (a one-line Lagrange exercise; Topic 4's harder questions do it in full) gives optimal allocations that are themselves power laws of the budget,

NCa,DCb,a=βα+β,b=αα+β,a+b=1. N^{\star}\propto C^{a},\qquad D^{\star}\propto C^{b},\qquad a=\frac{\beta}{\alpha+\beta},\quad b=\frac{\alpha}{\alpha+\beta},\quad a+b=1 .

With Chinchilla's near-equal exponents αβ\alpha\approx\beta, both aa and bb are 0.5\approx 0.5: doubling the budget should roughly 2×\sqrt{2}\times the model and 2×\sqrt{2}\times the data. Because NN^\star and DD^\star then grow at the same rate, their ratio D/ND^\star/N^\star stays put — and the fitted constants make that ratio land near 2020.

Kaplan vs. Chinchilla: what changed

If scaling laws were the breakthrough of 2020, why is the Chinchilla paper of 2022 the one everyone cites? Because they reached opposite practical advice. The earlier Kaplan et al. (2020) law concluded you should pour most of a growing budget into model size, scaling NN much faster than DD (NC0.73N^\star\propto C^{0.73}). That recommendation is exactly why GPT-3 was built gigantic (175175B parameters) but trained on a relatively thin 300300B tokens. Chinchilla re-ran the experiments more carefully and found the balanced 0.5/0.5\approx 0.5/0.5 split instead — under which GPT-3 was drastically undertrained and should have seen something like 3.73.7T tokens for its size.

The discrepancy traces to two methodological fixes. First, Kaplan counted only non-embedding parameters and otherwise mis-accounted for the embedding/unembedding weights, which skews the NN-vs-DD trade especially at small scale; counting parameters consistently shifts the exponents toward balance. Second, Kaplan used a single learning-rate schedule rather than re-tuning the decay to each run's token count — and a schedule tuned for a long run handicaps a short one, biasing the measured curves. Repair both and the two laws reconcile on Chinchilla's balanced exponents. The lesson that outlived the specific numbers: your measured scaling exponents are only as trustworthy as your experimental hygiene.

Loading diagram…

Overtraining: why deployed models ignore Chinchilla on purpose

Here is the twist that governs most models you actually use. Chinchilla-optimal minimizes the cost of training. But a model that will be served to users also pays an inference cost every time it answers, forever — and inference cost scales with NN (a bigger model is more expensive on every query). If you expect to serve billions of tokens, it is rational to deliberately pick a smaller model and overtrain it — pour in far more than 2020 tokens per parameter — so that it is permanently cheaper to run, even though that is “wasteful” by the training-only metric.

Formally, you stop minimizing 6NDtr6ND_{\text{tr}} alone and instead minimize total lifetime FLOPs, training plus serving,

6NDtrtraining  +  2NDinfinference, \underbrace{6\,N\,D_{\text{tr}}}_{\text{training}} \;+\; \underbrace{2\,N\,D_{\text{inf}}}_{\text{inference}},

where DinfD_{\text{inf}} is the number of tokens you expect to generate over the model's life (2N2N per token is the forward-pass cost from the 6ND6ND derivation). The bigger DinfD_{\text{inf}} is, the more the optimum shifts toward smaller NN trained on more data. This is why the Llama-3 models (88B and 7070B) were trained on a colossal 1515T tokens — roughly 1,9001{,}900 tokens per parameter for the 88B, nearly 100×100\times past Chinchilla — and why Qwen3 reportedly used 36\sim 36T. They are intentionally over the Chinchilla line, buying a small, cheap-to-serve model at the price of extra training. Llama-2's 22T tokens and Chinchilla's own 1.41.4T look modest only in hindsight.

Emergent abilities — and the measurement caveat

Scaling laws predict loss beautifully and smoothly. Downstream capabilities are another matter. Some skills — multi-step arithmetic, certain reasoning tasks — appear to switch on suddenly as scale grows: flat, near-random performance for a long stretch, then a sharp jump to competence. These have been called emergent abilities, and they are genuinely striking.

But there is a deep caveat, and it is one the topic's questions push on hard. Much of that apparent suddenness is an artifact of the metric, not a real phase change in the model. Tasks like “solve this 55-digit multiplication” are often scored all-or-nothing (exact-match accuracy): get every digit right or score zero. Under such a brittle metric, a model whose per-digit probability is improving smoothly will look stuck at zero until it crosses the threshold where the whole answer finally clicks — producing a fake “emergence.” Swap to a smoother metric (per-token probability, partial credit) and the same underlying improvement looks gradual all along. The practical takeaway: loss is predictable; benchmark scores are not nearly as predictable, and a sudden jump on a harsh metric is not by itself evidence of a real discontinuity.

Loading diagram…

What to watch for

A handful of recurring tensions drive almost every question in this topic; naming them now will make the detailed answers feel familiar.

  • Optimal for what? Chinchilla-optimal minimizes training cost. For a model with heavy inference traffic, minimize lifetime cost instead and you will overtrain a smaller model — which is what real deployed models do.
  • The coefficients are not universal. E,A,B,α,βE,A,B,\alpha,\beta depend on architecture, tokenizer, and data mixture. They shift for Mixture-of-Experts (where “active” vs. “total” parameters change the FLOP accounting), for new tokenizers (a token stops meaning the same thing), and for multimodal data. Re-fit; never paste DeepMind's numbers onto a different stack.
  • Loss is not capability. Scaling laws extrapolate loss cleanly; downstream skills are far harder to forecast, and harsh metrics can manufacture illusory “emergence.”
  • The data wall. The B/DβB/D^{\beta} term assumes fresh tokens. High-quality text is finite, and repeating data gives diminishing then negative returns (memorization without generalization); empirically a few epochs are roughly as good as fresh data, but not many more. So DD cannot grow forever — which is exactly why data quality and mixture now matter as much as raw quantity.

Keep this skeleton in mind — objective, metrics, the 6ND6ND rule, the compute-optimal frontier, and the practical reasons people deviate from it — and the detailed questions that follow (the Lagrangian derivation, inference-adjusted optima, MoE FLOP accounting, IsoFLOP fitting, the emergence debate) will read as variations on parts you have already met.