Chapter 17Part V · Frontiers

Research Engineering & Debugging

9 practice sets · 8 coding problems

Most of what separates a productive LLM researcher from a stuck one is not cleverness about architectures; it is discipline about experiments and bugs. Training code fails in a uniquely nasty way: it almost never crashes. A masking error, an off-by-one label shift, or a silent fp16 overflow does not raise an exception — the loss still goes down, the run still finishes, and you are left holding a number that looks plausible and is wrong. This mini-chapter is the “how to actually do the work” chapter. It builds, from the ground up, two intertwined skills: how to run experiments that give a clean signal (the empirical loop, controlled ablations, small proxies you can extrapolate from), and how to debug when the signal lies to you (reading loss curves, monitoring norms, hunting NaNs, writing tests, bisecting). Assume only that you have seen a neural network train before. Everything the rest of this topic dissects — DPO losses that fall while evals regress, GRPO runs with rising reward and zero gradients, caches that disagree with the full forward pass — is a special case of the habits built here.

The empirical loop: how research actually proceeds

LLM research is not a sequence of brilliant insights; it is a loop run thousands of times, each turn cheap and humble. You form a hypothesis (“RoPE base $\theta$ of $500\text{k}$ will help long context”), design the smallest controlled experiment that could possibly confirm or refute it, run it, read the signal, and only then decide whether to scale up. The whole craft is about keeping that loop tight: a question you can answer in twenty minutes on one GPU you will ask fifty times and learn fifty things; a question that needs a week on a cluster you will ask once and probably misread.

Loading diagram…

Two facts about scale make this loop the only sane way to work. First, if a change hurts at small scale it will almost never be rescued at large scale — you can confidently rule it out cheaply. Second, if it helps at small scale you must still confirm it on a meaningfully large proxy before you believe it, because small-scale wins do not always transfer. So small experiments are a filter, not an oracle: cheap to fail things, less cheap to confirm them. We return to this small-proxy methodology at the end.

Why ML bugs are different: the silent-failure problem

In ordinary software a bug usually trips an assertion or throws an error. In a training loop, the output is a scalar loss curve, and almost any differentiable function of the inputs produces a curve that goes down. The optimizer is relentless: point it at the wrong objective and it will happily minimize the wrong thing, giving you a smooth, healthy-looking loss that means nothing. The model can even compensate for your bug — learn around a slightly-wrong mask — so the symptom is muffled rather than absent.

Three properties make this worse. First, stochasticity: runs involve random initialization, data shuffling, and dropout, so a real bug and ordinary run-to-run noise look alike unless you control the seed. Second, scale: a failure may only appear at long context, at large batch size, or across many GPUs, none of which you want to spin up just to reproduce a crash. Third, cost: a single “did that fix it?” iteration can be hours or days. The entire discipline is therefore about shrinking the feedback loop — making bugs reproduce in seconds on a toy — and building tripwires that convert silent corruption into a loud, immediate failure.

ML bugs rarely crash; they corrupt. Because gradient descent will optimize any objective, a wrong implementation usually yields a healthy-looking loss curve and quietly-degraded results. The job of a research engineer is to turn silent corruption into a loud, fast, reproducible failure — and to treat any too-good number as guilty until proven innocent.

Reading a loss curve: the practitioner's electrocardiogram

The training loss is the heartbeat you stare at all day, and like a heartbeat it has a handful of recognizable shapes. Learning to name them on sight is the first debugging skill. A loss is a cross-entropy, measured in nats (natural-log units) or bits; a from-scratch language model starts near $\ln V$ (the loss of uniform guessing over a $V$ -token vocabulary, e.g. $\ln 50000 \approx 10.8$ nats) and a healthy small model settles somewhere around $2$ – $4$ nats. Here are the archetypes.

Healthy decay. A steep initial drop (the model learns frequencies and the easy structure of language fast), bending into a long, gently-decreasing tail, slightly noisy but trending down. This is what you want to see.
Plateau / flat-line. The loss stops moving. Causes: a learning rate too low (slow crawl, you burn compute in a bad local region), a learning rate of zero from a scheduler/optimizer bug, a saturated nonlinearity, or the model has simply learned all it can from this data and stage. A plateau from step zero usually means no gradient is flowing at all.
Spike. A sudden jump up, then either a recovery (a recoverable spike, often from a bad batch of data interacting with the current parameter state) or no recovery. Spikes get more frequent at scale and are the single most common mid-run drama.
Divergence. The loss shoots up and never comes back, often to inf or NaN. Usual causes: learning rate too high, an overflow in low precision, or a bad initialization. This is a non-recoverable spike.
Data-order sawtooth. A regular, repeating zig-zag locked to the epoch or to the structure of the data loader — the loss dips on easy shards and rises on hard ones, or drops sharply at each epoch boundary because the model is re-seeing memorized data. A clean sawtooth is a data-pipeline fingerprint, not a learning signal: it usually means insufficient shuffling, a too-small dataset being repeated, or domains concatenated in blocks rather than interleaved.
Too good to be true. A loss far below what the task allows (e.g. $0.2$ nats where $2$ – $4$ is expected). This is not success; it is almost always a leak — the model can see the answer (broken causal mask, label shift that lets it copy the input, or eval data in the training set).

Loading diagram…

Hands-on · diagnose the curve: spike vs sawtooth vs divergence

You see three runs. Run A: smooth decay, then at step $40\text{k}$ a sharp jump from $2.6$ to $6.1$ , and over the next $300$ steps it slides back to $2.6$ and continues as before. Run B: a clean zig-zag of amplitude $\sim\!0.5$ that repeats every $\sim\!2{,}000$ steps, perfectly periodic, never trending up. Run C: a jump at step $40\text{k}$ that keeps climbing to inf within $50$ steps.

A is a recoverable spike — almost certainly a bad batch hitting a sensitive parameter state. Action: keep training; if it recurs, rewind to a checkpoint before the spike and skip those batches, or tighten gradient clipping.
B is a data-order sawtooth — the period gives it away. Something in the loader is structured: domains in blocks, an epoch boundary, or a dataset too small and repeating. Action: shuffle harder, interleave domains, check you are not silently looping a tiny dataset.
C is divergence — it does not recover. Action: this is usually LR-too-high or an overflow. Check the gradient-norm trace just before step $40\text{k}$ (did it explode?) and the dtype. Roll back and lower the LR or add/tighten clipping.

The single most useful companion plot is the gradient norm: a spike that is preceded by a gradient-norm blow-up is an optimization problem; a spike with a calm gradient norm is more likely a data problem.

Beyond the loss: gradient norm, activation norm, and downstream evals

The loss is necessary but not sufficient — it looked fine for the famous tensor-parallelism bug that still wasted a run. You want three more dashboards.

The gradient norm is the size of the whole gradient vector, $\lVert g\rVert_2 = \sqrt{\sum_i g_i^2}$ , logged every step. In a healthy run it is moderate and fairly stable; a sudden spike in gradient norm is the earliest warning of an impending loss spike or divergence, often visible a step or two before the loss reacts. This is also why gradient clipping exists: if $\lVert g\rVert > c$ (a threshold like $c=1.0$ ), the gradient is rescaled to norm $c$ before the optimizer step, capping the damage a single bad batch can do. A gradient norm that is zero means no learning signal is reaching the weights (a detached graph, a zeroed loss mask); one that grows monotonically toward inf is your divergence in slow motion.

The activation norm is the typical magnitude of the values flowing through the network (e.g. the RMS of the residual stream per layer). If activations grow without bound with depth or with training step, you are heading for an overflow; this is exactly what techniques like z-loss (a small penalty that keeps output logits from growing large) and QK-norm (normalizing queries and keys before attention) are designed to tame. Watching per-layer activation and gradient norms also localizes instability: if one block's gradient norm dwarfs the rest, that block is destabilizing training, and the per-layer trace points straight at it.

Finally, downstream evals. The loss can be healthy while the model is silently broken, so run a few cheap downstream evaluations on intermediate checkpoints and, where possible, compare against a known reference model's intermediate checkpoints. Evals lagging a trusted reference — while the loss looks normal — is the canonical signature of a deep, silent bug. Alongside these, log the boring infrastructure metrics too (throughput in tokens/sec, GPU temperature and memory): a throughput drop with an unchanged loss is a systems bug, not a learning one.

Loading diagram…

The overfit-a-single-batch test: the cheapest sanity check there is

Before any real run, take one small batch — a handful of examples — and train on it alone, with dropout and weight decay turned off, for a few hundred steps. Ask one question: does the loss go to (nearly) zero?

It should. The model has vastly more parameters than it needs to memorize a handful of sequences, so a correct supervised setup can drive the training loss on that fixed batch arbitrarily close to $0$ (the perfect-prediction loss; cross-entropy is $0$ only when the model puts probability $1$ on the right token every time). If it cannot, the gradient is not flowing the way you think, and the test has just saved you a multi-day run.

Hands-on · overfit one batch: reading the result

You wire up a new training loop and run the overfit test on a $4$ -sequence batch. Interpret each outcome:

Loss $\to 0.0$ smoothly. Plumbing is sound: forward pass, loss, backward pass, and optimizer all connect. The simplest possible explanation is the right one — ship the test, move on.
Loss stalls at a positive plateau (say $\sim\!\ln 2 \approx 0.69$ ). The model cannot distinguish examples it should be able to memorize. Suspect a loss mask that zeros out the very tokens you meant to learn, a detached tensor breaking the graph (so gradients never reach the weights), or a learning rate of $0$ . A plateau at exactly $\ln(\text{number of classes})$ is the model stuck at the uniform prior.
Loss reaches $\approx 0$ instantly, in one or two steps. Suspiciously easy. This is the too-good-to-be-true smell at toy scale: a label shift letting the model copy the input, or a broken causal mask letting it read the target. Re-run on random tokens; if it still hits $0$ immediately, the model is cheating, not learning.
Loss is NaN from the start. A numerical bug independent of learning — $\log(0)$ in the loss, an uninitialized weight, or an overflow. Fix this before anything else; NaNs poison everything downstream.

One batch, a few seconds, and you have separated “my code is wrong” from “my idea is wrong” — the most important distinction in all of research engineering.

Loading diagram…

Hunting NaNs and Infs

A NaN (“not a number”) is an IEEE-754 value produced by undefined operations: $0/0$ , $\infty-\infty$ , $\log 0$ , $\sqrt{-1}$ . Once one appears it propagates — any arithmetic touching a NaN returns NaN — so within one optimizer step every weight is NaN and your run is dead. The usual upstream cause is an Inf from overflow, and that is where dtype matters.

fp16 (half precision) has a $5$ -bit exponent and a maximum finite value of about $65{,}504$ . A large attention score or logit easily exceeds that, overflows to $\infty$ , and the next softmax or normalization yields NaN. bf16 keeps the $8$ -bit exponent of fp32 (max $\approx 3.4\times 10^{38}$ ) at the cost of only $7$ mantissa bits: it almost never overflows, trading precision for range — exactly the trade you want for stable training, which is why modern pretraining uses bf16, not fp16. Other NaN sources: a learning rate too high, division by a near-zero denominator with no $\varepsilon$ , $\log(0)$ in a hand-rolled loss, or bad data (an empty document, a token id out of range).

Loading diagram…

To localize a NaN, bisect the forward pass: install hooks that check each operation's output for non-finite values, and the first op that emits a NaN is your culprit (or sits one step downstream of it). The same divide-and-conquer logic applies in time: if a run goes NaN at step $40\text{k}$ , the gradient-norm and activation-norm traces just before tell you whether it was a slow build-up (overflow) or an instantaneous bad batch.

Loading diagram…

Reproducibility and determinism

A result you cannot reproduce is not a result. The bedrock is the seed: fix the random seeds for every source of randomness (weight init, data shuffling, dropout) and a run becomes repeatable — which is what lets you tell a real effect from noise, because two seeds of the same config bound how much run-to-run variation to expect. If a change moves the loss by less than the seed-to-seed spread, you have measured noise, not signal.

But seeds alone do not guarantee bit-exact determinism, because some GPU kernels are nondeterministic by design: atomic-add reductions accumulate in nondeterministic order, and floating-point addition is not associative, so $(a+b)+c \neq a+(b+c)$ in the last bits. Across GPUs, the order of an all-reduce can vary. Most of this nondeterminism is benign jitter in the low bits. The danger is when it masks or mimics a bug, so you want it controllable: frameworks expose deterministic-algorithm flags that trade some speed for reproducibility, invaluable when you need to confirm that two runs differ only because of the change you made.

Writing tests for model code

The fastest debugging is the bug you caught before it ran. A few small tests pay for themselves immediately.

Shape assertions. Put explicit assert statements on tensor shapes at the top of every forward pass. Shape mismatches are the one bug class that should crash — but broadcasting can hide them, silently expanding a $[B,T]$ mask against $[B,1,T]$ logits into the wrong layout, turning a shape error into a correctness error. An assert turns it back into a loud crash.

Causal-mask leakage test. The defining property of a causal model is that a position cannot see the future. Test it directly: run a forward pass on a short sequence, record the logits at position $i$ , then change a token at position $j>i$ and re-run. If the logits at position $i$ change, the future is leaking — the mask is broken. This one test catches the single most common “too good” bug.

Known-answer test (toy / unit test). A unit test pins one component against a known-correct answer; a toy test runs the real code path on inputs small enough to verify by hand (a $4$ -token causal mask, a $2$ -expert router, a $3$ -token attention you worked out on paper). Because floating-point arithmetic is not associative, two mathematically-equal computations rarely produce bit-identical tensors, so compare with a tolerance — torch.allclose(a, b, rtol=1e-4, atol=1e-6) — not ==. The canonical such test is cached-vs-full-forward equivalence: incremental decoding with a KV cache must produce the same logits (within tolerance) as one full forward pass over the whole sequence. The first position where they diverge points straight at the cache bug (a stale position id, wrong indexing, an off-by-one in the appended key/value).

Gradient checking. A forward-pass test cannot catch a wrong backward pass (a mis-derived custom gradient). Finite differences can. The central-difference estimate of the derivative of $f$ at $x$ is

\frac{\partial f}{\partial x} \;\approx\; \frac{f(x+h)-f(x-h)}{2h},

accurate to $O(h^2)$ . Compare it to the analytic gradient via the relative error

\frac{\lvert g_{\text{analytic}} - g_{\text{numeric}}\rvert}{\lvert g_{\text{analytic}}\rvert + \lvert g_{\text{numeric}}\rvert + \varepsilon},

and pick $h$ in a sweet spot (around $10^{-5}$ in double precision): too large and truncation error from the $O(h^2)$ term dominates; too small and floating-point cancellation in the numerator $f(x+h)-f(x-h)$ destroys the digits. A relative error below $\sim 10^{-6}$ means your backward matches your forward.

Bisection: binary search for the commit that broke it

When a metric regresses and you have a history of code changes, do not stare at diffs — bisect. git bisect performs binary search over commits: mark a known-good and a known-bad commit, and at each step test the midpoint, halving the suspect range until you land on the exact commit that broke the eval. Over $N$ commits this is $\log_2 N$ tests rather than $N$ — a regression introduced somewhere in $1024$ commits is found in about $10$ tests. The only requirement is a fast, deterministic “is it broken?” check to run at each step, which is exactly what the toy tests above give you.

Loading diagram…

Controlled ablations: change exactly one thing

To learn why something helped, you run an ablation: re-run with a single component removed or changed, holding everything else fixed, and measure the difference. The cardinal rule is change exactly one variable. If you simultaneously switch the optimizer and raise the learning rate and swap the data mix, and the loss improves, you have learned nothing — you cannot attribute the gain. The thing that ruins ablations is a confounder: a hidden variable that moved alongside the one you meant to test. Classic confounders are the random seed (always compare against the seed-to-seed noise band, or your “improvement” may be luck), the total token count (a change that lets you process more tokens per second will look better even if it is neutral per token), and the learning rate (many architectural changes are really just “this lets me use a higher LR” in disguise, so a fair comparison must re-tune the LR on both arms).

The data pipeline: the most common silent bug

If you remember one thing: the data pipeline is where the bugs hide. It touches every run, it rarely crashes, and its bugs masquerade as model behavior. The usual suspects:

Label / loss-mask off-by-one. A causal model predicts $x_{t+1}$ from position $t$ , so targets are inputs shifted left by one: score logits[:, :-1] against labels[:, 1:]. Get it wrong and you train the model to predict the current token from itself — a trivial copy task with a suspiciously low loss.
Loss-mask errors. In post-training you score only assistant tokens, not the prompt or padding. A wrong mask trains on the prompt, on padding, or on special tokens, quietly polluting the objective (and, if it masks everything, gives zero gradient).
Tokenization mismatch, train vs eval. The model trains on text wrapped in one set of role markers and special tokens, but the eval (or the serving server) wraps prompts differently — a chat-template mismatch. No error fires; the model is simply run off-distribution and quality silently drops. Detect it by round-tripping: tokenize an example exactly as training does and exactly as eval does, and diff the token ids.
Truncation and packing bugs. Sequences truncated mid-example, documents packed together without a separator so attention bleeds across document boundaries, or a max-length that silently drops your longest (and often most informative) examples.
Insufficient dedup / shuffling. Near-duplicate documents inflate the effective weight of some data and cause the sawtooth above; an under-deduplicated train set that overlaps the eval set produces the “every model is better than the last” illusion.

The fix is the same reflex as everywhere: do not trust the pipeline, print and eyeball a few fully-decoded training examples (with the loss mask overlaid) and verify by hand that the model sees what you intended.

Profiling and chasing MFU

Speed is a research multiplier: a run twice as fast lets you ask twice as many questions. The headline efficiency number is MFU (Model FLOPs Utilization) — the fraction of the GPU's peak floating-point throughput your training actually achieves. If a model needs $C$ FLOPs of useful compute per step and the hardware's theoretical peak is $P$ FLOPs/s, then

\text{MFU} = \frac{C/\text{(step time)}}{P}.

A well-tuned large-model run reaches roughly $30\text{--}50\%$ MFU; the rest is lost to memory movement, communication between GPUs, and pipeline bubbles. To raise it you profile: capture a timeline of where each step's milliseconds go and look for the bottleneck. Is the GPU compute-bound (good, you are using it) or memory-bound (waiting on data movement)? Is it stalling on the data loader (the GPU idle while the CPU fetches the next batch — a throughput drop with no GPU-utilization)? Is communication dominating (too much cross-GPU traffic)? The first move when throughput is mysteriously low is always the profiler, not guesswork — the same shrink-and-look discipline, applied to time instead of correctness.

Loading diagram…

Small proxies, extrapolation, and scaling-aware hyperparameter transfer

You cannot afford to tune a frontier model directly, so you tune a small proxy and extrapolate. There are two ways to make a proxy: train the target-size model on far fewer tokens, or train a smaller model for the full schedule. The closer the proxy is to the target and the longer it trains, the more its conclusions transfer. As established earlier: a change that hurts the proxy is dead; a change that helps the proxy is a candidate that still needs confirmation at larger scale.

The deepest version of this idea is scaling-aware hyperparameter transfer, most cleanly realized by $\mu$ P (the Maximal Update Parametrization). The problem it solves: normally, the optimal learning rate shifts as you make the model wider, so a learning rate tuned on a small model is wrong for a big one, and you would have to re-tune expensively at scale. $\mu$ P re-parametrizes the network — carefully setting how initialization scales and how the per-layer learning rate scales with width — so that the optimal hyperparameters become (approximately) invariant to width. The payoff (“ $\mu$ Transfer”): tune the learning rate and other knobs once on a cheap small proxy, then transfer them zero-shot to the full model, collapsing what would be a fortune in tuning runs into essentially one. You do not need the derivation to use the result; you need to know that naive hyperparameter transfer across scale is unreliable, and that scaling-aware parametrizations exist precisely to make small-proxy tuning trustworthy.

Loading diagram…

What to watch for

The practical stakes are large because the failure mode is wasted confidence: a silently-buggy run does not warn you, it ships. Internalize the loop and the battery. The loop: hypothesize, run the smallest controlled experiment, read the signal (loss and gradient norm, activation norm, downstream eval), decide, repeat — keeping a toy that reproduces in seconds so you debug logic at small scale and reserve the GPUs for runs you already trust. The battery, before any costly job: overfit one batch to near-zero, a cached-vs-full equivalence test, a causal-mask leakage test, a chat-template round-trip, a fixed-seed determinism check, and a hand-eyeball of a few decoded training examples. Read every loss curve as an ECG — name the spike, the plateau, the sawtooth, the divergence on sight — and treat any too-good number as guilty until proven innocent. Change exactly one thing in every ablation, watch for confounders, and when a fix improves one metric while worsening another, resist declaring victory: that is usually the sign that you traded one behavior for another, not that you found the truth. The rest of this topic is a guided tour of these failure modes — label shifts, mask leaks, KV-cache mismatches, DPO and GRPO traps, distributed-training divergence, leaking evals — in concrete detail. You now have the map and the instruments.