Chapter 15Part V · Frontiers

Diffusion & Non-Autoregressive Language Models

8 practice sets · 4 coding problems

The big picture: escaping strict left-to-right

Every model in the earlier topics is autoregressive (AR): it writes a sequence one token at a time, strictly left to right, each new token conditioned only on the ones already produced. The joint probability factorizes exactly into a product of next-token terms, $p(x_1,\dots,x_L)=\prod_{t}p(x_t\mid x_{<t})$ , and this clean factorization is the whole reason GPT-style models work — you can train on every position at once (the causal mask) and sample by a simple loop. But that very structure bakes in two costs. First, generation is sequential: producing $L$ tokens needs $L$ forward passes that cannot overlap, so latency grows with output length no matter how much hardware you throw at it. Second, the model only ever sees left context while generating, which makes editing, infilling, and global planning awkward, and produces oddities like the reversal curse — a model that learned “A is B” often fails at “B is A,” because it never trains to reason right-to-left.

Diffusion language models are the leading non-autoregressive alternative. Instead of committing to one token and marching on, they start from a fully corrupted sequence and refine the whole thing in parallel over a handful of steps, with full bidirectional attention at every step. The promise: produce a 200-token answer in 16 or 32 passes instead of 200, with every position able to see every other. The very same idea powers image generators (Stable Diffusion, Imagen, DALL-E); the only trick for text is choosing the right notion of “noise” for discrete symbols. This chapter builds that machinery from scratch — first the continuous-image intuition, then the discrete-text version that actually ships — and then surveys the model zoo and the practical hazards the questions in this topic poke at.

Continuous diffusion: corrupt with Gaussian noise, learn to denoise

It is easiest to meet diffusion where it was born — on continuous data like images, where a data point $x_0$ is a vector of real numbers (pixel values). Diffusion has two halves. The forward (noising) process is fixed and hand-designed: it gradually destroys the clean point $x_0$ into pure noise by adding a little Gaussian at each of $T$ steps, until after many steps nothing of the original remains and we are left with a sample from a standard Gaussian. The reverse (denoising) process is what we learn: a neural network that, given a noised point, predicts how to step back toward cleaner data. To generate, you draw pure noise and run the reverse process all the way back to a clean sample.

The forward step for continuous data is just “mix in some noise.” A standard schedule (DDPM, denoising diffusion probabilistic models) lets you jump directly to the noised state at any time $t$ in closed form:

x_t = \sqrt{\bar\alpha_t}\;x_0 \;+\; \sqrt{1-\bar\alpha_t}\;\epsilon, \qquad \epsilon\sim\mathcal{N}(0,I),

where $\bar\alpha_t$ shrinks from $\bar\alpha_0\!\approx\!1$ (almost no noise) toward $0$ (almost all noise) as $t$ grows, and $\epsilon$ is the Gaussian noise that got mixed in. Read it plainly: $x_t$ is a weighted blend of the clean signal $x_0$ and fresh noise $\epsilon$ , with the blend tilting from “mostly signal” to “mostly noise” over time. Because there is no chain to unroll — you can sample $x_t$ for any $t$ with one draw of $\epsilon$ — training is cheap.

Hands-on · noising a single number

Forget images; take one scalar, $x_0 = 2.0$ . Use a tiny three-step schedule with $\bar\alpha_t$ equal to $0.9,\,0.5,\,0.1$ at steps $t=1,2,3$ , and suppose the noise draws happen to be $\epsilon_1{=}1.0$ , $\epsilon_2{=}-1.0$ , $\epsilon_3{=}0.5$ . Apply $x_t=\sqrt{\bar\alpha_t}\,x_0+\sqrt{1-\bar\alpha_t}\,\epsilon$ :

$t{=}1$ : $\sqrt{0.9}\cdot 2 + \sqrt{0.1}\cdot 1.0 \approx 1.897 + 0.316 = 2.21$ — barely budged.
$t{=}2$ : $\sqrt{0.5}\cdot 2 + \sqrt{0.5}\cdot(-1.0)\approx 1.414 - 0.707 = 0.71$ — signal and noise now comparable.
$t{=}3$ : $\sqrt{0.1}\cdot 2 + \sqrt{0.9}\cdot 0.5 \approx 0.632 + 0.474 = 1.11$ — mostly noise; $x_0$ is nearly washed out.

The clean value $2.0$ dissolves into noise as $t$ climbs. The reverse network's job is to look at, say, the $t{=}2$ value $0.71$ and nudge it back toward $2.0$ — one small denoising step at a time.

How does the network learn to reverse this? The pivotal trick is the “predict the change” view: rather than asking the network to output the clean $x_0$ directly, we train it to predict the noise $\epsilon$ that was added. Given $x_t$ and $t$ , the model outputs $\epsilon_\theta(x_t,t)$ , and the loss is simply how far that guess is from the true noise,

\mathcal{L} = \mathbb{E}_{t,\,x_0,\,\epsilon}\big\|\,\epsilon - \epsilon_\theta(x_t,t)\,\big\|^2 .

Once you can estimate the noise, you can subtract a bit of it off to take one step back toward clean data. This is the same lesson as residual connections (Topic 1): predicting a small difference is far easier than predicting the full target from scratch. Equivalently, knowing the noise is knowing the score $\nabla_x\log p(x)$ — the direction in which data becomes more probable — so this denoising view and the “score-matching” view are two names for one thing. A closely related modern framing, flow matching, learns a smooth velocity field that transports noise to data along (often straight) paths; it tends to need fewer sampling steps but rests on the same corrupt-then-learn-to-undo skeleton.

Loading diagram…

Discrete diffusion: what does “noise” mean for tokens?

Tokens are discrete symbols, not real numbers, so “add a little Gaussian” is meaningless — $\texttt{cat}+0.3$ is not a word. The general fix, D3PM (discrete denoising diffusion probabilistic models, Austin et al. 2021), replaces continuous addition with a transition matrix: at each forward step every token jumps to another token (or stays) according to fixed probabilities. Different choices of that matrix give different corruptions — a uniform matrix randomly swaps tokens for any other token; a matrix built from embedding nearest-neighbours swaps for similar tokens; and the one that won for language is the absorbing-state matrix, where the only place a token can jump to is a special $[\textsc{mask}]$ symbol.

That absorbing choice is so important it has its own name: masked (absorbing-state) discrete diffusion. The forward corruption is simply replacing tokens with $[\textsc{mask}]$ . Concretely, with a masking schedule $\alpha_t$ that decreases from $\alpha_0\!=\!1$ (nothing masked) to $\alpha_1\!=\!0$ (everything masked), each token is, independently, kept with probability $\alpha_t$ and turned into $[\textsc{mask}]$ with probability $1-\alpha_t$ :

q(x_t\mid x_0)=\prod_{i=1}^{L} q\!\left(x_t^{(i)}\mid x_0^{(i)}\right),\qquad q\!\left(x_t^{(i)}\mid x_0^{(i)}\right)= \begin{cases} \alpha_t & \text{keep } x_0^{(i)},\\[2pt] 1-\alpha_t & \text{set to } [\textsc{mask}]. \end{cases}

The mask token is called an absorbing state because once a position becomes $[\textsc{mask}]$ the forward process never changes it again — the mask only ever “absorbs,” never releases. That single property is why the marginal above is so clean: there is no multi-step chain to unroll, and the fraction of tokens masked at time $t$ is exactly $1-\alpha_t$ . The reverse process does the opposite: given a partially masked sequence, the network predicts the original token at each masked position, and we unmask some of them. Repeat, and the masks dissolve into text.

Loading diagram…

The BERT connection, and training as a masked cross-entropy

If “mask some tokens and predict them” rings a bell, it should: that is exactly BERT-style masked language modeling. The relationship is precise and worth internalizing. BERT masks a fixed fraction (about 15%) of tokens and learns to fill them in, but it is only ever used to extract representations — you cannot generate from it by repeatedly running it, because it never learned to handle other mask rates and there is no principled multi-step procedure. A masked diffusion LM is BERT generalized in two ways: (1) it trains across the full range of mask rates, from almost-clean to almost-fully-masked, by sampling the noise level $t$ at random each step; and (2) the diffusion framework supplies a principled iterative reverse process that turns “fill in the blanks” into a real generative model. BERT is the special case of a single, fixed noise level used for representation learning rather than generation.

Loading diagram…

How do we train the reverse network? We cannot maximize the exact log-likelihood $\log p_\theta(x_0)$ — it requires marginalizing over all the intermediate noised states, which is intractable. So, exactly as in VAEs, we maximize a tractable lower bound, the ELBO (evidence lower bound): a quantity $\mathcal{L}\le\log p_\theta(x_0)$ that we can actually compute and that gets pushed up by gradient ascent. “Evidence” is the data log-likelihood; “lower bound” because Jensen's inequality leaves a non-negative gap. Maximizing the ELBO raises a floor under the true likelihood. The remarkable fact about masked diffusion is that this ELBO collapses into something a practitioner already knows — a cross-entropy loss on the masked positions only. The recipe per training example is just:

sample a noise level $t\sim\mathcal{U}(0,1)$ ;
mask each token independently with probability $1-\alpha_t$ ;
run the bidirectional transformer once and predict the original token at every masked position;
take cross-entropy on those positions, weighted by a factor that depends on $t$ .

Written out, the continuous-time bound is an expectation over the noise level of a per-position weight times the masked cross-entropy:

-\mathcal{L}_{\text{ELBO}}=\mathbb{E}_{t}\;\frac{\alpha_t'}{1-\alpha_t}\; \mathbb{E}_{x_t}\!\!\sum_{i:\,x_t^{(i)}=[\textsc{mask}]}\!\!-\log p_\theta\!\left(x_0^{(i)}\mid x_t\right),

where $\alpha_t'=\mathrm{d}\alpha_t/\mathrm{d}t$ . The weight $\alpha_t'/(1-\alpha_t)$ is the bookkeeping that turns “predict the masked tokens” into a proper bound: it tracks how much probability mass is being unmasked per unit time, so errors at clean steps (few masks) and noisy steps (many masks) are scored on a common footing. Because each step asks the model to fill masks given both left and right neighbours, the model is natively bidirectional — the structural cure for the reversal curse, since it trains to predict in every direction rather than only forward.

A more theory-forward variant, SEDD (score-entropy discrete diffusion, Lou et al. 2023), instead of predicting masked tokens, learns the concrete score — ratios of marginal probabilities $p(y)/p(x)$ between neighbouring discrete states. This is the discrete analogue of the continuous score $\nabla_x\log p$ from the image case: it tells the sampler how much more probable one nearby state is than another. In the absorbing-state limit, the score-entropy objective reduces back to the masked cross-entropy above, so masked diffusion and SEDD are two faces of the same coin — one practical, one principled.

Sampling: start all-masked, unmask the confident ones first

Sampling produces a length- $L$ answer (lengths are fixed per block, or chosen up front). Start from all $[\textsc{mask}]$ . At each reverse step the model predicts every masked position at once, but we only commit a few and re-mask the rest, because committing all of them simultaneously ignores the dependencies between positions: the and cat are predicted independently, and a confident-but-incoherent joint can result (each token plausible alone, the combination nonsense). Which tokens to commit? Confidence-based remasking: after the forward pass, keep the positions where the predicted distribution is sharpest (highest probability / lowest entropy) and re-mask the rest for the next round. This is easiest-first decoding, and it consistently beats committing random positions, because locking in the model's surest guesses gives the harder positions more context to lean on next round.

Hands-on · a masking / unmasking schedule

Take $L=20$ tokens and a linear schedule, $\alpha_t = 1-t$ , so the masked fraction at time $t$ is exactly $t$ . At $t=0.5$ , the expected number masked is $0.5\times 20 = 10$ tokens; at $t=0.25$ it is $5$ . Now run the reverse process to generate, unmasking $4$ tokens per step from an all-masked start:

step	0 (start)	1	2	3	4
unmasked so far	0	4	8	12	16
still masked	20	16	12	8	4

After step 5 all $20$ are filled, so $\lceil 20/4\rceil = 5$ steps suffice. The headline: $5$ sequential forward passes instead of $20$ for an AR model — a $4\times$ cut in sequential dependency. The general rule for $L$ tokens at $r$ unmasked per step is $\lceil L/r\rceil$ steps; smaller $r$ means more steps but more refinement (each remaining position sees more committed context before it must commit).

The number of steps is the master dial. More steps means committing fewer tokens per step, hence more refinement and higher quality — this is exactly the knob behind inference-time scaling: spend more compute (more steps) to get better samples, with no retraining. The flip side is the practitioner's classic failure: good text at 64 steps, garbled at 8. Too few steps forces too many independent commitments at once, the parallel approximation breaks, and the output reads like several half-sentences welded together; the cure is more steps or smaller commit batches. With $S$ steps for $L$ tokens, the rough speedup over AR is $L/S$ when $S\ll L$ — but each diffusion pass processes the whole sequence (more FLOPs per pass than an AR decode step with a KV-cache), so diffusion wins on latency (fewer sequential passes) while often costing more total compute. That is the central trade: fewer sequential steps, more work per step.

Loading diagram…

Masked diffusion $=$ train a bidirectional transformer to fill in $[\textsc{mask}]$ tokens at every noise level (a weighted masked cross-entropy that is secretly an ELBO), then generate by starting all-masked and iteratively unmasking the most confident positions. AR trades $L$ sequential steps for an exact left-to-right factorization; diffusion trades exactness for $S\ll L$ parallel steps and full bidirectional context.

A lineage: non-autoregressive machine translation

Diffusion did not invent parallel text generation; it inherited it. Non-autoregressive translation (NAT, Gu et al. 2017) tried to emit all target tokens in a single parallel pass and ran straight into the core difficulty: because the tokens are predicted independently, the decoder cannot coordinate, and you get the “multimodality” failure — “thank you” and “many thanks” both fit, so the parallel model averages them into “thank thanks.” The fixes that followed are precisely the moves diffusion formalized: iterative refinement (Lee et al. 2018) decodes in a few passes instead of one, and Mask-Predict / CMLM (Ghazvininejad et al. 2019) masks the lowest-confidence tokens and re-predicts them over several rounds — confidence-based remasking, years before LLaDA. Early NAT also leaned on sequence-level knowledge distillation: train the parallel model on the outputs of an AR teacher to simplify the target distribution. Masked diffusion LMs are this lineage, scaled up and given a probabilistic foundation.

Loading diagram…

The model zoo and the AR $\leftrightarrow$ diffusion spectrum

Real systems span a spectrum between fully sequential and fully parallel. LLaDA (8B) is a full-attention masked diffusion model trained from scratch on 2.3T tokens that reaches quality competitive with LLaMA3-8B (which trained on $\sim$ 15T) — proof the recipe scales without any RLHF. Dream (7B) reaches similar quality more cheaply by initializing from an AR checkpoint (Qwen2.5-7B), reusing the language knowledge already baked into AR weights so the any-order objective has less to learn. There are also sparse variants (an MoE diffusion LM routes each token to a few experts, combining conditional compute with parallel decoding). On the product side, Mercury (Inception Labs) and Gemini Diffusion (Google DeepMind) are commercial diffusion LMs whose headline use case is fast code generation: Mercury reports $1000+$ tokens/second on an H100 with Mercury Coder around $88\%$ on HumanEval, and Gemini Diffusion advertises similar speed — the existence proof that diffusion text models are production-viable, especially for code, where fill-in-the-middle (write the body given the signature and the return) is a natural fit for bidirectional infilling.

The most important practical hybrid is Block Diffusion (semi-autoregressive). Pure full-attention diffusion has a real weakness: with bidirectional attention over the whole sequence you cannot reuse a KV-cache, because every step re-reads every position — and fixed-length blocks make variable-length output awkward. Block diffusion splits the sequence into blocks, generates blocks autoregressively (left-to-right, so earlier blocks are frozen and cacheable) but uses diffusion within each block (parallel refinement). This recovers the KV-cache and variable length while keeping intra-block parallelism — it sits exactly between AR and full diffusion, and is how any-order (order-agnostic) generation gets reconciled with caching. “Any-order” simply means the model is not committed to a fixed left-to-right factorization: it can fill positions in whatever order confidence dictates, which is what makes infilling and editing natural and what hurts when you need a KV-cache.

Loading diagram…

Where AR still wins, and what to watch for

AR is not going anywhere. Its left-to-right factorization is exact, so AR models hit lower perplexity per token and remain the default for tasks where every token depends tightly on the last (tight reasoning chains, long structured outputs). AR also enjoys a mature KV-cache that makes per-step decoding cheap, a huge body of RLHF/RLVR tooling, and no awkwardness around choosing the output length up front. Diffusion's edge is concentrated where its assumptions pay off: low-latency generation of moderate-length outputs, infilling/editing, and code. Three practical hazards recur and connect directly to this topic's questions:

Alignment is harder. Running RLHF/DPO on a diffusion LM means differentiating through the ELBO, whose Monte-Carlo estimate (random $t$ , random masks) is high variance; naive preference optimization shows exploding gradients. The fix is variance-reduced schemes (e.g. VRPO): timestep-aware sampling and antithetic noise — paired, correlated mask draws whose errors cancel — to tame the estimator.
Quasi-AR collapse. A model that can use right-context may learn to ignore it and decode essentially left-to-right, throwing away the main advantage. Detect it by ablating right-context at inference and checking whether quality drops; if it does not, the model never relied on it.
The steps/quality dial is the master knob. Pick the default by sweeping step counts against a quality metric and reading off the knee of the curve: too few steps yields garbled parallel commitments, too many wastes the latency win that justified diffusion in the first place.

Finally, a security wrinkle: bidirectional infilling creates a “fill-in-the-middle” jailbreak surface — fixing a benign prefix and suffix and letting the model fill the gap can route around guardrails an AR model would catch, precisely because the model was trained to satisfy both sides at once. Keep the spectrum picture in mind — AR at one end, full diffusion at the other, block diffusion in between — and the detailed questions on ELBO derivations, confidence remasking, the wall-clock crossover, VRPO, and the rest will read as variations on parts you have already met.

Loading diagram…

The big picture: escaping strict left-to-right

Continuous diffusion: corrupt with Gaussian noise, learn to denoise

Discrete diffusion: what does “noise” mean for tokens?

The BERT connection, and training as a masked cross-entropy

Sampling: start all-masked, unmask the confident ones first

A lineage: non-autoregressive machine translation

The model zoo and the AR↔\leftrightarrow↔diffusion spectrum

Where AR still wins, and what to watch for

The model zoo and the AR $\leftrightarrow$ diffusion spectrum