Chapter 07Part II · Pretraining & Scale

Mixture-Of-Experts

8 practice sets · 4 coding problems

A Mixture-of-Experts (MoE) is a transformer that carries a huge number of parameters but spends only a small slice of them on any one token. That single sentence is the whole idea, and it answers a question that haunts everyone who builds language models: how do you make a model know more without making it cost more to run? In a plain (“dense”) model those two things are welded together — more knowledge means more weights, and every extra weight is multiplied on every token, so the bill for each word you generate climbs in lockstep with the model's size. MoE pries them apart. This mini-chapter builds the idea from nothing, assuming only that you have met a transformer block and its feed-forward network before (Topic 1). By the end you should be able to follow a token as it enters an MoE layer, gets scored by a router, is sent to just a few experts, and comes back out — and you should understand routing and gating, top- $k$ selection, the all-important gap between total and active parameters, load balancing and router collapse, shared and fine-grained experts, and the distributed-systems twist (all-to-all communication) that makes MoE powerful but finicky. Every later question in this topic — DeepSeek-V3's $671$ B/ $37$ B design, FLOP-conservation proofs, collapse diagnoses — is a variation on the parts we assemble here.

The motivation: paying for capacity you don't use

Recall the anatomy of a transformer block (Topic 1): an attention sublayer, where tokens exchange information across positions, and a feed-forward network (FFN, also called the MLP), which transforms each token's vector on its own, with no cross-token interaction. The FFN is where a model does most of its per-token “thinking,” and it is also where most of the parameters live — typically two-thirds or more of the non-embedding weights. So if you want a smarter model, the FFN is the natural place to add capacity.

Here is the problem. A dense model runs every parameter on every token. Widen the FFN to hold twice as many weights and you have done two things at once: doubled its capacity and doubled the arithmetic it performs per token. Capacity and cost rise together, glued. For a $70$ B-parameter dense model, generating a single token costs roughly $140$ billion floating-point multiply-adds, every token, forever — and that is the floor you accept the moment you choose $70$ B parameters.

But do you really need all of that machinery for every token? When the model processes the word def in a Python snippet, the circuitry it has learned for conjugating French verbs sits idle but still gets multiplied through. When it reads a line of poetry, its knowledge of organic-chemistry nomenclature contributes nothing yet still runs. A single token rarely needs more than a sliver of the model's specialized knowledge at once. That observation is the seed of the whole topic.

Conditional computation: the core principle

The general name for “activate only the part of the network that is relevant to this input” is conditional computation. A dense network is unconditional: its computation graph is the same for every input, so every weight is touched every time. A conditionally computed network makes the path through itself depend on the data — different inputs light up different sub-networks.

MoE is the dominant way to realize this in modern LLMs. The recipe is simple: in some or all of the transformer blocks, replace the single FFN with a collection of $E$ parallel FFNs — the experts — and add a small learned router that, for each token, picks a few experts to run and skips the rest. Each expert is an ordinary FFN with its own independent weights; nothing about an expert is special except that it is one of many. Crucially, MoE touches only the FFN sublayer. The attention sublayer, the norms, the residual stream — all unchanged. (This is why MoE and the attention-efficiency tricks of Topic 3 stack cleanly: they modify different parts of the block.)

A dense FFN spends every parameter on every token. An MoE layer holds $E$ expert FFNs but runs only $k\ll E$ of them per token, chosen by a router. This decouples total parameters (which set capacity and memory, $\propto E$ ) from active parameters and FLOPs (which set per-token cost, $\propto k$ ). The slogan is “more parameters, roughly the same compute per token.” MoE replaces the FFN, not the attention.

The router and the gate

The router (also called the gating network) is the brain of the layer — the receptionist from the analogy. In its standard form it is astonishingly small: a single learned matrix $W_g\in\mathbb{R}^{d\times E}$ that maps a token's hidden vector $x\in\mathbb{R}^d$ to one routing logit (or affinity score) per expert,

s = W_g^{\top} x \in \mathbb{R}^{E}, \qquad s_i = \text{how well expert } i \text{ suits token } x .

That is one tiny matrix-vector product per token — negligible next to the experts themselves. From the affinities $s$ the router must do two separate things: choose which experts run, and weight how much each chosen expert's output counts.

Top- $k$ routing handles the choice: keep the $k$ experts with the largest affinities and discard the rest. Typically $k=1$ or $k=2$ , while $E$ ranges from $8$ up into the hundreds. Write $\mathcal{T}=\operatorname{top\text{-}}k(s)$ for the chosen set. The router then produces a gate weight $g_i$ for each selected expert, almost always by a softmax taken only over the selected affinities (so the weights of the running experts sum to one):

g_i = \frac{e^{s_i}}{\sum_{j\in\mathcal{T}} e^{s_j}}\quad (i\in\mathcal{T}),\qquad \sum_{i\in\mathcal{T}} g_i = 1 .

Finally, the layer's output is the gate-weighted sum of the chosen experts' outputs — experts outside $\mathcal{T}$ contribute exactly nothing:

y \;=\; \sum_{i\in\mathcal{T}} g_i\,\mathrm{FFN}_i(x).

This $y$ is then added back into the residual stream, exactly as a dense FFN's output would be. The gates earn their keep twice over. They weight the blend, letting the layer lean more on a strongly-matched expert. And because they are differentiable in $s$ , they are the only channel through which gradients reach the router: when an expert that was scaled by gate $g_i$ produces a useful output, the gradient flows back through $g_i$ and nudges $W_g$ to score that expert higher for similar tokens next time. The hard top- $k$ “pick” itself is non-differentiable — it is a discrete choice — so the soft gate is what makes the router learnable at all.

Loading diagram…

Hands-on · a 4-expert, top-2 router by hand

A token arrives at an MoE layer with $E=4$ experts and top- $2$ routing. The router produces affinity logits

s = (s_1,s_2,s_3,s_4) = (2.0,\ 1.0,\ 3.0,\ 0.0).

Step 1 — softmax the logits (just to see the full distribution; the math uses $e^{2}\!\approx\!7.39,\ e^{1}\!\approx\!2.72,\ e^{3}\!\approx\!20.09,\ e^{0}\!=\!1$ , summing to $\approx 31.2$ ):

\operatorname{softmax}(s)\approx(0.237,\ 0.087,\ 0.644,\ 0.032).

Step 2 — pick the top-2. The two largest affinities are $s_3=3.0$ and $s_1=2.0$ , so $\mathcal{T}=\{3,1\}$ . Experts $2$ and $4$ are dropped. Step 3 — renormalize the gates over only the survivors. We do not reuse the full-softmax numbers above; we softmax again over just $\{s_3,s_1\}$ :

g_3=\frac{e^{3}}{e^{3}+e^{2}}=\frac{20.09}{20.09+7.39}\approx 0.731,\quad g_1=\frac{e^{2}}{e^{3}+e^{2}}\approx 0.269,

and indeed $g_1+g_3=1$ . Step 4 — combine. Suppose expert 3 and expert 1 return the output vectors

\mathrm{FFN}_3(x)=(1.0,\,0.0),\qquad \mathrm{FFN}_1(x)=(0.0,\,2.0).

The layer output is the gate-weighted sum:

y = 0.731\,(1.0,0.0) + 0.269\,(0.0,2.0) = (0.731,\ 0.538).

That is the entire forward pass of an MoE layer for one token: score, pick top-2, renormalize, blend. Notice we ran $2$ of the $4$ experts — half the FFN compute of a 4-way-dense layer, yet the layer holds $4\times$ the FFN parameters.

Total vs. active parameters — the central bookkeeping

The single most important number-sense in MoE is the gap between total and active parameters, so it is worth saying slowly. Total parameters count every weight stored in the model. They set the memory footprint (you must hold them all) and, loosely, the capacity — how much the model can know. Active parameters count only the weights actually touched while processing one token. With the standard rule of thumb that a forward-plus-backward pass costs about $6$ FLOPs per active parameter per token ( $C\approx 6ND$ , with $N$ the active parameter count and $D$ the number of tokens), it is the active count that sets your training and inference compute bill.

A toy case makes it concrete. Suppose each expert is a copy of the dense FFN with $P$ parameters, there are $E=8$ experts, and we route top- $1$ . The MoE FFN then stores $8P$ parameters in total but activates only $1P$ per token: an $8{:}1$ total-to-active ratio. Capacity grew $8\times$ ; per-token FFN compute is unchanged. With top- $2$ you would activate $2P$ , an $8{:}2=4{:}1$ ratio. This ratio is so central it has a name — the sparsity, total $/$ active — and the clear recent trend is toward sparser models (higher ratios) because, holding active compute fixed, adding more idle experts reliably lowers loss.

Now the real designs fall into place:

Mixtral 8x7B (Mistral): $8$ experts per layer, top- $2$ routing, SwiGLU experts. About $46.7$ B parameters total but only $\sim\!12.9$ B active per token — so it generates text at roughly the speed and cost of a $13$ B dense model while reaching toward the quality of something much larger. (The total is not $8\times7$ B because attention and embeddings are shared, not multiplied.)
DeepSeek-V3: $671$ B parameters total but only $\sim\!37$ B active per token — a sparsity near $18{:}1$ . It has roughly the knowledge capacity of a $671$ B model at roughly the per-token cost of a $37$ B one.

This is precisely the True statement the warm-up asks you to confirm: MoE lets you grow capacity without proportionally growing per-token compute.

Loading diagram…

There is a crucial caveat for serving, and it flips the usual intuition on its head. Even though only $k$ experts run per token, all experts must be resident in memory, because the next token might route to any of them. So MoE buys you cheaper compute at the price of a larger memory footprint — the opposite of nearly every other efficiency lever, which trades memory for compute. This is why MoE inference tends to be memory-bound rather than compute-bound, and why expert offloading (parking rarely-used experts on slower CPU memory and fetching them on demand) is a real serving tactic. We return to this at the close.

Capacity, overflow, and token dropping

A subtle problem appears the moment you process tokens in batches rather than one at a time. Hardware loves dense, fixed-shape tensors; it hates ragged ones. But routing is data-dependent, so in any given batch one expert might be chosen by $300$ tokens and another by $5$ . To keep the math on regular tensors, each expert is given a fixed-size buffer — its expert capacity — and the dispatch fills these buffers. Capacity is set by a capacity factor $c$ (commonly $1.0$ to $1.25$ ) relative to the perfectly-balanced load:

\text{capacity} \;=\; \Big\lceil c\cdot \frac{k\,T}{E}\Big\rceil ,

where $T$ is the number of tokens in the batch, so $kT$ is the total number of routing slots (each token claims $k$ of them) spread over $E$ experts. If routing is uneven and more than “capacity” tokens select the same expert, the surplus tokens overflow, and there are two ways to react. Dropping simply discards the overflow: those tokens skip that expert, so their FFN contribution at this layer is zero — but, importantly, the residual connection still carries the token forward unchanged, so a dropped token is not lost from the sequence, it just misses one FFN update. This saves compute but loses information. Padding goes the other way: any under-filled expert's buffer is topped up with dummy slots so all shapes stay fixed; this processes every real token but wastes compute on the padding. A higher capacity factor $c$ means fewer drops but more padding, and a lower $c$ the reverse — a direct capacity-factor $\leftrightarrow$ dropping $\leftrightarrow$ compute trade-off.

Hands-on · capacity and a dropped-token count

Take $E=8$ experts, $T=1024$ tokens, top- $1$ routing, and $c=1.25$ . The balanced load is $kT/E = 1024/8 = 128$ tokens per expert, so each expert's capacity is $\lceil 1.25\times 128\rceil = 160$ slots. Now suppose routing is skewed and one popular expert attracts $40\%$ of the batch, i.e. $0.40\times 1024 = 410$ tokens. It can process only $160$ of them, so it drops

410 - 160 = 250 \text{ tokens},

nearly a quarter of the whole batch silently losing its FFN update at this layer. That is the everyday cost of imbalance — and the reason the next section exists. (Conversely, in a well-balanced model with a sane capacity factor, drop rates are routinely below $1\%$ and do little harm.)

Loading diagram…

Load balancing: keeping experts busy and alive

Why would routing be skewed in the first place? Because nothing in the basic setup forces balance, and imbalance is self-reinforcing. Early in training a few experts happen to get slightly more traffic; more traffic means more gradient updates, so they improve faster; improving faster makes them look better to the router, so they attract still more traffic. It is a rich-get-richer loop. Its pathological endpoint is router collapse (or expert collapse): the router funnels nearly all tokens to a handful of experts, while the rest are starved of gradient and end up undertrained and effectively dead. You paid for $E$ experts and got the capacity of a few; the others are dead weight, and meanwhile the popular experts overflow and drop tokens. Collapse defeats the entire point of MoE.

Loading diagram…

The auxiliary load-balancing loss. The classic fix is to add a small extra term to the training objective that pushes expert usage toward uniform. In the canonical Switch Transformer form, for $E$ experts let $f_i$ be the fraction of tokens dispatched to expert $i$ and $P_i$ the mean routing probability the router assigned to expert $i$ , both averaged over the batch. The loss is

\mathcal{L}_{\text{aux}} \;=\; \alpha\,E\sum_{i=1}^{E} f_i\,P_i ,

scaled by a small coefficient $\alpha$ and added to the language-modeling loss. Two facts make this exactly the right object. First, under perfectly uniform routing $f_i=P_i=1/E$ , so the sum is $E\cdot E\cdot \tfrac{1}{E^2}=1$ (ignoring $\alpha$ ): the loss bottoms out at $1$ , which doubles as a convenient health gauge — a value near $1$ means balanced, larger means skewed. Second, the two factors play different roles. The dispatch fraction $f_i$ comes from a hard top- $k$ count, which is non-differentiable, so it carries no gradient; it is just a measured load. The probability $P_i$ is smooth, so the gradient flows entirely through it. The product $f_i P_i$ therefore creates a force that lowers the routing probability of experts that are currently both over-chosen and over-weighted, and relatively raises it for the under-used ones — gently flattening the histogram above.

The catch, and DeepSeek's fix. An auxiliary loss is a second, competing objective. Pushing the router toward uniformity can drag it away from the routing the language-modeling loss actually prefers, costing a little quality — you are paying a small “balance tax.” DeepSeek-V3 popularized an auxiliary-loss-free alternative that sidesteps the tax entirely. The idea: maintain a per-expert bias term $b_i$ that is added to the affinities only for the top- $k$ selection, never to the gate value $g_i$ that scales the expert's output. After each step you simply nudge $b_i$ down for over-loaded experts and up for under-loaded ones — a plain feedback controller watching observed load. Because the bias steers who gets selected but never enters the gradient-bearing gate, balancing exerts no force that fights the main loss. You get balance for free. (DeepSeek-V3 also switched the gate from a softmax over all experts to a per-expert sigmoid, which scales more gracefully as $E$ grows into the hundreds, since a single softmax over $256$ experts squashes most scores toward zero.)

Shared and fine-grained experts

Two refinements from the DeepSeekMoE line are worth knowing because later questions assume them.

A shared expert is an FFN that every token always passes through, in addition to its routed experts — it is never gated off. The motivation is a clean division of labor. Some computation is needed by almost every token: basic grammar, common-word handling, generic “housekeeping.” Without a shared expert, each routed expert wastes part of its capacity re-learning this common knowledge, and the redundancy is sheer waste. A shared expert absorbs the common case once, freeing the routed experts to specialize aggressively on the specific. In DeepSeek-V3 there is $1$ shared expert plus $256$ routed experts with top- $8$ routing, so each token runs $1+8 = 9$ experts in all.

Loading diagram…

Fine-grained experts means slicing each expert into several smaller ones: replace $E$ experts of hidden size $H$ by $mE$ experts of hidden size $H/m$ , while activating $m\times$ as many of them. The FLOP budget is unchanged — each expert is $m\times$ cheaper to run, but $m\times$ as many run, so $m$ cancels (a fact the harder questions ask you to prove). What you buy with the same compute is sharper specialization. With more, narrower experts the router can assemble a more precise combination per token, and the number of possible top- $k$ subsets explodes combinatorially: choosing $8$ of $256$ experts gives vastly more distinct “recipes” than choosing $2$ of $8$ . Finer granularity means finer-grained conditional computation.

The systems wrinkle: all-to-all and expert parallelism

With hundreds of experts you cannot fit them all on one GPU, so expert parallelism places different experts on different devices — a natural way to spread the memory of all those weights (link Topic 6's discussion of parallelism). But it creates a communication headache. The router scatters a batch's tokens across all experts wherever they live, so each device must ship its tokens to the devices holding their chosen experts, and then gather the results back. The collective operation for “every device sends a different slice of its data to every other device” is an all-to-all. An MoE layer needs two all-to-alls per forward pass: a dispatch (send each token to its experts) and a combine (send each expert's outputs back to the originating device).

Loading diagram…

Because an all-to-all moves data proportional to (tokens $\times$ $k$ $\times$ hidden size) across the interconnect, on a slow network it can dominate the step time — the router and experts may sit idle waiting for tokens to arrive. This is the central reason MoE complicates distributed training, and why MoE efficiency is so sensitive to interconnect bandwidth. It also explains a design choice from earlier: fixed expert capacity and token dropping partly exist to keep these communication tensors a predictable, fixed shape, since variable-sized all-to-alls are far harder to pipeline efficiently.

Training and inference: what changes

MoE shifts several practical realities, and it helps to name them now because the questions probe each.

Training is touchier than dense. The router is a discrete, data-dependent switch sitting in the middle of a differentiable network, so it can oscillate, collapse, or chase a moving target. It is sensitive to router precision (compute routing logits in higher precision to avoid noisy argmax flips), to batch composition (if a micro-batch is too narrow in domain, its routing statistics are biased and balancing suffers — so balance is best computed over a global batch across devices, not per local micro-batch), and to the load-balancing coefficient $\alpha$ (too small and experts collapse; too large and uniformity overrides the language objective). A common, cheaper route to an MoE is upcycling: initialize the experts as copies of a pretrained dense model's FFN, then continue training so they diverge — you inherit the dense model's knowledge instead of learning from scratch, though early on the cloned-identical experts must first break symmetry.

Inference is memory-bound, as flagged above: you hold all experts but run few, so the bottleneck is shuttling expert weights, not arithmetic. Batching is also subtler. At high batch sizes, tokens in a batch scatter across many experts, so latency can become spiky — a step waits on whichever expert got the most tokens (or on the all-to-all), and different batches stress different experts. This load-dependent jitter is the routing/batching cause of the inference-latency questions in this topic.

When MoE wins — and when it doesn't

MoE is not free quality; it is a specific trade. It tends to win when:

You are bottlenecked by training/serving compute (FLOPs) and have memory to spare. MoE gives more quality per training-FLOP than a dense model of equal active size, and serves at the latency of its small active footprint.
You want a very knowledgeable model but can tolerate a large memory/VRAM footprint to host all the experts.

It tends to be the wrong choice when memory is the binding constraint — e.g. on-device or single-GPU deployments where you cannot hold hundreds of experts, so a dense model of the same memory budget will simply be better. It also adds engineering complexity (all-to-all, balancing, capacity tuning) and can be harder to fine-tune: heavy SFT or RL post-training shifts the input distribution, which can disturb a delicately-balanced router and erode the specialization gains, so the MoE's edge over a comparable dense model sometimes shrinks after alignment. Keeping the router stable through post-training is its own discipline, and several questions in this topic are really about exactly that.

Putting it together / what to watch for

The mental model in one breath: in each MoE transformer block, replace the single FFN with $E$ expert FFNs (optionally plus an always-on shared expert); a tiny router scores experts per token, top- $k$ selection keeps a few, a restricted softmax (or sigmoid) gives gate weights, the chosen experts run (subject to fixed capacity, with overflow dropped or padded), and their gate-weighted sum is added to the residual stream. Total parameters $\propto E$ set capacity and memory; active parameters $\propto k$ set compute — decoupling the two is the whole point.

A handful of recurring tensions drive nearly every design choice that follows, and naming them now will make the detailed questions feel familiar:

Balance vs. quality. Load must be spread (aux loss, or DeepSeek's aux-loss-free bias) or experts collapse — yet over-forcing balance fights the language objective.
Capacity vs. compute. A higher capacity factor drops fewer tokens but wastes more on padding; a lower one is cheaper but loses information.
Memory vs. FLOPs. MoE is the rare lever that cuts compute while raising memory, reshaping serving (offloading, batching, memory-bound inference).
Specialization is fragile. Post-training distribution shift can disturb the router and erode the gains, so a stable router is the prize.

Hold this skeleton — router, top- $k$ gate, total/active split, capacity, balancing, all-to-all — and the detailed questions that follow (the aux-loss derivation and its minimum, FLOP-conservation proofs, DeepSeek-V3's bias trick, collapse diagnostics, MoE-vs-dense at fixed budget) will read as variations on parts you have already met.