Chapter 02Part I · Foundations

Tokenization & Embeddings

6 practice sets · 4 coding problems

A transformer is a machine for turning vectors into vectors. The world, though, hands us text: a stream of characters in some script, sprinkled with spaces, punctuation, emoji, URLs, and code. Two thin layers stand between the two. Tokenization chops the raw string into discrete units and assigns each a unique integer; embedding turns each integer into a learned vector the network can actually compute with. Together they are the model's eyes and mouth — the only place text enters and the only place it leaves — and a surprisingly large share of a model's everyday behaviour is decided right here, before a single attention head fires: how much text fits in the context window, what an API call costs, whether the model can do arithmetic or spell a word, even how cheaply it serves one language versus another. This chapter builds the whole input/output bridge from the ground up, assuming only that you have seen a neural network and a little linear algebra. By the end you should be able to follow a string all the way to a row of vectors and back to a probability over the next token, and know the job of every piece (tokenizers, BPE merges, bytes, special tokens, the embedding and unembedding matrices) that the rest of this topic takes apart.

The big picture: why chop text up at all?

A language model predicts the next token given the tokens so far. But “token” is a design choice, not a fact of nature, and the choice is a genuine three-way trade-off between vocabulary size, sequence length, and coverage (never being stuck on a word you have never seen). Walk through the obvious options and the tension becomes clear.

Suppose tokens are whole words, split on spaces. This keeps each unit meaningful, but the vocabulary is effectively unbounded: new names, typos, hashtags, compounds, and inflections keep arriving forever. Most words are rare, so the model wastes capacity on entries it sees a handful of times, learns nothing useful for them, and — worst of all — is simply stuck the moment a word it never saw in training appears. That last failure has a name, out-of-vocabulary (OOV), and it is fatal: there is no integer to feed the network. Word-level tokenizers also treat walk, walks, walked, and walking as four unrelated symbols, throwing away the shared root.

Suppose instead tokens are single characters. Now the vocabulary is tiny (a few dozen letters, or a few thousand for Chinese) and nothing is ever unknown. But sequences become very long — "transformer" is eleven steps — so the model must burn attention and compute stitching distant characters back into one concept, and its fixed context window holds far less actual text. A character model also has to learn, from scratch, that the letters c-a-t tend to travel together, a regularity a word model gets for free.

Loading diagram…

A subword is just a frequent chunk of text — something like " the", "tion", or " token" (note the leading space, which is genuinely part of the token, more on that shortly). Common words collapse to a single token; rare words shatter into a few familiar pieces; and, with the byte-level trick below, nothing is ever truly unknown. The full set of subwords a model knows is its vocabulary of size $V$ (typically $30$ k– $200$ k for modern LLMs), and each entry gets a unique integer token id in $\{0,\dots,V-1\}$ . A tokenizer is the deterministic function mapping a string to a list of ids (encoding) and back (decoding); it runs before the network sees anything and emits a list of integers, nothing more. The same string always yields the same ids for a fixed tokenizer — it is a lookup table and a fixed set of rules, not a learned probabilistic layer. A useful rule of thumb for English: one token is roughly $4$ characters, or about $\tfrac{3}{4}$ of a word, so $100$ tokens $\approx 75$ words.

Byte-Pair Encoding: learning a vocabulary by gluing frequent pairs

How do we choose the subwords? We do not hand-pick them; we learn them from data with the dominant recipe, Byte-Pair Encoding (BPE). The idea is almost embarrassingly simple and entirely greedy: start with the smallest possible pieces, then repeatedly find the most frequent adjacent pair and glue it into a new token. Each “glue the most frequent pair” step produces one merge rule; the ordered list of merge rules is the trained tokenizer. We stop when the vocabulary reaches its target size.

A trained BPE tokenizer is nothing but a base alphabet plus an ordered list of merge rules. Training discovers, by frequency, which adjacent pairs are worth gluing together. Encoding a new string just replays those merges in order, greedily, until none apply — which is fast, deterministic, and gives the same ids for the same string every time. A merge rule, in plain terms, is just the instruction “wherever you see X immediately followed by Y, replace them with the single token XY.”

A fully worked toy example makes the mechanism concrete. Take a tiny corpus of five words with the frequencies shown (the symbol _ marks a word boundary, so the tokenizer can tell a word-final piece from a mid-word one). We split every word into characters to start, so the base alphabet is $\{\texttt{d},\texttt{e},\texttt{i},\texttt{l},\texttt{n},\texttt{o},\texttt{r},\texttt{s},\texttt{t},\texttt{w},\texttt{\_}\}$ .

Hands-on · four BPE merges by hand

Corpus (frequency $\times$ word, each split into characters):

5\times\texttt{l o w \_}\quad 2\times\texttt{l o w e s t \_}\quad 6\times\texttt{n e w e r \_}\quad 3\times\texttt{w i d e r \_}\quad 2\times\texttt{n e w \_}

We repeatedly count every adjacent pair (weighted by word frequency) and merge the most frequent.

Merge 1. The pair (r,_) appears in newer_ and wider_, total $6+3=9$ times — the winner. Rule r _ $\to$ r_. Now newer_ reads n e w e r_.

Merge 2. Recount. The pair (e,r_) now appears $6+3=9$ times (still inside those two words) and wins. Rule e r_ $\to$ er_. So newer_ is now n e w er_.

Merge 3. Recount. The pair (e,w) appears in newer_ $(6)$ and new_ $(2)$ , total $8$ — the new winner. Rule e w $\to$ ew. (Note low also has w, but never e w.)

Merge 4. Recount. The pair (n,ew) appears in newer_ $(6)$ and new_ $(2)$ , total $8$ , and wins. Rule n ew $\to$ new.

The four learned rules, in order, are

\texttt{r\,\_}\to\texttt{r\_},\quad \texttt{e\,r\_}\to\texttt{er\_},\quad \texttt{e\,w}\to\texttt{ew},\quad \texttt{n\,ew}\to\texttt{new}.

After them, newer_ encodes as just two tokens new er_, and the word new_ as new _. To encode any new word at inference — say newer_ — we simply replay these four rules in order on its characters until none applies. That replay is all “encoding” ever is.

Loading diagram…

Notice the vocabulary does not simply grow by one per merge. Merge $4$ created new but, in this corpus, every n was followed by ew, so n and ew could both be dropped — the size can go up, stay flat, or even shrink for a step. The practical point stands: a handful of merges already turns frequent words into single tokens while leaving rare words expressible as a few pieces.

Going byte-level: why GPT-2 and friends start from bytes

The toy above started from characters, which raises an awkward question: which characters? There are over a million Unicode code points — every script, emoji, and symbol on the internet. Seeding the base vocabulary with all of them would be wasteful, and you would still hit OOV on anything you missed. Byte-level BPE (used by GPT-2 onward, and by Llama, DeepSeek, and most modern models) solves this cleanly. Every string, in any language, is stored as a sequence of bytes under UTF-8, and there are exactly $256$ possible byte values. So we seed the vocabulary with those $256$ base tokens and run BPE merges on top of bytes instead of characters.

This is exactly how GPT-2's tokenizer is built: $256$ base byte tokens, plus $50{,}000$ learned merges, plus $1$ special end-of-text token, for a vocabulary of $V = 50{,}257$ . Byte-level BPE also shares structure across languages for free — two scripts that share UTF-8 byte patterns share subword pieces — which is why it has become the default. One practical wrinkle: before any merging, the raw text is first split by a regular expression in a step called pre-tokenization, so that merges never glue across word or whitespace boundaries in undesirable ways (e.g. never merging the end of one word with the start of the next). This is also why a leading space tends to cling to the following word and ride along as part of the token: the tokenizer treats " token" (with its space) and a mid-word "token" as different units. Spaces are part of tokens, not separators thrown away.

Cousins of BPE: WordPiece and Unigram / SentencePiece

BPE is not the only subword algorithm; two relatives are worth knowing because the questions reference them.

WordPiece (used by BERT) is BPE's close cousin. It also starts small and merges pairs, but instead of merging the most frequent pair it merges the pair that most increases the training data's likelihood under a simple unigram model. Concretely it scores a candidate merge of $x$ and $y$ by $\tfrac{P(xy)}{P(x)\,P(y)}$ — the higher this ratio, the more $x$ and $y$ “want” to be together beyond what chance predicts. This favours pairs that are surprisingly co-occurrent (informative subwords) over pairs that are merely common.

Unigram, the default in Google's SentencePiece library, works backwards. Rather than building a vocabulary up by merging, it starts from a large pool of candidate subwords and prunes it down, repeatedly dropping the pieces whose removal hurts the corpus likelihood the least, until it hits the target size. At encode time, instead of replaying greedy merges, it considers all ways to segment a word and picks the most probable one by a Viterbi search over a probabilistic model. This makes Unigram inherently probabilistic — it can even sample alternative segmentations — where BPE is greedy and deterministic.

Loading diagram…

SentencePiece adds one more convenience: it treats the input as a raw stream (spaces become a visible _ marker) so it needs no language-specific pre-tokenizer, which is why multilingual models lean on it. Across all three algorithms, the single number people use to compare tokenizers is fertility — the average number of tokens per word. Lower fertility means each token carries more text, so for the same document you get shorter, cheaper sequences. It is the headline metric precisely because everything downstream (context usage, latency, cost) scales with token count.

Special tokens: punctuation for the model, not the reader

A real tokenizer's vocabulary is not only subwords of natural text. It also reserves a handful of special tokens — reserved ids that never appear in ordinary text and instead carry structural meaning the model learns to obey:

BOS / EOS (beginning- and end-of-sequence, e.g. GPT-2's single <endoftext|>|): mark where a document starts and stops, so the model knows when a sequence is complete and can stop generating.
PAD: a filler id used to make sequences in a batch the same length; it is masked out so it never contributes to the loss.
Chat / role tokens: special markers such as <|user|>, <|assistant|>, and <|system|> that delimit turns in a conversation. A chat template is just the fixed recipe that wraps each message in these tokens before tokenizing, so the model can tell who said what. This is purely a tokenization-level convention — the same flat sequence of ids the model always sees, with structure encoded by reserved tokens.

These tokens are how higher-level structure (turns, document boundaries, padding) is smuggled into a model that, underneath, only ever sees a flat list of integers.

From token ids to vectors: the embedding matrix

We now have a list of integer ids. But a token id is a categorical label, not a magnitude: id $5000$ is in no sense “larger than” id $10$ , and the network must not read meaning into the integers themselves. So before any real computation, every id is turned into a vector. The embedding matrix $E\in\mathbb{R}^{V\times d}$ stores one learned row of length $d$ per vocabulary entry, where $d$ is the model's hidden width (its model dimension $d_{\text{model}}$ , e.g. $4096$ ). Embedding token $t$ simply returns row $E_t\in\mathbb{R}^{d}$ .

An embedding is a table lookup, full stop. There is no matrix multiply at the input — “embedding token $t$ ” means “go fetch row $t$ of $E$ .” The vectors themselves are learned during training, right alongside every other weight, so over time the rows arrange themselves so that tokens used in similar ways end up with similar vectors. The integer id is just an address; the vector at that address is what carries meaning.

Hands-on · an embedding lookup, then cosine similarity

Take a toy vocabulary of three tokens with $d=2$ , and suppose training has produced

E=\begin{pmatrix} E_{\texttt{cat}}\\ E_{\texttt{dog}}\\ E_{\texttt{the}} \end{pmatrix} =\begin{pmatrix} 2 & 1\\ 3 & 1.5\\ {-}1 & 2 \end{pmatrix}.

Lookup. The sentence “the cat” is ids $(2,0)$ , so embedding it just stacks rows $2$ and $0$ :

X=\begin{pmatrix}{-}1 & 2\\ 2 & 1\end{pmatrix}\in\mathbb{R}^{2\times 2}.

No arithmetic — a gather of $d$ numbers per token. Cosine similarity. How close are cat and dog? Cosine similarity is $\cos\theta=\frac{u\cdot v}{\|u\|\,\|v\|}$ , the angle between the two vectors (ignoring length). With $u=(2,1)$ , $v=(3,1.5)$ :

u\cdot v = 2\cdot 3 + 1\cdot 1.5 = 7.5,\quad \|u\|=\sqrt{5}\approx 2.24,\quad \|v\|=\sqrt{11.25}\approx 3.35,

\cos\theta = \frac{7.5}{2.24\times 3.35}\approx \frac{7.5}{7.50}\approx 1.00.

cat and dog point the same direction — the model has placed them almost on top of each other. Against the $=({-}1,2)$ , though, $u\cdot v = 2({-}1)+1(2)=0$ : cosine $0$ , a right angle, “unrelated.” That geometry — similar words close, unrelated words orthogonal — is exactly what the embedding table learns.

Loading diagram…

So embedding is computationally trivial; its cost lives entirely in parameters. The table holds $V\times d$ of them: with $V=50{,}000$ and $d=4096$ that is already $\approx 2\times 10^{8}$ parameters, and it grows linearly with $V$ — one concrete reason vocabularies are not made arbitrarily large.

Where do the learned vectors come from? Distributional semantics

Why should the rows of $E$ end up meaningful at all? The answer is one of the oldest ideas in linguistics, captured by Firth's slogan “you shall know a word by the company it keeps.” A word's meaning is reflected in the contexts it appears in; words that appear in similar contexts (cat, dog, hamster) tend to mean similar things. This is distributional semantics, and it is exactly the signal next-token prediction exploits: to predict well, the model is forced to give tokens that behave alike similar vectors.

The classic demonstration predates LLMs: word2vec. Its skip-gram variant trains a tiny network on one task — given a center word, predict the words around it — and then throws the network away, keeping only its input embedding table. Because words sharing contexts get pushed toward similar vectors, the resulting space has startling structure: directions become meaningful, so that $E_{\texttt{king}} - E_{\texttt{man}} + E_{\texttt{woman}} \approx E_{\texttt{queen}}$ .

The output end: unembedding, logits, and weight tying

At the other end of the network the move is reversed. After the final layer, the hidden vector $h\in\mathbb{R}^{d}$ at the last position must become a probability over the whole vocabulary. The unembedding (output projection) is a matrix $U\in\mathbb{R}^{V\times d}$ producing logits $z = Uh\in\mathbb{R}^{V}$ — one raw score per token — and a softmax turns those scores into probabilities, $p_i = e^{z_i}\big/\sum_j e^{z_j}$ .

Note the asymmetry between the two ends. The input lookup touches a single row of $E$ and is nearly free. The output projection is a genuine $V\times d$ matrix multiply, costing about $2Vd$ floating-point operations per token generated. With $V=128{,}000$ and $d=8192$ that is $\approx 2.1\times 10^{9}$ FLOPs for the final projection of one token. A big vocabulary is therefore expensive at both ends — in parameters at the input, in compute at the output — even though the lookup half feels free.

Because $E$ and $U$ have identical shape $V\times d$ , a standard trick is weight tying: set $U = E^{\top}$ , reusing one matrix for input and output. This halves the embedding parameter count, tends to improve quality slightly, and ties together how a token is read in and written out — it has been common since Press & Wolf (2017) and is used by GPT-2 and many models since. The alternative, untied embeddings, keeps $E$ and $U$ separate for a little extra capacity, at the cost of doubling those parameters, which bites hard for a $256$ k-token vocabulary.

Loading diagram…

Choosing $V$ : the central trade-off and its effect on the matrices

Vocabulary size $V$ is the main knob, and it pulls in opposite directions at once. A larger $V$ makes each token cover more text, so a given document becomes fewer tokens; that shortens sequences, lowers attention cost per document, and improves coverage across many languages. This is exactly why vocabularies have grown over time — GPT-2's $50$ k, Llama 2's $32$ k, Llama 3's $128$ k, GPT-4o's roughly $200$ k — as multilingual coverage became a priority. But a larger $V$ also inflates the embedding and unembedding matrices ( $\propto V d$ parameters) and the per-token softmax FLOPs ( $\propto V d$ ). A smaller $V$ is cheap per step but yields longer sequences and weaker cross-lingual coverage. The result is a U-shaped total cost with a sweet spot in the middle, which is roughly why production vocabularies cluster in the tens-of-thousands to low-hundreds-of-thousands.

Loading diagram…

Why this thin layer matters

It would be easy to dismiss tokenization and embedding as plumbing, but their quirks surface all over the model's behaviour, and knowing where they bite is half the value of this chapter.

Arithmetic and spelling. The model cannot “look inside” a token. If a number like 1234 or the letters of a word are glued into a chunk the model never sees decomposed, then per-digit or per-character tasks become artificially hard. Splitting digits into individual tokens is a common mitigation.
Cost and fairness across languages. A script underrepresented in the merge-training corpus has high fertility, so the same sentence costs several times more tokens. That eats context window and money for exactly those users — a real fairness and economics issue, not a rounding error.
Glitch / undertrained tokens. A token can win a vocabulary slot during training yet almost never appear in the text the model trains on. Its embedding row stays essentially untrained — close to its random initialization — and prompting it can produce bizarre, off-distribution outputs.
The softmax bottleneck. Because logits are $z = Uh$ with $U\in\mathbb{R}^{V\times d}$ , the matrix of log-probabilities the model can express has rank at most $d$ . A low-dimensional model therefore cannot represent every conceivable next-token distribution over a large vocabulary (Yang et al., 2017) — a fundamental ceiling set right here at the edges.

The mental model to carry forward is a short, fixed pipeline: a string is split and mapped to ids by a deterministic BPE tokenizer; each id gathers a row of $E$ to become a vector; those vectors flow through the transformer; the final vector is projected by $U$ to $V$ logits; a softmax makes them a distribution; and we sample the next token, append it, and repeat. Everything in between is the rest of the book — but the floor and ceiling on what that machinery can achieve are set by the tokenizer and the two embedding matrices at its ends, and the questions that follow (fertility, byte-level vs. subword, weight tying, the softmax bottleneck, undertrained tokens) are all variations on the parts you have just met.

The big picture: why chop text up at all?

Byte-Pair Encoding: learning a vocabulary by gluing frequent pairs

Going byte-level: why GPT-2 and friends start from bytes

Cousins of BPE: WordPiece and Unigram / SentencePiece

Special tokens: punctuation for the model, not the reader

From token ids to vectors: the embedding matrix

Where do the learned vectors come from? Distributional semantics

The output end: unembedding, logits, and weight tying

Choosing VVV: the central trade-off and its effect on the matrices

Why this thin layer matters

Choosing $V$ : the central trade-off and its effect on the matrices