Chapter 16Part V · Frontiers

Multimodal / Vision-Language (Lighter)

8 practice sets · 3 coding problems

A language model, stripped of its mystique, is a machine that consumes a sequence of vectors and predicts the next one. Topic 1 walked a token from an integer id, through an embedding lookup, into the residual stream, and out to a probability distribution. Nothing in that pipeline actually cares that the vectors came from words. The attention layers see a list of $d_{\text{model}}$ -dimensional vectors; they mix and transform them; they never ask where the vectors were born. That indifference is the single most important fact in this chapter, because it means there is a back door. If we can somehow turn a picture into a few vectors of width $d_{\text{model}}$ that “mean the same kind of thing” as word vectors, we can drop them into the sequence and the language model will reason over them as if they were text. A model that does this is a vision-language model (VLM), and this mini-chapter builds one from the ground up, assuming only that you have read Topic 1.

The big picture: how do you feed an image to a token machine?

Start with the puzzle. A sentence is already a sequence — words in a row — so chopping it into tokens is natural. An image is not a sequence; it is a grid of pixels, a $224\times224\times3$ block of numbers (height, width, three color channels) with no obvious “first” element and no left-to-right order. The transformer wants a list of vectors. We have a slab of pixels. Bridging that gap is the entire architectural problem of multimodality, and essentially every modern system solves it the same way, in two moves:

Turn the image into a short list of vectors using a vision encoder (a separate neural network, almost always a Vision Transformer). Its output is a handful of vectors, one per region of the image.
Translate those vectors into the LLM's space with a small connector, then splice them into the token sequence (or let the LLM attend to them), and run the language model as usual.

Why go to all this trouble? Because a staggering fraction of what people want from an assistant is visual: reading a chart, describing a photo, debugging a screenshot, parsing a scanned receipt, finding the button to click in a UI. A text-only model is blind to all of it. The bet of multimodality is that vision and language share enough abstract structure — the concept “dog” is the same whether you see one or read the word — that a single transformer can host both, if we can only get the pixels into a form it understands.

From pixels to tokens: the Vision Transformer (ViT)

The standard “eye” is the Vision Transformer, or ViT, and its central trick is to treat an image exactly the way a tokenizer treats text: chop it into pieces and embed each piece. There is no convolutional magic to memorize; it is the transformer you already know, fed a cleverly prepared input. Here is the recipe, step by step.

1. Patchify. Lay a grid over the image and cut it into non-overlapping square patches of side $P$ pixels (a typical $P$ is $14$ or $16$ ). A $224\times224$ image with $P=16$ becomes a $14\times14$ grid of patches. Each patch is a tiny tile of raw pixels, a block of $P\times P\times 3$ numbers.

2. Linearly embed each patch. Flatten each patch's $P\cdot P\cdot 3$ pixel values into one long vector and pass it through a single shared linear map to produce a vector of width $d$ . This vector is a patch token — the visual analogue of a word embedding. (Implementations usually do this as a single strided convolution with kernel and stride both equal to $P$ ; that is mathematically identical to “cut into patches, then multiply each by a weight matrix,” just packaged more efficiently.)

3. Add position embeddings. A patch needs to know where in the grid it sat, exactly as a word needs to know its position — without this, “sky above grass” and “grass above sky” would look identical. ViTs add a learned position vector to each patch token.

4. Run a transformer encoder. Feed the full set of patch tokens (optionally with a prepended learnable [CLS] token, a slot whose job is to summarize the whole image) through an ordinary transformer. Out comes a refined vector per patch.

So a ViT emits patches, not words: a grid of $d$ -dimensional vectors, one per region of the image, each having “looked at” the others through self-attention.

Loading diagram…

Hands-on · counting patch tokens

The arithmetic every VLM engineer does in their sleep. For an $H\times W$ image cut into $P\times P$ patches with no overlap, the number of patches is

N_{\text{patch}}=\frac{H}{P}\times\frac{W}{P}.

A $224\times224$ image at $P=16$ : $\frac{224}{16}=14$ per side, so $14\times14=\mathbf{196}$ patch tokens. Add the [CLS] token and the encoder processes $197$ tokens. Switch to $P=14$ on the same image: $\frac{224}{14}=16$ per side $\Rightarrow 16\times16=256$ tokens. Now a sharper $336\times336$ image at $P=14$ : $\frac{336}{14}=24$ per side $\Rightarrow 24\times24=\mathbf{576}$ tokens. Notice the pattern: halve the patch size or scale up the image, and token count grows with the area. Remember $196$ — it is the canonical number, and it is exactly $14^2$ .

Two of the ViT's outputs matter, and the distinction comes up constantly. The patch tokens are the full grid ( $196$ or $576$ vectors), each describing one local region — this is what a VLM feeds to the LLM so it can reason about where things are and how they relate. The pooled embedding (the [CLS] vector, or an average over all patches) is a single vector summarizing the whole image; it is great for classification or retrieval but spatially blind. A model that must say “the red cup is left of the laptop” needs the patch grid, not the pooled summary.

Where the encoder comes from: CLIP and contrastive pretraining

We do not usually train the vision encoder from scratch inside the VLM; we borrow one that already knows how to “see in a language-aware way.” The workhorse is CLIP (Contrastive Language–Image Pretraining). CLIP trains an image encoder and a text encoder together, from hundreds of millions of (image, caption) pairs scraped from the web — no hand-drawn class labels, just naturally occurring image–text pairs. Its goal is to build a shared embedding space: a single space in which a picture of a dog and the words “a photo of a dog” land on nearly the same point.

How do you train for that without labels? With a contrastive objective, and the idea is purely geometric. Take a batch of $N$ image–caption pairs. Encode the images into vectors $u_1,\dots,u_N$ and the captions into vectors $v_1,\dots,v_N$ , and $\ell_2$ -normalize them all onto the unit sphere (so “similarity” becomes the cosine of the angle between two vectors — $1$ means same direction, $0$ means orthogonal). Now form the $N\times N$ grid of all pairwise similarities. The $N$ matching pairs sit on the diagonal: image $i$ with its own caption $i$ . Every off-diagonal entry is a mismatch: image $i$ with somebody else's caption. The whole training signal is one sentence: pull each diagonal pair together; push every off-diagonal pair apart.

Loading diagram…

To make “pull together / push apart” into a trainable loss, CLIP reuses the softmax-cross-entropy machinery from Topic 1. Scale every similarity by a learned temperature $\tau$ (a single number that controls how sharp the softmax is) to get logits $S_{ij}=u_i^{\top}v_j/\tau$ . Now read each row as a tiny classification problem: “among these $N$ captions, which one belongs to image $i$ ?” The right answer is the diagonal, $j=i$ . Do the same down each column (“which image belongs to this caption?”). Averaging both directions gives the symmetric InfoNCE loss:

\mathcal{L}=\tfrac12\!\left[ -\frac1N\sum_{i}\log\frac{e^{S_{ii}}}{\sum_j e^{S_{ij}}} \;-\;\frac1N\sum_{j}\log\frac{e^{S_{jj}}}{\sum_i e^{S_{ij}}}\right].

The batch is the set of classes: each image is classified against the $N$ captions present, and vice versa. A bigger batch therefore supplies more negatives (more wrong captions to push away from), which is why CLIP-style training famously wants huge batches.

Hands-on · the contrastive softmax on a

3\times3

batch

Three pairs. Suppose temperature scaling has already produced these row logits (similarities of image $1$ against captions $1,2,3$ ):

\text{image 1 row: } S_{1\bullet}=(\,2,\;0,\;0\,),\qquad e^{2}\approx7.39,\; e^{0}=1.

The softmax probability that image $1$ matches its own caption (column $1$ ) is

p_{11}=\frac{e^{2}}{e^{2}+e^{0}+e^{0}}=\frac{7.39}{7.39+1+1}=\frac{7.39}{9.39}\approx 0.79,

so the per-example loss is $-\log 0.79\approx 0.24$ — small, because the model is already fairly confident the diagonal is the match. If instead all three similarities were equal, say $S_{1\bullet}=(0,0,0)$ , then $p_{11}=\tfrac13$ and the loss is $-\log\tfrac13\approx 1.10$ : maximal confusion. Training drives the diagonal logit up and the off-diagonal logits down until $p_{ii}\to1$ . Note the temperature's role: a small $\tau$ multiplies all similarities up, sharpening the softmax (more confident, harsher penalty for near-misses); a large $\tau$ flattens it.

After this training, distance in the shared space means semantic mismatch and closeness means semantic match, so you can score how well any caption fits any image with a single dot product. That is why CLIP is the default “image–text model” and the default vision backbone for VLMs: its encoder already produces image vectors that live in a language-aware space. One subtlety the questions raise — the modality gap: even after training, image embeddings tend to cluster in one region of the sphere and text embeddings in another, slightly separated region. Matched pairs are still closer to each other than to random, but the two clouds do not fully overlap. It comes from different encoder initializations and the fact that the contrastive loss only needs relative ordering (diagonal beats off-diagonal), not absolute coincidence. It rarely hurts retrieval, but it is worth knowing it exists.

Loading diagram…

Connecting vision to the LLM: the connector and three recipes

CLIP's encoder outputs patch vectors of its width $d$ , living in its space — not the LLM's width $d_{\text{model}}$ , and not aligned with how the LLM reads embeddings. So between the encoder and the LLM sits a small connector (also called a projector, adapter, or vision–language bridge) whose only job is to translate. There are three main ways to wire the connection, and the choice colors everything downstream.

Recipe 1 — a projector/adapter (the LLaVA style; early fusion). The simplest and now most common approach. Pass each patch token through a tiny MLP — in LLaVA-1.5, a two-layer MLP with a GELU nonlinearity — that maps it from $\mathbb{R}^{d}$ into $\mathbb{R}^{d_{\text{model}}}$ . After projection, each image token is just another entry in the sequence. Concatenate them in front of the text tokens (image first, then the question) and run the unmodified LLM over the whole stream. Beautifully simple, cheap to build, and the LLM's full depth attends to vision from layer one. The cost: every image token occupies a sequence slot, so it lengthens the context and, because the image tokens sit in ordinary self-attention, enlarges the KV cache by exactly their count.

Recipe 2 — cross-attention adapters (the Flamingo style; late fusion). Keep the text sequence short and do not put image tokens in it. Instead, splice new cross-attention layers into the (frozen) LLM at intervals; in these layers the text tokens attend out to the visual features, which never occupy sequence slots of their own. A crucial detail makes this safe: each injected layer is multiplied by $\tanh(\alpha)$ with a learnable scalar $\alpha$ initialized to zero, so at the start the new layer contributes nothing and the pretrained LLM behaves exactly as before; vision is then faded in as $\alpha$ grows during training. This decouples the text context length from the image-token count — attractive when you have many images or video — at the price of new parameters and a more invasive architecture.

Recipe 3 — native / early-fusion multimodal tokens. Rather than bolting a vision encoder onto a finished text LLM, train one model from the start on a stream that interleaves text tokens and image (or image-patch) tokens, sometimes from a shared or jointly-learned tokenizer. The model is multimodal at birth, which tends to produce tighter cross-modal reasoning, but it forfeits the convenience of reusing an off-the-shelf LLM and demands multimodal data and compute from day one. Several recent frontier models lean this way.

Loading diagram…

After projection (recipe 1), the image tokens are usually prepended to the text, they extend the sequence one-for-one, and positional encodings must be extended to cover them — a common scheme gives the image block its own positions (with 2-D-aware variants encoding row and column) before the text positions resume.

Three knobs, one goal — let language condition on vision. Projector/early fusion (LLaVA) puts image tokens in the sequence: minimal new parameters, but cost scales with image-token count through context length and KV cache. Cross-attention (Flamingo) keeps image features outside the sequence and lets text attend to them via added gated layers: the text context stays short, but you add parameters and architectural complexity. Native multimodal trains one model on interleaved tokens from the start: tightest fusion, highest data/compute bill. Same goal, different invoices.

How a VLM is actually trained: stages and freezing

Building a VLM out of a text LLM and a pretrained vision backbone is mostly a curriculum of what to freeze. The standard projector-style recipe has two stages.

Stage 1 — alignment / connector warm-up. Freeze both the vision encoder and the LLM; train only the projector, on a large, cheap pile of image–caption pairs. The connector's job is narrow — learn to land CLIP-space vectors in the LLM's embedding space — so a small trainable module and simple data suffice. Nothing else moves, so this stage is fast and stable.

Stage 2 — visual instruction tuning. Unfreeze the LLM (the vision encoder often stays frozen, or is fine-tuned only gently) and train on high-quality multimodal instructions: “describe this,” “answer this question about the chart,” “find the bug in this screenshot.” This is where the model learns to use vision to follow instructions, not merely caption.

Two practical pains recur in the questions. First, good multimodal instruction data is hard to get: each example must pair an image with an instruction and a faithful, genuinely image-grounded answer — laborious to write and easy to get subtly wrong. Teams lean on synthetic data (a strong model writes question–answer pairs about images), which scales beautifully but risks teaching the student the teacher's mistakes and hallucinations. Second, flooding the model with vision data can degrade its pure-text ability (a form of catastrophic forgetting); the standard fix is to mix in text-only data during multimodal training and watch text benchmarks as a guardrail against regression.

Counting the cost: “image $=$ many tokens,” resolution, and video

Here is the fact that drives most VLM serving headaches: an image is not one token, it is hundreds. A single $336\times336$ image at $P=14$ is $576$ tokens. Prepend it to a $1{,}000$ -token prompt and the sequence is $1{,}576$ tokens, of which the image is more than a third — it dominates the prefill compute and the KV cache. And it gets worse with resolution: because token count grows with image area, doubling the side length quadruples the patches.

Loading diagram…

This quadratic blow-up is why high-resolution schemes use tiling (the “AnyRes” idea): instead of shrinking a detailed image down to $336\times336$ — which would smear the small text on a chart into mush — split the image into a grid of tiles, encode each tile at native resolution, and add one downsized thumbnail for global context. If there are $N$ tiles of $P_t$ tokens each plus a thumbnail of $P_0$ tokens, the total is

N_{\text{img}}=N\cdot P_t + P_0,

which is what lets a VLM read fine print — at a token (and latency) cost that scales with the tile count. The same budget arithmetic governs video: a clip is just a stack of frames, so with $P$ tokens per frame and a context budget $B$ you can afford about $f_{\max}=B/P$ frames. A one-hour clip blows any budget, so video forces aggressive frame sampling or temporal pooling (merging adjacent frames' tokens). You trade temporal coverage against per-frame detail, and the craft is spending the budget where motion or content actually changes rather than on static stretches.

Loading diagram…

The signature failure: hallucination and language priors

A VLM's characteristic failure mode is visual hallucination: confidently describing objects, text, or relationships that are simply not in the image. The root cause is subtle and important. The LLM half of the model is an enormously strong language prior: trained on oceans of text in which, say, “kitchen” co-occurs with “refrigerator.” Show it a kitchen with no fridge and ask “is there a refrigerator?”, and the model can answer “yes” by pattern-matching the prompt and its priors rather than by consulting the pixels. It produces a fluent, plausible answer without actually looking. This is the model “cheating” via language priors, and it is dangerous precisely because the wrong answer sounds exactly as confident as a right one.

The cure is better visual grounding — tying every claim to evidence in the image — pursued on three fronts. On data: include negative and counterfactual examples (questions whose honest answer is “no, that object is not here”) so the model is rewarded for saying no. On preference optimization: when you run RLHF or DPO for a VLM (Topics 7–8), the reward or preference labels must reward factual grounding, not just fluent style — prefer the response that correctly omits an absent object over the smoother one that invents it. (The usual over-optimization caveat applies: push too hard on a grounding proxy and the model turns terse and over-cautious.) On evaluation: you must separate “the model sees” from “the model guesses.” The cleanest probe is to hold the question fixed and swap the image: a model that truly grounds changes its answer when the image changes; one riding on priors gives the same answer regardless. Object-presence probes — asking about both present and absent objects and scoring the false-positive rate — measure hallucination far more honestly than a single captioning score.

Loading diagram…

Other modalities, briefly

The patches-into-tokens template generalizes. Audio is usually turned into a spectrogram (a time–frequency image) and patched just like a picture, or encoded by a dedicated speech encoder, then projected into the LLM's space. Video, as above, is frames-as-images plus a temporal budget. The recurring pattern across all of them is identical to the vision story: a modality-specific encoder produces vectors, a small connector maps them to $d_{\text{model}}$ , and the LLM does the reasoning. Learn the image case and the rest is variations on a theme.

What to watch for / why it matters

Two threads run through every question in this topic, and naming them makes the rest feel familiar. The token thread: an image becomes a grid of patch tokens (ViT), borrowed from a contrastively-trained encoder (CLIP), projected into $d_{\text{model}}$ and either placed in the sequence (early fusion / LLaVA), attended to via gated cross-attention (Flamingo), or trained in natively — with cost scaling as image area, which makes resolution, tiling, and video all exercises in token budgeting. The grounding thread: because a powerful language prior can answer without looking, the genuinely hard engineering lives not in the architecture but in the data, rewards, and evaluations that force the model to use the pixels and let you verify that it did. Keep those two threads in mind — “how many tokens, and did it actually look?” — and the detailed questions on AnyRes tiling, InfoNCE derivations, Flamingo adapter FLOPs, video frame budgets, and hallucination evals will all read as variations on parts you have already met.

The big picture: how do you feed an image to a token machine?

From pixels to tokens: the Vision Transformer (ViT)

Where the encoder comes from: CLIP and contrastive pretraining

Connecting vision to the LLM: the connector and three recipes

How a VLM is actually trained: stages and freezing

Counting the cost: “image === many tokens,” resolution, and video

The signature failure: hallucination and language priors

Other modalities, briefly

What to watch for / why it matters

Counting the cost: “image $=$ many tokens,” resolution, and video