Practice

Every problem, in one place

86 interactive coding problems and 1177+ theory & math questions across 17 chapters. Filter by difficulty, jump straight in, and track what you've solved.

Coding problems

1177+

Theory & math Qs

—

Jump into coding

01Transformer Architecture Internals

Implement scaled dot-product attention with a causal mask in numpy CodingEasy
Implement RMSNorm from scratch in PyTorch CodingEasy
Implement multi-head attention from scratch in PyTorch incl CodingMedium
Implement RoPE applied to a [batch, heads, seq, head_dim] tensor CodingMedium
Implement grouped-query attention with configurable KV heads plus a KV cache for increment CodingHard
Implement a single MLA layer (down-proj to latent, up-proj, decoupled RoPE) CodingHard
Implement a numerically stable online-softmax attention pass (FlashAttention recurrence) i CodingSuper-hard
Theory questions· 13 Q&ATheoryWarm-up
Theory questions· 12 Q&ATheoryEasy
Theory questions· 12 Q&ATheoryMedium
Theory questions· 11 Q&ATheoryHard
Theory questions· 7 Q&ATheoryInsane
Math questions· 6 Q&AHands-on / MathSuper easy
Math questions· 6 Q&AHands-on / MathMedium
Math questions· 9 Q&AHands-on / MathHard

02Tokenization & Embeddings

Given a merge list, implement a BPE tokenizer for a string CodingEasy
Implement BPE training (learn merges) from a corpus in pure Python CodingMedium
Implement Viterbi segmentation for a unigram-LM tokenizer given token log-probs CodingHard
Implement a byte-level BPE end-to-end (train $+$ encode $+$ decode) over arbitrary UTF-8 b CodingSuper-hard
Theory questions· 12 Q&ATheoryWarm-up
Theory questions· 13 Q&ATheoryEasy
Theory questions· 11 Q&ATheoryMedium
Theory questions· 10 Q&ATheoryHard
Theory questions· 5 Q&ATheoryInsane
Math questions· 10 Q&AHands-on / MathSuper easy

03Attention Efficiency & Long Context

Implement sliding-window causal attention in PyTorch CodingEasy
Implement a KV cache + incremental single-token decode loop for a small transformer CodingMedium
Implement grouped-query attention (GQA) in PyTorch by repeating/broadcasting KV heads acro CodingMedium
Implement a blocked/tiled attention forward pass (FlashAttention-style) with running max/s CodingHard
Implement YaRN RoPE scaling (frequency grouping + attention-logit scaling) and demonstrate CodingSuper-hard
Theory questions· 13 Q&ATheoryWarm-up
Theory questions· 12 Q&ATheoryEasy
Theory questions· 11 Q&ATheoryMedium
Theory questions· 10 Q&ATheoryHard
Theory questions· 5 Q&ATheoryInsane
Math questions· 8 Q&AHands-on / MathSuper easy
Math questions· 4 Q&AHands-on / MathHard
Intuition questions· 7 Q&AExperiments / Practitioner IntuitionOther

04Pretraining Objectives & Scaling Laws

Implement next-token cross-entropy for a batch of logits/targets in numpy with padding mas CodingEasy
Given Chinchilla coefficients, return compute-optimal $N$ and $D$ for a budget $C$ (numeri CodingMedium
Fit $L(N,D)=E+A/N^{\alpha}+B/D^{\beta}$ to a synthetic $(N,D,\text{loss})$ grid via least CodingHard
Implement an IsoFLOP analysis: from loss curves at several fixed-FLOP budgets, extract the CodingSuper-hard
Theory questions· 13 Q&ATheoryWarm-up
Theory questions· 12 Q&ATheoryEasy
Theory questions· 10 Q&ATheoryMedium
Theory questions· 10 Q&ATheoryHard
Theory questions· 5 Q&ATheoryInsane
Math questions· 7 Q&AHands-on / MathSuper easy
Math questions· 5 Q&AHands-on / MathHard
Intuition questions· 4 Q&AExperiments / Practitioner IntuitionOther

05Optimization & Training Dynamics

Implement AdamW from scratch in numpy for one parameter tensor CodingEasy
Implement a cosine LR schedule with linear warmup as a callable CodingMedium
Implement global gradient-norm clipping over a list of tensors CodingMedium
Implement the Muon step (momentum + Newton–Schulz orthogonalization) for 2D params, falli CodingHard
Implement a mixed-precision loop (bf16 compute, fp32 master weights) with loss scaling on CodingSuper-hard
Theory questions· 12 Q&ATheoryWarm-up
Theory questions· 12 Q&ATheoryEasy
Theory questions· 11 Q&ATheoryMedium
Theory questions· 10 Q&ATheoryHard
Theory questions· 5 Q&ATheoryInsane
Math questions· 8 Q&AHands-on / MathSuper easy
Math questions· 5 Q&AHands-on / MathHard
Intuition questions· 6 Q&AExperiments / Practitioner IntuitionOther

06Infrastructure, Distributed Training & Scaling

Implement a function computing total $+$ per-GPU memory for a given model/parallelism conf CodingEasy
Implement toy data-parallel SGD with manual gradient all-reduce (torch CodingMedium
Implement a 1F1B pipeline-schedule simulator reporting the bubble fraction for given (stag CodingHard
Implement tensor-parallel linear layers (column-parallel then row-parallel) with correct f CodingSuper-hard
Theory questions· 12 Q&ATheoryWarm-up
Theory questions· 12 Q&ATheoryEasy
Theory questions· 10 Q&ATheoryMedium
Theory questions· 10 Q&ATheoryHard
Theory questions· 5 Q&ATheoryInsane
Math questions· 10 Q&AHands-on / MathSuper easy
Math questions· 1 Q&AHands-on / MathInsane
Intuition questions· 6 Q&AExperiments / Practitioner IntuitionOther

07Mixture-Of-Experts

Implement top- $k$ routing (softmax $\to$ top- $k$ $\to$ renormalized gates) in PyTorch CodingEasy
Implement a full sparse MoE FFN with capacity, token dropping, and gate-weighted combinati CodingMedium
Implement the load-balancing aux loss and a training step demonstrating it equalizes exper CodingHard
Implement expert-parallel dispatch/combine with a simulated all-to-all and verify outputs CodingSuper-hard
Theory questions· 13 Q&ATheoryWarm-up
Theory questions· 13 Q&ATheoryEasy
Theory questions· 11 Q&ATheoryMedium
Theory questions· 10 Q&ATheoryHard
Theory questions· 5 Q&ATheoryInsane
Math questions· 10 Q&AHands-on / MathSuper easy
Math questions· 1 Q&AHands-on / MathInsane
Intuition questions· 4 Q&AExperiments / Practitioner IntuitionOther

08SFT, Instruction Tuning, Data & PEFT

Implement prompt-token loss masking given (prompt_len, total_len) per sample CodingEasy
Implement a LoRA-wrapped linear layer (frozen $W$ $+$ trainable $B\cdot A$ scaled by $ CodingMedium
Implement sequence packing with a block-diagonal attention mask so packed samples can't at CodingHard
Implement a synthetic-data pipeline: generate candidate examples, validate them with a rul CodingHard
Implement QLoRA-style NF4 4-bit quantization of a weight matrix plus a LoRA adapter, verif CodingSuper-hard
Build a dataset-audit tool that flags duplicated prompts, suspicious templates, tool-call CodingSuper-hard
Theory questions· 12 Q&ATheoryWarm-up
Theory questions· 12 Q&ATheoryEasy
Theory questions· 13 Q&ATheoryMedium
Theory questions· 13 Q&ATheoryHard
Theory questions· 7 Q&ATheoryInsane
Math questions· 10 Q&AHands-on / MathSuper easy
Intuition questions· 7 Q&AExperiments / Practitioner IntuitionOther

09RLHF, RL & Preference Optimization (Core)

Implement the Bradley–Terry pairwise reward-model loss in PyTorch CodingEasy
Implement GAE (the backward recursion) given per-token rewards and value estimates CodingMedium
Implement the GRPO group-normalized advantage and the clipped token-level objective CodingMedium
Implement a minimal PPO update for LLMs: ratio, clipped surrogate, value loss, per-token K CodingHard
Implement Dr CodingHard
Implement an end-to-end toy RLHF loop on a “bandit-LM”: train an RM from synthetic prefe CodingSuper-hard
Implement DAPO on a toy RLVR task: clip-higher, dynamic sampling (drop all-correct/all-wro CodingSuper-hard
Theory questions· 12 Q&ATheoryWarm-up
Theory questions· 13 Q&ATheoryEasy
Theory questions· 10 Q&ATheoryMedium
Theory questions· 13 Q&ATheoryHard
Theory questions· 7 Q&ATheoryInsane
Math questions· 10 Q&AHands-on / MathSuper easy
Math questions· 10 Q&AHands-on / MathHard
Math questions· 2 Q&AHands-on / MathInsane
Intuition questions· 12 Q&AExperiments / Practitioner IntuitionOther

10Alignment Algorithms Zoo

Implement the DPO loss given policy/reference logprobs for chosen/rejected and $\beta$ CodingEasy
Implement SimPO (length-normalized, reference-free) and KTO losses and unit-test on toy da CodingMedium
Implement best-of- $n$ selection given a reward/verifier over sampled completions CodingMedium
Implement on-policy distillation: sample from the student, score tokens under a (toy) teac CodingHard
Implement iterative/online DPO: generate on-policy pairs, label with a toy preference func CodingSuper-hard
Theory questions· 12 Q&ATheoryWarm-up
Theory questions· 11 Q&ATheoryEasy
Theory questions· 11 Q&ATheoryMedium
Theory questions· 10 Q&ATheoryHard
Theory questions· 5 Q&ATheoryInsane
Math questions· 7 Q&AHands-on / MathSuper easy
Math questions· 7 Q&AHands-on / MathHard
Intuition questions· 6 Q&AExperiments / Practitioner IntuitionOther

11Reasoning & Test-Time Compute

Implement self-consistency: sample $N$ CoTs, extract answers, return the majority vote CodingEasy
Implement best-of- $n$ selection given a reward/verifier over sampled completions CodingMedium
Implement beam-search-over-reasoning-steps that expands/prunes partial CoTs using a PRM sc CodingHard
Implement a rule-based-reward GRPO loop on a toy arithmetic task rewarding a correct boxed CodingSuper-hard
Theory questions· 12 Q&ATheoryWarm-up
Theory questions· 12 Q&ATheoryEasy
Theory questions· 11 Q&ATheoryMedium
Theory questions· 10 Q&ATheoryHard
Theory questions· 5 Q&ATheoryInsane
Math questions· 10 Q&AHands-on / MathSuper easy
Math questions· 1 Q&AHands-on / MathInsane
Intuition questions· 5 Q&AExperiments / Practitioner IntuitionOther

12Evaluation, Reward Hacking & Alignment Methodology

Implement ECE given arrays of predicted confidences and correctness CodingEasy
Implement a pairwise LLM-as-judge harness with position-swap debiasing (run both orders, a CodingMedium
Implement a bootstrap confidence interval for win-rate from paired preference judgments CodingHard
Implement an n-gram/embedding contamination detector that flags eval items overlapping a t CodingSuper-hard
Theory questions· 13 Q&ATheoryWarm-up
Theory questions· 13 Q&ATheoryEasy
Theory questions· 13 Q&ATheoryMedium
Theory questions· 12 Q&ATheoryHard
Theory questions· 5 Q&ATheoryInsane
Math questions· 10 Q&AHands-on / MathSuper easy
Math questions· 1 Q&AHands-on / MathInsane
Intuition questions· 4 Q&AExperiments / Practitioner IntuitionOther

13Inference & Serving

Implement temperature + top- $k$ + top- $p$ sampling from a logits vector in numpy CodingEasy
Implement int8 symmetric per-channel weight quantization and dequantization for a linear l CodingMedium
Implement nucleus (top- $p$ ) sampling with correct renormalization and edge cases CodingMedium
Implement speculative decoding: draft proposes $k$ tokens, target verifies in one pass, ac CodingHard
Implement a continuous-batching scheduler simulator with a paged KV cache that admits/evic CodingSuper-hard
Theory questions· 13 Q&ATheoryWarm-up
Theory questions· 13 Q&ATheoryEasy
Theory questions· 11 Q&ATheoryMedium
Theory questions· 10 Q&ATheoryHard
Theory questions· 5 Q&ATheoryInsane
Math questions· 9 Q&AHands-on / MathSuper easy
Math questions· 4 Q&AHands-on / MathHard
Intuition questions· 5 Q&AExperiments / Practitioner IntuitionOther

14Agents, Tool Use & Product Post-Training

Validate a tool-call object against a JSON schema (required fields and types) CodingEasy
Compute an agent success-rate metric over a list of trajectories CodingEasy
Implement a tool-calling loop with retries, timeouts, and error handling CodingMedium
Implement a code-evaluation harness that runs unit tests against generated solutions and r CodingMedium
Implement a preference-dataset builder from accepted/rejected suggestions, including a pos CodingHard
Implement an inverse-propensity-weighted (IPS) off-policy evaluator for logged agent actio CodingHard
Build a toy agent environment where the model can solve a task, call tools, fail safely, o CodingSuper-hard
Theory questions· 14 Q&ATheoryWarm-up
Theory questions· 14 Q&ATheoryEasy
Theory questions· 10 Q&ATheoryMedium
Theory questions· 8 Q&ATheoryHard
Theory questions· 5 Q&ATheoryInsane
Math questions· 9 Q&AHands-on / MathSuper easy
Math questions· 5 Q&AHands-on / MathHard
Intuition questions· 6 Q&AExperiments / Practitioner IntuitionOther

15Diffusion & Non-Autoregressive Language Models

Implement the forward masking process for a masked diffusion LM (mask a fraction $t$ of to CodingEasy
Implement a single reverse-denoising step: predict all masked tokens, keep the most confid CodingMedium
Implement a minimal masked-diffusion training loss (sample a mask rate, mask tokens, cross CodingHard
Implement a small end-to-end masked diffusion LM sampler with confidence-based remasking a CodingSuper-hard
Theory questions· 12 Q&ATheoryWarm-up
Theory questions· 12 Q&ATheoryEasy
Theory questions· 12 Q&ATheoryMedium
Theory questions· 10 Q&ATheoryHard
Theory questions· 5 Q&ATheoryInsane
Math questions· 10 Q&AHands-on / MathSuper easy
Math questions· 2 Q&AHands-on / MathInsane
Intuition questions· 4 Q&AExperiments / Practitioner IntuitionOther

16Multimodal / Vision-Language (Lighter)

Implement patch embedding (conv or unfold $+$ linear) converting an image tensor to patch CodingEasy
Implement a projection adapter and interleave image $+$ text embeddings into one sequence wi CodingMedium
Implement cross-attention adapter layers letting text tokens attend to frozen vision featu CodingHard
Theory questions· 12 Q&ATheoryWarm-up
Theory questions· 13 Q&ATheoryEasy
Theory questions· 10 Q&ATheoryMedium
Theory questions· 10 Q&ATheoryHard
Theory questions· 5 Q&ATheoryInsane
Math questions· 10 Q&AHands-on / MathSuper easy
Math questions· 1 Q&AHands-on / MathInsane
Intuition questions· 3 Q&AExperiments / Practitioner IntuitionOther

17Research Engineering & Debugging

Write shape assertions for the tensors in an attention forward pass CodingEasy
Write a check that verifies labels are correctly shifted by one relative to inputs CodingEasy
Write unit tests for causal masking (no token may attend to the future) CodingMedium
Write a cached-vs-full-forward equivalence test for incremental decoding CodingMedium
Debug a broken GPT training notebook containing NaNs, a mask bug, and shifted labels CodingHard
Debug a broken GRPO implementation with wrong grouping, wrong masks, and wrong old-logprob CodingHard
Build a toy post-training stack with intentional bugs, then write tests that catch every o CodingSuper-hard
Implement an automated regression harness that compares outputs, logits, losses, rewards, CodingSuper-hard
Theory questions· 8 Q&ATheoryWarm-up
Theory questions· 10 Q&ATheoryEasy
Theory questions· 9 Q&ATheoryMedium
Theory questions· 6 Q&ATheoryHard
Theory questions· 5 Q&ATheoryInsane
Math questions· 6 Q&AHands-on / MathSuper easy
Math questions· 9 Q&AHands-on / MathMedium
Math questions· 2 Q&AHands-on / MathInsane
Intuition questions· 6 Q&AExperiments / Practitioner IntuitionOther