Hard

Implement on-policy distillation: sample from the student, score tokens under a (toy) teac

Alignment Algorithms Zoo · Problem 4 of 5

Chapter 10Alignment Algorithms Zoo

Implement on-policy distillation: sample from the student, score tokens under a (toy) teac

HardProblem 4 / 5

Implement on-policy distillation: sample from the student, score tokens under a (toy) teacher, and apply the reverse-KL policy-gradient update.

Implement the function/class skeleton in the editor. Any correct approach is accepted.

Hints

solution.pypython

local draft

import torch
import torch.nn as nn
import torch.nn.functional as F

class TinyLM(nn.Module):

    def __init__(self, V, ctx=4):
        raise NotImplementedError

    def step_logits(self, last_tok):
        raise NotImplementedError

@torch.no_grad()
def rollout(student, V, B, T, start=0):
    raise NotImplementedError

def distill_step(student, teacher, opt, V, B=64, T=6, gamma=0.99):
    raise NotImplementedError

⌘/Ctrl + ↵ to submit

AI review

Ready when you are

Submit your solution and a structured review appears here — verdict, score, and concrete feedback. Any correct approach passes.

Chapter 10Alignment Algorithms Zoo

Implement on-policy distillation: sample from the student, score tokens under a (toy) teac

HardProblem 4 / 5

Implement on-policy distillation: sample from the student, score tokens under a (toy) teacher, and apply the reverse-KL policy-gradient update.

Implement the function/class skeleton in the editor. Any correct approach is accepted.

Hints