Alignment Algorithms Zoo · Problem 4 of 5
Implement on-policy distillation: sample from the student, score tokens under a (toy) teacher, and apply the reverse-KL policy-gradient update.
Implement the function/class skeleton in the editor. Any correct approach is accepted.
import torch
import torch.nn as nn
import torch.nn.functional as F
class TinyLM(nn.Module):
def __init__(self, V, ctx=4):
raise NotImplementedError
def step_logits(self, last_tok):
raise NotImplementedError
@torch.no_grad()
def rollout(student, V, B, T, start=0):
raise NotImplementedError
def distill_step(student, teacher, opt, V, B=64, T=6, gamma=0.99):
raise NotImplementedErrorReady when you are
Submit your solution and a structured review appears here — verdict, score, and concrete feedback. Any correct approach passes.
Implement on-policy distillation: sample from the student, score tokens under a (toy) teacher, and apply the reverse-KL policy-gradient update.
Implement the function/class skeleton in the editor. Any correct approach is accepted.