Hard

Implement on-policy distillation: sample from the student, score tokens under a (toy) teac

Alignment Algorithms Zoo · Problem 4 of 5

Chapter 10Alignment Algorithms Zoo

Implement on-policy distillation: sample from the student, score tokens under a (toy) teac

HardProblem 4 / 5

Implement on-policy distillation: sample from the student, score tokens under a (toy) teacher, and apply the reverse-KL policy-gradient update.

Implement the function/class skeleton in the editor. Any correct approach is accepted.

Hints