Super-hard

Implement iterative/online DPO: generate on-policy pairs, label with a toy preference func

Alignment Algorithms Zoo · Problem 5 of 5

Chapter 10Alignment Algorithms Zoo

Implement iterative/online DPO: generate on-policy pairs, label with a toy preference func

Super-hardProblem 5 / 5

Implement iterative/online DPO: generate on-policy pairs, label with a toy preference function, and run successive DPO rounds; show the win-rate trend.

Implement the function/class skeleton in the editor. Any correct approach is accepted.

Hints