Alignment Algorithms Zoo · Problem 5 of 5
Implement iterative/online DPO: generate on-policy pairs, label with a toy preference function, and run successive DPO rounds; show the win-rate trend.
Implement the function/class skeleton in the editor. Any correct approach is accepted.
import torch
import torch.nn as nn
import torch.nn.functional as F
def seq_logp(model, toks):
raise NotImplementedError
@torch.no_grad()
def sample(model, B):
raise NotImplementedError
def oracle_pref(a, b):
raise NotImplementedErrorReady when you are
Submit your solution and a structured review appears here — verdict, score, and concrete feedback. Any correct approach passes.
Implement iterative/online DPO: generate on-policy pairs, label with a toy preference function, and run successive DPO rounds; show the win-rate trend.
Implement the function/class skeleton in the editor. Any correct approach is accepted.