Super-hard

Implement iterative/online DPO: generate on-policy pairs, label with a toy preference func

Alignment Algorithms Zoo · Problem 5 of 5

Chapter 10Alignment Algorithms Zoo

Implement iterative/online DPO: generate on-policy pairs, label with a toy preference func

Super-hardProblem 5 / 5

Implement iterative/online DPO: generate on-policy pairs, label with a toy preference function, and run successive DPO rounds; show the win-rate trend.

Implement the function/class skeleton in the editor. Any correct approach is accepted.

Hints

solution.pypython

local draft

import torch
import torch.nn as nn
import torch.nn.functional as F

def seq_logp(model, toks):
    raise NotImplementedError

@torch.no_grad()
def sample(model, B):
    raise NotImplementedError

def oracle_pref(a, b):
    raise NotImplementedError

⌘/Ctrl + ↵ to submit

AI review

Ready when you are

Submit your solution and a structured review appears here — verdict, score, and concrete feedback. Any correct approach passes.

Chapter 10Alignment Algorithms Zoo

Implement iterative/online DPO: generate on-policy pairs, label with a toy preference func

Super-hardProblem 5 / 5

Implement iterative/online DPO: generate on-policy pairs, label with a toy preference function, and run successive DPO rounds; show the win-rate trend.

Implement the function/class skeleton in the editor. Any correct approach is accepted.

Hints