Super-hard

Implement an end-to-end toy RLHF loop on a “bandit-LM”: train an RM from synthetic prefe

RLHF, RL & Preference Optimization (Core) · Problem 6 of 7

Chapter 09RLHF, RL & Preference Optimization (Core)

Implement an end-to-end toy RLHF loop on a “bandit-LM”: train an RM from synthetic prefe

Super-hardProblem 6 / 7

Implement an end-to-end toy RLHF loop on a “bandit-LM”: train an RM from synthetic preferences, optimize the policy with PPO + KL control, and plot gold vs proxy reward to exhibit over-optimization. [OpenAI]

Implement the function/class skeleton in the editor. Any correct approach is accepted.

Hints

solution.pypython

local draft

import numpy as np

def softmax(z):
    raise NotImplementedError

def bt_fit(n_pairs=500, lr=0.1, steps=2000, noise=1.5):
    raise NotImplementedError

def train(beta, lr=0.2, steps=600):
    raise NotImplementedError

⌘/Ctrl + ↵ to submit

AI review

Ready when you are

Submit your solution and a structured review appears here — verdict, score, and concrete feedback. Any correct approach passes.

Chapter 09RLHF, RL & Preference Optimization (Core)

Implement an end-to-end toy RLHF loop on a “bandit-LM”: train an RM from synthetic prefe

Super-hardProblem 6 / 7

Implement the function/class skeleton in the editor. Any correct approach is accepted.

Hints