RLHF, RL & Preference Optimization (Core) · Problem 6 of 7
Implement an end-to-end toy RLHF loop on a “bandit-LM”: train an RM from synthetic preferences, optimize the policy with PPO + KL control, and plot gold vs proxy reward to exhibit over-optimization. [OpenAI]
Implement the function/class skeleton in the editor. Any correct approach is accepted.
import numpy as np
def softmax(z):
raise NotImplementedError
def bt_fit(n_pairs=500, lr=0.1, steps=2000, noise=1.5):
raise NotImplementedError
def train(beta, lr=0.2, steps=600):
raise NotImplementedErrorReady when you are
Submit your solution and a structured review appears here — verdict, score, and concrete feedback. Any correct approach passes.
Implement an end-to-end toy RLHF loop on a “bandit-LM”: train an RM from synthetic preferences, optimize the policy with PPO + KL control, and plot gold vs proxy reward to exhibit over-optimization. [OpenAI]
Implement the function/class skeleton in the editor. Any correct approach is accepted.