Super-hard

Implement an end-to-end toy RLHF loop on a “bandit-LM”: train an RM from synthetic prefe

RLHF, RL & Preference Optimization (Core) · Problem 6 of 7

Chapter 09RLHF, RL & Preference Optimization (Core)

Implement an end-to-end toy RLHF loop on a “bandit-LM”: train an RM from synthetic prefe

Super-hardProblem 6 / 7

Implement an end-to-end toy RLHF loop on a “bandit-LM”: train an RM from synthetic preferences, optimize the policy with PPO + KL control, and plot gold vs proxy reward to exhibit over-optimization. [OpenAI]

Implement the function/class skeleton in the editor. Any correct approach is accepted.

Hints