Hard

Implement Dr

RLHF, RL & Preference Optimization (Core) · Problem 5 of 7

Chapter 09RLHF, RL & Preference Optimization (Core)

Implement Dr

HardProblem 5 / 7

Implement Dr. GRPO (GRPO with the length and std-normalization terms removed) and show, on a toy batch, that it changes the relative weighting vs GRPO. [DeepSeek]

Implement the function/class skeleton in the editor. Any correct approach is accepted.

Hints