RLHF, RL & Preference Optimization (Core) · Problem 5 of 7
Implement Dr. GRPO (GRPO with the length and std-normalization terms removed) and show, on a toy batch, that it changes the relative weighting vs GRPO. [DeepSeek]
Implement the function/class skeleton in the editor. Any correct approach is accepted.
import torch
def grpo_adv(rewards):
raise NotImplementedError
def dr_grpo_adv(rewards):
raise NotImplementedError
def grpo_objective(per_tok_pg, mask):
raise NotImplementedError
def dr_grpo_objective(per_tok_pg, mask):
raise NotImplementedErrorReady when you are
Submit your solution and a structured review appears here — verdict, score, and concrete feedback. Any correct approach passes.
Implement Dr. GRPO (GRPO with the length and std-normalization terms removed) and show, on a toy batch, that it changes the relative weighting vs GRPO. [DeepSeek]
Implement the function/class skeleton in the editor. Any correct approach is accepted.