RLHF, RL & Preference Optimization (Core) · Problem 3 of 7
Implement the GRPO group-normalized advantage and the clipped token-level objective. [DeepSeek]
Implement the function/class skeleton in the editor. Any correct approach is accepted.
import torch
import torch.nn.functional as F
def grpo_advantage(rewards):
raise NotImplementedError
def grpo_loss(logp_new, logp_old, advantages, mask, eps=0.2):
raise NotImplementedErrorReady when you are
Submit your solution and a structured review appears here — verdict, score, and concrete feedback. Any correct approach passes.
Implement the GRPO group-normalized advantage and the clipped token-level objective. [DeepSeek]
Implement the function/class skeleton in the editor. Any correct approach is accepted.