RLHF, RL & Preference Optimization (Core) · Problem 7 of 7
Implement DAPO on a toy RLVR task: clip-higher, dynamic sampling (drop all-correct/all-wrong groups), token-level loss, and overlong reward shaping. [NVIDIA]
Implement the function/class skeleton in the editor. Any correct approach is accepted.
import torch
def overlong_shaping(reward, length, soft_cap, hard_cap, penalty=1.0):
raise NotImplementedError
def dapo_loss(logp_new, logp_old, rewards, lengths, mask, G, eps_low=0.2, eps_high=0.28, soft_cap=4, hard_cap=8):
raise NotImplementedErrorReady when you are
Submit your solution and a structured review appears here — verdict, score, and concrete feedback. Any correct approach passes.
Implement DAPO on a toy RLVR task: clip-higher, dynamic sampling (drop all-correct/all-wrong groups), token-level loss, and overlong reward shaping. [NVIDIA]
Implement the function/class skeleton in the editor. Any correct approach is accepted.