Super-hard

Implement DAPO on a toy RLVR task: clip-higher, dynamic sampling (drop all-correct/all-wro

RLHF, RL & Preference Optimization (Core) · Problem 7 of 7

Chapter 09RLHF, RL & Preference Optimization (Core)

Implement DAPO on a toy RLVR task: clip-higher, dynamic sampling (drop all-correct/all-wro

Super-hardProblem 7 / 7

Implement DAPO on a toy RLVR task: clip-higher, dynamic sampling (drop all-correct/all-wrong groups), token-level loss, and overlong reward shaping. [NVIDIA]

Implement the function/class skeleton in the editor. Any correct approach is accepted.

Hints

solution.pypython

local draft

import torch

def overlong_shaping(reward, length, soft_cap, hard_cap, penalty=1.0):
    raise NotImplementedError

def dapo_loss(logp_new, logp_old, rewards, lengths, mask, G, eps_low=0.2, eps_high=0.28, soft_cap=4, hard_cap=8):
    raise NotImplementedError

⌘/Ctrl + ↵ to submit

AI review

Ready when you are

Submit your solution and a structured review appears here — verdict, score, and concrete feedback. Any correct approach passes.

Chapter 09RLHF, RL & Preference Optimization (Core)

Implement DAPO on a toy RLVR task: clip-higher, dynamic sampling (drop all-correct/all-wro

Super-hardProblem 7 / 7

Implement DAPO on a toy RLVR task: clip-higher, dynamic sampling (drop all-correct/all-wrong groups), token-level loss, and overlong reward shaping. [NVIDIA]

Implement the function/class skeleton in the editor. Any correct approach is accepted.

Hints