Super-hard

Implement DAPO on a toy RLVR task: clip-higher, dynamic sampling (drop all-correct/all-wro

RLHF, RL & Preference Optimization (Core) · Problem 7 of 7

Chapter 09RLHF, RL & Preference Optimization (Core)

Implement DAPO on a toy RLVR task: clip-higher, dynamic sampling (drop all-correct/all-wro

Super-hardProblem 7 / 7

Implement DAPO on a toy RLVR task: clip-higher, dynamic sampling (drop all-correct/all-wrong groups), token-level loss, and overlong reward shaping. [NVIDIA]

Implement the function/class skeleton in the editor. Any correct approach is accepted.

Hints