RLHF, RL & Preference Optimization (Core) · Problem 4 of 7
Implement a minimal PPO update for LLMs: ratio, clipped surrogate, value loss, per-token KL penalty to a reference. [OpenAI]
Implement the function/class skeleton in the editor. Any correct approach is accepted.
import torch
import torch.nn.functional as F
def ppo_step(logp_new, logp_old, logp_ref, values_new, returns, advantages, mask, eps=0.2, vf_coef=0.5, kl_coef=0.1):
raise NotImplementedError
def _m(x, mask):
raise NotImplementedErrorReady when you are
Submit your solution and a structured review appears here — verdict, score, and concrete feedback. Any correct approach passes.
Implement a minimal PPO update for LLMs: ratio, clipped surrogate, value loss, per-token KL penalty to a reference. [OpenAI]
Implement the function/class skeleton in the editor. Any correct approach is accepted.