Hard

Implement a minimal PPO update for LLMs: ratio, clipped surrogate, value loss, per-token K

RLHF, RL & Preference Optimization (Core) · Problem 4 of 7

Chapter 09RLHF, RL & Preference Optimization (Core)

Implement a minimal PPO update for LLMs: ratio, clipped surrogate, value loss, per-token K

HardProblem 4 / 7

Implement a minimal PPO update for LLMs: ratio, clipped surrogate, value loss, per-token KL penalty to a reference. [OpenAI]

Implement the function/class skeleton in the editor. Any correct approach is accepted.

Hints

solution.pypython

local draft

import torch
import torch.nn.functional as F

def ppo_step(logp_new, logp_old, logp_ref, values_new, returns, advantages, mask, eps=0.2, vf_coef=0.5, kl_coef=0.1):
    raise NotImplementedError

def _m(x, mask):
    raise NotImplementedError

⌘/Ctrl + ↵ to submit

AI review

Ready when you are

Submit your solution and a structured review appears here — verdict, score, and concrete feedback. Any correct approach passes.

Chapter 09RLHF, RL & Preference Optimization (Core)

Implement a minimal PPO update for LLMs: ratio, clipped surrogate, value loss, per-token K

HardProblem 4 / 7

Implement a minimal PPO update for LLMs: ratio, clipped surrogate, value loss, per-token KL penalty to a reference. [OpenAI]

Implement the function/class skeleton in the editor. Any correct approach is accepted.

Hints