Hard

Implement a minimal PPO update for LLMs: ratio, clipped surrogate, value loss, per-token K

RLHF, RL & Preference Optimization (Core) · Problem 4 of 7

Chapter 09RLHF, RL & Preference Optimization (Core)

Implement a minimal PPO update for LLMs: ratio, clipped surrogate, value loss, per-token K

HardProblem 4 / 7

Implement a minimal PPO update for LLMs: ratio, clipped surrogate, value loss, per-token KL penalty to a reference. [OpenAI]

Implement the function/class skeleton in the editor. Any correct approach is accepted.

Hints