Super-hard

Implement a rule-based-reward GRPO loop on a toy arithmetic task rewarding a correct boxed

Reasoning & Test-Time Compute · Problem 4 of 4

Chapter 11Reasoning & Test-Time Compute

Implement a rule-based-reward GRPO loop on a toy arithmetic task rewarding a correct boxed

Super-hardProblem 4 / 4

Implement a rule-based-reward GRPO loop on a toy arithmetic task rewarding a correct boxed answer plus a format reward for <think>/<answer> tags. [DeepSeek]

Implement the function/class skeleton in the editor. Any correct approach is accepted.

Hints