Super-hard

Implement a rule-based-reward GRPO loop on a toy arithmetic task rewarding a correct boxed

Reasoning & Test-Time Compute · Problem 4 of 4

Chapter 11Reasoning & Test-Time Compute

Implement a rule-based-reward GRPO loop on a toy arithmetic task rewarding a correct boxed

Super-hardProblem 4 / 4

Implement a rule-based-reward GRPO loop on a toy arithmetic task rewarding a correct boxed answer plus a format reward for <think>/<answer> tags. [DeepSeek]

Implement the function/class skeleton in the editor. Any correct approach is accepted.

Hints

solution.pypython

local draft

import numpy as np

def make_actions(target):
    raise NotImplementedError

def reward(parsed, target, well_formed):
    raise NotImplementedError

def softmax(z):
    raise NotImplementedError

⌘/Ctrl + ↵ to submit

AI review

Ready when you are

Submit your solution and a structured review appears here — verdict, score, and concrete feedback. Any correct approach passes.

Chapter 11Reasoning & Test-Time Compute

Implement a rule-based-reward GRPO loop on a toy arithmetic task rewarding a correct boxed

Super-hardProblem 4 / 4

Implement a rule-based-reward GRPO loop on a toy arithmetic task rewarding a correct boxed answer plus a format reward for <think>/<answer> tags. [DeepSeek]

Implement the function/class skeleton in the editor. Any correct approach is accepted.

Hints