Reasoning & Test-Time Compute · Problem 4 of 4
Implement a rule-based-reward GRPO loop on a toy arithmetic task rewarding a correct boxed answer plus a format reward for <think>/<answer> tags. [DeepSeek]
Implement the function/class skeleton in the editor. Any correct approach is accepted.
import numpy as np
def make_actions(target):
raise NotImplementedError
def reward(parsed, target, well_formed):
raise NotImplementedError
def softmax(z):
raise NotImplementedErrorReady when you are
Submit your solution and a structured review appears here — verdict, score, and concrete feedback. Any correct approach passes.
Implement a rule-based-reward GRPO loop on a toy arithmetic task rewarding a correct boxed answer plus a format reward for <think>/<answer> tags. [DeepSeek]
Implement the function/class skeleton in the editor. Any correct approach is accepted.