RLVR — Reinforcement Learning from Verifiable Rewards

v0.25.0 adds deterministic reward signals for GRPO training on math, code, and JSON-schema tasks. No reward model, no human labels.

Enable it

yaml
base: Qwen/Qwen3-8B
task: grpo

data:
  train: math_problems.jsonl
  format: chatml

training:
  reward_fn: verifiable
  verifiable_domain: math   # math | code | json_schema
  lr: 1e-5
  epochs: 1
  lora: { r: 16, alpha: 32 }

Built-in reward functions

DomainWhat it doesSandbox
mathRegex-extracts the final answer, compares numerically with tolerancepure Python
codeExecutes Python against expected outputssubprocess, timeout 5s, no network, restricted builtins, 10 KB output cap
json_schemaValidates output against a JSON Schema and scores completenesspure Python

All three live in soup_cli/trainer/rewards.py and are routed through the existing GRPO trainer.

Generate training data

bash
soup data generate --template verifiable --domain math --count 500

The verifiable template in soup_cli/data/templates/verifiable.py emits problems with ground-truth answers you can verify at training time.

Safety

  • code_exec runs each completion in a short-lived subprocess with no network access and a restricted builtin set.
  • math_verify never uses eval() on model output — answers are extracted by regex.
  • verifiable_domain is a Pydantic Literal so arbitrary strings can't reach the dispatcher.

See also

  • [Training methods](/docs/training) — GRPO
  • [Autopilot](/docs/autopilot) — --goal reasoning picks RLVR when your data has ground truth