GRPO Plus (v0.50.0)

22 features. Unsloth + axolotl GRPO parity.

7 GRPO variants

yaml
task: grpo
training:
  grpo_variant: gspo   # standard | gspo | dapo | dr_grpo | bnpo | two_sided | rft
  grpo_delta: 0.5      # only for two_sided, (0, 1]
  grpo_fp16: true      # explicit FP16 mixed precision

Long-context & vision GRPO

yaml
training:
  long_context_grpo: true   # extends rope scaling into rollouts
  vision_grpo: true         # multimodal GRPO
  vllm_sleep_mode: true     # free vLLM memory between rollouts

Async rollout backends

yaml
training:
  rollout_backend: openenv   # art | ruler | nemo_gym | openenv
  async_grpo_prefetch: true

art — Anthropic Research Tool, ruler — long-horizon judging, nemo_gym — NVIDIA Gym, openenv — multi-turn agent envs.

Reference-model controls

yaml
training:
  ref_model_ema_alpha: 0.99   # EMA-style ref refresh
  replay_buffer_size: 50000
  tis_threshold: 1.5          # truncated importance sampling
  mask_truncated_completions: true
  defer_rerolling: true
  skip_zero_advantage: true
  off_policy_mask_threshold: 0.7

New task: prm

task: prm (Process Reward Model) wired through the standard reward-model trainer. Gates: vision + GRPO compat checked.

All v0.50.0 features ship as schema-only; live trainer wiring lands in v0.50.1.