GRPO Plus (v0.50.0)
22 features. Unsloth + axolotl GRPO parity.
7 GRPO variants
yaml
task: grpo
training:
grpo_variant: gspo # standard | gspo | dapo | dr_grpo | bnpo | two_sided | rft
grpo_delta: 0.5 # only for two_sided, (0, 1]
grpo_fp16: true # explicit FP16 mixed precisionLong-context & vision GRPO
yaml
training:
long_context_grpo: true # extends rope scaling into rollouts
vision_grpo: true # multimodal GRPO
vllm_sleep_mode: true # free vLLM memory between rolloutsAsync rollout backends
yaml
training:
rollout_backend: openenv # art | ruler | nemo_gym | openenv
async_grpo_prefetch: trueart — Anthropic Research Tool, ruler — long-horizon judging, nemo_gym — NVIDIA Gym, openenv — multi-turn agent envs.
Reference-model controls
yaml
training:
ref_model_ema_alpha: 0.99 # EMA-style ref refresh
replay_buffer_size: 50000
tis_threshold: 1.5 # truncated importance sampling
mask_truncated_completions: true
defer_rerolling: true
skip_zero_advantage: true
off_policy_mask_threshold: 0.7New task: prm
task: prm (Process Reward Model) wired through the standard reward-model trainer. Gates: vision + GRPO compat checked.
All v0.50.0 features ship as schema-only; live trainer wiring lands in v0.50.1.