Training Stability & Auto-Tuning (v0.32.0)
Pre-flight tuning + in-training stability nets. All flags are opt-in.
LR Range Finder
Run a fast.ai-style geometric LR sweep before the real training run. Soup writes a JSON report with the recommended LR, the loss curve, and the divergence point.
soup train --config soup.yaml \
--find-lr \
--find-lr-start 1e-7 \
--find-lr-end 1e-1 \
--find-lr-steps 100 \
--find-lr-output ./lr_finder.jsonThe report contains the geometric lrs[], raw + EMA-smoothed losses[], the recommended LR (steepest negative gradient before divergence), the LR with min loss, and the divergence point if any.
> v0.33.0: --find-lr now runs an in-process LR-sweep training loop. NaN/Inf loss terminates the sweep early so diverged_at is honest.
Auto warmup schedule
training:
warmup_auto: true
warmup_ratio: 0.03 # 3% of total update steps (default)Clamped to [10, 1000] so tiny datasets get some warmup and huge datasets don't burn half a million wasted steps.
Auto mixed-precision
training:
auto_mixed_precision: truePicks bf16 on Ampere+, fp16 on Turing or known fp16-stable models (Qwen2 / Qwen2.5 / Phi-3 / Phi-3.5), no on pre-Pascal. Multi-version pairs match the longest substring deterministically.
Loss spike auto-recovery
Extends the watchdog: instead of stopping on a spike, decay LR and resume.
training:
loss_watchdog: true # required
loss_spike_recovery: true
loss_spike_recovery_max_attempts: 3
loss_spike_recovery_lr_decay: 0.5Capped at 3 attempts by default. Spike recovery writes a spike_recovery.json hint with the decayed LR for re-launch.
Convergence detector
training:
convergence_detection: true
convergence_window: 50
convergence_rel_tol: 0.005Surfaces continue / early_stop / lower_lr advice based on the loss curve.
VRAM pressure advisory
training:
grad_accum_auto_tune: true
grad_accum_pressure_threshold: 0.92Records peak memory each step. When pressure crosses the threshold, recommends a new (batch, accum) pair preserving effective batch (capped at accum=1024).