Training Stability & Auto-Tuning (v0.32.0)

Pre-flight tuning + in-training stability nets. All flags are opt-in.

LR Range Finder

Run a fast.ai-style geometric LR sweep before the real training run. Soup writes a JSON report with the recommended LR, the loss curve, and the divergence point.

bash
soup train --config soup.yaml \
  --find-lr \
  --find-lr-start 1e-7 \
  --find-lr-end 1e-1 \
  --find-lr-steps 100 \
  --find-lr-output ./lr_finder.json

The report contains the geometric lrs[], raw + EMA-smoothed losses[], the recommended LR (steepest negative gradient before divergence), the LR with min loss, and the divergence point if any.

> v0.33.0: --find-lr now runs an in-process LR-sweep training loop. NaN/Inf loss terminates the sweep early so diverged_at is honest.

Auto warmup schedule

yaml
training:
  warmup_auto: true
  warmup_ratio: 0.03   # 3% of total update steps (default)

Clamped to [10, 1000] so tiny datasets get some warmup and huge datasets don't burn half a million wasted steps.

Auto mixed-precision

yaml
training:
  auto_mixed_precision: true

Picks bf16 on Ampere+, fp16 on Turing or known fp16-stable models (Qwen2 / Qwen2.5 / Phi-3 / Phi-3.5), no on pre-Pascal. Multi-version pairs match the longest substring deterministically.

Loss spike auto-recovery

Extends the watchdog: instead of stopping on a spike, decay LR and resume.

yaml
training:
  loss_watchdog: true                   # required
  loss_spike_recovery: true
  loss_spike_recovery_max_attempts: 3
  loss_spike_recovery_lr_decay: 0.5

Capped at 3 attempts by default. Spike recovery writes a spike_recovery.json hint with the decayed LR for re-launch.

Convergence detector

yaml
training:
  convergence_detection: true
  convergence_window: 50
  convergence_rel_tol: 0.005

Surfaces continue / early_stop / lower_lr advice based on the loss curve.

VRAM pressure advisory

yaml
training:
  grad_accum_auto_tune: true
  grad_accum_pressure_threshold: 0.92

Records peak memory each step. When pressure crosses the threshold, recommends a new (batch, accum) pair preserving effective batch (capped at accum=1024).