Training Intelligence
v0.25.0 adds two training-time subsystems that no other CLI ships: catastrophic forgetting detection and checkpoint intelligence.
Catastrophic forgetting detection
Runs a mini benchmark on the base model *before* training, then repeats it every N steps on the current checkpoint. If general-knowledge accuracy drops more than your threshold, the Rich dashboard turns yellow; cross the red line and you can auto-stop training.
Config
training:
forgetting_detection: true
forgetting_eval_steps: 100
forgetting_threshold: 0.10 # warn if accuracy drops >10%
forgetting_benchmark: mini_mmlu # mini_mmlu | mini_common_sense | mini_instruction
forgetting_stop: false # auto-stop on severe forgettingBuilt-in benchmarks
Each is a 100-question set embedded in the source (no external downloads):
mini_mmlu— diverse MMLU coverage (STEM / humanities / social sciences)mini_common_sense— common-sense reasoningmini_instruction— instruction-following quality
Checkpoint intelligence
HF Trainer's "best checkpoint" is the one with lowest loss. But lower loss ≠ better model — overfitted checkpoints hit low loss with bad real-world quality. Checkpoint intelligence runs a quality eval *during* training and tags best_quality separately from best_loss.
Config
training:
checkpoint_intelligence: true
checkpoint_eval_steps: 200
checkpoint_eval_metric: composite # judge | mmlu | custom | composite
checkpoint_eval_tasks: eval.jsonl # optional custom eval
checkpoint_keep_top: 3 # delete the rest
early_stop_on_regression: true
early_stop_patience: 2Dashboard
Epoch 2/3 ████████ loss: 0.89 step 450/720
→ Loss best: step-450 (loss 0.89)
→ Quality best: step-300 (judge 8.2/10) ⭐
→ Gen knowledge: 91.2% (baseline 95.0%, -3.8% ⚠)
→ Last eval: +0.4 judge points (improving)After training, the best_quality checkpoint is linked at ./output/best_quality/.
Storage
Both subsystems extend the SQLite experiment tracker (~/.soup/experiments.db) with two new tables:
checkpoint_quality(run_id, step, metric, score, is_best, created_at)forgetting_eval(run_id, step, benchmark, accuracy, baseline, delta, warning_level)
Inspect them with soup runs show <run_id> or query directly.
Safety
- Benchmark data is embedded in code — no external file loading at runtime.
- Eval intervals are bounded (10 ≤ steps ≤ 10,000) to prevent runaway eval overhead.
- Pruning only deletes files inside the run's
output_dirand never follows symlinks outside.
Defaults
Autopilot turns both subsystems on by default. If you hand-write soup.yaml, flip the flags above.