Tracker & Eval Pro (v0.43.0)

`--tracker` allowlist

Closed allowlist: wandb | tensorboard | mlflow | swanlab | trackio | none.

bash
soup train --tracker mlflow

Mutually exclusive with legacy --wandb / --tensorboard via resolve_report_to.

PostHog telemetry (opt-IN)

Off by default. SOUP_TELEMETRY=1 enables hardware-info-only schema (soup_version / command / python / os / arch / duration). No network code in v0.43.0 — PostHog wire-up lands in v0.43.1.

NLG metrics

Pure-Python BLEU + ROUGE-1/2/L + effective_tokens_per_second. Closed allowlist NLG_METRICS = frozenset({"bleu","rouge_1","rouge_2","rouge_l"}).

KL-divergence calibration

bash
soup eval calibrate --before old.json --after new.json

Classifies the KL delta as OK / MINOR / MAJOR at 0.05 / 0.20 thresholds (mirrors quant-check).

Model Arena (Elo)

bash
soup eval arena add chat-llama@v1 chat-llama@v2

K=32 default. 256-model cap, 1M-match cap. Tournament.ratings returns MappingProxyType to prevent mutation.

New benchmarks

ceval, cmmlu, aider_polyglot (live Aider Polyglot runner v0.43.1).

Profiling helpers

  • memory_snapshot_context — CUDA torch.cuda.memory._record_memory_history wrapper
  • detect_anomaly_contexttorch.autograd.set_detect_anomaly wrapper
  • nccl_bandwidth_check — reference table for h100 / a100 / v100 / rtx40-series (OK ≥80% / MINOR ≥50% / MAJOR <50%)

VS Code launch.json writer

bash
soup vscode init

Writes .vscode/launch.json with cwd containment + symlink TOCTOU guard.

soup data demo

4-bundle frozen registry: alpaca_demo / sharegpt_demo / dpo_demo / grpo_demo. Atomic copy via sibling temp file + os.replace.

bash
soup data demo alpaca_demo --output ./train.jsonl