Loop Hardening (v0.70.0)

Six surfaces that protect the training loop from the failure modes that cost a real GPU-hour. Schema-only today — every live callback / math kernel raises NotImplementedError with explicit v0.70.1 marker after validating inputs, except the math kernels that don't need a Trainer hook (cluster-separation, RM-ensemble divergence, echo-trap n-gram repetition — all LIVE).

`--reward-hack-detector` — InfoRM + RM-ensemble divergence

bash
soup train --task grpo --base-model meta-llama/Llama-3.1-8B \
  --reward-model registry://rm-v1 \
  --reward-hack-detector info_rm \
  --reward-hack-halt

Two detectors:

  • info_rm — InfoRM Cluster-Separation Index (Wang et al. 2024, [arXiv 2402.09345](https://arxiv.org/abs/2402.09345)). Drops when the policy collapses onto a degenerate reward-maximising subspace.
  • rm_ensemble — mean pairwise variance across an RM ensemble (cap 32). When ensemble members disagree, the policy is exploiting one of them.

Math kernels compute_cluster_separation, compute_rm_ensemble_divergence, classify_hack_signal are LIVE with OK / WARN / HACK bands at 0.10 / 0.30 relative drop. --reward-hack-halt auto-stops on HACK (exit 2). Cross-validator: task in {grpo, ppo} only, halt=True requires detector, rejects mlx. Composes with v0.34 soup why for anomaly explanation. Live HF Trainer callback ships in v0.70.1.

`--uld-strategy` — cross-tokenizer Universal Logit Distillation

yaml
# soup.yaml
task: distill
training:
  uld_strategy: wasserstein   # or: topk_align
  uld_top_k: 32               # required for topk_align

Boizard et al. 2024 ([arXiv 2402.12030](https://arxiv.org/abs/2402.12030)). Llama → Mistral, Llama → Qwen — no shared vocabulary required.

  • wasserstein — 1-D Wasserstein distance over sorted teacher / student logits, no alignment (cheap, robust default)
  • topk_align — top-K teacher logits matched via BPE-overlap heuristic alignment (use when you have a good vocab-overlap heuristic and want sharper signal)
  • _MAX_VOCAB_SIZE=262144 covers multilingual SentencePiece + GPT-OSS 200K vocabularies
  • Gated to task='distill' and rejects mlx backend
  • Live projection module ships in v0.70.1

`--minillm-enabled` — reverse-KL with 3 stability tricks bundled

yaml
task: distill
training:
  minillm_enabled: true
  minillm_teacher_mix_ratio: 0.3
  minillm_length_normalize: true
  minillm_pretrain_anchor_weight: 0.1
  minillm_pretrain_anchor_path: ./pretrain.jsonl

Gu et al. 2024 ([arXiv 2306.08543](https://arxiv.org/abs/2306.08543)). All three §3 stability tricks bundled: teacher-mixed sampling (mix teacher samples into the on-policy rollout), length normalisation (per-token KL averaged), pretrain-loss anchor (regularise toward an anchor distribution at weight α).

Cross-validators reject silent no-ops:

  • anchor_weight=0 with anchor_path set → error
  • anchor_weight > 0 with path = None → error

Gated to task='distill'. Live callback ships in v0.70.1.

`--rl-checkpoint-save-every-steps` — mid-epoch PPO/GRPO ckpt

bash
soup train --task ppo --base-model ... \
  --rl-checkpoint-save-every-steps 200 \
  --rl-checkpoint-keep-last 4 \
  --rl-checkpoint-include-optimizer \
  --rl-checkpoint-include-ref-model \
  --rl-checkpoint-include-rollout-buffer

TorchTune explicitly punts mid-epoch checkpointing. Soup ships the schema today with bounds save_every_steps ∈ [1, 10M], keep_last ∈ [1, 100] (oldest pruned).

Composes with v0.32 spike recovery + v0.40 reference-model regen — recovery now hops to the most recent mid-epoch ckpt instead of restarting the epoch on a PPO crash. Live save_state / load_state ships in v0.70.1.

`soup iterative-dpo` — sample → score → re-pair → retrain driver

bash
soup iterative-dpo --base-model registry://policy-v3 \
  --reward-model registry://rm-v1 \
  --prompts ./prompts.jsonl \
  --output-dir ./iter-dpo \
  --rounds 4 --pairs-per-round 4000

Frozen IterativeDPOPlan with a consecutive-`round_index` invariant and canonical per-round artifacts:

./iter-dpo/round-01/pairs.jsonl
./iter-dpo/round-01/adapter/
./iter-dpo/round-02/pairs.jsonl
./iter-dpo/round-02/adapter/
...

So a crashed run resumes cleanly. --plan-only renders the validated plan and exits 0; live runner (subprocess soup train --task dpo --resume between rounds) ships in v0.70.1.

`--echo-trap-enabled` — RAGEN multi-turn n-gram repetition detector

bash
soup train --task grpo ... \
  --echo-trap-enabled \
  --echo-trap-threshold 0.6 \
  --echo-trap-halt

Zhu et al. 2025 ([arXiv 2504.14437](https://arxiv.org/abs/2504.14437)). Pure-Python n-gram repetition rate per trajectory + a batch mean — when an agent's rollout collapses into "echoing itself" (the same n-gram pattern appearing repeatedly within and across turns), this catches it before the reward model rewards the degenerate policy.

OK / WARN / TRAP bands at 0.30 / 0.60. DoS caps _MAX_NGRAM_N=32, _MAX_TRAJECTORY_TOKENS=1M, _MAX_BATCH_TRAJECTORIES=100k. Gated to task in {grpo, ppo} non-mlx. Composes with v0.53.11 GRPOStabilityCallback. Live callback ships in v0.70.1; math kernel is LIVE.

Numbers

+337 tests in v0.70.0 (11,487 → 11,824) across 6 new test files plus follow-up boundary tests. Combined v0.68 → v0.70 net: +803 tests across 17 new test files (251 → 268).

See also

  • [Adapter lifecycle (v0.67)](/docs/adapter-lifecycle) — soup adapters bisect finds which mid-epoch ckpt regressed.
  • [Anti-trend insurance (v0.68)](/docs/anti-trend-insurance) — soup distill-prompt + ULD pair up to bridge tokeniser gaps.
  • [Soup Loop (v0.58)](/docs/soup-loop) — iterative-DPO runs inside a soup loop iteration.