Soup Cans v2 + Live LR Finder (v0.33.0)

v0.33 graduates several v0.27–v0.32 stubs into live functionality and ships meaningful follow-ups across the whole stack.

Soup Cans v2 — `soup can run` + `soup can publish`

bash
# Run a .can end-to-end: extract → train → optional deploy
soup can run my-recipe.can --yes
soup can run my-recipe.can --yes --deploy --env-capture env.txt

# Publish to HF Hub as a dataset
soup can publish my-recipe.can --hf-hub user/my-recipe

soup can run requires explicit --yes (mandatory consent — auto-downloads data + auto-trains). Manifest format bumped 1 → 2 (additive: new deploy_targets field). Both v1 and v2 cans still load.

Security — extract dir is containment-checked, GGUF rglob result for ollama deploy is realpath+commonpath checked against extract dir to prevent symlink escape, subprocess TimeoutExpired after 24h cap returns rc=124 (coreutils convention). soup can publish validates repo_id, resolves the HF token via env / cache files, commit messages first-line + 200-char capped (matches v0.29 push policy).

Live `--find-lr` training loop

The v0.32 stub curve is replaced with a real in-process LR-sweep training loop. NaN/Inf loss terminates the sweep early so diverged_at is honest. Falls back to a synthetic curve when prerequisites are missing (no torch / config load failure / dataset empty) so CI without GPUs still produces a parseable report.

bash
soup train --config soup.yaml --find-lr --find-lr-output ./lr_finder.json

Spike-recovery hint file

Loss-spike recovery now writes a spike_recovery.json hint with the decayed LR for re-launch:

json
{"original_lr": 2e-4, "recovery_lr": 5e-5, "decay_factor": 0.25, "trigger_step": 482}

Live optimizer-state rewind and live DataLoader rebuild remain follow-ups (HF Trainer / TRL upstream constraints).

VRAM grad-accum advisory (live)

When VRAM pressure crosses the threshold, the advisory now prints a concrete recommended (batch, accum) pair preserving effective batch:

[advisory] VRAM at 94% — try batch=2, accum=8 (preserves effective batch=16)

One-shot per run; doesn't fire when CUDA is unavailable.

Auto-reexec under `accelerate`

soup train --gpus N now auto-reexecs under accelerate launch instead of just printing the command. Critical flags (--fsdp, --deepspeed, --resume, --wandb, --tensorboard, --yes) are forwarded as separate argv elements. Use --no-reexec to opt out and just print the command.

bash
soup train --config soup.yaml --gpus 4               # auto-reexec
soup train --config soup.yaml --gpus 4 --no-reexec   # print command only

Registry artifact attach

bash
# Attach an eval JSON to an existing registry entry
soup eval custom --tasks evals/sanity.jsonl --model ./output \
  --attach-to-registry chat-llama@v1

# Auto-attach exported artifact to registry
soup export --model ./output --format gguf --registry-id chat-llama@v1

eval_results and tensorrt are now valid artifact kinds. lookup_entry_by_output_dir emits ResourceWarning when its 1000-row scan limit is hit (no silent miss).

v0.28 features expanded to DPO + Pretrain trainers

use_cut_ce, quantization_aware: "fp8", kernel_auto_compose, and activation_offloading now work on DPO and Pretrain trainers (in addition to SFT from v0.28). GRPO/KTO/ORPO/SimPO/IPO/PPO/RewardModel/Embedding still error at config-load with a precise multi-trainer message — full expansion arrives in v0.35.

RLVR OS-level isolation

GRPO's code_exec reward gains real OS-level sandboxing on top of the v0.25 RLIMIT/socket-patch baseline:

  • Linux — best-effort os.unshare(CLONE_NEWUSER|CLONE_NEWNET|CLONE_NEWPID). Falls back silently to RLIMIT + socket-patch on hardened kernels (unprivileged_userns_clone=0).
  • macOSsandbox-exec wrapper with default-deny profile. (allow mach-lookup) narrowed to a 3-name allowlist (SecurityServer / notification_center / opendirectoryd.libinfo) to prevent DNS/NSURLSession bypass.

See also

  • [Soup Cans](/docs/soup-cans) — pack, fork, run, publish
  • [Training stability](/docs/training-stability) — LR finder + spike recovery
  • [Multi-GPU](/docs/multi-gpu) — auto-reexec
  • [Registry](/docs/registry) — artifact attach