Question 1

What's new in the latest Soup CLI release?

Accepted Answer

The newest release, v0.71.35, is the 'compliance pack': it makes a regulated fine-tune shippable with the paperwork it needs, and fixes GGUF export on Windows. 'soup init --template hipaa|soc2|eu-ai-act|sr-11-7' starts you from a regulation-shaped config. The honest design constraint is that Soup's compliance controls are CLI flags and commands rather than config keys, so a template cannot 'pre-wire audit-log on' as YAML: each one is a valid training config on a license-clean Apache-2.0 base, plus header comments naming that regime's exact commands (PHI scrubbing and air-gap for HIPAA; BOM, attest and sign for SOC 2; Annex XI plus energy tracking for the EU AI Act; repro-receipt plus diagnose and ship for SR 11-7). Templates went from 17 to 21. 'soup card -o MODELCARD.md' turns a Local Model Registry entry into a publishable, provenance-carrying Hugging Face model card: base model, the real training config, an eval scorecard, config and data sha256 hashes, lineage (ancestors) and a table of every registered artifact; adapter versus full model is inferred from the registered artifacts and falls back to the training config's LoRA rank, with Spectrum and LISA full fine-tunes correctly treated as dense, so the card sets the right 'library_name' and never misreports the model type. 'soup push --card ' renders that same card and uploads it as README.md, overriding the auto-generated one (a bad ref fails fast before any network call; Hugging Face hub only). 'soup ci init' writes '.github/workflows/soup-gate.yml', a pull-request gate chaining 'soup data validate' then 'soup expect' then 'soup ship --evidence', where exit 2 blocks the merge; every interpolated path is validated to stay under the repo root and shell-quoted, the branch and Python version are regex-gated, and the write is atomic, symlink-rejecting and refuses to clobber an existing workflow without --force. A new compliance quickstart walks template to PII scrub to train with receipt, Annex XI and energy to registry to BOM and attestation to scan, sign and verify to air-gap to model card to CI gate. The release also made GGUF export actually work on Windows, validated end to end for the first time against a locally built llama.cpp (SmolLM2-135M to q4_0, q4_k_m, q8_0 and f16, then 'soup deploy ollama', then live inference), which surfaced four real bugs that each independently broke the path: the export cloned llama.cpp into your current directory instead of ~/.soup because the lookup used a relative path; the first GGUF export ran llama.cpp's own requirements.txt through pip and downgraded torch 2.5.1+cu to 2.2.2+cpu, silently destroying CUDA in your training setup (Soup now installs only the convert script's extra dependencies, unpinned, never touching torch); a correctly built llama.cpp was not found because MSVC is a multi-config generator emitting build/bin/Release/llama-quantize.exe; and 'soup deploy ollama' failed on a relative GGUF path because Ollama resolves FROM against the Modelfile's directory. It also hardened model-card injection, a pre-existing hole in 'soup push': the Training section interpolated base, task, scheduler and recipe unescaped into a card published to the Hub, and since base and scheduler have no charset validator a crafted-but-valid config could smuggle raw HTML or a backtick that breaks out of the code span. The prior release, v0.71.34, added adapter algebra and LISA. 'soup adapters arithmetic "coder + 0.5*math - toxic" --adapter coder=./a --adapter math=./b --adapter toxic=./c -o out/' applies task arithmetic (arXiv:2212.04089) to LoRA deltas and produces one merged adapter you can serve or merge further; the combine is signed and un-normalized element-wise over same-rank adapters, and the effective delta (B @ A) scales linearly with each coefficient via a square-root-of-absolute-value factor split rather than the quadratic a naive sum gives, so negation flips the delta and 0.5 halves it. That exactness holds on the self (diagonal) term, meaning scaling or negating a single adapter is exact, while a multi-adapter sum also carries cross-terms the element-wise combine does not cancel; and the mechanism (the delta really negates) is proven while the behavioural outcome, such as whether subtracting a toxicity adapter measurably reduces toxicity on a real safety eval, is not yet benchmarked. Mixed-rank inputs are refused with a clear 'harmonize rank' message, the expression parser is hand-written with no eval, it reuses the backdoor-scan gate (refusing a FAIL-scanned input unless --allow-unscanned) and a same-base-model check (--allow-cross-base to override), and paths are cwd-contained and symlink-rejecting. LISA (Layerwise Importance Sampled AdamW, arXiv:2403.17919) targets full-fine-tuning quality at LoRA-like memory: every N steps it freezes all decoder layers except a small random set, with embeddings and the head always trainable. Enable it with 'training.lisa_enabled: true' plus 'lisa_num_layers' and 'lisa_interval_steps' on a task: sft, transformers, text, quantization: none run; it is mutually exclusive with LoRA features and the other freeze mechanisms, and it is live on a 4 GB GPU for small models. Before that, v0.71.33 shipped 'soup draft', and its story is the measurement, not a speedup claim. 'soup draft measure --target --draft --prompts p.jsonl' reports a draft's acceptance rate (the fraction of the target's own greedy tokens the draft would have proposed correctly, the teacher-forced metric the Medusa and EAGLE papers report) plus real plain-versus-assisted throughput, with exit 0, exit 2 when below --min-acceptance for CI, or exit 1. 'soup draft distill --target --draft-base --data d.jsonl -o draft/' distils your target into a tiny base via logit KD and emits a dense draft model ready to load as an assistant_model, 'soup draft list' shows what you have, and a local draft registry at ~/.soup/drafts.json is consulted by 'soup serve --auto-spec' before the built-in pairing table, so a draft you trained yourself is picked up automatically. Draft and target must share a tokenizer and a mismatched pair is refused up front, because speculative decoding proposes draft token ids into the target's vocabulary and a mismatch silently produces garbage rather than failing. The honest result: on the validated pair (SmolLM2-360M-Instruct target, SmolLM2-135M-Instruct draft) the stock draft already scored 69.3% acceptance, and distilling it moved that to 69.7% after 2 epochs and back to 69.3% after 10 epochs, no gain beyond noise, because a small same-family draft is already near its capacity ceiling for agreeing with the target and logit KD cannot buy capacity it does not have; speculative decoding on that pair measured 0.55 to 0.64x, a net slowdown, since the draft's forward pass costs more than the tokens it saves at this size. So the 'faster serving' pitch was withdrawn rather than shipped, and the feature ships as the honest gate that tells you whether speculative decoding is worth enabling before you ship it. Whether distillation materially raises acceptance for a genuinely diverged fine-tune or a larger pair is unproven on a 4 GB box and is tracked as a scale issue; cross-tokenizer drafts are deferred. v0.71.33 also fixed a hardware-fit gate that refused to train small models ('model_size_from_name' did not understand an M suffix, so SmolLM2-135M fell through to the 7B default and was blocked, and 1.7B was read as 7B), stopped 'soup serve --backend vllm' force-enabling trust_remote_code, made multi-adapter serving actually switch adapters per request instead of silently running the startup model, containment-checked llava and sharegpt4v image paths against image_dir so a crafted row cannot read arbitrary local files, stopped 'soup train --dry-run --gpus N' launching a real multi-GPU run, gave MLX SFT a real optimizer instead of passing None (which left the model untrained), and aligned the knowledge-distillation KD term with the CE term over causal-shifted positions. Earlier, v0.71.32 added ASR fine-tuning: 'soup train' with task 'asr' fine-tunes OpenAI Whisper on your own audio from plain JSONL rows of {"audio": "clip.wav", "text": "the transcript"}, wrapping Hugging Face Seq2SeqTrainer and WhisperProcessor; 'training.asr_lora: true' trains LoRA on the q/v projections so whisper-tiny (39M) and whisper-base (74M) fit a 4 GB GPU, and 'soup infer --task asr' reports per-row and corpus WER/CER via a pure-Python metrics module with no new dependency, where word accuracy (1 - WER) plugs into 'soup ship' as a higher-is-better task metric. And v0.71.31 shipped the judge-in-the-loop suite: 'soup train --task online_dpo' trains on-policy against an LLM judge wrapping TRL's OnlineDPOTrainer, 'soup data best-of-n' is BOND-lite rejection sampling, 'soup data evolve' is Evol-Instruct instruction evolution, and 'soup ship --task-mode pairwise' decides SHIP by a swap-debiased pairwise judge win-rate against a 0.5 coin-flip base. Totals as of v0.71.35: 16,001 tests across 313 test files, 142 ready-made recipes, 21 built-in templates, 25 training tasks, 19 data formats, 17 quantization formats, Python 3.10+, Apache-2.0.

Question 2

How do I fine-tune a model for HIPAA, SOC 2, the EU AI Act or SR 11-7 with Soup CLI?

Accepted Answer

The v0.71.35 compliance pack covers it end to end. Start with 'soup init --template hipaa|soc2|eu-ai-act|sr-11-7'. Be clear on what a template is, because the honest constraint shaped the design: Soup's compliance controls are CLI flags and commands, not config keys (audit_log, bom, attest, repro_receipt, annex_xi, track_energy, pii and decontaminate have zero matches in the config schema), so a template cannot 'pre-wire audit-log on' as YAML. Each one is instead a valid, immediately trainable config on a license-clean Apache-2.0 base, plus header comments naming that regime's exact commands: PHI scrubbing and air-gap for HIPAA, BOM/attest/sign for SOC 2, Annex XI plus energy tracking for the EU AI Act, repro-receipt plus diagnose and ship for SR 11-7. Built-in templates went from 17 to 21. The common path is then: 'soup data pii' and 'soup data decontaminate' to clean the data; 'soup train --repro-receipt receipt.json' (adding '--annex-xi annex_xi.md --track-energy --energy-country DEU --energy-out energy.json' for the EU AI Act); 'soup registry push'; 'soup bom emit --format both' for CycloneDX plus SPDX and 'soup attest emit --sign ed25519' for in-toto plus SLSA-3; 'soup adapters scan/sign/verify'; 'soup airgap-bundle' if you need to cross an air gap; then 'soup card <registry-id> -o MODELCARD.md' to publish. The audit log records every command automatically ('soup audit-log tail'). Finally 'soup ci init' writes .github/workflows/soup-gate.yml, a pull-request gate chaining 'soup data validate', then 'soup expect', then 'soup ship --evidence', where exit 2 blocks the merge.

Question 3

What is 'soup card' and 'soup ci init' in Soup CLI v0.71.35?

Accepted Answer

'soup card -o MODELCARD.md' (v0.71.35) turns a Local Model Registry entry into a publishable, provenance-carrying Hugging Face model card: base model, the real training config, an eval scorecard joined from the tracker, config and data sha256 hashes, lineage (ancestors), and a table of every registered artifact. Adapter versus full model is inferred from the registered artifacts and falls back to the training config's LoRA rank, with Spectrum ('unfrozen_parameters') and LISA ('lisa_enabled') full fine-tunes correctly treated as dense, so the card sets the right 'library_name' and never misreports the model type. That detail was found by a live smoke, not by tests: a real LoRA run with no attached artifacts rendered 'Type | Full model' and 'library_name: transformers', a false claim in a provenance document that 90 green tests had missed. 'soup push --model ./out --repo you/m --card ' renders the same card and uploads it as README.md, overriding the auto-generated one; a bad ref fails fast before any network call and it is Hugging Face only (it warns and ignores on other hubs). 'soup ci init [--data d.jsonl --suite s.yaml --evidence ev.json] [--branch main --python 3.11] [--force]' writes .github/workflows/soup-gate.yml, a fine-tuning CI gate that runs 'soup data validate' for dataset format compliance, then 'soup expect' for PII, token-length, refusal and judge expectations, then 'soup ship --evidence', where exit 2 blocks the merge. Every interpolated path is validated to stay under the repo root and shell-quoted (a '; rm -rf /' in a path is quoted as one token), the branch and Python version are regex-gated so they cannot break out of their YAML context, and the write is atomic, symlink-rejecting, and refuses to clobber an existing workflow without --force.

Question 4

What is 'soup draft', and does training a speculative-decoding draft actually make serving faster?

Accepted Answer

'soup draft' (v0.71.33) trains and, above all, MEASURES a speculative-decoding draft, and the honest answer to whether it makes serving faster is: measure it, because on the pair we validated it did not. 'soup draft measure --target --draft --prompts p.jsonl' reports a draft's acceptance rate (the fraction of the target's own greedy tokens the draft would have proposed correctly, teacher-forced argmax agreement, the metric the Medusa and EAGLE papers report) plus real plain-versus-assisted throughput measured on your box; roughly 70% and up is where speculative decoding starts paying for the draft's forward pass, and '--min-acceptance 0.6' exits 2 below the floor so CI can gate on it (exit 0 = ok, 2 = below floor, 1 = error). 'soup draft distill --target --draft-base --data d.jsonl -o draft/' distils your target into a tiny base via logit KD through the existing 'task: distill' trainer and emits a dense model, because a PEFT adapter directory cannot be loaded as an assistant_model; flags include --steps, --device, --force and --plan-only. 'soup draft list' shows local drafts, and a registry at ~/.soup/drafts.json is consulted by 'soup serve --auto-spec' BEFORE the built-in pairing table, so a draft you trained yourself is picked up automatically. Draft and target must share a tokenizer and a mismatch is refused up front, because speculative decoding proposes draft token ids into the target's vocabulary and a mismatched pair silently produces garbage rather than failing. The measured result, which is the feature: on SmolLM2-360M-Instruct with a SmolLM2-135M-Instruct draft, the stock draft already scored 69.3% acceptance and distilling it moved that to 69.7% after 2 epochs and back to 69.3% after 10, no gain beyond noise, because a small same-family draft is already near its capacity ceiling for agreeing with the target and logit KD cannot buy capacity it does not have; assisted decoding measured 0.55 to 0.64x, a net slowdown, since the draft's forward pass costs more than the tokens it saves at that size. The 'faster serving' pitch was therefore withdrawn rather than shipped, and 'soup draft measure' is what tells you so before you ship. Whether distillation pays off on a larger or genuinely diverged pair is unproven on a 4 GB box; cross-tokenizer drafts are deferred.

Question 5

What is 'soup adapters arithmetic' (LoRA task arithmetic) and LISA in Soup CLI v0.71.34?

Accepted Answer

'soup adapters arithmetic "coder + 0.5*math - toxic" --adapter coder=./coder-lora --adapter math=./math-lora --adapter toxic=./toxic-lora -o ./blended' (v0.71.34) applies task arithmetic (arXiv:2212.04089) to LoRA deltas: add to blend two skills, scale to dial one down, and, the differentiator, negate to subtract one. The output is a single loadable adapter (exit 0 = ok, 1 = refusal). The coefficient math is the point: a LoRA does not store its delta directly but two factors whose product is the effective weight change (delta W = B @ A), so scaling both factors by c scales the delta by c squared, which would make '- toxic' square to +1 and become a no-op instead of a removal. Soup splits the coefficient as the square root of its absolute value across the two factors and carries the sign, so the delta scales linearly (verified live on two real rank-8 adapters: the ratio of the norms for 2*a versus a is exactly 2.000). That exactness holds on the self (diagonal) term, so scaling or negating a single adapter is exact, while a multi-adapter sum also carries cross-terms this element-wise combine does not cancel; and the mechanism is what is proven (the delta really negates), not the behavioural outcome, since whether subtracting a toxicity adapter measurably reduces toxicity on a real safety eval is not yet benchmarked. Guards: mixed ranks are refused with a 'harmonize rank' message rather than silently approximated, a different base model is refused unless --allow-cross-base, a FAIL-scanned input is refused by the backdoor-scan gate unless --allow-unscanned, and paths are cwd-contained and symlink-rejecting. The expression parser is hand-written with no eval, so an adapter name can never become code. LISA (Layerwise Importance Sampled AdamW, arXiv:2403.17919) is the other half of v0.71.34 and targets full-fine-tuning quality at LoRA-like memory: where Spectrum picks the layers once, LISA re-samples a small random set of decoder layers every N steps and freezes the rest, with the input embeddings, LM head and final norm always trainable. Because only a handful of layers train at any moment and their optimizer state is cleared when re-frozen, peak optimizer memory is roughly embeddings + head + lisa_num_layers, while every layer still gets updated over the run. Enable it with 'training.lisa_enabled: true' plus 'lisa_num_layers', 'lisa_interval_steps' and 'lisa_reset_optimizer' on a task: sft, transformers, text, quantization: none run; it is mutually exclusive with LoRA features, freeze_layers/freeze_ratio and Spectrum's unfrozen_parameters, since each independently decides what trains. Live on a 4 GB GPU for small models.

Question 6

Does 'soup export --format gguf' work on Windows?

Accepted Answer

Yes, as of v0.71.35, and it was validated end to end on Windows for the first time: llama.cpp built CPU-only with VS2022 BuildTools and cmake, SmolLM2-135M exported to q4_0 (92 MB), q4_k_m (105 MB), q8_0 (145 MB) and f16 (271 MB), then 'soup deploy ollama', then live inference. Doing that validation surfaced four real bugs, each of which independently broke the path, and which 361 green export and ollama tests had not caught. First, 'soup export --format gguf' cloned llama.cpp into your current directory: SOUP_DIR is the bare name '.soup' but the lookup used it relatively instead of anchoring to the home directory like the rest of the codebase, so the canonical ~/.soup/llama.cpp was never found and a fresh ~200 MB checkout was dropped into whatever directory you ran from. Second, and worst, the first GGUF export downgraded your PyTorch and broke CUDA: the auto-clone ran 'pip install -r <llama.cpp>/requirements.txt' into your interpreter, and llama.cpp pins torch~=2.2.1 against the CPU wheel index, so torch 2.5.1+cu became 2.2.2+cpu and transformers 4.57 became 4.46, meaning a user's first GGUF export silently destroyed their training setup; Soup now installs only the convert script's extra dependencies (gguf, sentencepiece, protobuf), unpinned, never touching torch, and non-fatally. Third, a correctly built llama.cpp was 'not found' on Windows because MSVC, like Xcode, is a multi-config generator and emits build/bin/Release/llama-quantize.exe, while only the flat single-config layout was searched. Fourth, 'soup deploy ollama' failed on a relative GGUF path with 'pull model manifest: file does not exist', because Ollama resolves FROM against the Modelfile's directory and Soup writes the Modelfile to a temp dir; the Modelfile now emits an absolute path. This closes the CPU-validatable half of the long-standing GGUF-on-Windows issue; the CUDA plus AWQ/GPTQ half remains open.

Question 7

Can Soup CLI fine-tune Whisper (ASR / speech-to-text) models?

Accepted Answer

Yes. As of v0.71.32, 'soup train' with 'task: asr' fine-tunes OpenAI Whisper speech-recognition models on your own audio, locally. Data is plain JSONL rows of {"audio": "clip.wav", "text": "the transcript"} with 'data.format: asr' and a containment-checked 'data.audio_dir'; the trainer wraps Hugging Face Seq2SeqTrainer + WhisperProcessor (clips become 16 kHz mono log-mel input features, transcripts become the decoder labels). 'training.asr_lora: true' trains LoRA on the q/v projections instead of a full fine-tune, so whisper-tiny (39M) and whisper-base (74M) train on a 4 GB GPU; 'training.asr_language' and 'training.asr_task' (transcribe or translate) pin the decoder prefix and persist to an 'asr_generation.json' sidecar that inference restores automatically. 'soup infer --task asr --model <whisper-or-adapter> --input eval.jsonl --output preds.jsonl' transcribes a JSONL of clips and, when references are present, reports per-row and corpus WER/CER via a pure-Python metrics module with no new dependency; word accuracy (1 - WER) plugs into 'soup ship' as a higher-is-better task metric. Recipes: 'whisper-tiny-asr' and 'whisper-base-asr' (live-validated on a 4 GB RTX 3050: WER 1.000 to 0.000 on a memorization smoke at 0.4 of 4 GB) plus 'whisper-large-v3-asr' (parse-tested, needs 16 GB+). Audio decodes only through a hardened loader (soundfile pre-probe, symlink rejection, O_NOFOLLOW, byte caps); a non-Whisper base or an mlx/unsloth backend is rejected with a friendly error before any download.

Question 8

What is the judge-in-the-loop suite in Soup CLI v0.71.31 (Online DPO, best-of-N, Evol-Instruct, soup ship pairwise)?

Accepted Answer

The judge-in-the-loop suite (v0.71.31) puts an LLM judge in the loop across the whole post-training workflow as four integrated commands sharing one swap-debiased pairwise-judge core, and no competitor (Unsloth, Axolotl, LLaMA-Factory, OpenPipe) ships it as an integrated CLI suite. 'soup train --task online_dpo' trains on-policy against a judge: each step samples two completions per prompt and a pairwise judge (or a reward model) names the winner 'chosen' and the loser 'rejected', wrapping TRL's OnlineDPOTrainer (config online_dpo_judge or reward_model, exactly one of the two; online_dpo_loss_type sigmoid or ipo; online_dpo_max_new_tokens; beta reuses dpo_beta; transformers and text only). 'soup data best-of-n --base --prompts p.jsonl --n 8 --judge -o sft.jsonl [--emit-pairs dpo.jsonl]' is best-of-N rejection sampling (BOND-lite): sample N completions locally, a judge scores each one pointwise, the winner becomes an SFT row with provenance and, with --emit-pairs, winner-versus-loser DPO pairs. 'soup data evolve --input seeds.jsonl --provider ollama|vllm --model --strategy depth|breadth --rounds N -o out.jsonl' is Evol-Instruct (WizardLM) instruction evolution that grows instruction diversity, completing the synthetic-data suite (Magpie, Forge, Persona, evolve). And 'soup ship --task-mode pairwise --judge-model ' adds a third task-win mode to the SHIP or DON'T-SHIP verdict: a true pairwise judge win-rate where the judge picks base versus tuned per prompt (swap-debiased, a win counts only when the (A,B) and (B,A) orders agree), the base is a 0.5 coin-flip and the tuned model wins only if its win-rate exceeds 0.5, fused with the catastrophic-forgetting gate and CI exit codes (0 = SHIP, 2 = DON'T SHIP, 1 = error). It is honest about scale: the training claim is proof-of-mechanism only, validated on SmolLM2-135M with a synthetic length-preferring judge on CPU, not a production RLHF claim (community scale ask tracked in issue #286). Because CI resolves trl 1.x, a runtime trl-version adapter uses the swap-debiased pairwise judge on trl 0.19.x and the same judge as a pointwise reward function on trl 1.x, a documented per-version behaviour difference. Judge and provider URLs are SSRF-validated, dataset paths are cwd-contained with symlink rejection and atomic writes, and untrusted model code defaults to trust_remote_code off. It adds the online-dpo-smollm2-135m recipe.

Question 9

What is PRM-guided GRPO in Soup CLI v0.71.30?

Accepted Answer

PRM-guided GRPO (v0.71.30) lets a trained Process Reward Model drive GRPO reinforcement learning, the o1-era process-supervision signal, and no OSS fine-tuning CLI ships this. 'soup train --task grpo' with training.prm_reward set to a PRM directory or Hugging Face id (from 'soup train --task prm') and training.prm_aggregate set to min, prod or last scores each reasoning step of a generated answer and folds the per-step scores into one reward that GRPO optimizes, replacing reward_fn; the default aggregation 'min' is weakest-link and the safe choice, and it is gated to grpo, transformers and text. v0.71.30 also bundles three ready-to-run toy rollout environments (soup_cli.envs.calculator, retrieval_qa and guess_number) so the GRPO openenv rollout path works with no external setup, with the recipes grpo-env-calculator, grpo-env-retrieval-qa and grpo-env-guess-number. It is proof-of-mechanism on SmolLM2-135M with a tiny synthetic PRM and synthetic reward, not a production reward-model claim (scale ask tracked in issue #286).

Question 10

What is 'soup shrink' in Soup CLI v0.71.29, and how does depth pruning work?

Accepted Answer

'soup shrink --model [--drop-ratio 0.25 | --drop-layers N] --calib calib.jsonl [--heal heal.jsonl --heal-steps 200] [--tolerance 0.10] [-o shrunk] [--device cpu] [--attach-to-registry ] [--plan-only]' (v0.71.29) makes a model permanently smaller and faster by removing its least-useful decoder layers, optionally healing the damage by distillation, and printing a single SHIP / DON'T-SHIP perplexity verdict. It implements 'The Unreasonable Ineffectiveness of the Deeper Layers' (Gromov et al., arXiv:2403.17887): over a calibration set it ranks every contiguous block of decoder layers by the mean per-token angular distance the residual stream travels from the block's input to its output, drops the least-important block (the first and last decoder layers are always protected), reloads the sliced model so surviving layers re-index cleanly, and with '--heal' distills the full-depth original into the pruned student via an isolated LoRA logit-distillation 'soup train' subprocess whose adapter is then fused back — so the shipped artifact stays a single dense smaller model, not a base plus adapter. The verdict SHIPs only if the perplexity ratio (final / original) stays within '--tolerance' (default 10%), else DON'T SHIP; exit codes are 0 = SHIP, 2 = DON'T SHIP, 1 = error, and '--plan-only' prints the layer-importance table and the chosen block without writing anything. The arch allowlist v1 is Llama / Qwen / SmolLM (checked twice). It is validated live on SmolLM2-135M on a single RTX 3050: drop 25% took it from 30 to 22 layers (~21% of params) at perplexity x2.98 unhealed, while drop-4 plus a CPU heal recovered perplexity to x1.35. No fine-tuning CLI (Unsloth, Axolotl, LLaMA-Factory) ships one-command importance-ranked depth pruning with a distill-heal and a binary perplexity verdict. Security: --calib, --heal, --output-dir and every derived write path stay under cwd with realpath + commonpath + O_NOFOLLOW + symlink rejection re-validated right before each write (TOCTOU defence, including after the heal), the heal subprocess is an argv list with a timeout and a schema-validated config, subprocess output is control-char-stripped, input files are capped at 64 MiB / 10,000 calib rows, and --model defaults trust_remote_code off.

Question 11

Does Soup CLI have an MCP server, and what does 'soup mcp serve' do?

Accepted Answer

Yes. 'soup mcp serve [--allow-mutating]' (v0.71.28) is a Model Context Protocol server that lets you drive Soup from any MCP client (Claude Code, Cursor, Cline, Continue) over stdio, so your coding agent can inspect data, search recipes, read runs and give ship verdicts without leaving the chat. No other fine-tuning CLI ships an MCP server. It exposes 16 tools: 14 read-only ones that each map to a Soup command and return JSON (advise, data_inspect, data_validate, data_score, data_doctor, recipes_search, recipes_show, runs_list, runs_show, registry_list, registry_show, profile, diagnose_evidence, ship_evidence), plus 2 plan-only mutating tools (train_start, export) that are always listed but refuse to run unless you pass '--allow-mutating', and even then only render the exact command that would run — they never execute training or export in v1. Transport is stdio only: there is no network listener, no HTTP, no SSE. stdout is reserved for the JSON-RPC channel (all human text goes to stderr), every path argument re-enters cwd-containment plus symlink rejection, output is control-char sanitized so a malicious dataset string cannot smuggle terminal escapes into the client, errors are path-free, and string (<= 4096 chars) / JSON (<= 16 MiB) / dataset (<= 1 GiB) / int bounds are enforced. It installs behind a new '[mcp]' extra ('pip install soup-cli[mcp]', dependency mcp>=1.2.0) that is lazy-imported so the core CLI stays PyTorch-free and SDK-free. A client connects with { "mcpServers": { "soup": { "command": "soup", "args": ["mcp", "serve"] } } }. The server exposes MCP Tools only — no Resources or Prompts. SSE/HTTP transport and live execution of the mutating tools are filed follow-ups; v1 is stdio-only and plan-only.

Question 12

What is the Fine-tune Doctor (soup data doctor / soup data lint) in Soup CLI v0.71.27?

Accepted Answer

v0.71.27 'Fine-tune Doctor' is a pure-CPU, zero-GPU pre-flight that kills the top silent fine-tune failures before a single training step — something no competitor (Unsloth, Axolotl, LLaMA-Factory) ships. 'soup data doctor --model ' runs 8 chat-template compatibility checks against the real tokenizer: chat_template present, template renders cleanly, {% generation %} markers present, eos_in_labels (the number-one 'model never stops generating' bug — every assistant turn's trained span must actually contain an EOS/EOT token, checked on every turn not just the last), bos_duplication (template and tokenizer both prepending BOS), system_role support (Mistral-style templates reject a leading system turn), unknown_roles, and truncation_risk (p95 rendered length vs max_length). It uses the same OK/MINOR/MAJOR taxonomy as 'soup diagnose' with exit 0 on OK/MINOR and exit 2 on MAJOR, and '--show-mask N' colours each token trained-versus-masked through the REAL collator path (answer-only / per-message train-field / RAFT span-mask), not a reimplementation, so an assistant-mask bug is visible instantly. 'soup data lint ' is a preference-data linter for dpo/orpo/simpo/ipo/bco/kto that flags length_bias (chosen systematically longer than rejected, the number-one silent DPO degradation, reported as a Cohen's d effect size), label_imbalance (KTO desirable:undesirable ratio), near_duplicates (MinHash/LSH, reusing the dedup kernel), identical_pairs (chosen == rejected, zero preference signal), and prompt_leak (prompt echoed verbatim in the completion); pass '--model' for exact token-length bias instead of word count.

Question 13

What is 'soup spectrum scan' and how does Spectrum targeted training work?

Accepted Answer

'soup spectrum scan --model --top-percent 50 [--modules mlp,attn|all] [--output patch.yaml]' (v0.71.23) is native Spectrum targeted training. It streams a model's safetensors shards one tensor at a time — no model load, so peak RAM is the largest single weight matrix and you can scan even a 70B on a CPU box — computes a singular-value SNR per weight matrix with a Marchenko-Pastur noise threshold (arXiv:2406.06623), ranks layers within each module-type group, and prints the top --top-percent as a ready-to-paste 'training.unfrozen_parameters' YAML block. The SNR kernel is pure-numpy and transpose-invariant, so GPT-2 Conv1D weights (c_attn/c_fc/c_proj) are recognised alongside Llama-style names; results cache at '~/.soup/spectrum/.json'. The new schema field 'training.unfrozen_parameters: list[str]' takes regex patterns of parameter names to keep trainable: the SFT trainer freezes every parameter then unfreezes the matched set (a full fine-tune of a subset of layers, LoRA off). It requires task=sft, backend=transformers, modality=text, quantization=none and is mutually exclusive with LoRA features / freeze_layers / freeze_ratio / train_router_only / expand_layers. Patterns are ReDoS-validated (nested-unbounded-quantifier reject, 50k count cap, 512-char cap); Hub downloads route through the SSRF-hardened namespace-pinned loader and matrices above a 2^31-element SVD cap are skipped.

Question 14

How does 'soup train --reward-hack-detector' detect reward hacking in GRPO/PPO?

Accepted Answer

'soup train --reward-hack-detector info_rm|rm_ensemble [--reward-hack-halt]' (v0.70.0) is an early-warning system for reward hacking in GRPO/PPO runs. Two detectors: 'info_rm' tracks the InfoRM Cluster-Separation Index (Wang et al. 2024, arXiv 2402.09345) — a calibrated metric on reward-model embeddings that drops when the policy collapses onto a degenerate reward-maximizing subspace; 'rm_ensemble' tracks the mean pairwise variance across an RM ensemble (cap 32) — when ensemble members disagree, the policy is exploiting one of them. Math kernels 'compute_cluster_separation', 'compute_rm_ensemble_divergence', and 'classify_hack_signal' are LIVE with OK/WARN/HACK bands at 0.10 / 0.30 relative drop from the baseline window. '--reward-hack-halt' auto-stops on HACK verdict (exit 2). Cross-validator: task in {grpo, ppo} only, halt=True requires detector, rejects mlx backend. Composes with v0.34 'soup why' for anomaly explanation. LIVE as of v0.71.11 — the GRPO TrainerCallback now reads per-step rewards via a thread-safe capture buffer, classifies OK/WARN/HACK, logs the verdict, and halts on HACK.

Question 15

What is closed-loop reward-hacking auto-mitigation in Soup CLI v0.71.26?

Accepted Answer

'soup train --reward-hack-mitigation off|log_only|kl_control|pid_lagrangian' (config field 'training.reward_hack_mitigation', v0.71.26) makes a GRPO/PPO trainer DETECT reward hacking mid-run and SELF-CORRECT instead of only halting. No open-source RLHF library (TRL, Unsloth, Axolotl, OpenRLHF, verl) closes the loop from detection to automatic correction to continue — that is the whitespace it fills. Four modes: 'log_only' streams a per-step mitigation_log.jsonl (drop_pct, OK/WARN/HACK verdict, reward mean/std, completion-length trend, n-gram repetition) and provably never touches training dynamics; 'kl_control' is a reversible bang-bang controller with hysteresis that, when a multi-signal vote (InfoRM cluster-separation or RM-ensemble divergence plus length-trend and repetition) trips for a dwell window, raises the KL coefficient beta (clamped to a positive floor and ceiling, never crossing zero, written to both trainer.beta and trainer.args.beta for stock and variant GRPO and to args.kl_coef for PPO) and relaxes it once the signal recovers; 'pid_lagrangian' replaces bang-bang with a PID-Lagrangian controller (Stooke et al. 2020) that holds the hacking signal at a target with integral anti-windup, plus an escalation ladder (raise KL, then roll back to the last-good RL checkpoint, then early-stop with a plain-English give-up explanation via 'soup why'). Anti-gaming hardening adds per-signal EMA/median smoothing, conservative-on-disagreement voting, a reward-distribution-drift guard, and optional bounded reward shaping on the gamed proxy. It ships as proof-of-mechanism ONLY, validated on SmolLM2-135M plus a synthetic length-hacking task on a single RTX 3050 (all four modes live including a real mid-run rollback); PPO ships BETA (mechanism wired and unit-tested, on-GPU proof GRPO-only), and scale validation on 7B+ with real reward models is an open community question tracked in issue #286. +184 tests; the suite is now 16,001 across 313 files.

Question 16

What is '--uld-strategy' (Universal Logit Distillation) in Soup CLI v0.70?

Accepted Answer

'soup train --uld-strategy wasserstein|topk_align [--uld-top-k N]' (v0.70.0) is cross-tokenizer knowledge distillation (Boizard et al. 2024, arXiv 2402.12030) — Llama→Mistral, Llama→Qwen, no shared vocabulary required. 'wasserstein' computes a 1-D Wasserstein distance over sorted teacher / student logits, no token-level alignment needed (the cheap and robust default). 'topk_align' uses the top-K teacher logits matched via BPE-overlap heuristic alignment (use when you have a good vocab-overlap heuristic and want sharper signal). _MAX_VOCAB_SIZE=262144 covers multilingual SentencePiece + GPT-OSS 200K vocabularies. Gated to task='distill' and rejects mlx backend. LIVE as of v0.71.11 — the distill trainer now computes the real Wasserstein-1 / top-k-aligned loss, clamping teacher ids on a vocab-size mismatch.

Question 17

How does '--minillm-enabled' stabilize reverse-KL distillation?

Accepted Answer

'soup train --minillm-enabled [--minillm-teacher-mix-ratio 0.3] [--minillm-length-normalize true] [--minillm-pretrain-anchor-weight 0.1 --minillm-pretrain-anchor-path pre.jsonl]' (v0.70.0) is on-policy reverse-KL distillation (Gu et al. 2024, arXiv 2306.08543) with all three §3 stability tricks bundled — teacher-mixed sampling (mix ratio of teacher samples into the on-policy rollout), length normalisation (per-token KL averaged), and pretrain-loss anchor (regularise toward an anchor distribution at weight α). Cross-validators reject silent no-ops: anchor_weight=0 with anchor_path set, or anchor_weight > 0 with path None. Gated to task='distill'. LIVE as of v0.71.11 — the teacher-mixed, length-normalised reverse-KL term plus the optional pretrain anchor now train end-to-end (the anchor reader is cwd-contained + symlink-rejecting).

Question 18

What does '--rl-checkpoint-save-every-steps' fix that TorchTune punts on?

Accepted Answer

'soup train --rl-checkpoint-save-every-steps N [--rl-checkpoint-keep-last N] [--rl-checkpoint-include-optimizer] [--rl-checkpoint-include-ref-model] [--rl-checkpoint-include-rollout-buffer]' (v0.70.0) adds mid-epoch checkpointing to PPO and GRPO — a feature TorchTune explicitly punts (their docs link out to 'restart the epoch on crash'). save_every_steps ∈ [1, 10M], keep_last ∈ [1, 100] (oldest pruned). Optional inclusions: optimizer state, reference model snapshot, rollout buffer. Composes with v0.32 spike recovery and v0.40 reference-model regen — when the recovery fires, it now hops to the most recent mid-epoch checkpoint instead of restarting the epoch. LIVE as of v0.71.11 — it writes a real adapter + optimizer state + JSON manifest every N steps and prunes to --keep-last.

Question 19

What is 'soup iterative-dpo' and how does it differ from a single DPO run?

Accepted Answer

'soup iterative-dpo --base-model --reward-model --prompts --output-dir --rounds N --pairs-per-round N [--plan-only]' (v0.70.0) drives the iterative-DPO loop: sample from the current policy, score with the reward model, build new preference pairs, retrain DPO, repeat for N rounds. Frozen IterativeDPOPlan with a consecutive-round_index invariant and canonical per-round artifacts ('./out/round-NN/pairs.jsonl', './out/round-NN/adapter') so a crashed run resumes cleanly. '--plan-only' renders the validated plan and exits 0; the live runner shipped in v0.71.11 — each round samples completions from the previous round's adapter, then trains a fresh LoRA from the base on that round's harvested pairs.

Question 20

How does '--echo-trap-enabled' detect multi-turn agent degeneration?

Accepted Answer

'soup train --echo-trap-enabled [--echo-trap-threshold 0.6] [--echo-trap-halt]' (v0.70.0) is the RAGEN-style trajectory-degeneration detector for multi-turn agent RL (Zhu et al. 2025, arXiv 2504.14437). Pure-Python n-gram repetition rate per trajectory + a batch mean — when an agent's rollout collapses into 'echoing itself' (the same n-gram pattern appearing repeatedly within and across turns), this catches it before the reward model rewards the degenerate policy. OK / WARN / TRAP bands at 0.30 / 0.60. DoS caps _MAX_NGRAM_N=32, _MAX_TRAJECTORY_TOKENS=1M, _MAX_BATCH_TRAJECTORIES=100k. Gated to task in {grpo, ppo} non-mlx. Composes with v0.53.11 GRPOStabilityCallback. LIVE as of v0.71.11 — the GRPO callback now scores per-trajectory n-gram repetition, logs the verdict, and halts on TRAP when --echo-trap-halt is set.

Question 21

What is 'soup build' and how is it dbt-shaped?

Accepted Answer

'soup build <manifest.yaml> [--dry-run]' (v0.69.0) is a dbt-shaped DAG of dataset transforms with incremental materialisation. Closed SUPPORTED_MODEL_KINDS = {incremental, table, view}. Topo-sort via Kahn's algorithm. The 're-tokenise only changed rows' kernel: compute_row_hash (SHA-256 over canonical-JSON, 'id' field excluded) + incremental_diff(prev, new) → {added, changed, removed, unchanged}. DoS caps _MAX_MODELS=256, _MAX_REFS_PER_MODEL=32, _MAX_FILE_BYTES=1 MiB. '--dry-run' validates topology and exits 0. LIVE as of v0.71.6 — run_build materialises datasets with five built-in transforms (identity / drop_empty / lowercase / strip / dedup_exact); 'table' rebuilds, 'view' re-derives, 'incremental' re-transforms only the rows whose content hash changed (SQLite state store keyed by row hash + transform fingerprint). Custom transforms pass per-run via the Python API.

Question 22

What does 'soup expect' assert about a chat dataset?

Accepted Answer

'soup expect ' (v0.69.0, LIVE) is a Great Expectations suite for chat data with a closed SUPPORTED_EXPECTATIONS allowlist: expect_no_pii (reuses v0.47 Presidio analyzer), expect_token_length_between (min / max bounds), expect_no_refusal_pattern (reuses v0.56 refusal detector), expect_chosen_preferred_over_rejected_by_judge (reuses v0.19 LLM-judge surface). Walks 'text', 'content', 'output', 'prompt', 'instruction', 'response' top-level keys + 'messages[].content' nested arrays. _MAX_SUITE_LEN=64. Exit 0 = all expectations passed, 2 = validation rejection (suite shape invalid), 3 = expectations failed. Drop into CI between 'soup data' and 'soup train'.

Question 23

How do 'soup data gen-magpie' and 'persona-mix' improve synthetic data diversity?

Accepted Answer

'soup data gen-magpie --base --provider ollama|vllm --target N [--quality-filter] [--plan-only]' (v0.69.0, LIVE as of v0.71.6) generates synthetic chat data via the Magpie chat-template-prefix harvest trick — prime the base model with just the assistant chat-template prefix (chatml / llama3 / gemma / mistral auto-detected) and let it complete the prompt itself. Live providers are 'ollama' and 'vllm' raw-completion endpoints (both SSRF-hardened, loopback-only); 'anthropic' is rejected (no raw-completion endpoint). '--quality-filter' drops low-quality rows via the v0.47 scorers; exact-duplicate instructions are de-duplicated. 'soup data persona-mix --prompts --n N --output [--personas ] [--styles ]' (v0.69.0, LIVE) is the Persona-Hub diversity sampler — bundled 12 personas × 5 writing styles by default (BYO Tencent 200k corpus via --personas / --styles). Deterministic by seed (random.Random(seed)); compute_topic_diversity = Shannon entropy over pooled whitespace tokens; atomic JSONL write with 100 MiB / 100k entries per loader caps.

Question 24

How does 'soup data brain-rot' detect AI slop in training data?

Accepted Answer

'soup data brain-rot <data.jsonl> [--strict] [--max-major-fraction 0.25]' (v0.69.0, LIVE) implements the arXiv 2510.13928 brain-rot detector with two pure-Python scorers: score_triviality (token-diversity inversion + '!!' / '??' punctuation runs + low-effort token density + length penalty) and score_popularity_signal (clickbait phrase scan + emoji U+1F300–U+1FAFF density). Worst-signal-wins composition: 1.0 − max(triviality, popularity). Bands match the v0.26 / v0.56 / v0.65 taxonomy: OK ≥ 0.85, MINOR ≥ 0.60, else MAJOR. refuse_if_rotten raises when MAJOR fraction exceeds threshold. '--strict' exits 3 on excessive MAJOR. English-keyword-only in v0.69.0; multilingual lands in v0.69.1.

Question 25

What is 'soup compile' and why does it hedge against prompt engineering winning over fine-tuning?

Accepted Answer

'soup compile --eval [--optimizer mipro|gepa|textgrad|copro|bootstrap_fewshot] [--max-iters N] [--output ] [--plan-only]' (v0.68.0) is a DSPy + GEPA + TextGrad prompt-program compiler — if prompt engineering paradigms keep winning over fine-tuning, you have the compilation surface inside Soup. Hard bound MAX_COMPILE_ITERS=1000. Closed optimiser allowlist of 5. validate_program_path requires '.py' extension, cwd containment, and os.lstat + S_ISLNK rejection via the shared paths.enforce_under_cwd_and_no_symlink helper. CompileResult enforces finite-score (no NaN / ±Inf) and iterations ≥ 0. Module is commands/compile_cmd.py because 'compile' is a Python builtin. LIVE as of v0.71.13 behind the '[compile]' extra (pip install 'soup-cli[compile]') — run_compile dispatches DSPy / GEPA / TextGrad optimisation; '--plan-only' still renders the plan and exits 0.

Question 26

What does 'soup distill-prompt' do?

Accepted Answer

'soup distill-prompt --traces --teacher --student --strategy sft|preference|kl [--output ] [--plan-only]' (v0.68.0) distils prompt-heavy traces — long system prompts, in-context examples — into small fine-tunes that internalise the prompt. Closed strategy set {sft, preference, kl}. Teacher / student IDs capped at 512 chars; null-byte + empty rejected. Composes with v0.71 cross-tokenizer ULD — distill-prompt picks the dataset shape, ULD bridges teacher / student vocabularies. LIVE as of v0.71.13 — prepare_distill_dataset calls the teacher once per trace via the v0.20 providers (Ollama / Anthropic / vLLM) and emits sft/kl messages or preference chosen/rejected pairs.

Question 27

How does 'soup apple-adapter' target Apple FoundationModels?

Accepted Answer

'soup apple-adapter --direction hf-to-mlx|mlx-to-hf|hf-to-apple|mlx-to-apple --output [--sign/--no-sign] [--plan-only]' (v0.68.0) converts adapter formats between HuggingFace safetensors, Apple MLX npz, and Apple FoundationModels (iOS 26+ on-device LLM) adapter blobs, with optional Merkle-root signing. Closed direction allowlist of 4. Explicit S_ISDIR + symlink rejection on source. 'sign' must be a real bool (bool-as-int defence). Reuses v0.60 Part B signing infrastructure and extends the v0.25 MLX backend. Live convert_apple_adapter ships in v0.68.1 (Apple's spec is still moving).

Question 28

What is 'soup local-rl' and how does the personal-LLM flywheel work?

Accepted Answer

'soup local-rl {init, status, record, harvest, train}' (v0.68.0) is a personal-LLM feedback flywheel daemon (LIVE except 'train'). 'soup local-rl init --db ' creates a POSIX 0o600 SQLite DB with interactions + thumbs tables (idempotent CREATE TABLE IF NOT EXISTS). 'soup local-rl record --db --prompt --response --thumb up|down' parameterised inserts with 16 KiB prompt + 16 KiB response caps and null-byte rejection. 'soup local-rl status --db ' renders a Rich table of interactions / up / down counters. 'soup local-rl harvest --db ' walks thumbs by ts ASC and emits one DpoPair{prompt, chosen, rejected} per prompt with both an up and down (last-writes-win dedup), atomic JSONL write via tempfile.mkstemp + os.replace — operators feed this straight into 'soup train --task dpo'. 'soup local-rl train' is LIVE as of v0.71.13 — '--once' harvests the latest thumbs DPO pairs and runs a real nightly DPO/KTO/ORPO train via a 'soup train' subprocess (a 'state' table skips re-runs with no new feedback or fewer than --min-pairs); without '--once' it renders an injection-safe systemd .service/.timer + launchd .plist scheduler scaffold.

Question 29

What is 'soup adapters merge --strategy cmaes' and why is it different from linear / TIES / DARE / SVD?

Accepted Answer

'soup adapters merge --strategy cmaes --adapter --adapter ... --eval --budget 1h' (v0.67.0) runs Sakana-style evolutionary search over LoRA merge weights via pure-Python rank-mu CMA-ES (no 'cma' dependency). The optimiser parameterises N-1 logits, softmaxes them onto the simplex (sum=1, each ≥0), samples a population, keeps the elite half, and plateau-detects after 3 generations without improvement (converged=True). Bounds: 2–16 adapters, population [2, 256], generations [1, 10K], budget [60s, 24h] (reuses v0.57 blame.parse_budget). Operator-supplied eval_fn closure; failures swallowed with a sentinel -1e9 score so one broken eval doesn't crash the run (mirrors v0.40.3 proxy-failure isolation). Linear / TIES / DARE / SVD are still available as fast deterministic baselines; CMA-ES is for when you actually have a measurable eval and want the optimiser to search for you. LIVE as of v0.71.4 — the full CMA-ES loop now merges, materialises, and scores each candidate against the eval suite and writes the best-weighted merge to --output; '--canary ' adds an OK/MINOR/MAJOR verdict and the backdoor-scan + license gates run for cmaes too.

Question 30

What is the VeRA / VB-LoRA vector bank and why does it matter for multi-tenant serving?

Accepted Answer

'soup_cli.utils.vector_bank' (v0.67.0) is the storage format for VeRA / VB-LoRA — shared random projection matrix P (d_model × d_model) + per-user scaling vector v_u (vector_dim floats). A 128-D scaling vector at fp32 is ≈512 bytes per user vs. ~30 MB for a rank-16 LoRA on a 7B model — thousands of per-user adapters at MB-each instead of hundreds-of-MB per LoRA. Hosted vendors price by GPU-hour, not adapter count, so they structurally cannot offer this economics. Schema: bank name (kebab-case, ≤128 chars), base_model (≤512 chars), per-entry {user_id (≤256 chars), scaling_vector (floats, finite, no NaN/Inf)}. Bounds: vector_dim [1, 16K], up to 1M entries per bank, 16 MiB file-size cap, atomic JSON I/O via shared paths.atomic_write_text + cwd containment + symlink rejection. estimate_bank_size(num_users, vector_dim) for capacity planning. LIVE as of v0.71.12 — 'soup serve --bank <bank.json> [--bank-strength S]' reconstructs the shared projection + per-user scaling vectors and installs a decode-time forward hook; the active user is chosen per request via the 'X-User-Id' header (an unknown/absent id is a zero-delta no-op, so there is no cross-request leak), serving N personas at ~KB-per-user.

Question 31

What is MoLE per-token gating in Soup CLI v0.67?

Accepted Answer

task='moe_lora_routing' (v0.67.0) is Mixture of LoRA Experts: a gating network routes per-token activations to top-K task adapters via softmax over the hidden state. Config: MoleGatingConfig with num_task_adapters [2, 64], hidden_dim [1, 16K], temperature (1e-6, 100.0] finite, top_k ≤ num_task_adapters. Cross-validator rejects mlx backend. Beyond 64 adapters per-token softmax becomes the bottleneck; for more, hierarchy the gating. LIVE as of v0.71.12 — 'task: moe_lora_routing' with 'mole_task_adapters: [...]' trains a real per-token gating network that blends N frozen task LoRAs (mole_top_k / mole_temperature); only the router trains, and the gate is saved as 'mole_gate.pt' alongside the run.

Question 32

How do 'soup adapters pr' adapter pull requests work?

Accepted Answer

'soup adapters pr --base-sha <hex> --adapter <path> --eval <eval_delta.json> --samples <sample_diffs.json>' (v0.67.0) renders a GitHub-shaped pull request from an adapter triple {base SHA, dataset diff, adapter weights, eval-delta report} as review-friendly Markdown with an eval-delta table + per-sample baseline/candidate diffs. Bounds: ≤64 EvalDelta entries, ≤256 sample diffs, per-output cap 32 KiB to keep PRs reviewable. The _md_table_escape function neutralises backslash, pipe, newline, CR, and tab characters in operator-controlled cells against Markdown / Rich injection. JSON output is also available for the v0.68 GitHub Action that will post the PR comment. Composes with 'soup adapters diff' from v0.57.

Question 33

What does 'soup lock' do and how is it different from 'soup env lock'?

Accepted Answer

'soup lock write --base-model --base-sha <64hex> --dataset-sha <64hex> --env-hash <64hex>' (v0.67.0) computes closure_sha = SHA256(base_sha || dataset_sha || env_hash) and writes a JSON soup.lock with soup_version, base_model (≤512 chars), and the three SHAs. The point: commit soup.lock to git alongside soup.yaml so the entire team trains on identical (base, dataset, env). 'soup lock check' compares the 5 content fields and exits 3 on drift; soup_version + created_at are advisory only (legitimate operator upgrades don't trigger drift). 'soup env lock' is v0.64.0 — it produces the env_hash that feeds soup.lock; the two together close the reproducibility chain.

Question 34

What is 'soup adapters bisect' and how does it work?

Accepted Answer

'soup adapters bisect ... --eval-command "soup eval custom --checkpoint {ckpt}"' (v0.67.0) does binary search over an ordered checkpoint history to find the first regression boundary. The operator supplies a shell template with a {ckpt} placeholder — Soup uses shlex.split after shlex.quote(ckpt), argv-list mode, no shell=True. Probes both endpoints first to short-circuit all-OK or all-broken, then ~log₂(n) midpoint probes. Verdicts: ALL_OK or BROKEN_AT (returns first_broken checkpoint ID). Exit 3 on BROKEN_AT for cron-friendly automation. Input: 2..4096 unique ordered checkpoint IDs. Composes with the v0.66 live influence-blame runner — bisect tells you which checkpoint broke, blame attributes it to specific training rows.

Question 35

What does 'soup tunability' decide before you pick a base model?

Accepted Answer

'soup tunability --dataset <path> --candidates llama-3.1-8b qwen2.5-7b gemma-3-9b --probe-steps 100 --holdout-size 64' (v0.64.0) runs lightweight LoRA probes on a held-out dataset slice against each candidate base, measures training-loss deltas, and reports the Pareto frontier (best efficiency for cost). The candidate allowlist is closed and ships with licensing metadata (Apache-2.0, MIT, LLaMA-3, etc.). '--plan-only' dry-runs without probing; '--list' catalogues all bundled candidates. Bounds: probe_steps [10, 10000], holdout_size [10, 100000] rows. Output: per-candidate delta (base_loss - probe_loss), wall-clock seconds, estimated USD cost, Pareto frontier membership. Safety: path containment (is_under_cwd), null-byte rejection, symlink-escape rejection on dataset and output paths.

Question 36

What is 'soup plan' / 'soup apply' and how is it Terraform-shaped?

Accepted Answer

'soup plan --config soup.yaml --state ./soup.tfstate' (v0.64.0) reads a training configuration, computes cost / ETA / peak-VRAM / SHA-256 hashes, and writes an immutable state file. 'soup apply' re-reads the config, detects any drift (changed batch size, dataset SHA, base SHA), and refuses to proceed (exit 3) if drift is found. State shape: plain JSON with 'plan' (cost, ETA, SHA hashes), 'applied' (bool), 'applied_at' (ISO timestamp), 'run_id' (reserved). TOCTOU defense: config read as YAML with os.lstat before open; symlinks explicitly rejected. Peak VRAM calculated with a 10% safety margin; estimated cost + minutes included with spot pricing per GPU tier. Composes with v0.60 license-check for adapter merge and v0.67 soup.lock for the full reproducibility chain.

Question 37

How does 'soup env lock' detect ABI drift?

Accepted Answer

'soup env lock --output ./soup-env.lock' (v0.64.0) snapshots the current Python environment (version, CUDA major version, platform, every installed package) into a JSON lockfile. 'soup env check --lock ./soup-env.lock' compares the current env against the lock and exits 3 on ABI-sensitive drift (e.g. Python minor version change, CUDA major version change) that would invalidate training. Fields: soup_version, python_version, platform, cuda_version (or 'none'), ISO timestamp, per-package {name, version, source}. Atomic write to lock file; reads have a file-size cap (no unbounded YAML). The env_hash from 'soup env lock' feeds the v0.67 soup.lock closure for full team reproducibility.

Question 38

What does 'soup license-advisor' decide for a deploy target?

Accepted Answer

'soup license-advisor --target b2c|defense|embedded --license <id> --monthly-active-users 0' (v0.64.0) recommends license-clean base models for a deployment target. Returns ok (safe), warn (yellow), or block (red; exits 3) with a reason string and recommended/forbidden license lists. Per-license risk scales with expected MAU (e.g., LLAMA-3 OK for <1M MAU, risky above). Composes with the v0.60 license_matrix.check_license_compat gate on 'soup adapters merge' — so a 33-entry SPDX matrix governs adapter merges, while license-advisor governs base-model selection upstream.

Question 39

What new failure modes does 'soup eval behavior' / 'eval capability' / 'eval checklist' add in v0.65?

Accepted Answer

v0.65.0 takes failure-mode coverage from 6 to 10. 'soup eval behavior --battery xstest|harmbench|jailbreakbench|elephant|syceval --evidence ' diffs pre/post-FT on bundled safety / refusal / jailbreak / sycophancy probe sets with OK/MINOR/MAJOR via the v0.26/v0.56 thresholds (≥0.85 OK, ≥0.60 MINOR, else MAJOR). 'soup eval capability --suite full|fast|math|code' emits validated lm-eval-harness task IDs for 7 bundled benchmarks (MMLU-Pro, GPQA, BBEH, AIME, MATH-500, HumanEval+, SWE-bench-Verified). 'soup eval checklist ' runs CheckList-style MFT (minimum functionality, keyword in response) / INV (invariance under paraphrase) / DIR (directional perturbation) tests from a 1 MiB-capped YAML spec with up to 1,000 tests. Evidence files capped at 16 MiB with O_NOFOLLOW open against symlink swap.

Question 40

What does 'soup eval irt-subset' do?

Accepted Answer

'soup eval irt-subset <responses.jsonl> --size full|small|tiny' (v0.65.0) fits a 1-parameter Rasch Item Response Theory model to per-item correctness signals and selects a minimum-cost subset that preserves ranking power. The math: P(correct | ability θ, difficulty β) = σ(θ - β); β_i = -log(p̂_i / (1 - p̂_i)); item information I(β) = σ(-β) · σ(β) which peaks at β≈0 (50/50 items are most discriminating). Profiles: full (100%), small (~30%), tiny (~10%). Input shape: JSONL rows {item_id (str, ≤256 chars, no null bytes), correct (bool), score? (float)} up to 1M rows. Pure-Python math (no numpy/scipy in v0.65.0); 256 MiB file cap against unbounded uploads.

Question 41

What is 'soup probe sae-diff' and which SAE repos are bundled?

Accepted Answer

'soup probe sae-diff --top-k 20' (v0.66.0) applies a Sparse Autoencoder encoder to pre- and post-FT activation batches, computes per-feature mean changes, and reports the top-K most-changed features. Purely descriptive (no verdict gate); designed for CI logging and model cards. The SAE repo allowlist is closed — Gemma Scope (2B/9B/27B residual-stream), EleutherAI Pythia SAEs, JBloomAus Llama SAEs, OpenAI GPT-2 SAE — and as of v0.71.8 '--auto-download' fetches an allowlisted SAE from the HF Hub into '~/.soup/sae-cache/' (validated against the allowlist BEFORE any network call, via an SSRF-hardened snapshot_download). Input shape: JSON with 'activations': [[...], ...] (2D float32 matrix [num_tokens, hidden_dim]) read via O_NOFOLLOW to prevent symlink swap. Output: SaeFeatureDiffReport with top-K features and per-feature {feature_id, delta, pre_mean, post_mean}. Bounds: top-k [1, 10K], up to 1M SAE features, up to 1M tokens per activation batch, 16 MiB evidence file cap.

Question 42

How does 'soup probe sleeper' work?

Accepted Answer

'soup probe sleeper --evidence ' (v0.66.0) applies a linear defection probe (synthetic, deterministic, keyed on base name for reproducibility) to a 2D activation tensor. Classifies per-token scores as defection or benign, returns flagged rate, and assigns OK/MINOR/MAJOR: ≤1% → OK, ≤5% → MINOR, >5% → MAJOR. Closed allowlist of bundled bases (Llama-3-8B, Gemma-2-9B, etc.) — metadata includes hidden_dim, threshold, description. Activation input: JSON {'activations': [[...], ...]} (2D float32 [num_tokens, hidden_dim]); 1M token cap. No evidence = OK report with 0 tokens (matches v0.56 diagnose neutral-mode policy). 16 MiB evidence cap, symlink rejection.

Question 43

What is 'soup probe interference' and when does it gate CI?

Accepted Answer

'soup probe interference <losses.json>' (v0.66.0) builds an N×N matrix of adapter pairwise compatibility scores from operator-measured losses. Input JSON has 'adapters' list and 'losses' dict with keys like 'a|b' (loss on domain A when both A and B loaded) and 'a|a' (baseline loss with A alone). Formula: score(A→B) = (loss(A_target | A+B) - loss(A_target | A alone)) / loss(A alone). Classification: |score| <5% → OK, <20% → MINOR, ≥20% → MAJOR. Bounds: 2–16 adapters (4–256 pairwise probes); adapter names ≤256 chars and markup-escaped against Rich/Markdown injection before render. Output: InterferenceMatrix with cells and worst-pair summary; exits 2 if worst ≥20% interference (gates CI merge).

Question 44

How does 'soup ingest' pull traces from observability vendors?

Accepted Answer

'soup ingest --source langfuse|langsmith|helicone|openpipe|otel|openai-stored --logs --output ' (v0.63.0) normalises any observability export into one TraceRecord schema (trace_id / prompt / output / source / signal / metadata as a MappingProxyType so callers can't mutate). No network calls — the operator exports from the SaaS dashboard, then 'soup ingest' parses the file. Six sources at launch: LANGFUSE_KEY for Langfuse, LANGSMITH_API_KEY for LangSmith, HELICONE_API_KEY for Helicone, OPENPIPE_API_KEY for OpenPipe, OTEL_EXPORTER_OTLP_HEADERS for OpenTelemetry OTLP, OPENAI_API_KEY for OpenAI Stored Completions. PII warning prints once per invocation. Output feeds directly into the v0.26 'soup data from-traces' preference-pair builder and the v0.58 'soup loop' HarvestFn — closing the production-traces → training loop without paying per-trace fees to any vendor.

Question 45

What does 'soup prune-prompt' do?

Accepted Answer

'soup prune-prompt --input --output --min-frequency 0.95' (v0.63.0) detects the longest shared system-prompt prefix across all rows via binary search over up to 32 candidate lengths and strips it. By removing the shared boilerplate (e.g., 'You are a helpful AI assistant...'), the fine-tuned model internalizes it into its weights instead of repeating it on every turn — OpenPipe's signature trick at the data layer. Two-pass file read; capped at 100k rows. v0.63 fixed an O(N²) early-exit bug from the prototype detect_common_prefix function.

Question 46

What is 'soup ab' and how is it different from a fixed-N A/B test?

Accepted Answer

'soup ab --input --metric latency|judge_score|retry_rate --alpha 0.05 --beta 0.20 --effect-size 0.1' (v0.63.0) runs Wald's Sequential Probability Ratio Test (SPRT). The log-likelihood ratio is a martingale under the null, so Type-I error is controlled at every stopping time — you can peek at the results at any sample count and decide reject_h0 / accept_h0 / continue without breaking statistical guarantees. The SPRT log-likelihood ratio is compared against A = log((1-β)/α) and B = log(β/(1-α)). v0.63 ships a CRITICAL fix for a sign error in the historical mSPRT implementation. Three metrics at launch: latency, judge_score, retry_rate. Input rows: {arm: 'control'|'treatment', : }.

Question 47

What does 'soup drift-alarm' watch in production?

Accepted Answer

'soup drift-alarm --reference --live --threshold 0.2 [--slack-url ... --discord-url ...]' (v0.63.0) computes KL divergence between token-distribution snapshots at fine-tune time (reference) and production (live). Catches both behavioral drift ('model now outputs JSON when it didn't before') and vocabulary drift ('same 20 phrases on repeat'). Whitespace tokenization by default; pluggable tokenizers ship in v0.63.1. Optional Slack / Discord webhooks fire on drift, SSRF-validated: loopback HTTP only, RFC1918 / link-local / cloud-metadata IPs (169.254.0.0/16, 100.64.0.0/10, 198.18.0.0/15) rejected. Exit code 3 on drift for cron-friendly automation. Input rows: {token: '', log_prob: }.

Question 48

How does Soup CLI handle GDPR right-to-be-forgotten?

Accepted Answer

Soup CLI v0.61.0 ships 'task: unlearn' as a first-class trainer with three methods: NPO (Negative Preference Optimization — DPO-shaped negative-only loss, needs reference model + retain set), SimNPO (length-normalized NPO, no reference model needed — faster on long sequences), and RMU (Representation Misdirection Unlearning — residual-stream noise on forget inputs, best for concept-level removal). Config: set 'task: unlearn', 'training.unlearn_method: npo|simnpo|rmu', 'training.unlearn_alpha' (0.0–10.0, default 0.5, retain-set weight), 'data.forget_set' (required JSONL or HF dataset ID), 'data.retain_set' (optional, strongly recommended for NPO). Verify with 'soup eval unlearning <run-id> --benchmark tofu|muse|wmdp' which scores Forget Quality / Model Utility / PrivLeak and emits an OK / MINOR / MAJOR verdict (MUSE + WMDP fixtures bundled since v0.71.1; WMDP forget-set probes ship redacted). The trainer is LIVE as of v0.71.9 — covariance-free NPO / length-normalised SimNPO / RMU representation-steering kernels run end-to-end, warning when run without a retain set.

Question 49

Can I patch a single fact in a model without retraining?

Accepted Answer

Yes. Soup CLI v0.61.0 ships 'soup edit set --base --method rome|memit|alphaedit --subject "" --target ""' for surgical fact patching. Three methods: ROME (Rank-One Model Editing, targets MLP layers — Meng 2022), MEMIT (Mass-Edit Memory in a Transformer — Meng 2023), and AlphaEdit (ROME projected orthogonal to the down-proj's top singular direction). LIVE as of v0.71.9 — covariance-free rank-1 weight-edit kernels apply the update and optionally save (on a tiny model a ROME edit moved P('Lyon' | 'The capital of France is') from 0.0016 → 0.96); a SQLite EditGovernor enforces the per-base edit cap + norm-blowup refusal across separate runs. Live GRACE codebook (epsilon-ball decode-time substitution) ships alongside. 'soup edit diff registry://before_id registry://after_id --probes probes.jsonl --top-k 10' compares two Registry entries and surfaces which facts changed with a citation visualizer. The Sequential Edit Governor caps per-base-model edits (default 10) and rejects edits that would amplify weight norms beyond threshold — preventing cascading drift across many sequential edits.

Question 50

What is RAFT and how does Soup CLI implement it?

Accepted Answer

RAFT (Retrieval Augmented Fine-Tuning, Stanford 2024) teaches a model to use retrieved context — both golden and distractor documents — to answer queries. Soup CLI v0.62.0 ships 'data.format: raft' with per-row schema {query, golden_doc, distractor_docs[], answer}. Pair with 'training.ra_dit_stage: retriever|generator' for the RA-DIT two-stage pipeline: stage 1 trains a contrastive sentence-transformer ('ra_dit_retriever_model: sentence-transformers/all-MiniLM-L6-v2' default), stage 2 SFTs the main model on RAFT rows. LIVE as of v0.71.10 — RAFT span-mask training masks the prompt to -100 and labels each doc '[doc-N]' so the model learns to cite the supporting document (reproducible shuffle via 'data.raft_shuffle_seed'), and the 'soup ra-dit' one-shot orchestrator trains retriever then generator in a single command, recording the trained retriever as the generator's paired retriever (auto-linked from the Registry when unset). Add 'training.citation_faithful: true' + 'citation_style: bracket|inline|footnote' + 'citation_recall_threshold: 0.85' to enforce source attribution at training time — the final save is refused if recall falls below threshold. Three new recipes shipped: raft-llama3-8b, ra-dit-retriever, ra-dit-llama3-8b.

Question 51

Can I steer a model's behavior at inference time without fine-tuning?

Accepted Answer

Yes. Soup CLI v0.62.0 ships 'soup steer train --base --method caa|iti|repe --name --pairs ' for training activation steering vectors, then 'soup steer apply --name --strength ' to apply them at decode time (|strength| ≤ 10 enforced). Three methods: CAA (Contrastive Activation Addition, adds a learned vector to the residual stream), ITI (Inference-Time Intervention, shifts specific attention heads), and RepE (Representation Engineering, PCA-based direction extraction). Pairs JSONL: {positive: '', negative: ''}. Steering vectors register as the new 'steering_vector' artifact kind in the v0.26 Registry. LIVE as of v0.71.10 — 'soup steer train' fits the CAA/ITI/RepE control vector from {positive, negative} pairs and persists a safetensors + config artifact, and 'soup serve --steer --steer-strength ~~' applies it at decode time via a forward hook. Use case: nudge safety, formality, or domain focus without a full SFT/DPO run.~~

Question 52

Does Soup CLI generate compliance docs for procurement / regulators?

Accepted Answer

Yes (v0.59.0 'Governance & Provenance'). 'soup bom emit --name --version --base-model --base-sha --config-sha --task --license --format cyclonedx|spdx|both' produces machine-learning Bills of Material in CycloneDX 1.6 (with the ML-BOM extension) and SPDX 2.3 AI-profile formats — both, in one shot. 'soup attest emit --stage extract|train|eval|export|publish --subject --sha --builder --invocation --sign unsigned|ed25519|sigstore' writes SLSA-3 in-toto attestations per stage. 'soup audit-log tail/rotate' keeps a HIPAA/SOC2-compliant JSONL audit log at ~/.soup/audit.jsonl with PII redaction. 'soup train --annex-xi --repro-receipt ' emits EU AI Act Annex XI/XII documentation plus SR 11-7 reproducibility receipts (every seed, kernel version, library version, dataset hash). All atomic writes (tempfile.mkstemp + os.replace) with symlink-rejection TOCTOU defense.

Question 53

How does Soup CLI prevent supply-chain attacks on LoRA adapters?

Accepted Answer

Six controls in v0.60.0. (1) 'soup adapters scan ' spectral-analyses LoRA weights for backdoors: rank-1 dominance (warn at 50×, fail at 200×), top-singular-value energy concentration (warn > 75%, fail > 95%), Frobenius outliers (warn > 4σ, fail > 8σ), NaN/Inf. Exit 0/1/3 for OK/WARN/FAIL — pure numpy, no torch. (2) 'soup adapters sign ' computes a Merkle root over all files and writes .soup-signature.json — real ed25519 detached signatures are LIVE as of v0.71.2 via the '[sign]' extra ('--backend ed25519 --key ' or '--generate-key'; 'soup adapters verify --public-key ' does a cryptographic verify, failing closed on any tamper/wrong key). Sigstore keyless stays infra-blocked. (3) 'soup adapters verify --strict' refuses to load on signature mismatch. (4) 'soup adapters check-safetensors --strict' refuses pickle / .bin / .pt — the single biggest LoRA attack vector. (5) 'soup adapters merge --license [--license ] --license-override ""' enforces SPDX-license compatibility via a 33-entry matrix. (6) 'soup airgap-bundle --model --dataset ... --wheel ... --kernel ... --bundle-size-cap 100' packs everything for one-way data-diode transfer, ≤100 GiB cap, signed manifest. Plus namespace-pin TOFU: the first pull of an HF repo pins its owner SHA in a local SQLite cache so a hijacked org cannot silently swap your base model.

Question 54

What is 'soup loop' and how does the production data flywheel work?

Accepted Answer

'soup loop' (v0.58.0) runs the full production data flywheel from a single CLI: harvest production traces → distill into preference pairs → eval-gated DPO train → canary deploy → auto-rollback on regression. 'soup loop init --eval --baseline registry:// --monthly-budget 50usd --max-runs-per-day 3' creates an atomic '.soup/loop.yaml' state file. 'soup loop watch' runs as a long-running daemon (SIGTERM/SIGINT safe, reloads state every iteration so external pause/resume takes effect immediately). 'soup loop canary --traffic 5% --autoroll-on-regress' splits traffic via deterministic SHA-256 hash routing (±0.01% split granularity, threading-locked BucketStats); the verdict uses the v0.26 Quant-Lobotomy OK/MAJOR thresholds with a 30-sample minimum. 'soup loop pause' / 'soup loop resume' flip atomic status without killing the daemon. 'soup loop replay []' walks per-iteration manifests at '.soup-loops//iteration.json'. Budget guardrails enforce a daily-cap → estimate-sanity → monthly-budget check order; budget-skipped iterations produce no manifests. Every iteration is recorded as a frozen IterationRecord with gate verdict, canary verdict, shipped flag, rolled-back flag, and estimated cost.

Question 55

What does 'soup advise' decide before you fine-tune?

Accepted Answer

'soup advise --goal ""' (v0.54.0) is a pre-flight decision engine that classifies the task (keyword + structural signals — tool_calls → tool_use, '' → reasoning, chat messages → input-extraction) across 7 task categories, profiles the dataset (row_count, avg input/output chars, type-token diversity, label variance, has_chosen_rejected, has_reasoning_traces — capped at 2,000-row sample), and emits a rubric verdict among PROMPT_ENG / RAG / SFT / DPO / GRPO. Heuristics: preference pairs → DPO; reasoning + ≥500 rows → GRPO; <50 rows → PROMPT_ENG; high-variance factual → RAG; default → SFT. Optional '--probe' runs a 100-step LoRA probe to put real numbers on each ROI estimate. '--record' appends an entry to '~/.soup/advise_history.jsonl' so future verdicts learn across projects (atomic, file-locked, 16 MiB cap). 'soup advise explain' prints the full rubric. 'soup advise compare ' compares two candidate datasets. Outputs the literal next command ('soup autopilot --data … --task sft') so the handoff is one paste away.

Question 56

How does 'soup eval design' derive an eval suite from your data?

Accepted Answer

'soup eval design --goal "..."' (v0.55.0) uses TF-IDF salience over your training data (10,000-row DoS-capped subsample) plus goal-keyword dispatch to draft a goal-conditioned eval suite with up to 5 dimensions and a scorer per dimension chosen from {exact_match, regex, judge, rlvr}: 'json' / 'code' / 'math' keywords → rlvr; 'classify' → exact_match; 'extract' → regex; default → judge. 'soup eval discover ' runs greedy farthest-first Jaccard-distance clustering (10,000-row cap) to surface held-out canaries + adjacent-skill probes + 25%-prefix memorization probes. 'soup eval lock ' freezes the suite as a SHA-256-checksummed 'eval_suite' artifact (canonical-JSON for stable hashes; registered in the v0.26 registry alongside the new 'canaries' kind). 'soup eval coverage --task ' does gap analysis against the v0.54.0 task taxonomy (each task has a recommended scorer set — e.g. reasoning → rlvr/judge, format_conversion → regex/rlvr). 'soup eval gate-install --baseline ' writes a '.git/hooks/pre-push' that paired-bootstraps a regression verdict against the baseline (configurable n_samples in [100, 100_000], ci_level in (0, 1), direction-aware metric handling) and blocks the push on regression.

Question 57

What does 'soup diagnose' check in a trained model?

Accepted Answer

'soup diagnose ' (v0.56.0) is a post-training model report card that runs 6 pure-function failure-mode probes and rolls them up into an OK/MINOR/MAJOR verdict (thresholds: ≥0.85 OK, ≥0.60 MINOR, else MAJOR). The probes: (1) forgetting — per-task Δ accuracy with tolerance band, extending the v0.25 forgetting baseline; (2) refusal — advbench / xstest delta over caller-supplied generators, 8,192-row scan cap; (3) format — JSON / regex / tool-call validity over RLVR verifiers with explicit ReDoS probe ('a' × 128); (4) mode_collapse — pairwise n-gram-Jaccard distance over K completions (k ∈ [2,32], ngram_n ∈ [1,8]); (5) memorization — training-prefix echo via partial-prompt continuation (1,000-row scan); (6) contamination — n-gram overlap with public benchmarks (combined-complexity cap rejects when |training| × |benchmark| > 1e9). Outputs JSON + a 6-cell SVG badge ('utils/diagnose/badge.py::render_badge_svg', html-escaped) that you can embed in model cards. '--attach-to-registry ' attaches it as the v0.56.0 'diagnose_report' artifact kind. Pair with 'soup train --diagnose-gate ' to refuse the final save on MAJOR regression (exits typer.Exit(code=2)).

Question 58

What is 'soup adapters' and how is it like git for LoRA?

Accepted Answer

'soup adapters' (v0.57.0) ships 6 subcommands that treat LoRA adapters as first-class versioned objects. 'soup adapters diff ' computes per-layer Frobenius norm + relative drift + SVD effective rank — pure numpy, no torch, '.bin' rejected with an actionable 're-save as safetensors' message. 'soup adapters merge [c...] -o --strategy linear|ties|dare|svd' implements four pure-numpy merge strategies: linear weighted average, TIES (Yadav 2023 — trim by density / elect majority sign / disjoint average), DARE (Yu 2024 — Bernoulli drop with rescale + deterministic seed), and SVD low-rank reconstruction. 'soup adapters blame --dataset --layer --budget 5m --shards 4' plans leave-one-out layer ablation given a wall-clock budget ('60s'/'5m'/'2h' parser, bounds [60s, 24h], 30s minimum per shard); the live runner ships in v0.57.1. 'soup adapters branch -c --base --dataset ' names a LoRA config with SHA-256-pinned config + dataset + base-model hashes ('^[A-Za-z0-9][A-Za-z0-9._\-]{0,127}$' regex, 1 MiB config cap, 1,024-pointer cap, atomic POSIX 0o600). 'soup adapters checkout -o ' refuses to restore on SHA mismatch (drift detection). 'soup adapters branches' lists everything. Env override 'SOUP_BRANCHES_DIR' is containment-checked to $HOME / $CWD / $TMPDIR.

Question 59

What is the Soup CLI registry?

Accepted Answer

The registry is a local SQLite-backed catalog of every fine-tune at ~/.soup/registry.db. Each entry stores the config, eval baseline, and parent lineage. You push runs with 'soup registry push --run-id --name --tag ', visualize the DAG with 'soup history ', diff two versions with 'soup registry diff', and promote a version to prod with a tag. Eval gates can reference 'registry://' as a baseline to catch regressions.

Question 60

What is Soup Autopilot?

Accepted Answer

Autopilot is a zero-config decision engine introduced in Soup CLI v0.25.0. You pass 'soup autopilot --model --data --goal chat' and Soup profiles the dataset, model, and GPU, then picks task, quantization, PEFT rank, batch size, learning rate, epochs, and max length — generating a soup.yaml with every choice justified.

Question 61

Can Soup CLI auto-push checkpoints to HuggingFace Hub?

Accepted Answer

Yes. Soup ships deep HuggingFace Hub integration. 'soup train --push-as user/my-model' uploads every save_steps checkpoint to HF Hub as a 'checkpoint-<N>' branch. Pair with '--hf-resume' to pull the latest branch and keep going after a spot-instance preemption. Set 'HF_ENDPOINT=https://hf.internal.example.com' to route to a self-hosted Hub (SSRF-hardened: loopback HTTP only, RFC1918 IPs rejected). 'soup deploy hf-space' creates a Gradio or Streamlit Space wrapping your model in one command.

Question 62

What is the unified preference dispatcher in Soup CLI v0.40?

Accepted Answer

v0.40.0 'Preference Variety' adds 'task: preference' as a unified dispatcher. Set 'training.preference_loss: dpo|simpo|orpo|ipo|bco' to pick the loss without renaming the task — making hyperparameter sweeps over the loss type itself trivial. Legacy 'task: dpo' / 'task: simpo' / etc. remain first-class. The release also ships BCO (Binary Classifier Optimization) as a new trainer with 'task: bco', two opt-in DPO controls (beta annealing via 'dpo_beta_schedule' + periodic reference-model refresh via 'dpo_ref_regen_epochs'), and a multi-objective preference-loss schema ('preference_loss_weights') that validates 2–5 entries summing to 1. Since v0.53.11 the multi-objective path is fully live: 'attach_weighted_preference_combine' computes per-batch DPO / IPO / SimPO / ORPO terms from the four TRL logprob tensors and combines them via 'combine_losses(terms, weights)', replacing the v0.40.1 primary-loss scaling shim.

Question 63

What quantization formats does Soup CLI support?

Accepted Answer

Seventeen formats as of v0.53.0 'Quant Menu II'. Set via 'training.quantization' in soup.yaml: 4bit (bitsandbytes default), 8bit, none (fp16/bf16/full), gptq, awq, hqq:1bit … hqq:8bit (8 sub-variants), aqlm (extreme 2-bit), eetq (8-bit fast kernel SM75+), mxfp4, fp8 (training Hopper+), and bitnet_1.58 (v0.52.0). On top of that, v0.53.0 added a 14-entry Unsloth Dynamic 2.0 GGUF ladder (UD-Q8_K_XL … UD-IQ1_M), 12-entry IQ family, 10-entry Apple/ARM-friendly GGUF set, KV-cache types ('training.kv_cache_type: q8_0|bf16|f16|fp8' — FP8 Hopper-only), FP8 attention, NVFP4 (Blackwell), explicit 'unsloth_bnb_4bit', BNB double-quant. As of v0.53.1 every writer is live: 'soup merge --save-format 4bit | 4bit_forced' performs a single-shot BNB merge without dequant/requant; 'soup export --format torchao --quant-config' covers Int4WeightOnly / Int8DynActInt4 / Float8DynActFloat8 / NVFP4; 'soup export --format gguf-ud' runs the 3-stage llama.cpp imatrix pipeline. 'soup train' runs check_quant_distributed_compat() at startup to flag FSDP / ZeRO-3 incompatibilities before training begins.

Question 64

What is Multipack in Soup CLI?

Accepted Answer

Multipack (v0.37.0) is Soup's largest single throughput win on chat fine-tuning over uneven-length data. Instead of padding every sample to max_length, it uses First-Fit-Decreasing bin packing to group variable-length samples — eliminating padding waste. Set 'training.multipack: true' in soup.yaml. 18-architecture allowlist (Llama 3.x, Qwen 2/3, Mistral, Gemma 2/3, Phi 3/4, DeepSeek V2/V3, Mixtral, Falcon, StableLM, SmolLM2). Unknown architectures fail loudly at config-load. SFT / Pretrain only on the transformers backend.

Battery	What it catches
`xstest`	Over-refusal on benign prompts
`harmbench`	Jailbreak resistance
`jailbreakbench`	Jailbreak prompt-pair contrasts
`elephant`	Sycophancy / opinion-shifting
`syceval`	Sycophantic alignment to user

Eval Depth (v0.65.0)

`soup eval behavior` — pre/post safety diff

`soup eval capability` — lm-eval-harness task surface

`soup eval checklist` — MFT / INV / DIR DSL

`soup eval irt-subset` — Rasch IRT cost-cut

See also

Eval Depth (v0.65.0)

soup eval behavior — pre/post safety diff

soup eval capability — lm-eval-harness task surface

soup eval checklist — MFT / INV / DIR DSL

soup eval irt-subset — Rasch IRT cost-cut

See also

`soup eval behavior` — pre/post safety diff

`soup eval capability` — lm-eval-harness task surface

`soup eval checklist` — MFT / INV / DIR DSL

`soup eval irt-subset` — Rasch IRT cost-cut