Question 1

What's new in the latest Soup CLI release?

Accepted Answer

v0.70.0 'loop hardening' protects the training loop from the failure modes that cost a real GPU-hour. Six surfaces, schema-first today with live callbacks landing in v0.70.1. '--reward-hack-detector info_rm|rm_ensemble [--reward-hack-halt]' is a reward-hacking early-warning for GRPO/PPO: 'info_rm' tracks the InfoRM Cluster-Separation Index (Wang et al. 2024, arXiv 2402.09345); 'rm_ensemble' tracks mean pairwise variance across an RM ensemble (cap 32). Math kernels — 'compute_cluster_separation', 'compute_rm_ensemble_divergence', 'classify_hack_signal' — are LIVE with OK/WARN/HACK bands at 0.10 / 0.30 relative drop; '--reward-hack-halt' auto-stops on HACK verdict. Composes with v0.34 'soup why' for anomaly explanation. '--uld-strategy wasserstein|topk_align [--uld-top-k N]' adds cross-tokenizer Universal Logit Distillation (Boizard et al. 2024, arXiv 2402.12030) — Llama→Mistral, Llama→Qwen, no shared vocab needed; 'wasserstein' is 1-D Wasserstein over sorted logits with no alignment, 'topk_align' uses top-K teacher logits via BPE overlap. _MAX_VOCAB_SIZE=262144 covers multilingual SentencePiece + GPT-OSS 200K. '--minillm-enabled' is on-policy reverse-KL distillation (Gu et al. 2024, arXiv 2306.08543) with all three stability tricks from §3 bundled — teacher-mixed sampling, length-normalisation, pretrain-loss anchor — cross-validators reject silent no-ops (anchor_weight=0 with anchor_path set, anchor_weight > 0 with path None). '--rl-checkpoint-save-every-steps N [--rl-checkpoint-keep-last N] [--rl-checkpoint-include-optimizer] [--rl-checkpoint-include-ref-model] [--rl-checkpoint-include-rollout-buffer]' adds mid-epoch checkpointing for PPO/GRPO (TorchTune explicitly punts this); save_every_steps ∈ [1, 10M], keep_last ∈ [1, 100]. Composes with v0.32 spike recovery and v0.40 ref-model regen — recovery hops to the most recent mid-epoch ckpt instead of restarting the epoch. 'soup iterative-dpo --base-model --reward-model --prompts --output-dir --rounds N --pairs-per-round N [--plan-only]' is the iterative DPO loop driver (sample → RM-score → re-pair → retrain) with a frozen IterativeDPOPlan, consecutive-round_index invariant, and canonical per-round artifacts ('./out/round-NN/pairs.jsonl', './out/round-NN/adapter'). '--echo-trap-enabled [--echo-trap-threshold 0.6] [--echo-trap-halt]' is the RAGEN-style trajectory-degeneration detector for multi-turn agent RL (Zhu et al. 2025, arXiv 2504.14437) — pure-Python n-gram repetition rate per trajectory + batch mean, OK/WARN/TRAP bands at 0.30 / 0.60. v0.69.0 'data engineering pro' turns dataset prep into a first-class engineering discipline. 'soup build [--dry-run]' is a dbt-shaped DAG of dataset transforms with incremental materialisation, closed SUPPORTED_MODEL_KINDS = {incremental, table, view}, topo-sort via Kahn's, SHA-256 row hashing, incremental_diff(prev, new) → {added, changed, removed, unchanged} re-tokenise-only-changed-rows kernel. DoS caps _MAX_MODELS=256, _MAX_REFS_PER_MODEL=32, _MAX_FILE_BYTES=1 MiB. 'soup expect ' (LIVE) is a Great-Expectations suite for chat data with a closed SUPPORTED_EXPECTATIONS set: expect_no_pii (v0.47 Presidio), expect_token_length_between, expect_no_refusal_pattern (v0.56 detector), expect_chosen_preferred_over_rejected_by_judge (v0.19 judge); exit 0 = pass, 2 = validation rejection, 3 = suite failure. 'soup data gen-magpie --base --provider ollama|anthropic|vllm --target N' generates synthetic data via chat-template-prefix harvest. 'soup data persona-mix --prompts --n N --output ' (LIVE) is a Persona-Hub diversity sampler with 12 personas × 5 styles bundled (BYO Tencent 200k via --personas / --styles), Shannon-entropy topic diversity, deterministic by seed. 'soup data brain-rot [--strict] [--max-major-fraction 0.25]' (LIVE) is the arXiv 2510.13928 brain-rot detector — two pure-Python scorers (score_triviality + score_popularity_signal), worst-signal-wins, OK ≥ 0.85 / MINOR ≥ 0.60 / else MAJOR (matches v0.26/v0.56/v0.65 bands), '--strict' exits 3 on excessive MAJOR. v0.68.0 'anti-trend insurance' is a 5-bet hedge against paradigm shifts. 'soup compile --eval [--optimizer mipro|gepa|textgrad|copro|bootstrap_fewshot] [--max-iters N]' is a DSPy + GEPA + TextGrad prompt-program compiler; MAX_COMPILE_ITERS=1000, closed optimiser allowlist of 5. 'soup distill-prompt --traces --teacher --student --strategy sft|preference|kl' distils long-system-prompt traces into small fine-tunes — composes with v0.70 ULD when both are live. 'soup compile-tools --eval [--optimizer textgrad|gepa]' optimises tool descriptions via textual gradients; composes with v0.46 Agent Forge. 'soup apple-adapter --direction hf-to-mlx|mlx-to-hf|hf-to-apple|mlx-to-apple --output [--sign]' converts adapter formats between HF safetensors, MLX npz, and Apple FoundationModels with optional Merkle-root signing (reuses v0.60 Part B signing). 'soup local-rl {init, status, record, harvest}' (LIVE) is a personal-LLM feedback flywheel daemon — POSIX 0o600 SQLite DB with interactions + thumbs tables, prompt/response 16 KiB caps, harvests JSONL DpoPair output that feeds straight into 'soup train --task dpo'; live nightly-train scheduler ships in v0.68.1. Together v0.68 + v0.69 + v0.70 land +803 tests (11,021 → 11,824) across 17 new test files (251 → 268). Totals: 11,824 tests, 23 training tasks, 18 data formats, 17 quantization formats, 116 recipes — Python 3.9+, Apache-2.0. Three releases prior. v0.65 'eval depth' takes the failure-mode count from 6 to 10 — 'soup eval behavior' (xstest / harmbench / jailbreakbench / elephant / syceval), 'soup eval capability' (validated lm-eval-harness task IDs for MMLU-Pro / GPQA / BBEH / AIME / MATH-500 / HumanEval+ / SWE-bench-Verified), 'soup eval checklist' (CheckList MFT/INV/DIR DSL), 'soup eval irt-subset' (Rasch IRT cost-cut). v0.66 'post-train x-rays' opens mechanistic interpretability for every fine-tune: 'soup probe sae-diff' (closed allowlist of Gemma Scope / EleutherAI Pythia / JBloomAus Llama / OpenAI GPT-2 SAEs), 'soup probe sleeper' (calibrated linear defection probe), 'soup probe interference' (N×N catastrophic-interference matrix, exits 2 on worst ≥20%), 'soup probe pack', plus live influence-function blame. v0.67 'adapter lifecycle finish' — 'soup adapters merge --strategy cmaes --eval --budget 1h' (Sakana-style evolutionary search via pure-Python rank-mu CMA-ES, no 'cma' dep), 'soup_cli.utils.vector_bank' (VeRA / VB-LoRA storage at ~512 bytes per user vs. ~30 MB per LoRA), task='moe_lora_routing' (MoLE per-token gating with MoleGatingConfig num_task_adapters [2,64]), 'soup adapters pr --base-sha <hex> --adapter <path> --eval <json>' (GitHub-shaped PR markdown), 'soup lock write/show/check' (SHA256(base_sha || dataset_sha || env_hash) closure, exit 3 on drift), 'soup adapters bisect <ckpt1> ... --eval-command "..."' (binary search over checkpoint history, ~log₂(n) probes, exit 3 on BROKEN_AT, composes with v0.66 influence-blame).

Question 2

How does 'soup train --reward-hack-detector' detect reward hacking in GRPO/PPO?

Accepted Answer

'soup train --reward-hack-detector info_rm|rm_ensemble [--reward-hack-halt]' (v0.70.0) is an early-warning system for reward hacking in GRPO/PPO runs. Two detectors: 'info_rm' tracks the InfoRM Cluster-Separation Index (Wang et al. 2024, arXiv 2402.09345) — a calibrated metric on reward-model embeddings that drops when the policy collapses onto a degenerate reward-maximizing subspace; 'rm_ensemble' tracks the mean pairwise variance across an RM ensemble (cap 32) — when ensemble members disagree, the policy is exploiting one of them. Math kernels 'compute_cluster_separation', 'compute_rm_ensemble_divergence', and 'classify_hack_signal' are LIVE with OK/WARN/HACK bands at 0.10 / 0.30 relative drop from the baseline window. '--reward-hack-halt' auto-stops on HACK verdict (exit 2). Cross-validator: task in {grpo, ppo} only, halt=True requires detector, rejects mlx backend. Composes with v0.34 'soup why' for anomaly explanation. Live HF Trainer callback ships in v0.70.1; today the schema is validated and the kernels are unit-tested end-to-end.

Question 3

What is '--uld-strategy' (Universal Logit Distillation) in Soup CLI v0.70?

Accepted Answer

'soup train --uld-strategy wasserstein|topk_align [--uld-top-k N]' (v0.70.0) is cross-tokenizer knowledge distillation (Boizard et al. 2024, arXiv 2402.12030) — Llama→Mistral, Llama→Qwen, no shared vocabulary required. 'wasserstein' computes a 1-D Wasserstein distance over sorted teacher / student logits, no token-level alignment needed (the cheap and robust default). 'topk_align' uses the top-K teacher logits matched via BPE-overlap heuristic alignment (use when you have a good vocab-overlap heuristic and want sharper signal). _MAX_VOCAB_SIZE=262144 covers multilingual SentencePiece + GPT-OSS 200K vocabularies. Gated to task='distill' and rejects mlx backend. Live projection module ships in v0.70.1; the validation surface is live today.

Question 4

How does '--minillm-enabled' stabilize reverse-KL distillation?

Accepted Answer

'soup train --minillm-enabled [--minillm-teacher-mix-ratio 0.3] [--minillm-length-normalize true] [--minillm-pretrain-anchor-weight 0.1 --minillm-pretrain-anchor-path pre.jsonl]' (v0.70.0) is on-policy reverse-KL distillation (Gu et al. 2024, arXiv 2306.08543) with all three §3 stability tricks bundled — teacher-mixed sampling (mix ratio of teacher samples into the on-policy rollout), length normalisation (per-token KL averaged), and pretrain-loss anchor (regularise toward an anchor distribution at weight α). Cross-validators reject silent no-ops: anchor_weight=0 with anchor_path set, or anchor_weight > 0 with path None. Gated to task='distill'. The live callback ships in v0.70.1; the config schema + cross-validator are live today.

Question 5

What does '--rl-checkpoint-save-every-steps' fix that TorchTune punts on?

Accepted Answer

'soup train --rl-checkpoint-save-every-steps N [--rl-checkpoint-keep-last N] [--rl-checkpoint-include-optimizer] [--rl-checkpoint-include-ref-model] [--rl-checkpoint-include-rollout-buffer]' (v0.70.0) adds mid-epoch checkpointing to PPO and GRPO — a feature TorchTune explicitly punts (their docs link out to 'restart the epoch on crash'). save_every_steps ∈ [1, 10M], keep_last ∈ [1, 100] (oldest pruned). Optional inclusions: optimizer state, reference model snapshot, rollout buffer. Composes with v0.32 spike recovery and v0.40 reference-model regen — when the recovery fires, it now hops to the most recent mid-epoch checkpoint instead of restarting the epoch. Live save_state / load_state ships in v0.70.1.

Question 6

What is 'soup iterative-dpo' and how does it differ from a single DPO run?

Accepted Answer

'soup iterative-dpo --base-model --reward-model --prompts --output-dir --rounds N --pairs-per-round N [--plan-only]' (v0.70.0) drives the iterative-DPO loop: sample from the current policy, score with the reward model, build new preference pairs, retrain DPO, repeat for N rounds. Frozen IterativeDPOPlan with a consecutive-round_index invariant and canonical per-round artifacts ('./out/round-NN/pairs.jsonl', './out/round-NN/adapter') so a crashed run resumes cleanly. '--plan-only' renders the validated plan and exits 0; the live runner (subprocess 'soup train --task dpo --resume' between rounds) ships in v0.70.1.

Question 7

How does '--echo-trap-enabled' detect multi-turn agent degeneration?

Accepted Answer

'soup train --echo-trap-enabled [--echo-trap-threshold 0.6] [--echo-trap-halt]' (v0.70.0) is the RAGEN-style trajectory-degeneration detector for multi-turn agent RL (Zhu et al. 2025, arXiv 2504.14437). Pure-Python n-gram repetition rate per trajectory + a batch mean — when an agent's rollout collapses into 'echoing itself' (the same n-gram pattern appearing repeatedly within and across turns), this catches it before the reward model rewards the degenerate policy. OK / WARN / TRAP bands at 0.30 / 0.60. DoS caps _MAX_NGRAM_N=32, _MAX_TRAJECTORY_TOKENS=1M, _MAX_BATCH_TRAJECTORIES=100k. Gated to task in {grpo, ppo} non-mlx. Composes with v0.53.11 GRPOStabilityCallback. Live callback ships in v0.70.1; the math kernel is live today.

Question 8

What is 'soup build' and how is it dbt-shaped?

Accepted Answer

'soup build <manifest.yaml> [--dry-run]' (v0.69.0) is a dbt-shaped DAG of dataset transforms with incremental materialisation. Closed SUPPORTED_MODEL_KINDS = {incremental, table, view}. Topo-sort via Kahn's algorithm. The 're-tokenise only changed rows' kernel: compute_row_hash (SHA-256 over canonical-JSON, 'id' field excluded) + incremental_diff(prev, new) → {added, changed, removed, unchanged}. DoS caps _MAX_MODELS=256, _MAX_REFS_PER_MODEL=32, _MAX_FILE_BYTES=1 MiB. '--dry-run' validates topology and exits 0 (LIVE). Live run_build materialiser (DuckDB / SQLite backend, live transform-resolver registry) ships in v0.69.1.

Question 9

What does 'soup expect' assert about a chat dataset?

Accepted Answer

'soup expect ' (v0.69.0, LIVE) is a Great Expectations suite for chat data with a closed SUPPORTED_EXPECTATIONS allowlist: expect_no_pii (reuses v0.47 Presidio analyzer), expect_token_length_between (min / max bounds), expect_no_refusal_pattern (reuses v0.56 refusal detector), expect_chosen_preferred_over_rejected_by_judge (reuses v0.19 LLM-judge surface). Walks 'text', 'content', 'output', 'prompt', 'instruction', 'response' top-level keys + 'messages[].content' nested arrays. _MAX_SUITE_LEN=64. Exit 0 = all expectations passed, 2 = validation rejection (suite shape invalid), 3 = expectations failed. Drop into CI between 'soup data' and 'soup train'.

Question 10

How do 'soup data gen-magpie' and 'persona-mix' improve synthetic data diversity?

Accepted Answer

'soup data gen-magpie --base --provider ollama|anthropic|vllm --target N [--plan-only]' (v0.69.0) generates synthetic chat data via the Magpie chat-template-prefix harvest trick — prime the base model with just the assistant chat-template prefix and let it complete the prompt itself. Reuses the v0.20 provider stack. 'soup data persona-mix --prompts --n N --output [--personas ] [--styles ]' (v0.69.0, LIVE) is the Persona-Hub diversity sampler — bundled 12 personas × 5 writing styles by default (BYO Tencent 200k corpus via --personas / --styles). Deterministic by seed (random.Random(seed)); compute_topic_diversity = Shannon entropy over pooled whitespace tokens; atomic JSONL write with 100 MiB / 100k entries per loader caps.

Question 11

How does 'soup data brain-rot' detect AI slop in training data?

Accepted Answer

'soup data brain-rot <data.jsonl> [--strict] [--max-major-fraction 0.25]' (v0.69.0, LIVE) implements the arXiv 2510.13928 brain-rot detector with two pure-Python scorers: score_triviality (token-diversity inversion + '!!' / '??' punctuation runs + low-effort token density + length penalty) and score_popularity_signal (clickbait phrase scan + emoji U+1F300–U+1FAFF density). Worst-signal-wins composition: 1.0 − max(triviality, popularity). Bands match the v0.26 / v0.56 / v0.65 taxonomy: OK ≥ 0.85, MINOR ≥ 0.60, else MAJOR. refuse_if_rotten raises when MAJOR fraction exceeds threshold. '--strict' exits 3 on excessive MAJOR. English-keyword-only in v0.69.0; multilingual lands in v0.69.1.

Question 12

What is 'soup compile' and why does it hedge against prompt engineering winning over fine-tuning?

Accepted Answer

'soup compile --eval [--optimizer mipro|gepa|textgrad|copro|bootstrap_fewshot] [--max-iters N] [--output ] [--plan-only]' (v0.68.0) is a DSPy + GEPA + TextGrad prompt-program compiler — if prompt engineering paradigms keep winning over fine-tuning, you have the compilation surface inside Soup. Hard bound MAX_COMPILE_ITERS=1000. Closed optimiser allowlist of 5. validate_program_path requires '.py' extension, cwd containment, and os.lstat + S_ISLNK rejection via the shared paths.enforce_under_cwd_and_no_symlink helper. CompileResult enforces finite-score (no NaN / ±Inf) and iterations ≥ 0. Module is commands/compile_cmd.py because 'compile' is a Python builtin. Live run_compile orchestrator ships in v0.68.1.

Question 13

What does 'soup distill-prompt' do?

Accepted Answer

'soup distill-prompt --traces --teacher --student --strategy sft|preference|kl [--output ] [--plan-only]' (v0.68.0) distils prompt-heavy traces — long system prompts, in-context examples — into small fine-tunes that internalise the prompt. Closed strategy set {sft, preference, kl}. Teacher / student IDs capped at 512 chars; null-byte + empty rejected. Composes with v0.70 cross-tokenizer ULD when both are live — distill-prompt picks the dataset shape, ULD bridges teacher / student vocabularies. Live prepare_distill_dataset (tokeniser bridge + dataset prep) ships in v0.68.1.

Question 14

How does 'soup apple-adapter' target Apple FoundationModels?

Accepted Answer

'soup apple-adapter --direction hf-to-mlx|mlx-to-hf|hf-to-apple|mlx-to-apple --output [--sign/--no-sign] [--plan-only]' (v0.68.0) converts adapter formats between HuggingFace safetensors, Apple MLX npz, and Apple FoundationModels (iOS 26+ on-device LLM) adapter blobs, with optional Merkle-root signing. Closed direction allowlist of 4. Explicit S_ISDIR + symlink rejection on source. 'sign' must be a real bool (bool-as-int defence). Reuses v0.60 Part B signing infrastructure and extends the v0.25 MLX backend. Live convert_apple_adapter ships in v0.68.1 (Apple's spec is still moving).

Question 15

What is 'soup local-rl' and how does the personal-LLM flywheel work?

Accepted Answer

'soup local-rl {init, status, record, harvest, train}' (v0.68.0) is a personal-LLM feedback flywheel daemon (LIVE except 'train'). 'soup local-rl init --db ' creates a POSIX 0o600 SQLite DB with interactions + thumbs tables (idempotent CREATE TABLE IF NOT EXISTS). 'soup local-rl record --db --prompt --response --thumb up|down' parameterised inserts with 16 KiB prompt + 16 KiB response caps and null-byte rejection. 'soup local-rl status --db ' renders a Rich table of interactions / up / down counters. 'soup local-rl harvest --db ' walks thumbs by ts ASC and emits one DpoPair{prompt, chosen, rejected} per prompt with both an up and down (last-writes-win dedup), atomic JSONL write via tempfile.mkstemp + os.replace — operators feed this straight into 'soup train --task dpo'. 'soup local-rl train --backend ollama|mlx --model [--train-method dpo|kto|orpo]' (systemd / launchd cron glue) ships in v0.68.1.

Question 16

What is 'soup adapters merge --strategy cmaes' and why is it different from linear / TIES / DARE / SVD?

Accepted Answer

'soup adapters merge --strategy cmaes --adapter --adapter ... --eval --budget 1h' (v0.67.0) runs Sakana-style evolutionary search over LoRA merge weights via pure-Python rank-mu CMA-ES (no 'cma' dependency). The optimiser parameterises N-1 logits, softmaxes them onto the simplex (sum=1, each ≥0), samples a population, keeps the elite half, and plateau-detects after 3 generations without improvement (converged=True). Bounds: 2–16 adapters, population [2, 256], generations [1, 10K], budget [60s, 24h] (reuses v0.57 blame.parse_budget). Operator-supplied eval_fn closure; failures swallowed with a sentinel -1e9 score so one broken eval doesn't crash the run (mirrors v0.40.3 proxy-failure isolation). Linear / TIES / DARE / SVD are still available as fast deterministic baselines; CMA-ES is for when you actually have a measurable eval and want the optimiser to search for you. Live auto-wiring of the eval suite is deferred to v0.67.1; v0.67.0 prints the validated plan today.

Question 17

What is the VeRA / VB-LoRA vector bank and why does it matter for multi-tenant serving?

Accepted Answer

'soup_cli.utils.vector_bank' (v0.67.0) is the storage format for VeRA / VB-LoRA — shared random projection matrix P (d_model × d_model) + per-user scaling vector v_u (vector_dim floats). A 128-D scaling vector at fp32 is ≈512 bytes per user vs. ~30 MB for a rank-16 LoRA on a 7B model — thousands of per-user adapters at MB-each instead of hundreds-of-MB per LoRA. Hosted vendors price by GPU-hour, not adapter count, so they structurally cannot offer this economics. Schema: bank name (kebab-case, ≤128 chars), base_model (≤512 chars), per-entry {user_id (≤256 chars), scaling_vector (floats, finite, no NaN/Inf)}. Bounds: vector_dim [1, 16K], up to 1M entries per bank, 16 MiB file-size cap, atomic JSON I/O via shared paths.atomic_write_text + cwd containment + symlink rejection. estimate_bank_size(num_users, vector_dim) for capacity planning. Schema + disk I/O live in v0.67.0; live wiring into multi-adapter serve (v0.22) deferred to v0.67.1 (apply_bank_to_serve raises NotImplementedError with explicit marker).

Question 18

What is MoLE per-token gating in Soup CLI v0.67?

Accepted Answer

task='moe_lora_routing' (v0.67.0) is Mixture of LoRA Experts: a gating network routes per-token activations to top-K task adapters via softmax over the hidden state. Config: MoleGatingConfig with num_task_adapters [2, 64], hidden_dim [1, 16K], temperature (1e-6, 100.0] finite, top_k ≤ num_task_adapters. Cross-validator rejects mlx backend. Beyond 64 adapters per-token softmax becomes the bottleneck; for more, hierarchy the gating. Schema + cross-validator in v0.67.0; live gating-kernel training + serving dispatch deferred to v0.67.1.

Question 19

How do 'soup adapters pr' adapter pull requests work?

Accepted Answer

'soup adapters pr --base-sha <hex> --adapter <path> --eval <eval_delta.json> --samples <sample_diffs.json>' (v0.67.0) renders a GitHub-shaped pull request from an adapter triple {base SHA, dataset diff, adapter weights, eval-delta report} as review-friendly Markdown with an eval-delta table + per-sample baseline/candidate diffs. Bounds: ≤64 EvalDelta entries, ≤256 sample diffs, per-output cap 32 KiB to keep PRs reviewable. The _md_table_escape function neutralises backslash, pipe, newline, CR, and tab characters in operator-controlled cells against Markdown / Rich injection. JSON output is also available for the v0.68 GitHub Action that will post the PR comment. Composes with 'soup adapters diff' from v0.57.

Question 20

What does 'soup lock' do and how is it different from 'soup env lock'?

Accepted Answer

'soup lock write --base-model --base-sha <64hex> --dataset-sha <64hex> --env-hash <64hex>' (v0.67.0) computes closure_sha = SHA256(base_sha || dataset_sha || env_hash) and writes a JSON soup.lock with soup_version, base_model (≤512 chars), and the three SHAs. The point: commit soup.lock to git alongside soup.yaml so the entire team trains on identical (base, dataset, env). 'soup lock check' compares the 5 content fields and exits 3 on drift; soup_version + created_at are advisory only (legitimate operator upgrades don't trigger drift). 'soup env lock' is v0.64.0 — it produces the env_hash that feeds soup.lock; the two together close the reproducibility chain.

Question 21

What is 'soup adapters bisect' and how does it work?

Accepted Answer

'soup adapters bisect ... --eval-command "soup eval custom --checkpoint {ckpt}"' (v0.67.0) does binary search over an ordered checkpoint history to find the first regression boundary. The operator supplies a shell template with a {ckpt} placeholder — Soup uses shlex.split after shlex.quote(ckpt), argv-list mode, no shell=True. Probes both endpoints first to short-circuit all-OK or all-broken, then ~log₂(n) midpoint probes. Verdicts: ALL_OK or BROKEN_AT (returns first_broken checkpoint ID). Exit 3 on BROKEN_AT for cron-friendly automation. Input: 2..4096 unique ordered checkpoint IDs. Composes with the v0.66 live influence-blame runner — bisect tells you which checkpoint broke, blame attributes it to specific training rows.

Question 22

What does 'soup tunability' decide before you pick a base model?

Accepted Answer

'soup tunability --dataset <path> --candidates llama-3.1-8b qwen2.5-7b gemma-3-9b --probe-steps 100 --holdout-size 64' (v0.64.0) runs lightweight LoRA probes on a held-out dataset slice against each candidate base, measures training-loss deltas, and reports the Pareto frontier (best efficiency for cost). The candidate allowlist is closed and ships with licensing metadata (Apache-2.0, MIT, LLaMA-3, etc.). '--plan-only' dry-runs without probing; '--list' catalogues all bundled candidates. Bounds: probe_steps [10, 10000], holdout_size [10, 100000] rows. Output: per-candidate delta (base_loss - probe_loss), wall-clock seconds, estimated USD cost, Pareto frontier membership. Safety: path containment (is_under_cwd), null-byte rejection, symlink-escape rejection on dataset and output paths.

Question 23

What is 'soup plan' / 'soup apply' and how is it Terraform-shaped?

Accepted Answer

'soup plan --config soup.yaml --state ./soup.tfstate' (v0.64.0) reads a training configuration, computes cost / ETA / peak-VRAM / SHA-256 hashes, and writes an immutable state file. 'soup apply' re-reads the config, detects any drift (changed batch size, dataset SHA, base SHA), and refuses to proceed (exit 3) if drift is found. State shape: plain JSON with 'plan' (cost, ETA, SHA hashes), 'applied' (bool), 'applied_at' (ISO timestamp), 'run_id' (reserved). TOCTOU defense: config read as YAML with os.lstat before open; symlinks explicitly rejected. Peak VRAM calculated with a 10% safety margin; estimated cost + minutes included with spot pricing per GPU tier. Composes with v0.60 license-check for adapter merge and v0.67 soup.lock for the full reproducibility chain.

Question 24

How does 'soup env lock' detect ABI drift?

Accepted Answer

'soup env lock --output ./soup-env.lock' (v0.64.0) snapshots the current Python environment (version, CUDA major version, platform, every installed package) into a JSON lockfile. 'soup env check --lock ./soup-env.lock' compares the current env against the lock and exits 3 on ABI-sensitive drift (e.g. Python minor version change, CUDA major version change) that would invalidate training. Fields: soup_version, python_version, platform, cuda_version (or 'none'), ISO timestamp, per-package {name, version, source}. Atomic write to lock file; reads have a file-size cap (no unbounded YAML). The env_hash from 'soup env lock' feeds the v0.67 soup.lock closure for full team reproducibility.

Question 25

What does 'soup license-advisor' decide for a deploy target?

Accepted Answer

'soup license-advisor --target b2c|defense|embedded --license <id> --monthly-active-users 0' (v0.64.0) recommends license-clean base models for a deployment target. Returns ok (safe), warn (yellow), or block (red; exits 3) with a reason string and recommended/forbidden license lists. Per-license risk scales with expected MAU (e.g., LLAMA-3 OK for <1M MAU, risky above). Composes with the v0.60 license_matrix.check_license_compat gate on 'soup adapters merge' — so a 33-entry SPDX matrix governs adapter merges, while license-advisor governs base-model selection upstream.

Question 26

What new failure modes does 'soup eval behavior' / 'eval capability' / 'eval checklist' add in v0.65?

Accepted Answer

v0.65.0 takes failure-mode coverage from 6 to 10. 'soup eval behavior --battery xstest|harmbench|jailbreakbench|elephant|syceval --evidence ' diffs pre/post-FT on bundled safety / refusal / jailbreak / sycophancy probe sets with OK/MINOR/MAJOR via the v0.26/v0.56 thresholds (≥0.85 OK, ≥0.60 MINOR, else MAJOR). 'soup eval capability --suite full|fast|math|code' emits validated lm-eval-harness task IDs for 7 bundled benchmarks (MMLU-Pro, GPQA, BBEH, AIME, MATH-500, HumanEval+, SWE-bench-Verified). 'soup eval checklist ' runs CheckList-style MFT (minimum functionality, keyword in response) / INV (invariance under paraphrase) / DIR (directional perturbation) tests from a 1 MiB-capped YAML spec with up to 1,000 tests. Evidence files capped at 16 MiB with O_NOFOLLOW open against symlink swap.

Question 27

What does 'soup eval irt-subset' do?

Accepted Answer

'soup eval irt-subset <responses.jsonl> --size full|small|tiny' (v0.65.0) fits a 1-parameter Rasch Item Response Theory model to per-item correctness signals and selects a minimum-cost subset that preserves ranking power. The math: P(correct | ability θ, difficulty β) = σ(θ - β); β_i = -log(p̂_i / (1 - p̂_i)); item information I(β) = σ(-β) · σ(β) which peaks at β≈0 (50/50 items are most discriminating). Profiles: full (100%), small (~30%), tiny (~10%). Input shape: JSONL rows {item_id (str, ≤256 chars, no null bytes), correct (bool), score? (float)} up to 1M rows. Pure-Python math (no numpy/scipy in v0.65.0); 256 MiB file cap against unbounded uploads.

Question 28

What is 'soup probe sae-diff' and which SAE repos are bundled?

Accepted Answer

'soup probe sae-diff --top-k 20' (v0.66.0) applies a Sparse Autoencoder encoder to pre- and post-FT activation batches, computes per-feature mean changes, and reports the top-K most-changed features. Purely descriptive (no verdict gate); designed for CI logging and model cards. The SAE repo allowlist is closed and bundled — no auto-download: Gemma Scope (2B/9B/27B residual-stream), EleutherAI Pythia SAEs, JBloomAus Llama SAEs, OpenAI GPT-2 SAE. Input shape: JSON with 'activations': [[...], ...] (2D float32 matrix [num_tokens, hidden_dim]) read via O_NOFOLLOW to prevent symlink swap. Output: SaeFeatureDiffReport with top-K features and per-feature {feature_id, delta, pre_mean, post_mean}. Bounds: top-k [1, 10K], up to 1M SAE features, up to 1M tokens per activation batch, 16 MiB evidence file cap.

Question 29

How does 'soup probe sleeper' work?

Accepted Answer

'soup probe sleeper --evidence ' (v0.66.0) applies a linear defection probe (synthetic, deterministic, keyed on base name for reproducibility) to a 2D activation tensor. Classifies per-token scores as defection or benign, returns flagged rate, and assigns OK/MINOR/MAJOR: ≤1% → OK, ≤5% → MINOR, >5% → MAJOR. Closed allowlist of bundled bases (Llama-3-8B, Gemma-2-9B, etc.) — metadata includes hidden_dim, threshold, description. Activation input: JSON {'activations': [[...], ...]} (2D float32 [num_tokens, hidden_dim]); 1M token cap. No evidence = OK report with 0 tokens (matches v0.56 diagnose neutral-mode policy). 16 MiB evidence cap, symlink rejection.

Question 30

What is 'soup probe interference' and when does it gate CI?

Accepted Answer

'soup probe interference <losses.json>' (v0.66.0) builds an N×N matrix of adapter pairwise compatibility scores from operator-measured losses. Input JSON has 'adapters' list and 'losses' dict with keys like 'a|b' (loss on domain A when both A and B loaded) and 'a|a' (baseline loss with A alone). Formula: score(A→B) = (loss(A_target | A+B) - loss(A_target | A alone)) / loss(A alone). Classification: |score| <5% → OK, <20% → MINOR, ≥20% → MAJOR. Bounds: 2–16 adapters (4–256 pairwise probes); adapter names ≤256 chars and markup-escaped against Rich/Markdown injection before render. Output: InterferenceMatrix with cells and worst-pair summary; exits 2 if worst ≥20% interference (gates CI merge).

Question 31

How does 'soup ingest' pull traces from observability vendors?

Accepted Answer

'soup ingest --source langfuse|langsmith|helicone|openpipe|otel|openai-stored --logs --output ' (v0.63.0) normalises any observability export into one TraceRecord schema (trace_id / prompt / output / source / signal / metadata as a MappingProxyType so callers can't mutate). No network calls — the operator exports from the SaaS dashboard, then 'soup ingest' parses the file. Six sources at launch: LANGFUSE_KEY for Langfuse, LANGSMITH_API_KEY for LangSmith, HELICONE_API_KEY for Helicone, OPENPIPE_API_KEY for OpenPipe, OTEL_EXPORTER_OTLP_HEADERS for OpenTelemetry OTLP, OPENAI_API_KEY for OpenAI Stored Completions. PII warning prints once per invocation. Output feeds directly into the v0.26 'soup data from-traces' preference-pair builder and the v0.58 'soup loop' HarvestFn — closing the production-traces → training loop without paying per-trace fees to any vendor.

Question 32

What does 'soup prune-prompt' do?

Accepted Answer

'soup prune-prompt --input --output --min-frequency 0.95' (v0.63.0) detects the longest shared system-prompt prefix across all rows via binary search over up to 32 candidate lengths and strips it. By removing the shared boilerplate (e.g., 'You are a helpful AI assistant...'), the fine-tuned model internalizes it into its weights instead of repeating it on every turn — OpenPipe's signature trick at the data layer. Two-pass file read; capped at 100k rows. v0.63 fixed an O(N²) early-exit bug from the prototype detect_common_prefix function.

Question 33

What is 'soup ab' and how is it different from a fixed-N A/B test?

Accepted Answer

'soup ab --input --metric latency|judge_score|retry_rate --alpha 0.05 --beta 0.20 --effect-size 0.1' (v0.63.0) runs Wald's Sequential Probability Ratio Test (SPRT). The log-likelihood ratio is a martingale under the null, so Type-I error is controlled at every stopping time — you can peek at the results at any sample count and decide reject_h0 / accept_h0 / continue without breaking statistical guarantees. The SPRT log-likelihood ratio is compared against A = log((1-β)/α) and B = log(β/(1-α)). v0.63 ships a CRITICAL fix for a sign error in the historical mSPRT implementation. Three metrics at launch: latency, judge_score, retry_rate. Input rows: {arm: 'control'|'treatment', : }.

Question 34

What does 'soup drift-alarm' watch in production?

Accepted Answer

'soup drift-alarm --reference --live --threshold 0.2 [--slack-url ... --discord-url ...]' (v0.63.0) computes KL divergence between token-distribution snapshots at fine-tune time (reference) and production (live). Catches both behavioral drift ('model now outputs JSON when it didn't before') and vocabulary drift ('same 20 phrases on repeat'). Whitespace tokenization by default; pluggable tokenizers ship in v0.63.1. Optional Slack / Discord webhooks fire on drift, SSRF-validated: loopback HTTP only, RFC1918 / link-local / cloud-metadata IPs (169.254.0.0/16, 100.64.0.0/10, 198.18.0.0/15) rejected. Exit code 3 on drift for cron-friendly automation. Input rows: {token: '', log_prob: }.

Question 35

How does Soup CLI handle GDPR right-to-be-forgotten?

Accepted Answer

Soup CLI v0.61.0 ships 'task: unlearn' as a first-class trainer with three methods: NPO (Negative Preference Optimization — DPO-shaped negative-only loss, needs reference model + retain set), SimNPO (length-normalized NPO, no reference model needed — faster on long sequences), and RMU (Representation Misdirection Unlearning — residual-stream noise on forget inputs, best for concept-level removal). Config: set 'task: unlearn', 'training.unlearn_method: npo|simnpo|rmu', 'training.unlearn_alpha' (0.0–10.0, default 0.5, retain-set weight), 'data.forget_set' (required JSONL or HF dataset ID), 'data.retain_set' (optional, strongly recommended for NPO). Verify with 'soup eval unlearning <run-id> --benchmark tofu|muse|wmdp' which scores Forget Quality / Model Utility / PrivLeak and emits an OK / MINOR / MAJOR verdict.

Question 36

Can I patch a single fact in a model without retraining?

Accepted Answer

Yes. Soup CLI v0.61.0 ships 'soup edit set --base --method rome|memit|alphaedit --subject "" --target ""' for surgical fact patching. Three methods: ROME (Rank-One Model Editing, targets MLP layers — Meng 2022), MEMIT (Mass-Edit Memory in a Transformer — Meng 2023), and AlphaEdit (gradient-based fact patching). The plan / validation surface ships in v0.61.0; live application lands in v0.61.1. 'soup edit diff registry://before_id registry://after_id --probes probes.jsonl --top-k 10' compares two Registry entries and surfaces which facts changed with a citation visualizer. The Sequential Edit Governor caps per-base-model edits (default 10) and rejects edits that would amplify weight norms beyond threshold — preventing cascading drift across many sequential edits.

Question 37

What is RAFT and how does Soup CLI implement it?

Accepted Answer

RAFT (Retrieval Augmented Fine-Tuning, Stanford 2024) teaches a model to use retrieved context — both golden and distractor documents — to answer queries. Soup CLI v0.62.0 ships 'data.format: raft' with per-row schema {query, golden_doc, distractor_docs[], answer}. Pair with 'training.ra_dit_stage: retriever|generator' for the RA-DIT two-stage pipeline: stage 1 trains a contrastive sentence-transformer ('ra_dit_retriever_model: sentence-transformers/all-MiniLM-L6-v2' default), stage 2 SFTs the main model on RAFT rows. Both stages reuse the existing trainer wrappers; single-command orchestration lands in v0.62.1. Add 'training.citation_faithful: true' + 'citation_style: bracket|inline|footnote' + 'citation_recall_threshold: 0.85' to enforce source attribution at training time — the final save is refused if recall falls below threshold. Three new recipes shipped: raft-llama3-8b, ra-dit-retriever, ra-dit-llama3-8b.

Question 38

Can I steer a model's behavior at inference time without fine-tuning?

Accepted Answer

Yes. Soup CLI v0.62.0 ships 'soup steer train --base --method caa|iti|repe --name --pairs ' for training activation steering vectors, then 'soup steer apply --name --strength ' to apply them at decode time (|strength| ≤ 10 enforced). Three methods: CAA (Contrastive Activation Addition, adds a learned vector to the residual stream), ITI (Inference-Time Intervention, shifts specific attention heads), and RepE (Representation Engineering, PCA-based direction extraction). Pairs JSONL: {positive: '', negative: ''}. Steering vectors register as the new 'steering_vector' artifact kind in the v0.26 Registry. Use case: nudge safety, formality, or domain focus without a full SFT/DPO run.

Question 39

Does Soup CLI generate compliance docs for procurement / regulators?

Accepted Answer

Yes (v0.59.0 'Governance & Provenance'). 'soup bom emit --name --version --base-model --base-sha --config-sha --task --license --format cyclonedx|spdx|both' produces machine-learning Bills of Material in CycloneDX 1.6 (with the ML-BOM extension) and SPDX 2.3 AI-profile formats — both, in one shot. 'soup attest emit --stage extract|train|eval|export|publish --subject --sha --builder --invocation --sign unsigned|ed25519|sigstore' writes SLSA-3 in-toto attestations per stage. 'soup audit-log tail/rotate' keeps a HIPAA/SOC2-compliant JSONL audit log at ~/.soup/audit.jsonl with PII redaction. 'soup train --annex-xi --repro-receipt ' emits EU AI Act Annex XI/XII documentation plus SR 11-7 reproducibility receipts (every seed, kernel version, library version, dataset hash). All atomic writes (tempfile.mkstemp + os.replace) with symlink-rejection TOCTOU defense.

Question 40

How does Soup CLI prevent supply-chain attacks on LoRA adapters?

Accepted Answer

Six controls in v0.60.0. (1) 'soup adapters scan ' spectral-analyses LoRA weights for backdoors: rank-1 dominance (warn at 50×, fail at 200×), top-singular-value energy concentration (warn > 75%, fail > 95%), Frobenius outliers (warn > 4σ, fail > 8σ), NaN/Inf. Exit 0/1/3 for OK/WARN/FAIL — pure numpy, no torch. (2) 'soup adapters sign ' computes a Merkle root over all files and writes .soup-signature.json (unsigned backend in v0.60.0; ed25519 + sigstore in v0.60.1). (3) 'soup adapters verify --strict' refuses to load on signature mismatch. (4) 'soup adapters check-safetensors --strict' refuses pickle / .bin / .pt — the single biggest LoRA attack vector. (5) 'soup adapters merge --license [--license ] --license-override ""' enforces SPDX-license compatibility via a 33-entry matrix. (6) 'soup airgap-bundle --model --dataset ... --wheel ... --kernel ... --bundle-size-cap 100' packs everything for one-way data-diode transfer, ≤100 GiB cap, signed manifest. Plus namespace-pin TOFU: the first pull of an HF repo pins its owner SHA in a local SQLite cache so a hijacked org cannot silently swap your base model.

Question 41

What is 'soup loop' and how does the production data flywheel work?

Accepted Answer

'soup loop' (v0.58.0) runs the full production data flywheel from a single CLI: harvest production traces → distill into preference pairs → eval-gated DPO train → canary deploy → auto-rollback on regression. 'soup loop init --eval --baseline registry:// --monthly-budget 50usd --max-runs-per-day 3' creates an atomic '.soup/loop.yaml' state file. 'soup loop watch' runs as a long-running daemon (SIGTERM/SIGINT safe, reloads state every iteration so external pause/resume takes effect immediately). 'soup loop canary --traffic 5% --autoroll-on-regress' splits traffic via deterministic SHA-256 hash routing (±0.01% split granularity, threading-locked BucketStats); the verdict uses the v0.26 Quant-Lobotomy OK/MAJOR thresholds with a 30-sample minimum. 'soup loop pause' / 'soup loop resume' flip atomic status without killing the daemon. 'soup loop replay []' walks per-iteration manifests at '.soup-loops//iteration.json'. Budget guardrails enforce a daily-cap → estimate-sanity → monthly-budget check order; budget-skipped iterations produce no manifests. Every iteration is recorded as a frozen IterationRecord with gate verdict, canary verdict, shipped flag, rolled-back flag, and estimated cost.

Question 42

What does 'soup advise' decide before you fine-tune?

Accepted Answer

'soup advise --goal ""' (v0.54.0) is a pre-flight decision engine that classifies the task (keyword + structural signals — tool_calls → tool_use, '' → reasoning, chat messages → input-extraction) across 7 task categories, profiles the dataset (row_count, avg input/output chars, type-token diversity, label variance, has_chosen_rejected, has_reasoning_traces — capped at 2,000-row sample), and emits a rubric verdict among PROMPT_ENG / RAG / SFT / DPO / GRPO. Heuristics: preference pairs → DPO; reasoning + ≥500 rows → GRPO; <50 rows → PROMPT_ENG; high-variance factual → RAG; default → SFT. Optional '--probe' runs a 100-step LoRA probe to put real numbers on each ROI estimate. '--record' appends an entry to '~/.soup/advise_history.jsonl' so future verdicts learn across projects (atomic, file-locked, 16 MiB cap). 'soup advise explain' prints the full rubric. 'soup advise compare ' compares two candidate datasets. Outputs the literal next command ('soup autopilot --data … --task sft') so the handoff is one paste away.

Question 43

How does 'soup eval design' derive an eval suite from your data?

Accepted Answer

'soup eval design --goal "..."' (v0.55.0) uses TF-IDF salience over your training data (10,000-row DoS-capped subsample) plus goal-keyword dispatch to draft a goal-conditioned eval suite with up to 5 dimensions and a scorer per dimension chosen from {exact_match, regex, judge, rlvr}: 'json' / 'code' / 'math' keywords → rlvr; 'classify' → exact_match; 'extract' → regex; default → judge. 'soup eval discover ' runs greedy farthest-first Jaccard-distance clustering (10,000-row cap) to surface held-out canaries + adjacent-skill probes + 25%-prefix memorization probes. 'soup eval lock ' freezes the suite as a SHA-256-checksummed 'eval_suite' artifact (canonical-JSON for stable hashes; registered in the v0.26 registry alongside the new 'canaries' kind). 'soup eval coverage --task ' does gap analysis against the v0.54.0 task taxonomy (each task has a recommended scorer set — e.g. reasoning → rlvr/judge, format_conversion → regex/rlvr). 'soup eval gate-install --baseline ' writes a '.git/hooks/pre-push' that paired-bootstraps a regression verdict against the baseline (configurable n_samples in [100, 100_000], ci_level in (0, 1), direction-aware metric handling) and blocks the push on regression.

Question 44

What does 'soup diagnose' check in a trained model?

Accepted Answer

'soup diagnose ' (v0.56.0) is a post-training model report card that runs 6 pure-function failure-mode probes and rolls them up into an OK/MINOR/MAJOR verdict (thresholds: ≥0.85 OK, ≥0.60 MINOR, else MAJOR). The probes: (1) forgetting — per-task Δ accuracy with tolerance band, extending the v0.25 forgetting baseline; (2) refusal — advbench / xstest delta over caller-supplied generators, 8,192-row scan cap; (3) format — JSON / regex / tool-call validity over RLVR verifiers with explicit ReDoS probe ('a' × 128); (4) mode_collapse — pairwise n-gram-Jaccard distance over K completions (k ∈ [2,32], ngram_n ∈ [1,8]); (5) memorization — training-prefix echo via partial-prompt continuation (1,000-row scan); (6) contamination — n-gram overlap with public benchmarks (combined-complexity cap rejects when |training| × |benchmark| > 1e9). Outputs JSON + a 6-cell SVG badge ('utils/diagnose/badge.py::render_badge_svg', html-escaped) that you can embed in model cards. '--attach-to-registry ' attaches it as the v0.56.0 'diagnose_report' artifact kind. Pair with 'soup train --diagnose-gate ' to refuse the final save on MAJOR regression (exits typer.Exit(code=2)).

Question 45

What is 'soup adapters' and how is it like git for LoRA?

Accepted Answer

'soup adapters' (v0.57.0) ships 6 subcommands that treat LoRA adapters as first-class versioned objects. 'soup adapters diff ' computes per-layer Frobenius norm + relative drift + SVD effective rank — pure numpy, no torch, '.bin' rejected with an actionable 're-save as safetensors' message. 'soup adapters merge [c...] -o --strategy linear|ties|dare|svd' implements four pure-numpy merge strategies: linear weighted average, TIES (Yadav 2023 — trim by density / elect majority sign / disjoint average), DARE (Yu 2024 — Bernoulli drop with rescale + deterministic seed), and SVD low-rank reconstruction. 'soup adapters blame --dataset --layer --budget 5m --shards 4' plans leave-one-out layer ablation given a wall-clock budget ('60s'/'5m'/'2h' parser, bounds [60s, 24h], 30s minimum per shard); the live runner ships in v0.57.1. 'soup adapters branch -c --base --dataset ' names a LoRA config with SHA-256-pinned config + dataset + base-model hashes ('^[A-Za-z0-9][A-Za-z0-9._\-]{0,127}$' regex, 1 MiB config cap, 1,024-pointer cap, atomic POSIX 0o600). 'soup adapters checkout -o ' refuses to restore on SHA mismatch (drift detection). 'soup adapters branches' lists everything. Env override 'SOUP_BRANCHES_DIR' is containment-checked to $HOME / $CWD / $TMPDIR.

Question 46

What is the Soup CLI registry?

Accepted Answer

The registry is a local SQLite-backed catalog of every fine-tune at ~/.soup/registry.db. Each entry stores the config, eval baseline, and parent lineage. You push runs with 'soup registry push --run-id --name --tag ', visualize the DAG with 'soup history ', diff two versions with 'soup registry diff', and promote a version to prod with a tag. Eval gates can reference 'registry://' as a baseline to catch regressions.

Question 47

What is Soup Autopilot?

Accepted Answer

Autopilot is a zero-config decision engine introduced in Soup CLI v0.25.0. You pass 'soup autopilot --model --data --goal chat' and Soup profiles the dataset, model, and GPU, then picks task, quantization, PEFT rank, batch size, learning rate, epochs, and max length — generating a soup.yaml with every choice justified.

Question 48

Can Soup CLI auto-push checkpoints to HuggingFace Hub?

Accepted Answer

Yes. Soup ships deep HuggingFace Hub integration. 'soup train --push-as user/my-model' uploads every save_steps checkpoint to HF Hub as a 'checkpoint-<N>' branch. Pair with '--hf-resume' to pull the latest branch and keep going after a spot-instance preemption. Set 'HF_ENDPOINT=https://hf.internal.example.com' to route to a self-hosted Hub (SSRF-hardened: loopback HTTP only, RFC1918 IPs rejected). 'soup deploy hf-space' creates a Gradio or Streamlit Space wrapping your model in one command.

Question 49

What is the unified preference dispatcher in Soup CLI v0.40?

Accepted Answer

v0.40.0 'Preference Variety' adds 'task: preference' as a unified dispatcher. Set 'training.preference_loss: dpo|simpo|orpo|ipo|bco' to pick the loss without renaming the task — making hyperparameter sweeps over the loss type itself trivial. Legacy 'task: dpo' / 'task: simpo' / etc. remain first-class. The release also ships BCO (Binary Classifier Optimization) as a new trainer with 'task: bco', two opt-in DPO controls (beta annealing via 'dpo_beta_schedule' + periodic reference-model refresh via 'dpo_ref_regen_epochs'), and a multi-objective preference-loss schema ('preference_loss_weights') that validates 2–5 entries summing to 1. Since v0.53.11 the multi-objective path is fully live: 'attach_weighted_preference_combine' computes per-batch DPO / IPO / SimPO / ORPO terms from the four TRL logprob tensors and combines them via 'combine_losses(terms, weights)', replacing the v0.40.1 primary-loss scaling shim.

Question 50

What quantization formats does Soup CLI support?

Accepted Answer

Seventeen formats as of v0.53.0 'Quant Menu II'. Set via 'training.quantization' in soup.yaml: 4bit (bitsandbytes default), 8bit, none (fp16/bf16/full), gptq, awq, hqq:1bit … hqq:8bit (8 sub-variants), aqlm (extreme 2-bit), eetq (8-bit fast kernel SM75+), mxfp4, fp8 (training Hopper+), and bitnet_1.58 (v0.52.0). On top of that, v0.53.0 added a 14-entry Unsloth Dynamic 2.0 GGUF ladder (UD-Q8_K_XL … UD-IQ1_M), 12-entry IQ family, 10-entry Apple/ARM-friendly GGUF set, KV-cache types ('training.kv_cache_type: q8_0|bf16|f16|fp8' — FP8 Hopper-only), FP8 attention, NVFP4 (Blackwell), explicit 'unsloth_bnb_4bit', BNB double-quant. As of v0.53.1 every writer is live: 'soup merge --save-format 4bit | 4bit_forced' performs a single-shot BNB merge without dequant/requant; 'soup export --format torchao --quant-config' covers Int4WeightOnly / Int8DynActInt4 / Float8DynActFloat8 / NVFP4; 'soup export --format gguf-ud' runs the 3-stage llama.cpp imatrix pipeline. 'soup train' runs check_quant_distributed_compat() at startup to flag FSDP / ZeRO-3 incompatibilities before training begins.

Question 51

What is Multipack in Soup CLI?

Accepted Answer

Multipack (v0.37.0) is Soup's largest single throughput win on chat fine-tuning over uneven-length data. Instead of padding every sample to max_length, it uses First-Fit-Decreasing bin packing to group variable-length samples — eliminating padding waste. Set 'training.multipack: true' in soup.yaml. 18-architecture allowlist (Llama 3.x, Qwen 2/3, Mistral, Gemma 2/3, Phi 3/4, DeepSeek V2/V3, Mixtral, Falcon, StableLM, SmolLM2). Unknown architectures fail loudly at config-load. SFT / Pretrain only on the transformers backend.

Question 52

What LoRA quality features does Soup CLI ship?

Accepted Answer

Five PEFT-surface improvements (v0.39.0). PiSSA initializes LoRA from the SVD of the base weight for faster early convergence. ReLoRA fires every N steps to magnitude-prune the adapter and clear optimizer state — useful for very long runs. Per-pattern rank/alpha lets you map module name patterns to integer ranks. Surgical patches auto-fire for Gemma 4 ClippableLinear and fused-MoE 3-D expert dropout. The 17 built-in templates live as soup_cli/templates/*.yaml with a manifest.json index.

Question 53

How does Soup CLI explain training anomalies?

Accepted Answer

'soup why' (v0.34.0) reads the most recent (or named) run and surfaces plain-English diagnoses with concrete next steps. Detects NaN/Inf loss, plateau (≥30 steps with <0.5% change), divergence (loss > 3× initial), persistent high gradient norm, learning rate outside [1e-6, 5e-3]. Pure rule-based — no model calls. 'soup tui' opens a full-screen Textual dashboard. 'soup train --profile' records a torch.profiler Chrome-trace. Crash bundles auto-write a self-contained .crash JSON with redacted secrets when training fails.

Question 54

Does Soup CLI's inference server support speculative decoding and structured output?

Accepted Answer

Yes (v0.30.0). Speculative decoding via '--speculative-decoding draft-model' or '--auto-spec' (auto-pairs Llama 3.1/3.3/4, Qwen 2.5/3, Mistral Large, Mixtral, DeepSeek V3/R1, Gemma 2/3). Structured output via '--structured-output json --json-schema s.json' or '--structured-output regex --regex-pattern ...'. vLLM prefix caching for RAG/agent workloads via '--prefix-cache'. Dynamic LoRA hot-swap via 'POST /v1/adapters/activate/<name>'. Live continuous-batching dashboard plus '/metrics' endpoint via '--dashboard'. OpenTelemetry tracing via '--trace --trace-endpoint http://localhost:4317'.

Question 55

Does Soup CLI default to safe model loading?

Accepted Answer

Yes (v0.36.0 'Correctness First'). 'soup train', 'chat', 'serve', 'data download', 'eval auto' now require '--trust-remote-code' to load any HF model that ships custom Python (auto_map in config.json). First-party orgs (Meta, Mistral, Qwen, Google, etc.) suppress the warning panel; everything else prints a REMOTE CODE WARNING before loading. Tokenizers without a chat template raise a ValueError instead of silently building garbage strings. Raw Jinja chat-template strings reject filesystem-touching directives (include/import/from/macro/extends) at config-load.

Question 56

What license is Soup CLI under?

Accepted Answer

Soup CLI is Apache-2.0 licensed as of v0.29.0 (previously MIT). Downstream redistributors must retain the NOTICE file per §4(d).

Question 57

Does Soup CLI support Apple Silicon?

Accepted Answer

Yes. Soup v0.25.0 added an MLX backend for M1–M4 Macs. Install with 'pip install soup-cli[mlx]', set 'backend: mlx' in soup.yaml, and run SFT, DPO, or GRPO natively on unified memory without CUDA.

Question 58

How do I fine-tune Llama with Soup CLI?

Accepted Answer

Install with 'pip install soup-cli', then run 'soup recipes use llama3.1-8b-sft' to drop a vetted soup.yaml, point 'data.train' at your dataset, and run 'soup train'. Soup ships 116 recipes covering Llama 3.1/3.2/4 Scout, Qwen 2.5/3, Gemma 3, Mistral, Phi-4, DeepSeek R1/V3.

Question 59

What training methods does Soup CLI support?

Accepted Answer

23 methods: SFT, DPO, GRPO (with RLVR verifiable rewards for math/code/JSON), PPO, KTO, ORPO, SimPO, IPO, BCO, Pretrain, Embedding, Reward Model, and the unified 'preference' dispatcher (set 'training.preference_loss: dpo|simpo|orpo|ipo|bco' to swap the loss without renaming the task). PEFT options include LoRA, QLoRA, DoRA, rsLoRA, VeRA, OLoRA, plus PiSSA SVD init, ReLoRA magnitude-prune cycles, and per-pattern rank/alpha overrides.

Question 60

How do I install Soup CLI?

Accepted Answer

Install from PyPI: 'pip install soup-cli'. Requires Python 3.9+. Optional extras: 'soup-cli[fast]' for Unsloth 2-5x speedup, 'soup-cli[mlx]' for Apple Silicon, 'soup-cli[serve]' for the inference server, 'soup-cli[ui]' for the web dashboard, 'soup-cli[remote]' for fsspec + s3fs / gcsfs / adlfs remote loaders, 'soup-cli[trackers]' for MLflow / SwanLab / Trackio (v0.53.8), 'soup-cli[mix]' for the scikit-optimize Bayesian mix optimizer (v0.53.10), and 'soup-cli[data-pro]' for proper langdetect + Presidio PII (v0.53.10).

Question 61

What models does Soup CLI support?

Accepted Answer

Soup CLI ships 116 recipes spanning text (Llama 3.1/3.2/4 Scout + Maverick, Qwen 2.5, Qwen 3 8B/14B/32B + 30B MoE + 235B-A22B, Gemma 3, Mistral, Mixtral 8x7B/8x22B, Phi-4, DeepSeek R1 + V3), vision (Llama-3.2-Vision 11B/90B, Pixtral 12B, Qwen2-VL 7B/72B, InternVL 2.5, MiniCPM-V 2.6), audio (Qwen2-Audio, SeamlessM4T v2, Whisper-large-v3), reasoning (all 6 DeepSeek-R1-Distill sizes, Qwen3-Coder 30B, Phi-4 reasoning), small/edge (SmolLM2 135M-1.7B, Phi-3.5-mini, Llama-3.2 1B/3B), and domain specialists (BioMistral, Meditron, CodeLlama, Magicoder, Mathstral, Nemotron-4 340B). Works with any of the 340,000+ text-generation models on Hugging Face Hub.

Question 62

What new modalities did Soup add in v0.52 'Modality II'?

Accepted Answer

v0.52.0 adds three new task families plus BitNet quant and reasoning-effort dispatch — all live trainers as of v0.53.2. (1) TTS: 'task: tts' + 'modality: audio_out' with 5 family allowlists (Orpheus, Sesame-CSM, Llasa, Spark, Oute) and per-family emotion vocabularies. (2) Classification heads: 'task: classifier' / 'reranker' / 'cross_encoder' with 'num_labels' + 'label_names' validators — single-label / multi-label / cross-encoder all live via 'ClassifierTrainerWrapper'. (3) Distillation: 'task: distill' + 'teacher_model' + 'distill_divergence' (kl / forward_kl / reverse_kl / js) + 'distill_temperature ∈ [0.05, 100]' — live via 'DistillTrainerWrapper' (independent 'trust_remote_code' per side). Plus 'quantization: bitnet_1.58' (text, sft/pretrain/dpo only — non-MLX), 'reasoning_effort: low|medium|high' for gpt-oss reasoning dispatch (injects '<|reasoning_effort|>' into chat turns), EBFT (structured/strided) + GDPO (standard/length_normalized/margin) loss kernels live, and MoE expert quant ('moe_expert_quant: nf4|int8_rowwise', 'train_router_only').

Question 63

What is GRPO Plus (v0.50) and what variants ship?

Accepted Answer

v0.50.0 'GRPO Plus' brings full unsloth + axolotl parity for GRPO — live since v0.53.11. Pick a variant with 'training.grpo_variant ∈ {standard, gspo, dapo, dr_grpo, bnpo, two_sided, rft}': every non-standard variant now ships a real loss kernel via the 'make_grpo_trainer_variant' LRU-cache factory subclassing GRPOTrainer. Pair 'two_sided' with 'grpo_delta ∈ (0, 1]' for asymmetric clipping. 'grpo_fp16: true' opts into explicit FP16 mixed precision (CPU/MPS/XPU → no AMP; CUDA → bf16 by default, fp16 when flag set). Long-context GRPO ('long_context_grpo'), vision GRPO ('vision_grpo' — known-VLM-base gate added in v0.53.3), async rollout prefetch ('async_grpo_prefetch'), reference-model EMA ('ref_model_ema_alpha' via GRPOStabilityCallback live in v0.53.11), bounded-deque replay buffer ('replay_buffer_size'), truncated-completion masking, and zero-advantage skipping all live. Rollout backends 'art' / 'ruler' / 'nemo_gym' / 'openenv' for multi-turn agentic RL. 'task: prm' (Process Reward Model) live in v0.53.11 with reward head + per-step MSE.

Question 64

Does Soup CLI ship a plugin system?

Accepted Answer

Yes (v0.45.0). 'soup plugins list/install/enable/disable' manages a public plugin / hook system built on the BasePlugin Protocol with PluginSpec registration (name + version validation). v0.45.0 also added an OpenAI ↔ Anthropic Messages converter, server-side tools allowlist with WebSearchConfig, n-gram speculative-decoding schema, a 15-entry external integrations catalog, advanced trainer-plugin allowlist, and a Data Recipe DAG parser with topological sort. As of v0.53.6 'SoupPluginCallback' is live across every HF Trainer (all 13 trainers) — plugin hooks fan in on every step / epoch event, and a single bad plugin can't crash training (exceptions caught + logged). v0.53.7 made the Data Recipe DAG runner ('soup data recipe --execute') live for 6 node kinds (seed / llm_text / code / judge / validator / sampler) with atomic per-node checkpoints + resume, and lazy-imported 6 trainer plugins (grokfast / spectrum / llmcompressor / sonicmoe / cce_plugin / math_verify). v0.46.0 layered Agent Forge on top: parse OpenAPI 3.x / MCP manifests / GraphQL introspection into tool-calling SFT datasets via 'soup agent synth' + 'soup agent train' + 'soup agent eval', plus 'soup deploy autopilot' for picking PEFT + quant + spec-decoding combos against a 10-profile hardware catalog (live measure-mode in v0.53.1).

CLI Reference

Core Commands

`soup init`

`soup train`

`soup chat`

`soup serve`

`soup export`

`soup merge`

`soup push`

`soup eval`

`soup deploy`

`soup infer`

`soup migrate`

`soup recipes`

Data Commands

Experiment Commands

Other Commands

v0.54 — Pre-flight Decision

v0.55 — Eval Design

v0.56 — Diagnose (Model Report Card)

v0.57 + v0.67 — Adapter VCS

v0.58 — Production Data Flywheel

v0.59-v0.62 — Governance, Supply Chain, Unlearn, RAG

v0.63 — Production Trace Ecosystem

v0.64 — Pre-flight & Tooling

v0.65 — Eval Depth (Failure Modes 6 → 10)

v0.66 — Post-train X-rays

Global Flags