Post-train X-rays (v0.66.0)
Mechanistic-interpretability on every fine-tune. Four probe families that surface what *changed inside* the model, not just what changed in outputs. Pure descriptive — no auto-mitigation, designed for CI logging and model cards.
`soup probe sae-diff` — Sparse Autoencoder feature movement
soup probe sae-diff ./gemma_scope.safetensors ./pre_acts.json ./post_acts.json \
--top-k 20 --output ./sae_diff.jsonEncode pre-FT and post-FT activation batches through a Sparse Autoencoder; report the top-K features whose mean activation moved the most. Bundled SAE-repo allowlist (no auto-download):
- Gemma Scope (2B / 9B / 27B residual-stream)
- EleutherAI Pythia SAEs
- JBloomAus Llama SAEs
- OpenAI GPT-2 SAE
Bounds: top-k [1, 10K], up to 1M features, up to 1M tokens per batch, 16 MiB evidence cap.
`soup probe sleeper` — defection-agent classifier
soup probe sleeper llama-3-8b --evidence ./activations.json --output ./sleeper.jsonCalibrated linear defection probe (per-base, deterministic) applied to a 2D activation tensor. Reports flagged-token rate and verdict:
| Flagged rate | Verdict |
|---|---|
| ≤1% | OK |
| ≤5% | MINOR |
| >5% | MAJOR |
Bundled-base allowlist (Llama-3-8B, Gemma-2-9B, …); no evidence → OK report with 0 tokens (matches v0.56 neutral-mode policy). 16 MiB cap; symlink rejection.
`soup probe interference` — N×N adapter compatibility matrix
soup probe interference ./losses.json --output ./matrix.jsonInput is operator-measured per-pair losses; output is an N×N catastrophic-interference matrix:
score(A → B) = (loss(A_target | A+B) - loss(A_target | A alone)) / loss(A alone)| score | Verdict | ||
|---|---|---|---|
| <5% | OK | ||
| <20% | MINOR | ||
| ≥20% | MAJOR (exit 2, gates CI) |
Bounds: 2..16 adapters (4..256 pairs). Adapter names ≤256 chars; markup-escaped before render against injection.
`soup probe pack` — bundled probes per base
soup probe pack llama-3-8b --output ./pack.json
soup probe pack --listPer-base manifest of calibrated probes (sleeper / sae / truth / harm). Metadata only — no weights embedded (v0.66 ships schema; weights fetcher in v0.66.x). 1..32 probes per pack; per-field caps against operator-controlled-input bloat.
Live influence-function blame
Bundled in v0.66 alongside the probe family: a DataInf-style row attribution runner that walks training data, computes per-example influence on a target output, and ranks the most causal rows. Composes with v0.67 soup adapters bisect — bisect tells you *which checkpoint* broke, blame tells you *which rows* caused it.
See also
- [Diagnose](/docs/diagnose) — v0.56 6-probe report card; v0.66 adds 4 more probes on top.
- [Adapter lifecycle](/docs/adapter-lifecycle) — v0.67 bisect uses v0.66 blame for row-level attribution.