Eval Depth (v0.65.0)
Failure-mode coverage goes from 6 to 10. The v0.56 diagnose report card stays the daily driver; v0.65 adds four optional deeper probes you opt into when you actually care.
`soup eval behavior` — pre/post safety diff
soup eval behavior <run-id> --battery xstest \
--evidence ./responses.json --output ./behavior.jsonBundled safety / refusal / jailbreak / sycophancy probe sets, scored pre-FT vs. post-FT with the v0.26 / v0.56 OK ≥0.85 / MINOR ≥0.60 / MAJOR thresholds.
Five batteries at launch:
| Battery | What it catches |
|---|---|
xstest | Over-refusal on benign prompts |
harmbench | Jailbreak resistance |
jailbreakbench | Jailbreak prompt-pair contrasts |
elephant | Sycophancy / opinion-shifting |
syceval | Sycophantic alignment to user |
Evidence is operator-supplied JSON {pre_responses, post_responses, oracle} — no auto-rollout. 16 MiB file cap, O_NOFOLLOW open against symlink swap.
`soup eval capability` — lm-eval-harness task surface
soup eval capability <run-id> --suite full --output ./capability.jsonValidated lm-eval-harness task IDs for 7 bundled benchmarks — MMLU-Pro, GPQA, BBEH, AIME, MATH-500, HumanEval+, SWE-bench-Verified. Four profiles: full, fast, math, code.
Operator runs the harness; soup eval capability validates the task IDs and emits the runbook.
`soup eval checklist` — MFT / INV / DIR DSL
# spec.yaml
tests:
- name: capital_facts
kind: mft # minimum functionality
prompts: ["What is the capital of France?"]
expected: ["Paris"]
- name: paraphrase_stable
kind: inv # invariance under paraphrase
prompts: ["Capital of FR?", "Tell me FR capital"]
expected: ["Paris"]
- name: more_polite
kind: dir # directional perturbation
prompts: ["You're rude.", "You're being unhelpful."]
expected: ["apolog"]CheckList-style behavioral DSL. Up to 1,000 tests per spec; 1 MiB YAML cap; enforce_under_cwd_and_no_symlink on file open.
`soup eval irt-subset` — Rasch IRT cost-cut
soup eval irt-subset ./responses.jsonl --size small --output ./plan.jsonFits a 1-parameter Rasch IRT model to per-item correctness signals and selects a minimum-cost subset that preserves ranking power:
full= 100% of itemssmall= ~30% (Rasch information-weighted)tiny= ~10%
The math: P(correct | θ, β) = σ(θ - β), item information I(β) = σ(-β) · σ(β). Information peaks at β≈0 (50/50 items most discriminate). Pure-Python kernel — no numpy/scipy.
256 MiB JSONL cap; item-ID validated (≤256 chars, no null bytes).
See also
- [Diagnose](/docs/diagnose) — v0.56 6-probe report card, the lighter daily driver.
- [Post-train x-rays](/docs/post-train-xrays) — v0.66 mechanistic interpretability probes that stack on top of v0.65 behaviour evals.