`soup eval design` — derive evals from your training data
Before v0.55 you wrote your eval suite by hand. Now Soup drafts one from your data.
`soup eval design`
soup eval design data.jsonl --goal "polite customer support chat" --num-dimensions 5How it works:
1. TF-IDF salience picks up to num_dimensions (default 5) salient terms over the dataset. 10,000-row DoS-capped subsample.
2. Goal-keyword dispatch maps each dimension to a scorer:
- json / code / math → rlvr (verifiable reward)
- classify → exact_match
- extract → regex
- default → judge
3. Output: a frozen EvalDesign (JSON) with one EvalDimension per row.
Scorer allowlist: {exact_match, regex, judge, rlvr}.
`soup eval discover` — canaries
soup eval discover data.jsonl --num-clusters 5 --per-cluster 3Three sets:
- Held-out canaries — greedy farthest-first Jaccard-distance clustering (
_CLUSTER_SUBSAMPLE = 10_000). - Adjacent-skill probes — neighbours that fall just outside training distribution.
- Memorization probes — 25%-prefix truncation. If the trained model can continue the rest of a training row from its prefix, it memorized.
Per-group cap: _MAX_CANARIES_PER_GROUP = 1024.
`soup eval lock` — pin the suite
soup eval lock my-design.jsonLocks the design as a SHA-256-checksummed eval_suite artifact via canonicalise_design_bytes (canonical-JSON for stable hashes across runs). The frozen LockedSuite (path / sha256 / dimension_count) is registered in the v0.26 registry alongside the new canaries artifact kind.
`soup eval coverage` — gap analysis
soup eval coverage my-design.json --task reasoningChecks the locked design against the v0.54.0 TASK_CATEGORIES taxonomy and the _RECOMMENDED_SCORERS allowlist (e.g. reasoning → (rlvr, judge), format_conversion → (regex, rlvr)). Returns a CoverageReport with concrete gap recommendations.
`soup eval gate-install` — git regression gate
soup eval gate-install --baseline run-id-7f3aWrites .git/hooks/pre-push (atomic, POSIX 0o755) that:
1. Runs your locked eval suite on the current head.
2. Compares each GateThresholds metric (task_accuracy / refusal_rate / format_validity / p95_latency_ms) against the baseline via paired_bootstrap_ci(baseline, candidate, n_samples, ci_level, seed).
- n_samples ∈ [100, 100_000]
- ci_level ∈ (0, 1)
3. decide_regression uses direction-aware metric handling via _METRIC_DIRECTION — higher-is-better for accuracy, lower-is-better for latency.
4. Refuses the push on a RegressionVerdict of REGRESSED.
The hook script is rendered via render_pre_push_hook with shlex.quote for every interpolated path — no shell injection.
See also
- [Quant-check](/docs/quant-check) — same idea, but for quant-induced regression
- [Eval-gated training](/docs/eval-gate) — halt training when quality drops
- [Registry](/docs/registry) — where
eval_suiteandcanariesartifacts live