Eval-Gated Training

v0.26.0 adds declarative eval suites that run at epoch boundaries. If a task score falls below its threshold — or regresses against a baseline — training halts before you waste another epoch.

The gate file

Every entry in tasks: is one of three types: custom, judge, or benchmark. Each task has a name (used as the baseline key) and a numeric threshold.

yaml
# evals/gate.yaml
suite: chat-quality
tasks:
  - type: custom
    name: tool_calls
    tasks: evals/tool_calls.jsonl
    scorer: exact            # exact | contains | regex | semantic
    threshold: 0.70

  - type: judge
    name: chat_judge
    prompts: evals/judge_prompts.jsonl
    judge_model: ollama://llama3.1     # or https://... or http://localhost
    threshold: 7.0

  - type: benchmark
    name: mmlu
    benchmark: mini_mmlu
    threshold: 0.60

Enable it

Either inline in soup.yaml:

yaml
training:
  eval_gate: ./evals/gate.yaml

…or on the command line:

bash
soup train --config soup.yaml --gate ./evals/gate.yaml

Post-hoc verdict

Run the gate standalone against any model:

bash
soup eval gate --suite ./evals/gate.yaml
# ✓ mmlu:        0.648 (baseline 0.643, +0.005) PASS
# ✗ chat_judge:  7.1   (baseline 8.2, -1.1)     REGRESSION
# → verdict: FAIL

A task fails if its score is below its threshold *or* if it drops more than the configured regression threshold (default 0.05) below the supplied baseline.

Baselines

  • registry://<id> — pulls eval results for the referenced [registry](/docs/registry) entry
  • ./baseline.json — a JSON map of {task_name: score}
  • omitted — tasks are judged only against their threshold

Judge URL allowlist

judge_model must use one of: ollama://, https://, or http://localhost / http://127.0.0.1. Any other scheme is rejected at load time — an SSRF guard on the eval path.

See also

  • [Registry](/docs/registry) — the typical baseline source
  • [Evaluation](/docs/experiments) — the broader eval platform