Data Engineering Pro (v0.69.0)
Dataset prep stops being "throw a JSONL at the trainer" and becomes a first-class engineering discipline. dbt-shaped DAG, Great-Expectations suite, Magpie synth, Persona-Hub diversity, and the arXiv 2510.13928 brain-rot detector.
`soup build` — dbt-shaped DAG of dataset transforms
# manifest.yaml
models:
- name: chats_raw
kind: table
sql: SELECT prompt, response FROM source.thumbs WHERE thumb = 'up'
- name: chats_decontaminated
kind: incremental
refs: [chats_raw]
sql: SELECT * FROM \`{chats_raw}\` WHERE NOT contaminated
- name: chats_train
kind: view
refs: [chats_decontaminated]
sql: SELECT * FROM \`{chats_decontaminated}\`soup build manifest.yaml --dry-run # validates topology, exits 0 (LIVE today)
soup build manifest.yaml # materialises (v0.69.1)- Closed
SUPPORTED_MODEL_KINDS = {incremental, table, view} - Topo-sort via Kahn's algorithm
- Re-tokenise only changed rows:
compute_row_hash(SHA-256 over canonical-JSON,idfield excluded) +incremental_diff(prev, new) → {added, changed, removed, unchanged} - DoS caps:
_MAX_MODELS=256,_MAX_REFS_PER_MODEL=32,_MAX_FILE_BYTES=1 MiB - Live
run_buildmaterialiser (DuckDB / SQLite backend, live transform-resolver registry) ships in v0.69.1
`soup expect` — Great Expectations for chat data (LIVE)
# suite.yaml
expectations:
- expect_no_pii
- expect_token_length_between: {min: 32, max: 2048}
- expect_no_refusal_pattern
- expect_chosen_preferred_over_rejected_by_judge:
judge: openai/gpt-4o-mini
min_win_rate: 0.55soup expect ./data.jsonl ./suite.yaml
# exit 0 = pass; 2 = validation rejection; 3 = suite failureClosed allowlist: expect_no_pii (reuses v0.47 Presidio), expect_token_length_between, expect_no_refusal_pattern (reuses v0.56 refusal detector), expect_chosen_preferred_over_rejected_by_judge (reuses v0.19 judge surface). Walks text, content, output, prompt, instruction, response top-level keys + messages[].content arrays. _MAX_SUITE_LEN=64. Drop into CI between soup data and soup train.
`soup data gen-magpie` — synthetic data via chat-template-prefix harvest
soup data gen-magpie --base meta-llama/Llama-3-8B \
--provider ollama --target 50000 \
--output ./synth.jsonlMagpie trick: prime the base model with just the assistant chat-template prefix and let it complete the prompt itself. Reuses the v0.20 provider stack (Ollama / Anthropic / vLLM). _MAX_TARGET_ROWS=1_000_000, _MAX_BASE_MODEL_LEN=512. Live run_magpie loop + v0.47 quality-filter chain integration ship in v0.69.1.
`soup data persona-mix` — Persona-Hub diversity sampler (LIVE)
soup data persona-mix --prompts ./prompts.jsonl \
--n 20000 --output ./diverse.jsonl
# Optional BYO: --personas tencent-200k.jsonl --styles styles.jsonlBundled 12 personas × 5 writing styles (BYO Tencent 200k corpus via --personas / --styles). Deterministic by seed (random.Random(seed)). compute_topic_diversity = Shannon entropy over pooled whitespace tokens. Atomic JSONL write + cwd-contained input + enforce_under_cwd_and_no_symlink on output. Caps: 100 MiB / 100k entries per loader.
`soup data brain-rot` — AI-slop detector (LIVE)
soup data brain-rot ./data.jsonl --strict --max-major-fraction 0.25
# exit 3 if MAJOR fraction > 0.25arXiv 2510.13928 brain-rot detector. Two pure-Python scorers:
score_triviality— token-diversity inversion +!!/??punctuation runs + low-effort token density + length penaltyscore_popularity_signal— clickbait phrase scan + emoji U+1F300–U+1FAFF density
Worst-signal-wins: 1.0 − max(triviality, popularity). Bands match v0.26/v0.56/v0.65: OK ≥ 0.85, MINOR ≥ 0.60, else MAJOR. refuse_if_rotten raises when MAJOR fraction exceeds threshold. English-keyword-only in v0.69.0; multilingual lands in v0.69.1.
Cross-cutting hardening
Refactored 3 duplicate TOCTOU blocks behind a new shared paths.enforce_under_cwd_and_no_symlink helper (code-review CRITICAL fix). v0.70 + future releases reuse it.
Numbers
+262 tests in v0.69.0 (11,225 → 11,487) across 5 new test files. 3 POSIX-only symlink tests skip on Windows.
See also
- [Anti-trend insurance (v0.68)](/docs/anti-trend-insurance) — the TOCTOU helper this consolidates was first lifted here.
- [Loop hardening (v0.70)](/docs/loop-hardening) —
soup expectgates data going into--reward-hack-detectorruns. - [Eval depth (v0.65)](/docs/eval-depth) — same OK / MINOR / MAJOR taxonomy.