Data Forge & Quality Moat (v0.47.0)

soup data forge — synthetic data pipeline

Chunk → judge → active-prune → JSONL with provenance.

bash
soup data forge --input ./docs --output ./synth.jsonl

Each row carries provenance (source doc, chunk offset, judge score).

soup data score — composite quality scorecard

PII + toxicity + langdetect + educational + decontamination.

bash
soup data score ./train.jsonl --output ./scored.jsonl

Each filter is also addressable individually:

  • soup data decontaminate — drop rows overlapping public benchmarks (n-gram heuristic)
  • soup data toxicity — keyword baseline today; Llama-Guard backend v0.47.1
  • soup data langdetect — 2-letter language code per row
  • soup data pii — flag email / phone / SSN / credit-card patterns
  • soup data educational — educational-value score per row [0, 1]