Data Augmentation
soup data augment expands a JSONL dataset with LLM-driven rewrites. Added in v0.25.0.
Usage
bash
# Rephrase each example 3x
soup data augment \
--input data.jsonl \
--output augmented.jsonl \
--strategy rephrase \
--count 3
# Translate to other languages
soup data augment --input data.jsonl --strategy translate --lang ru,zh,es
# Rewrite in multiple styles
soup data augment --input data.jsonl --strategy style --styles formal,casual,technicalStrategies
| Strategy | Purpose |
|---|---|
rephrase | Preserve meaning, diversify wording |
translate | Add examples in other languages for multilingual fine-tuning |
style | Rewrite in formal / casual / technical / etc. tones |
Providers
Augmentation reuses the generate providers: openai, ollama, anthropic, server, vllm. Select with --provider.
bash
soup data augment \
--input data.jsonl \
--strategy rephrase \
--count 2 \
--provider ollama \
--dedup \
--validateFlags
| Flag | Meaning |
|---|---|
--strategy | rephrase / translate / style |
--count | Augmentation multiplier (default 2, max 10) |
--lang | Target languages for translate |
--styles | Style list for style |
--validate | Run format validation on the augmented output |
--dedup | Deduplicate augmented + original data |
Safety
- Input/output paths are resolved and constrained to the current working directory.
- Output format must match the input format, enforced after generation.
--countis capped at 10 to prevent accidental 1000× blow-ups.- Requests honor
--requests-per-minuteinherited fromsoup data generate.