Data Augmentation

soup data augment expands a JSONL dataset with LLM-driven rewrites. Added in v0.25.0.

Usage

bash
# Rephrase each example 3x
soup data augment \
  --input data.jsonl \
  --output augmented.jsonl \
  --strategy rephrase \
  --count 3

# Translate to other languages
soup data augment --input data.jsonl --strategy translate --lang ru,zh,es

# Rewrite in multiple styles
soup data augment --input data.jsonl --strategy style --styles formal,casual,technical

Strategies

StrategyPurpose
rephrasePreserve meaning, diversify wording
translateAdd examples in other languages for multilingual fine-tuning
styleRewrite in formal / casual / technical / etc. tones

Providers

Augmentation reuses the generate providers: openai, ollama, anthropic, server, vllm. Select with --provider.

bash
soup data augment \
  --input data.jsonl \
  --strategy rephrase \
  --count 2 \
  --provider ollama \
  --dedup \
  --validate

Flags

FlagMeaning
--strategyrephrase / translate / style
--countAugmentation multiplier (default 2, max 10)
--langTarget languages for translate
--stylesStyle list for style
--validateRun format validation on the augmented output
--dedupDeduplicate augmented + original data

Safety

  • Input/output paths are resolved and constrained to the current working directory.
  • Output format must match the input format, enforced after generation.
  • --count is capped at 10 to prevent accidental 1000× blow-ups.
  • Requests honor --requests-per-minute inherited from soup data generate.