Data Pipeline Pro (v0.42.0)
18 features across 6 parts. Axolotl + LLaMA-Factory data-layer parity.
5 new data formats
data:
train: ./data/train.jsonl
format: prm # PRM stepwise-supervised
# | pre_tokenized # LF tokenized_path / Axolotl empty
# | input_output # Axolotl template-free segments+labels
# | video
# | multimodal # axolotl content-parts schemaRemote URIs
7-entry allowlist for object stores. Bucket regex ^[a-zA-Z0-9][a-zA-Z0-9._\-]{0,62}$. Userinfo / fragment / query rejected.
data:
train: s3://my-bucket/train.jsonl # s3 | gs | gcs | az | abfs | abfss | oci
streaming: true
buffer_size: 10000
shards: 16Live fsspec loaders ship in v0.42.1 (schema-only in v0.42.0).
Multi-dataset interleave
data:
train:
- ./data/sft.jsonl
- ./data/preference.jsonl
interleave:
strategy: probs # concat | under | over | probs
probs: [0.7, 0.3]Probs validated as 2–32 entries summing to 1.0 ± 1e-6, each in (0, 1].
Advanced masking + vocab expansion
mask_history, train_on_prompt (mutually exclusive with train_on_responses_only), eval_on_each_dataset, split_thinking (Qwen3 <think> masking), image_min_pixels / image_max_pixels, image_resize_algorithm, video_fps, video_maxlen.
data:
add_new_tokens: ["<thought>", "</thought>"]
new_special_tokens: ["<|im_end|>"]
resize_vocab: truesoup data ingest
Convert PDF / DOCX / MD / TXT into JSONL.
soup data ingest mybook.pdf --output mybook.jsonlLazy-imports pypdf / python-docx so missing optional deps don't crash soup data --help.
AOT tokenize cache
soup data preprocess soup.yaml --output ./cachePlans the cache key (16-char SHA-256 of dataset + tokenizer + max_length + format). Live tokenize loop ships in v0.42.1.