Data Pipeline Pro (v0.42.0)

18 features across 6 parts. Axolotl + LLaMA-Factory data-layer parity.

5 new data formats

yaml
data:
  train: ./data/train.jsonl
  format: prm            # PRM stepwise-supervised
  # | pre_tokenized      # LF tokenized_path / Axolotl empty
  # | input_output       # Axolotl template-free segments+labels
  # | video
  # | multimodal         # axolotl content-parts schema

Remote URIs

7-entry allowlist for object stores. Bucket regex ^[a-zA-Z0-9][a-zA-Z0-9._\-]{0,62}$. Userinfo / fragment / query rejected.

yaml
data:
  train: s3://my-bucket/train.jsonl    # s3 | gs | gcs | az | abfs | abfss | oci
  streaming: true
  buffer_size: 10000
  shards: 16

Live fsspec loaders ship in v0.42.1 (schema-only in v0.42.0).

Multi-dataset interleave

yaml
data:
  train:
    - ./data/sft.jsonl
    - ./data/preference.jsonl
  interleave:
    strategy: probs        # concat | under | over | probs
    probs: [0.7, 0.3]

Probs validated as 2–32 entries summing to 1.0 ± 1e-6, each in (0, 1].

Advanced masking + vocab expansion

mask_history, train_on_prompt (mutually exclusive with train_on_responses_only), eval_on_each_dataset, split_thinking (Qwen3 <think> masking), image_min_pixels / image_max_pixels, image_resize_algorithm, video_fps, video_maxlen.

yaml
data:
  add_new_tokens: ["<thought>", "</thought>"]
  new_special_tokens: ["<|im_end|>"]
  resize_vocab: true

soup data ingest

Convert PDF / DOCX / MD / TXT into JSONL.

bash
soup data ingest mybook.pdf --output mybook.jsonl

Lazy-imports pypdf / python-docx so missing optional deps don't crash soup data --help.

AOT tokenize cache

bash
soup data preprocess soup.yaml --output ./cache

Plans the cache key (16-char SHA-256 of dataset + tokenizer + max_length + format). Live tokenize loop ships in v0.42.1.