Multipack — FFD Bin-Packing Sampler (v0.37.0)

Soup's largest single throughput win on chat fine-tuning over uneven-length data. Instead of padding every sample to max_length, Multipack uses First-Fit-Decreasing bin packing to group variable-length samples into bins approaching batch_size × max_seq_length — eliminating padding waste.

yaml
training:
  multipack: true
  packing: false   # mutually exclusive with multipack

How it composes

  • Multipack picks WHICH samples go together (FFD packing).
  • `packing_cross_doc_attn_mask` sets HOW the attention mask is built (block-diagonal causal).
  • The two layer cleanly: enable both for FA-incompatible backends; the FA varlen path is auto-selected when FlashAttention is available.

Architecture allowlist

18 supported architectures: Llama 3.x, Qwen 2/3, Mistral, Gemma 2/3, Phi 3/4, DeepSeek V2/V3, Mixtral, Falcon, StableLM, SmolLM2.

Unknown architectures fail loudly at config-load instead of silently no-opping (critical fix vs Axolotl's silent-miss footgun).

Scope

Multipack is sft / pretrain only on the transformers backend. Preference / RLHF trainers and MLX backend get distinct error messages naming the actual reason. Live wiring of the sampler into HF Trainer's _get_train_sampler lands in v0.37.1; v0.37.0 ships the schema gate + helper builder (mirrors v0.27.0 MII stub-then-live pattern).

DoS hardening

  • FFD packer caps at 1M items (algorithm is O(N²) worst-case)
  • 4D mask builder caps allocations at 2³¹ cells
  • Chat-template Jinja analyzer caps at 128 KB
  • Every numeric input rejects bool explicitly

Jinja template analyzer

The JinjaTemplateAnalyzer (also v0.37.0) walks chat-template ASTs to discover non-standard message.<field> references (tool_calls, name, weight, train) — used by the v0.36.0 train_on_messages_with_train_field path so per-message training masks are aware of fields beyond role / content. The analyzer parses templates without rendering them, so a crafted soup.yaml cannot trigger SSRF.