Correctness First (v0.36.0)

Four silent-failure modes Soup had → loud failures.

Assistant-only loss masking

By default, Soup masks every non-assistant token with -100 so the SFT loss reflects only what the model should *generate*. Toggle via data.train_on_responses_only (default true):

yaml
data:
  train: data.jsonl
  train_on_responses_only: true   # default
  # OR per-message control:
  # train_on_messages_with_train_field: true

When the tokenizer ships a chat template with {% generation %} markers, the mask is exact. Without those markers, Soup falls back to an incremental tokenize-delta walk and documents the looseness.

`--trust-remote-code` opt-in

soup train, chat, serve, data download, eval auto now require --trust-remote-code to load any HF model that ships custom Python (auto_map in config.json).

bash
soup train --config soup.yaml --trust-remote-code

First-party orgs (Meta, Mistral, Qwen, Google, etc. — 15 in the allowlist) suppress the warning panel; everything else prints a REMOTE CODE WARNING panel before loading.

Chat-template hardening

Tokenizers without a chat template now raise a ValueError with a fix suggestion instead of silently building garbage f"{role}: {content}" strings.

yaml
data:
  train: data.jsonl
  chat_template: chatml   # or: llama3, qwen2.5, mistral, gemma3, phi4, deepseek-r1, or a raw Jinja string

Raw Jinja strings are validated: null bytes, >64 KB, and filesystem-touching directives ({% include %}, {% import %}, {% from %}, {% macro %}, {% extends %}) are rejected at config-load.

OOM-probe auto batch size

yaml
training:
  batch_size: auto                  # unchanged
  auto_batch_size_strategy: probe   # NEW: 'static' | 'probe' | 'auto' (default)

Replaces the static memory formula with a real try-halve-then-double-to-ceiling loop. Picked size is cached at ~/.soup/batch_cache.json keyed on (model, max_length, quantization, lora_r, gpu_name, gpu_memory_gb) so repeat runs short-circuit. Cache file gets best-effort 0o600 perms after atomic rename.