Preference Variety (v0.40.0)

Five preference losses live behind one config knob. Pick a loss without renaming your task, anneal β over training, periodically refresh the frozen reference, and blend losses with a forward-looking multi-objective surface.

BCO (Binary Classifier Optimization)

Same input format as DPO; rows are split internally to TRL's BCO unpaired schema ({prompt, completion, label}).

yaml
task: bco
data:
  train: ./data/preferences.jsonl
  format: dpo
training:
  bco_beta: 0.1
  lora: { r: 64, alpha: 16 }
  quantization: 4bit

bco_beta defaults to 0.1 (gt=0). New soup init --template bco ships in the box.

Unified preference dispatcher

Use task: preference + training.preference_loss to swap losses without touching task. Hyperparameter sweeps over the loss type itself become trivial.

yaml
task: preference
data:
  train: ./data/preferences.jsonl
  format: dpo
training:
  preference_loss: dpo   # or simpo, orpo, ipo, bco

Legacy task: dpo / task: simpo / etc. remain first-class — the unified surface is additive.

KL-controlled DPO variants

Anneal β over training and periodically refresh the reference model.

yaml
task: dpo   # or task: preference + preference_loss: dpo, or task: ipo
training:
  dpo_beta: 0.1
  dpo_beta_schedule: linear   # linear | cosine | exponential
  dpo_beta_end: 0.01
  dpo_ref_regen_epochs: 2     # copy student → ref model every 2 epochs

Both controls are gated to DPO-family tasks (dpo, ipo, or preference with preference_loss in {dpo, ipo}). Transformers backend only.

The BetaScheduleCallback resolves total_steps lazily in on_train_begin so the schedule sees the real state.max_steps populated by HF Trainer. Epoch-0 regen is suppressed (avoids copying the untrained student).

Multi-objective preference loss (schema-only in v0.40.0)

yaml
task: preference
training:
  preference_loss_weights: {dpo: 0.7, bco: 0.3}

The schema validates 2–5 entries summing to 1.0 (±1e-6). Single-entry rejected with an actionable message pointing at scalar preference_loss. Mutually exclusive with scalar preference_loss. Rejected on MLX backend.

Live runtime weighted-loss combination is wired in v0.40.1; v0.40.0 fails fast with an actionable NotImplementedError if you actually try to train (same stub-then-live pattern as v0.27.0 MII / v0.37.0 multipack / v0.38.0 quant menu / v0.39.0 ReLoRA).

Stats

  • Net +118 tests (4538 → 4656 across 136 files)
  • BCO trainer + dispatcher + β schedule math + ref-model regen TOCTOU + multi-objective schema bounds

See also

  • [DPO training guide](/docs/dpo-training-guide) — preference dataset format
  • [Trace-to-preference](/docs/trace-to-preference) — harvest pairs from production logs
  • [Registry](/docs/registry) — track preference variants in the lineage DAG