DPO training guide: align LLMs with human preferences
Direct Preference Optimization (DPO) aligns language models with human preferences without needing a reward model — it's simpler and more stable than RLHF/PPO.
When to use DPO
- You have a dataset of "chosen" vs "rejected" responses
- You want to reduce hallucinations and off-topic answers
- You want alignment without the complexity of PPO
Use DPO after SFT. A typical pipeline: Pretrain → SFT → DPO.
1. DPO dataset format
json
[
{
"prompt": "Explain quantum entanglement.",
"chosen": "Quantum entanglement is a physical phenomenon where...",
"rejected": "Idk, something quantum."
}
]Save as preferences.json.
2. Config
yaml
base:
model: ./runs/my-sft-model/latest # Start from SFT checkpoint
task: dpo
data:
train: preferences.json
format: dpo
training:
backend: transformers
epochs: 1
learning_rate: 5.0e-7
batch_size: 2
gradient_accumulation_steps: 8
beta: 0.1
max_seq_length: 2048
lora:
enabled: true
r: 16
alpha: 32Key DPO hyperparameters:
beta: 0.1— KL penalty weight. Higher = stay closer to reference model.learning_rate: 5e-7— DPO needs a much smaller LR than SFT.epochs: 1— DPO overfits quickly, rarely needs more than 1–2 epochs.
3. Train
bash
soup train --config dpo.yaml4. Evaluate
Compare the DPO model against the SFT baseline:
bash
soup eval compare \
--base ./runs/my-sft-model/latest \
--candidate ./runs/my-dpo-model/latest \
--judge gpt-4DPO variants in Soup CLI
Soup supports several preference-optimization methods — swap task: to change algorithm:
task: dpo— Direct Preference Optimizationtask: orpo— ORPO (combines SFT + DPO in one step, no reference model)task: simpo— SimPO (length-normalized, no reference model)task: ipo— IPO (IPO loss, more stable than DPO on noisy data)task: kto— KTO (works with unpaired binary labels)
Related
- [Training methods reference](/docs/training)
- [Fine-tune Llama 3.1 with LoRA](/docs/fine-tune-llama-3-1-lora)