Fine-tune Gemma 3 with QLoRA (single GPU)
QLoRA combines 4-bit base model quantization with LoRA adapters, making it possible to fine-tune Gemma 3 12B on a single 16GB GPU (RTX 4080, A4000).
Why QLoRA?
- 4× memory reduction vs full LoRA
- Same quality as full fine-tuning (~99% of benchmark scores per the QLoRA paper)
- Works on consumer hardware
1. Install
bash
pip install 'soup-cli[fast]'2. Config
yaml
base:
model: google/gemma-3-12b-it
task: sft
data:
train: train.json
format: alpaca
training:
backend: unsloth
quant: 4bit
epochs: 3
learning_rate: 2.0e-4
batch_size: 1
gradient_accumulation_steps: 16
max_seq_length: 2048
lora:
enabled: true
r: 16
alpha: 16
use_rslora: true
target_modules: [q_proj, k_proj, v_proj, o_proj]Note the key flags:
quant: 4bit— 4-bit NF4 quantization of base modeluse_rslora: true— rank-stabilized LoRA (v0.21.0+), better for larger modelsbatch_size: 1withgradient_accumulation_steps: 16— effective batch of 16 on tight VRAM
3. Train
bash
soup train --config gemma3.yamlMonitor VRAM with nvidia-smi in another terminal. You should see ~14GB peak on Gemma 3 12B.
4. Merge and export
bash
# Dequantize, merge LoRA, save full model
soup export --adapter ./runs/gemma3/latest --format hf --output ./gemma3-merged
# Or export directly to GGUF q4_k_m
soup export --adapter ./runs/gemma3/latest --format gguf --quant q4_k_mCommon issues
OOM during backward pass? Reduce max_seq_length to 1024 or enable gradient checkpointing:
yaml
training:
gradient_checkpointing: trueLoss spikes? Enable loss watchdog (v0.24.0+):
yaml
training:
loss_watchdog:
enabled: true
max_spike: 2.0Related
- [Export to GGUF and Ollama](/docs/export-to-gguf-ollama)
- [Training backends](/docs/backends)