QLoRA (Quantized Low-Rank Adaptation)
Fine-tune a 4-bit quantized base model using BF16 LoRA adapters — enabling 65B parameter fine-tuning on a single 48GB GPU.
Intent & Description
🎯 Intent
Combine 4-bit quantization’s memory savings with LoRA’s parameter efficiency — makes fine-tuning of very large models possible on hardware that couldn’t even hold them for inference at FP16.
📋 Context
Standard LoRA still requires the base model in FP16 — a 65B model needs ~130GB VRAM just for frozen base weights. QLoRA quantizes the frozen base to 4-bit NF4 while LoRA adapters are trained in BF16, dequantizing on the fly for each forward pass.
💡 Solution
Quantize frozen base model to 4-bit Normal Float (NF4) using bitsandbytes. Attach LoRA adapters in BF16 to target layers. During training: dequantize NF4 weight → BF16 for each forward pass, compute gradients in BF16, update only the LoRA adapter parameters. Apply double quantization (quantize the quantization constants themselves) and paged optimizers for memory spike handling.
Real-world Use Case
📌 TL;DR
4-bit base, BF16 adapters. Fits 65B fine-tuning on one GPU. The hardware barrier to large-model adaptation effectively collapses.
Advantages
- Makes 65B+ model fine-tuning accessible on a single 48GB GPU
- Accuracy close to full BF16 LoRA fine-tuning despite 4-bit base weights
- Paged optimizers handle memory spikes from gradient accumulation
Disadvantages
- Slower training than BF16 LoRA due to per-pass NF4 dequantization overhead
- NF4 base model has a slightly lower quality floor than FP16 baseline
- More complex setup — requires bitsandbytes and careful memory budgeting