LoRA (Low-Rank Adaptation)
Freeze the base model, inject tiny trainable rank-decomposition matrices into attention layers — 100x fewer trainable parameters than full fine-tuning, fits on one GPU.
Intent & Description
🎯 Intent
Adapt a large pre-trained model to a new task with a fraction of the trainable parameters and memory cost of full fine-tuning — makes fine-tuning accessible on hardware that can’t hold full gradients.
📋 Context
Fine-tuning a 7B model requires 7B gradient tensors, optimizer states, and weight copies — way beyond a single consumer GPU. LoRA observes that weight updates needed for fine-tuning have low intrinsic rank, and decomposes the update into two small matrices.
💡 Solution
For each target weight matrix W (typically Q, K, V attention projections), freeze W and add a parallel path: ΔW = A × B, where A is (d × r) and B is (r × k) with rank r << min(d, k). Only A and B are trained — typically 0.1–1% of base model parameters. At inference, merge ΔW back into W for zero overhead, or keep adapters separate for multi-task hot-swapping.
Real-world Use Case
📌 TL;DR
Freeze the big model, train two tiny matrices per attention layer. 80–90% of full fine-tuning quality at 1% of the parameter cost — fits on one GPU.
Advantages
- 10–100x fewer trainable parameters — fits fine-tuning on hardware that can’t hold full gradients
- Adapters are small and swappable — multiple tasks on one base model without storing full copies
- Base weights frozen — catastrophic forgetting of general capabilities is prevented
Disadvantages
- Lower adaptation capacity than full fine-tuning — constrained by rank r
- Rank selection is a hyperparameter — too low limits quality, too high approaches full fine-tuning cost
- Very large distribution shifts may require ranks that eliminate the memory advantage