LoRA (Low-Rank Adaptation) | designpattern.fyi

designpattern.fyi

Software Design Catalog

Language Models

Back to Catalog

Language Models Fine-Tuning

LoRA (Low-Rank Adaptation)

Freeze the base model, inject tiny trainable rank-decomposition matrices into attention layers — 100x fewer trainable parameters than full fine-tuning, fits on one GPU.

Intent & Description

🎯 Intent

Adapt a large pre-trained model to a new task with a fraction of the trainable parameters and memory cost of full fine-tuning — makes fine-tuning accessible on hardware that can’t hold full gradients.

📋 Context

Fine-tuning a 7B model requires 7B gradient tensors, optimizer states, and weight copies — way beyond a single consumer GPU. LoRA observes that weight updates needed for fine-tuning have low intrinsic rank, and decomposes the update into two small matrices.

💡 Solution

For each target weight matrix W (typically Q, K, V attention projections), freeze W and add a parallel path: ΔW = A × B, where A is (d × r) and B is (r × k) with rank r << min(d, k). Only A and B are trained — typically 0.1–1% of base model parameters. At inference, merge ΔW back into W for zero overhead, or keep adapters separate for multi-task hot-swapping.

Real-world Use Case

Adapting large models on consumer or single-GPU hardware. Maintaining multiple task-specific adapters on one shared base model. Rapid fine-tuning iteration before committing to full fine-tuning compute.

📌 TL;DR

Freeze the big model, train two tiny matrices per attention layer. 80–90% of full fine-tuning quality at 1% of the parameter cost — fits on one GPU.

Advantages

10–100x fewer trainable parameters — fits fine-tuning on hardware that can’t hold full gradients
Adapters are small and swappable — multiple tasks on one base model without storing full copies
Base weights frozen — catastrophic forgetting of general capabilities is prevented

Disadvantages

Lower adaptation capacity than full fine-tuning — constrained by rank r
Rank selection is a hyperparameter — too low limits quality, too high approaches full fine-tuning cost
Very large distribution shifts may require ranks that eliminate the memory advantage

38 of 58

© 2026 designpattern.fyi. Crafted with ❤️ for modern software engineers by OpenAGI