Full Fine-Tuning
Update every model parameter on task-specific data — maximum adaptation capacity, maximum compute and memory cost.
Intent & Description
🎯 Intent
Fully specialize a pre-trained model to a new domain by updating every weight — the highest-capacity adaptation method, used when lighter approaches fall short.
📋 Context
Pre-trained models encode general knowledge. For significant domain shift (medical imaging reports, legal contracts, financial filings) or deep behavioral change, partial fine-tuning methods may not adapt deep layers enough. Full fine-tuning changes everything.
💡 Solution
Initialize from a pre-trained checkpoint. Run standard supervised training on task-specific data with a small learning rate (1e-5 to 5e-5) and linear warmup. All parameters receive gradient updates. Use gradient checkpointing to manage memory (training requires 3–4x the inference memory footprint). Mix in general data to prevent catastrophic forgetting.
Real-world Use Case
📌 TL;DR
Maximum adaptation, maximum cost. Use when LoRA falls short — and always mix in general data or you’ll silently destroy capabilities you care about.
Advantages
- Maximum adaptation capacity — every parameter can shift to fit the new domain
- No architectural constraints — full expressive power of the model is available
- Produces a fully portable standalone checkpoint that doesn’t depend on a base model
Disadvantages
- Highest compute cost — requires full training infrastructure and significant GPU-hours
- Catastrophic forgetting — general capabilities degrade without careful data mixing
- Training memory is 3–4x inference memory — needs hardware most teams don’t have