Knowledge Distillation
Train a small student model on the big teacher's probability distributions — not just correct answers — so it inherits the teacher's uncertainty structure, not just its outputs.
Intent & Description
🎯 Intent
Compress a large model’s knowledge into a smaller deployable one without training from scratch — soft label distributions carry far richer signal than one-hot targets.
📋 Context
Training small from scratch on the same task consistently underperforms a distilled student. The teacher’s output probability over all classes — e.g., “90% cat, 8% lynx, 2% dog” — encodes which wrong answers are almost-right and why. Hard labels throw that signal away.
💡 Solution
Run the teacher on training data at temperature τ > 1 to spread its output distribution (soften near-correct classes). Train the student with a weighted combo: α × KL(student ∥ teacher soft labels) + (1-α) × cross-entropy(student ∥ hard labels). The student learns the teacher’s similarity structure. At inference, use τ = 1.
Real-world Use Case
📌 TL;DR
Train the small model to match what the big model almost said, not just what it said. The soft label distribution is the real signal.
Advantages
- Soft labels beat hard labels — student trained this way outperforms the same architecture trained normally
- Teacher’s learned class-similarity structure is preserved in soft probabilities
- Works across different architectures — no structural coupling between student and teacher
Disadvantages
- Teacher inference over the full training set is required upfront — adds compute before you save any
- Student cannot exceed teacher quality — you’re compressing, not amplifying
- Temperature τ is empirical and task-dependent — needs tuning per domain