Knowledge Distillation | designpattern.fyi

designpattern.fyi

Software Design Catalog

Language Models

Back to Catalog

Language Models Model Distillation

Knowledge Distillation

Train a small student model on the big teacher's probability distributions — not just correct answers — so it inherits the teacher's uncertainty structure, not just its outputs.

Intent & Description

🎯 Intent

Compress a large model’s knowledge into a smaller deployable one without training from scratch — soft label distributions carry far richer signal than one-hot targets.

📋 Context

Training small from scratch on the same task consistently underperforms a distilled student. The teacher’s output probability over all classes — e.g., “90% cat, 8% lynx, 2% dog” — encodes which wrong answers are almost-right and why. Hard labels throw that signal away.

💡 Solution

Run the teacher on training data at temperature τ > 1 to spread its output distribution (soften near-correct classes). Train the student with a weighted combo: α × KL(student ∥ teacher soft labels) + (1-α) × cross-entropy(student ∥ hard labels). The student learns the teacher’s similarity structure. At inference, use τ = 1.

Real-world Use Case

Edge/mobile deployment of frontier-quality models. Slashing inference cost while keeping quality. Building domain-specialized small models from a general large one where training from scratch is impractical.

📌 TL;DR

Train the small model to match what the big model almost said, not just what it said. The soft label distribution is the real signal.

Advantages

Soft labels beat hard labels — student trained this way outperforms the same architecture trained normally
Teacher’s learned class-similarity structure is preserved in soft probabilities
Works across different architectures — no structural coupling between student and teacher

Disadvantages

Teacher inference over the full training set is required upfront — adds compute before you save any
Student cannot exceed teacher quality — you’re compressing, not amplifying
Temperature τ is empirical and task-dependent — needs tuning per domain

27 of 58

© 2026 designpattern.fyi. Crafted with ❤️ for modern software engineers by OpenAGI