Progressive Distillation
Distill through intermediate sizes — 70B → 13B → 7B → 1B — instead of jumping from the largest teacher to the smallest target in one brutal step.
Intent & Description
🎯 Intent
Direct distillation from a very large teacher to a very small student loses too much in one jump. Chain of intermediate models closes the capacity gap in stages.
📋 Context
Distilling 70B → 1B is a 70x compression in one pass. That gap is too wide — the student can’t faithfully approximate the teacher’s distribution. Quality degrades sharply. Mid-size intermediates provide smoother gradients for knowledge transfer.
💡 Solution
Define a distillation chain: Teacher (70B) → 13B → 7B → Target (1B). Each step is a manageable compression ratio where student and teacher are close enough in capacity for effective transfer. Use standard knowledge distillation with soft labels at each stage. Every intermediate checkpoint is itself a deployable production model.
Real-world Use Case
📌 TL;DR
Don’t jump 70B to 1B in one step — use intermediates. Each hop is a shorter fall, and you land better. You get a model family as a byproduct.
Advantages
- Better final quality than direct distillation at the same compression ratio
- Intermediate checkpoints are independently deployable — 13B and 7B as byproducts
- Smoother knowledge gradient — each student has a capacity-matched teacher, not a 70x-larger one
Disadvantages
- Multiple training runs multiply total compute proportionally
- Pipeline complexity grows with chain length
- Errors in an intermediate model compound and degrade all downstream stages