Progressive Distillation | designpattern.fyi

designpattern.fyi

Software Design Catalog

Language Models

Back to Catalog

Language Models Model Distillation

Progressive Distillation

Distill through intermediate sizes — 70B → 13B → 7B → 1B — instead of jumping from the largest teacher to the smallest target in one brutal step.

Intent & Description

🎯 Intent

Direct distillation from a very large teacher to a very small student loses too much in one jump. Chain of intermediate models closes the capacity gap in stages.

📋 Context

Distilling 70B → 1B is a 70x compression in one pass. That gap is too wide — the student can’t faithfully approximate the teacher’s distribution. Quality degrades sharply. Mid-size intermediates provide smoother gradients for knowledge transfer.

💡 Solution

Define a distillation chain: Teacher (70B) → 13B → 7B → Target (1B). Each step is a manageable compression ratio where student and teacher are close enough in capacity for effective transfer. Use standard knowledge distillation with soft labels at each stage. Every intermediate checkpoint is itself a deployable production model.

Real-world Use Case

Extreme compression targets where direct large-to-small distillation degrades quality unacceptably. Model family development (70B → 13B → 7B → 3B → 1B) where every tier needs production quality. Research exploring theoretical compression limits.

📌 TL;DR

Don’t jump 70B to 1B in one step — use intermediates. Each hop is a shorter fall, and you land better. You get a model family as a byproduct.

Advantages

Better final quality than direct distillation at the same compression ratio
Intermediate checkpoints are independently deployable — 13B and 7B as byproducts
Smoother knowledge gradient — each student has a capacity-matched teacher, not a 70x-larger one

Disadvantages

Multiple training runs multiply total compute proportionally
Pipeline complexity grows with chain length
Errors in an intermediate model compound and degrade all downstream stages

29 of 58

© 2026 designpattern.fyi. Crafted with ❤️ for modern software engineers by OpenAGI