Structured Pruning
Remove entire attention heads, FFN neurons, or transformer layers based on importance scores — real hardware speedups, no sparse kernel requirements.
Intent & Description
🎯 Intent
Shrink a model by removing entire structural components — not individual weights — so that standard dense hardware acceleration applies directly without needing sparse compute kernels.
📋 Context
Unstructured pruning removes individual weights for fine-grained sparsity but requires sparse matmul kernels that most hardware doesn’t accelerate well — theoretical FLOPs savings don’t translate to wall-clock speedups. Structured pruning removes entire heads, neurons, or layers — the resulting model is just a smaller dense model that runs faster everywhere.
💡 Solution
Run sensitivity analysis: for each structural component (attention head, FFN neuron group, transformer layer), measure the increase in perplexity or task loss when it’s removed. Sort by importance score. Remove the least important components up to a target compression ratio. Recovery fine-tune (often using LoRA) on a representative dataset to restore quality. Post-pruning distillation using the original model as teacher can recover additional quality. Tools: LLM-Pruner, SlimLLM, Adapt-Pruner. Common targets: 20–50% layer reduction with 5–15% quality regression before recovery fine-tuning.
Real-world Use Case
📌 TL;DR
Remove whole heads, neurons, or layers — not random weights. The resulting model is a smaller dense model that runs faster everywhere with no sparse kernel requirements.
Advantages
- Real hardware speedups without sparse compute kernels — the pruned model is just a smaller dense model
- Composable with quantization and LoRA — prune first, then quantize and fine-tune for compounding gains
- Layer-level pruning produces models with reduced depth — lower latency on sequential hardware
Disadvantages
- Aggressive pruning (>40% of parameters) degrades quality significantly before recovery fine-tuning
- Sensitivity analysis requires calibration data and adds pre-pruning evaluation cost
- Recovery fine-tuning is required to restore quality — adds a training step after pruning