Mixed-Precision Quantization
Assign different precision to different layers based on sensitivity — INT8 where accuracy is fragile, INT4 where it's tolerant — best of both at near-INT4 memory.
Intent & Description
🎯 Intent
Blanket INT4 degrades accuracy unevenly — some layers are sensitive, others don’t care. Mixed precision puts bits where they actually matter.
📋 Context
Sensitivity analysis consistently shows early attention projections and final output layers suffer hard from INT4 quantization, while mid-stack FFN layers tolerate it fine. Uniform quantization wastes precision on tolerant layers and destroys accuracy in sensitive ones.
💡 Solution
Run per-layer sensitivity analysis on a calibration dataset — independently quantize each layer to INT4 and measure accuracy impact. Assign INT8 to high-sensitivity layers, INT4 to low-sensitivity ones. Average compression approaches uniform INT4; accuracy approaches uniform INT8. Tools: AutoGPTQ, SpQR, SqueezeLLM — all implement automated sensitivity-based mixed-precision assignment.
Real-world Use Case
📌 TL;DR
Not all layers handle precision loss equally — give bits to the layers that need them, compress the rest. Better than uniform INT4, leaner than uniform INT8.
Advantages
- Better accuracy than uniform INT4 at comparable average compression ratio
- Compression concentrated in tolerant layers — no wasted precision where it doesn’t matter
- Automated sensitivity analysis removes the need for manual layer inspection
Disadvantages
- More complex deployment toolchain than uniform quantization — multiple precision levels in play
- Mixed-precision kernels may not be supported on all target hardware
- Sensitivity analysis requires representative calibration data and adds evaluation cost