Mixed-Precision Quantization | designpattern.fyi

designpattern.fyi

Software Design Catalog

Language Models

Back to Catalog

Language Models Quantization

Mixed-Precision Quantization

Assign different precision to different layers based on sensitivity — INT8 where accuracy is fragile, INT4 where it's tolerant — best of both at near-INT4 memory.

Intent & Description

🎯 Intent

Blanket INT4 degrades accuracy unevenly — some layers are sensitive, others don’t care. Mixed precision puts bits where they actually matter.

📋 Context

Sensitivity analysis consistently shows early attention projections and final output layers suffer hard from INT4 quantization, while mid-stack FFN layers tolerate it fine. Uniform quantization wastes precision on tolerant layers and destroys accuracy in sensitive ones.

💡 Solution

Run per-layer sensitivity analysis on a calibration dataset — independently quantize each layer to INT4 and measure accuracy impact. Assign INT8 to high-sensitivity layers, INT4 to low-sensitivity ones. Average compression approaches uniform INT4; accuracy approaches uniform INT8. Tools: AutoGPTQ, SpQR, SqueezeLLM — all implement automated sensitivity-based mixed-precision assignment.

Real-world Use Case

Maximum compression at a given accuracy target. Production deployments where uniform INT4 kills key capabilities but uniform INT8 is too memory-heavy. Tuning the compression-accuracy frontier without any retraining.

📌 TL;DR

Not all layers handle precision loss equally — give bits to the layers that need them, compress the rest. Better than uniform INT4, leaner than uniform INT8.

Advantages

Better accuracy than uniform INT4 at comparable average compression ratio
Compression concentrated in tolerant layers — no wasted precision where it doesn’t matter
Automated sensitivity analysis removes the need for manual layer inspection

Disadvantages

More complex deployment toolchain than uniform quantization — multiple precision levels in play
Mixed-precision kernels may not be supported on all target hardware
Sensitivity analysis requires representative calibration data and adds evaluation cost

32 of 58

© 2026 designpattern.fyi. Crafted with ❤️ for modern software engineers by OpenAGI