Weight-Only Quantization
Store weights at INT4, keep activations in FP16 — most of the memory win with a fraction of the accuracy cost of quantizing everything.
Intent & Description
🎯 Intent
Model weights dominate memory; activations dominate compute precision sensitivity. Quantize only weights for most of the memory reduction while keeping arithmetic in FP16.
📋 Context
Quantizing both weights and activations to INT4 is aggressive and kills accuracy. But most of a large model’s memory is weights, not activations. Dequantizing weights to FP16 just-in-time for each matmul keeps arithmetic clean while storage stays at INT4.
💡 Solution
Store weight matrices in INT4 (or INT3). At each layer’s forward pass, dequantize the weight matrix INT4 → FP16, run the matmul in FP16, discard the dequantized copy. Activations stay FP16 throughout. GPTQ, AWQ, and GGUF all use this approach. Calibration determines optimal quantization grid per weight matrix.
Real-world Use Case
📌 TL;DR
Weights at INT4, compute in FP16. Most of the memory win, fraction of the accuracy cost.
Advantages
- ~4x memory reduction over FP16 with minimal accuracy loss on most architectures
- Arithmetic stays in FP16 — avoids INT4 matmul precision issues entirely
- Widely supported — GPTQ, AWQ, GGUF are mature, well-maintained ecosystems
Disadvantages
- Dequantization overhead on every forward pass reduces raw throughput vs. native INT4 compute
- Slower than unquantized FP16 on batch inference where memory isn’t the bottleneck
- Per-matrix calibration required for accurate quantization grid selection