Quantization-Aware Training (QAT)
Simulate quantization noise during training so the model learns weights that survive lower precision — significantly better accuracy than PTQ at the same bit-width, especially sub-INT8.
Intent & Description
🎯 Intent
Train the model to tolerate the precision reduction it’ll face at inference instead of applying quantization as a post-hoc surprise.
📋 Context
PTQ quantizes weights that were trained at full precision — the model never saw quantization noise during gradient updates. For aggressive targets (INT4, INT2) this mismatch degrades accuracy hard. QAT bakes quantization into training so it’s expected, not a shock.
💡 Solution
During forward passes, insert fake quantization nodes — rounding ops that simulate the INT4/INT8 grid on weights and activations. Gradients still flow through fake-quant nodes via the straight-through estimator (treat rounding as identity for backprop). The model learns params that already cluster near quantization grid points. At deployment, real quantization is applied to a model that already expects it.
Real-world Use Case
📌 TL;DR
Train knowing you’ll quantize. Fake quantization during training produces weights that survive the precision drop far better than anything PTQ touches.
Advantages
- Significantly better accuracy than PTQ at the same bit-width — especially at INT4 and below
- Robust to distribution shift — the model trained expecting quantization noise
- Final weights are optimized for actual inference precision, not retrofitted to it
Disadvantages
- Requires full access to training pipeline and data — not a post-hoc transformation
- Slower training due to fake quantization ops in every forward pass
- Hyperparameter sensitivity increases with more aggressive quantization targets