Post-Training Quantization (PTQ)
Drop weight precision from FP16 to INT8 or INT4 after training — zero retraining, 2–4x memory reduction, meaningful inference speedup.
Intent & Description
🎯 Intent
Cut inference memory and boost throughput by lowering numerical precision without touching the training pipeline.
📋 Context
A 7B model at FP16 needs ~14GB VRAM. INT8 halves that; INT4 quarters it — models that were GPU-cluster-only become runnable on a single consumer card. No retraining — it’s a post-hoc transformation on any existing checkpoint.
💡 Solution
After training, quantize weights (and optionally activations) from FP16/BF16 to INT8/INT4 using calibration data to find per-layer scaling factors that minimize quantization error. Key tools: bitsandbytes (INT8/INT4), GPTQ (INT4 weight quant), llama.cpp (GGUF format). Use representative calibration data — random data degrades quality.
Real-world Use Case
📌 TL;DR
Shrink the model after training with no retraining needed. INT8 halves memory with minimal loss; INT4 quarters it with more. Representative calibration data is the one thing that actually matters.
Advantages
- No retraining — applies to any existing checkpoint in minutes
- 2–4x memory reduction at INT8/INT4 with minimal quality regression at INT8
- Inference speedup on hardware with native INT8 support (most modern GPUs and NPUs)
Disadvantages
- Accuracy degrades — small at INT8, larger at INT4, varies by model and task
- Some layers are more sensitive and may need to stay at higher precision (mixed-precision PTQ)
- Calibration data quality matters — poor calibration data → worse accuracy grid