Post-Training Quantization (PTQ) | designpattern.fyi

designpattern.fyi

Software Design Catalog

Language Models

Back to Catalog

Language Models Quantization

Post-Training Quantization (PTQ)

Drop weight precision from FP16 to INT8 or INT4 after training — zero retraining, 2–4x memory reduction, meaningful inference speedup.

Intent & Description

🎯 Intent

Cut inference memory and boost throughput by lowering numerical precision without touching the training pipeline.

📋 Context

A 7B model at FP16 needs ~14GB VRAM. INT8 halves that; INT4 quarters it — models that were GPU-cluster-only become runnable on a single consumer card. No retraining — it’s a post-hoc transformation on any existing checkpoint.

💡 Solution

After training, quantize weights (and optionally activations) from FP16/BF16 to INT8/INT4 using calibration data to find per-layer scaling factors that minimize quantization error. Key tools: bitsandbytes (INT8/INT4), GPTQ (INT4 weight quant), llama.cpp (GGUF format). Use representative calibration data — random data degrades quality.

Real-world Use Case

Deploying large models on memory-constrained hardware. Boosting inference throughput on a fixed GPU budget. Consumer or edge deployment of models too large for available VRAM at full precision.

📌 TL;DR

Shrink the model after training with no retraining needed. INT8 halves memory with minimal loss; INT4 quarters it with more. Representative calibration data is the one thing that actually matters.

Advantages

No retraining — applies to any existing checkpoint in minutes
2–4x memory reduction at INT8/INT4 with minimal quality regression at INT8
Inference speedup on hardware with native INT8 support (most modern GPUs and NPUs)

Disadvantages

Accuracy degrades — small at INT8, larger at INT4, varies by model and task
Some layers are more sensitive and may need to stay at higher precision (mixed-precision PTQ)
Calibration data quality matters — poor calibration data → worse accuracy grid

30 of 58

© 2026 designpattern.fyi. Crafted with ❤️ for modern software engineers by OpenAGI