Weight-Only Quantization | designpattern.fyi

designpattern.fyi

Software Design Catalog

Language Models

Back to Catalog

Language Models Quantization

Weight-Only Quantization

Store weights at INT4, keep activations in FP16 — most of the memory win with a fraction of the accuracy cost of quantizing everything.

Intent & Description

🎯 Intent

Model weights dominate memory; activations dominate compute precision sensitivity. Quantize only weights for most of the memory reduction while keeping arithmetic in FP16.

📋 Context

Quantizing both weights and activations to INT4 is aggressive and kills accuracy. But most of a large model’s memory is weights, not activations. Dequantizing weights to FP16 just-in-time for each matmul keeps arithmetic clean while storage stays at INT4.

💡 Solution

Store weight matrices in INT4 (or INT3). At each layer’s forward pass, dequantize the weight matrix INT4 → FP16, run the matmul in FP16, discard the dequantized copy. Activations stay FP16 throughout. GPTQ, AWQ, and GGUF all use this approach. Calibration determines optimal quantization grid per weight matrix.

Real-world Use Case

Memory-constrained inference (consumer GPU, laptop, on-device) where weights are the bottleneck. Running 70B+ models on hardware that can’t hold them at FP16. Prioritizing accuracy over maximum throughput on batch inference.

📌 TL;DR

Weights at INT4, compute in FP16. Most of the memory win, fraction of the accuracy cost.

Advantages

~4x memory reduction over FP16 with minimal accuracy loss on most architectures
Arithmetic stays in FP16 — avoids INT4 matmul precision issues entirely
Widely supported — GPTQ, AWQ, GGUF are mature, well-maintained ecosystems

Disadvantages

Dequantization overhead on every forward pass reduces raw throughput vs. native INT4 compute
Slower than unquantized FP16 on batch inference where memory isn’t the bottleneck
Per-matrix calibration required for accurate quantization grid selection

33 of 58

© 2026 designpattern.fyi. Crafted with ❤️ for modern software engineers by OpenAGI