Grouped Query Attention (GQA)
Share K/V heads across groups of Q heads — slashes KV cache memory and bandwidth by up to 16x with minimal quality loss. Now the default attention recipe in Llama 3, Mistral, and Qwen.
Intent & Description
🎯 Intent
Standard multi-head attention (MHA) gives every query head its own K/V heads — the KV cache scales with the full number of attention heads. GQA shares K/V heads across groups of Q heads, collapsing KV cache memory dramatically.
📋 Context
With MHA, a 70B model with 64 attention heads caches 64 K matrices and 64 V matrices per layer per token. At long contexts or large batches, this KV cache alone exceeds available VRAM. Multi-query attention (MQA) collapses to 1 K/V head — maximum efficiency but noticeable quality degradation. GQA is the intermediate that gets near-MQA cache savings with near-MHA quality.
💡 Solution
Group the H query heads into G groups (G « H). All Q heads in a group share one K head and one V head. The KV cache scales with G, not H — a 64-head model with G=8 groups uses 8x less KV cache memory than MHA. During attention, each Q head attends to its group’s shared K/V. Adopted by Llama 2 (70B), Llama 3, Mistral, Qwen, Gemma, and most major open-weight models since 2023. Often combined with RoPE and Flash Attention.
Real-world Use Case
📌 TL;DR
Share K/V heads across query groups — slashes KV cache memory while preserving quality. The default attention mechanism in every serious modern decoder architecture. If you’re still using MHA at scale, upgrade.
Advantages
- Reduces KV cache memory by up to 16x vs. full MHA — enables longer context and larger batch sizes
- Near-MHA output quality — minimal degradation vs. full MHA, far better than MQA
- Composable with Flash Attention, RoPE, sliding window attention, and MoE without architectural changes
Disadvantages
- Slightly lower expressiveness than full MHA — each Q head has less individualized K/V context
- Group size G is a new hyperparameter to tune — too few groups approaches MQA’s quality tradeoff
- Converting existing MHA checkpoints to GQA requires mean-pooling of K/V heads and fine-tuning