Grouped Query Attention (GQA) | designpattern.fyi

Back to Catalog

Advantages

Reduces KV cache memory by up to 16x vs. full MHA — enables longer context and larger batch sizes
Near-MHA output quality — minimal degradation vs. full MHA, far better than MQA
Composable with Flash Attention, RoPE, sliding window attention, and MoE without architectural changes

Disadvantages

Slightly lower expressiveness than full MHA — each Q head has less individualized K/V context
Group size G is a new hyperparameter to tune — too few groups approaches MQA’s quality tradeoff
Converting existing MHA checkpoints to GQA requires mean-pooling of K/V heads and fine-tuning