Flash Attention | designpattern.fyi

Back to Catalog

Advantages

2–4x memory reduction vs. standard attention — enables longer context at the same GPU budget
Significant wall-clock speedup (2–4x on common sequence lengths) due to IO reduction
Exact attention output — no approximation, no quality degradation vs. standard attention

Disadvantages

Kernel implementations are hardware-specific — AMD ROCm and older GPUs need separate ports
Custom CUDA kernel complicates debugging and gradient inspection
Chunked/tiled computation makes layer-wise attention patterns harder to visualize