KV Cache
Cache attention keys and values from processed tokens so autoregressive generation doesn't recompute the full context on every new token ā a required optimization, not optional.
Intent & Description
šÆ Intent
Autoregressive generation naively recomputes attention over all previous tokens at every step. KV caching eliminates that redundant recomputation by storing and reusing the key/value tensors from prior tokens.
š Context
Without KV cache, generating token N requires a full attention pass over all N-1 prior tokens ā O(N²) compute for a sequence of length N. With KV cache, each new token only needs one attention pass over its own Q against the cached K/V, reducing generation to O(1) attention compute per step. It’s a foundational optimization that every production LLM inference framework applies.
š” Solution
During the prefill phase (processing the prompt), compute and cache the K and V tensors for every layer. During decode (token-by-token generation), compute only the new token’s Q and attend against the cached K/V ā appending the new token’s K/V to the cache after each step. KV cache memory scales with sequence length Ć batch size Ć num_heads Ć head_dim Ć 2 (K and V) Ć precision. Manage cache memory carefully ā it’s the primary memory bottleneck for long-context inference at scale. Tools: vLLM uses PagedAttention for KV cache memory management.
Real-world Use Case
š TL;DR
Cache the K/V tensors from processed tokens so you don’t recompute the full context every token. Non-negotiable baseline for any production LLM inference ā everything else builds on top of it.
Advantages
- Reduces per-token generation compute from O(N) to O(1) ā makes long-context generation practical
- Enables efficient multi-turn conversation without reprocessing the full context each turn
- Foundation for prefix caching optimizations (cache system prompts across requests)
Disadvantages
- KV cache memory grows linearly with sequence length and batch size ā major VRAM pressure at scale
- Long sequences or large batches can cause OOM if KV cache isn’t managed carefully
- Cache invalidation across requests requires careful memory management (PagedAttention addresses this)