Speculative Decoding
Let a tiny draft model write several candidate tokens ahead, then verify them all in parallel with the big model — same output distribution, 2–4x latency reduction.
Intent & Description
🎯 Intent
LLM inference is memory-bandwidth-bound, not compute-bound — loading weights is the bottleneck, not the matmuls. Speculative decoding exploits this by parallelizing token verification across a batch of drafts.
📋 Context
Autoregressive generation produces one token per forward pass. Each pass loads all model weights from HBM — that memory bandwidth is what limits throughput. A small draft model can generate several candidate tokens cheaply; the large model can verify them all in one parallel pass.
💡 Solution
A lightweight draft model (same family, 7B instead of 70B, or a dedicated speculator head like EAGLE) generates K candidate tokens. The large target model runs one forward pass to verify all K candidates in parallel using the acceptance criterion from the original speculative sampling paper. Accepted tokens advance the sequence; the first rejected token is resampled from the target distribution. Repeat. The output distribution is mathematically identical to pure autoregressive generation from the target model.
Real-world Use Case
📌 TL;DR
Small model drafts, big model verifies in parallel. Same output, 2–4x faster. Standard production inference technique — if you’re not using it, you’re leaving latency on the table.
Advantages
- 2–4x latency reduction (EAGLE-3 reports up to 6.5x on LLaMA 2 Chat 70B) with zero quality loss
- Output is provably identical to pure target model generation — not an approximation
- Standard in production LLM serving since 2024 (Google, vLLM, TGI all support it)
Disadvantages
- Requires a compatible draft model — either a smaller model from the same family or a trained speculator head
- Draft acceptance rate varies by task — low acceptance on highly creative or stochastic tasks reduces gains
- Adds complexity to the serving stack — draft model management, verification batching