Continuous Batching
Process new incoming requests mid-generation instead of waiting for a full batch to finish — turns GPU idle time between batch completions into throughput.
Intent & Description
🎯 Intent
Static batching leaves GPU capacity idle while waiting for the longest sequence in a batch to finish. Continuous batching inserts new requests as slots free up — keeping GPU utilization near 100%.
📋 Context
Traditional batching groups requests, runs the batch until all sequences complete, then accepts new requests. Sequences finish at different times — early-finishing sequences leave GPU slots empty until the slowest sequence completes. This idle time is pure waste at production serving scale.
💡 Solution
Instead of waiting for a full batch to finish, continuously monitor which sequences in the current batch have produced an EOS token or hit max length. As they complete, immediately insert new waiting requests into the freed slots. The batch composition changes at every decode step — hence “continuous” or “iteration-level” batching. vLLM’s PagedAttention makes this memory-efficient by treating the KV cache as pages rather than contiguous allocations, allowing dynamic slot management.
Real-world Use Case
📌 TL;DR
Don’t wait for the whole batch to finish — insert new requests as slots free up. Near-100% GPU utilization vs. the idle gaps of static batching. Standard in vLLM, TGI, and every serious serving framework.
Advantages
- Near-continuous GPU utilization — no idle time waiting for slow sequences to finish
- Dramatically higher request throughput vs. static batching at the same hardware cost
- Reduced tail latency for short requests that would otherwise wait behind long ones
Disadvantages
- More complex scheduling logic — requires iteration-level batch management
- Memory management complexity increases with dynamic batch composition (PagedAttention addresses this)
- Requires careful integration with KV cache management to avoid memory fragmentation