Latency vs. Throughput
Optimizing for one often degrades the other. High throughput often requires batching (increases latency). Low latency requires immediate processing (underutilizes hardware).
Intent & Description
🎯 Intent
Balance system responsiveness (latency) against processing capacity (throughput). Optimizing for one metric often degrades the other due to fundamental architectural constraints.
📋 Context
Latency = time to process a single request. Throughput = requests processed per unit time. High throughput requires batching requests, which increases individual request latency. Low latency requires immediate processing, which underutilizes hardware. Little’s Law: Throughput = Concurrency / Latency.
💡 Solution
Set separate SLOs for P50, P95, and P99 latency — averages hide tail latency users experience. For ML serving, use dynamic batching for GPU throughput without sacrificing per-request latency. Profile with realistic concurrency. Use batching, async processing, caching, CDN, replication, and connection pooling appropriately.
Real-world Use Case
📌 TL;DR
Latency vs. throughput: fundamental trade-off. Batching increases throughput but adds latency. Immediate processing reduces latency but lowers throughput. Use dynamic batching for ML serving. Set separate P50/P95/P99 SLOs. Apply Littles Law: Throughput = Concurrency / Latency.
Advantages
- Clear understanding of fundamental performance constraints
- Different optimization strategies for different goals
- Little’s Law provides mathematical framework
- Separate SLOs for different latency percentiles
Disadvantages
- Trade-off is fundamental — can’t optimize both simultaneously
- Batching adds complexity and requires tuning
- Latency optimization often reduces throughput
- Throughput optimization often increases latency
// Latency vs. Throughput Optimization Strategies
class PerformanceOptimizer {
constructor() {
this.requestQueue = [];
this.batchSize = 32;
this.batchTimeout = 10; // ms
this.currentBatch = [];
this.batchTimer = null;
}
// Batching: High throughput, higher latency
async processWithBatching(request) {
return new Promise((resolve, reject) => {
this.currentBatch.push({ request, resolve, reject });
if (this.currentBatch.length >= this.batchSize) {
this.flushBatch();
} else if (!this.batchTimer) {
this.batchTimer = setTimeout(() => this.flushBatch(), this.batchTimeout);
}
});
}
async flushBatch() {
if (this.batchTimer) {
clearTimeout(this.batchTimer);
this.batchTimer = null;
}
const batch = this.currentBatch;
this.currentBatch = [];
try {
// Process entire batch at once (high throughput)
const results = await this.processBatch(batch.map(b => b.request));
batch.forEach((item, index) => item.resolve(results[index]));
} catch (error) {
batch.forEach(item => item.reject(error));
}
}
// Immediate processing: Low latency, lower throughput
async processImmediately(request) {
return await this.processSingle(request);
}
// Dynamic batching: Balance both for ML serving
async dynamicBatchMLInference(request) {
// For GPU inference, balance batch size vs. latency
const optimalBatchSize = this.calculateOptimalBatchSize();
return await this.processWithDynamicBatching(request, optimalBatchSize);
}
calculateOptimalBatchSize() {
// Based on current load and latency SLO
const currentLatency = this.getCurrentLatency();
const targetLatency = this.latencySLO;
if (currentLatency > targetLatency * 0.8) {
return Math.max(1, this.batchSize / 2); // Reduce batch for latency
} else {
return this.batchSize; // Maximize batch for throughput
}
}
// Caching: Improves both latency and throughput
async getCachedResult(key) {
const cached = await this.cache.get(key);
if (cached) {
return cached; // Fast path (low latency)
}
return null; // Cache miss
}
// Connection pooling: Reduces connection overhead
async queryWithPooling(sql, params) {
const connection = await this.pool.getConnection();
try {
const result = await connection.query(sql, params);
return result;
} finally {
connection.release(); // Return to pool
}
}
}
// Little's Law in practice
function applyLittlesLaw(currentLatency, targetThroughput) {
// Throughput = Concurrency / Latency
// Concurrency = Throughput × Latency
const requiredConcurrency = targetThroughput * currentLatency;
return Math.ceil(requiredConcurrency);
}