Throughput vs. Latency Optimization
Batch processing maximizes throughput but increases latency. Real-time serving minimizes latency but reduces throughput. Different infrastructure for different SLAs.
Intent & Description
🎯 Intent
Understand the inverse relationship between throughput (requests per second) and latency (time per request) in AI serving. Optimizing for one often degrades the other. Choose based on your service level objectives.
📋 Context
You are deploying an ML model for serving. Batch processing (large batch sizes) maximizes GPU utilization and throughput but increases per-request latency. Real-time serving (batch size = 1) minimizes latency but reduces GPU utilization. The optimal configuration depends on whether you care more about serving many requests quickly or serving individual requests with minimal delay.
💡 Solution
Use separate serving endpoints for different SLAs. High-throughput batch endpoint for offline processing (e.g., nightly scoring jobs). Low-latency real-time endpoint for interactive applications (e.g., chat bots). Use dynamic batching for mixed workloads. Implement load balancing to route requests to appropriate endpoints. Monitor both metrics separately.
Real-world Use Case
📌 TL;DR
Throughput vs. latency: large batches maximize throughput but increase latency. Small batches minimize latency but reduce throughput. Use separate endpoints for different SLAs.
Advantages
- Optimizes infrastructure for specific use cases
- Reduces costs by using appropriate batch sizes
- Improves user experience with latency-optimized endpoints
- Enables clear SLA differentiation
Disadvantages
- More complex deployment and monitoring
- Requires request routing logic
- May need model versioning across endpoints
- Adds operational overhead
// Throughput vs. Latency: Multi-endpoint serving strategy
class MLServingStack {
constructor() {
this.throughputEndpoint = new ThroughputOptimizedServer({
batchSize: 64,
max_batch_delay: 50, // ms
gpu_utilization_target: 0.95,
model_precision: 'FP16'
});
this.latencyEndpoint = new LatencyOptimizedServer({
batchSize: 1,
max_batch_delay: 0,
gpu_utilization_target: 0.60,
model_precision: 'INT8'
});
this.router = new RequestRouter();
}
async serveRequest(request, priority) {
// Route based on SLA requirements
if (priority === 'real-time' || request.timeout < 100) {
return this.latencyEndpoint.predict(request);
} else if (priority === 'batch' || request.timeout > 1000) {
return this.throughputEndpoint.predict(request);
} else {
// Dynamic batching for mixed workload
return this.dynamicBatchingServer.predict(request);
}
}
}
// Throughput-optimized server configuration
const throughputConfig = {
server: {
workers: 4,
threads_per_worker: 8,
max_concurrent_requests: 256
},
model: {
batch_size: 64,
tensor_parallel: true,
pipeline_parallel: false
},
monitoring: {
primary_metric: 'requests_per_second',
target: 1000,
secondary_metric: 'p95_latency',
max: 500 // ms
}
};
// Latency-optimized server configuration
const latencyConfig = {
server: {
workers: 8,
threads_per_worker: 4,
max_concurrent_requests: 64
},
model: {
batch_size: 1,
tensor_parallel: false,
pipeline_parallel: true // Reduce latency via parallelism
},
monitoring: {
primary_metric: 'p95_latency',
target: 50, // ms
secondary_metric: 'requests_per_second',
min: 50
}
};