GPU vs. TPU vs. CPU for AI Workloads
Hardware choice depends on workload type, framework, scale, and cost. GPU (A100/H100), TPU (v4/v5), CPU (for small models), and custom ASIC trade-offs.
Intent & Description
🎯 Intent
Choose the right hardware for AI workloads based on matrix multiply throughput, memory bandwidth, programmability, framework support, cost per FLOP, availability, and specific use case requirements.
📋 Context
GPUs (A100/H100) offer very high matrix multiply throughput, high memory bandwidth, excellent programmability (CUDA), and excellent framework support. TPUs (v4/v5) offer extremely high throughput, very high memory bandwidth, but medium programmability (XLA, JAX) and are Google Cloud only. CPUs have low matrix multiply throughput but very high programmability and universal availability. Custom ASICs offer highest task-specific performance but very low programmability.
💡 Solution
Use A100 for most current production training workloads. Use H100 when time-to-train is critical or for largest models. Use TPUs for JAX-based workloads at Google Cloud scale. Use CPU inference for small models (BERT-base, distilled models) to avoid GPU cold-start costs. Use custom ASICs for hyperscale inference when cost is critical.
Real-world Use Case
📌 TL;DR
Hardware choice: A100 for most production training (good balance). H100 for largest models or time-critical training. TPU for JAX workloads at Google Cloud scale. CPU for small model inference (no cold-start cost). Custom ASIC for hyperscale.
Advantages
- GPUs: excellent general-purpose AI hardware
- TPUs: excellent for JAX workloads at scale
- CPUs: universal availability, no cold-start costs
- Custom ASICs: optimal for hyperscale workloads
Disadvantages
- GPUs: high cost per FLOP, limited to major cloud providers
- TPUs: Google Cloud only, XLA compilation overhead
- CPUs: poor performance for large models
- Custom ASICs: very low programmability, vendor lock-in
# Hardware Selection for AI Workloads
class HardwareSelector:
def __init__(self):
self.hardware_specs = {
'A100': {
'flops_fp16': 312,
'memory_bandwidth': '2 TB/s',
'memory': '80GB',
'cost_per_flop': 'high',
'programmability': 'high',
'framework_support': 'excellent'
},
'H100': {
'flops_fp16': 989,
'memory_bandwidth': '3.35 TB/s',
'memory': '80GB',
'cost_per_flop': 'very_high',
'programmability': 'high',
'framework_support': 'excellent'
},
'TPU-v4': {
'flops_fp16': 275,
'memory_bandwidth': '1.2 TB/s',
'memory': '128GB',
'cost_per_flop': 'medium',
'programmability': 'medium',
'framework_support': 'good'
},
'CPU': {
'flops_fp16': 1,
'memory_bandwidth': '50 GB/s',
'memory': 'variable',
'cost_per_flop': 'low',
'programmability': 'very_high',
'framework_support': 'excellent'
}
}
def select_hardware(self, workload):
model_size = workload.get('model_size', 'unknown')
framework = workload.get('framework', 'pytorch')
scale = workload.get('scale', 'small')
latency_sensitivity = workload.get('latency_sensitivity', 'medium')
# Decision logic
if framework == 'jax' and scale == 'large':
return 'TPU-v4'
elif model_size in ['7B', '13B', '33B']:
return 'A100'
elif model_size in ['70B', '175B'] and latency_sensitivity == 'high':
return 'H100'
elif model_size in ['BERT-base', 'distilled']:
return 'CPU'
elif scale == 'hyperscale' and latency_sensitivity == 'low':
return 'custom_asic'
else:
return 'A100' # Default choice
def estimate_cost(self, hardware, hours):
costs = {
'A100': 3.0, # $3.0/hour (simplified)
'H100': 9.0, # $9.0/hour
'TPU-v4': 2.0, # $2.0/hour
'CPU': 0.1, # $0.1/hour
}
return costs.get(hardware, 0) * hours
# Usage example
selector = HardwareSelector()
# Scenario 1: LLM training with PyTorch
llm_workload = {
'model_size': '13B',
'framework': 'pytorch',
'scale': 'medium',
'latency_sensitivity': 'low'
}
hardware = selector.select_hardware(llm_workload) # Returns 'A100'
# Scenario 2: JAX-based large-scale training
jax_workload = {
'model_size': '175B',
'framework': 'jax',
'scale': 'large',
'latency_sensitivity': 'medium'
}
hardware = selector.select_hardware(jax_workload) # Returns 'TPU-v4'
# Scenario 3: Small model inference
small_model_workload = {
'model_size': 'BERT-base',
'framework': 'pytorch',
'scale': 'small',
'latency_sensitivity': 'low'
}
hardware = selector.select_hardware(small_model_workload) # Returns 'CPU'