Spot vs. On-Demand Instances
Up to 90% cost savings with spot instances, but they can be revoked with 2-minute notice. Trade reliability for cost or vice versa.
Intent & Description
🎯 Intent
Balance cost savings against reliability in cloud infrastructure. Spot instances offer up to 90% discounts compared to on-demand pricing, but can be terminated by the cloud provider with 2-minute notice when capacity is needed.
📋 Context
You are running batch jobs, CI/CD pipelines, or stateless services on AWS/GCP/Azure. On-demand instances are always available but expensive. Spot instances are cheap but unreliable. The decision depends on your fault tolerance, checkpointing strategy, and time sensitivity.
💡 Solution
Use spot instances for fault-tolerant batch workloads (data processing, CI builds, training jobs). Implement checkpointing to resume interrupted jobs. Use spot capacity-optimized allocation to reduce interruption frequency. Mix spot and on-demand for critical services. Use managed services that handle spot termination gracefully (ECS Spot, Fargate Spot). Monitor interruption rates and adjust strategy accordingly.
Real-world Use Case
📌 TL;DR
Spot instances = 90% savings but can be revoked with 2-minute notice. Use for fault-tolerant batch jobs. On-demand = reliable but expensive. Mix both for optimal cost-reliability balance.
Advantages
- Massive cost savings (60-90%) for suitable workloads
- Forces architectural improvements (fault tolerance, checkpointing)
- Enables using more powerful instances for same budget
- Cloud providers offer tools to manage spot complexity
Disadvantages
- Adds operational complexity (interruption handling)
- Not suitable for stateful services or latency-sensitive workloads
- Capacity can be unavailable during high-demand periods
- Requires application-level changes to handle interruptions
// Spot vs. On-Demand: Handling spot instance interruptions
import boto3
import signal
import sys
class SpotInstanceManager:
def __init__(self):
self.ec2 = boto3.client('ec2')
self.setup_interrupt_handler()
def setup_interrupt_handler(self):
"""Handle spot termination notices"""
signal.signal(signal.SIGTERM, self.handle_interrupt)
def handle_interrupt(self, signum, frame):
"""Gracefully shutdown and save state"""
print("Spot interruption detected, saving checkpoint...")
self.save_checkpoint()
self.graceful_shutdown()
def request_spot_instances(self, instance_type, count, bid_price):
"""Request spot instances with fallback to on-demand"""
try:
response = self.ec2.request_spot_instances(
SpotPrice=str(bid_price),
InstanceCount=count,
LaunchSpecification={
'ImageId': 'ami-12345',
'InstanceType': instance_type,
}
)
return response
except self.ec2.exceptions.ClientError as e:
print(f"Spot unavailable, falling back to on-demand: {e}")
return self.request_on_demand_instances(instance_type, count)
def request_on_demand_instances(self, instance_type, count):
"""Fallback to on-demand instances"""
response = self.ec2.run_instances(
ImageId='ami-12345',
InstanceType=instance_type,
MinCount=count,
MaxCount=count
)
return response
// AWS Fargate Spot: Managed spot for containers
const ecs = new AWS.ECS();
const taskDefinition = {
family: 'batch-processor',
requiresCompatibilities: ['FARGATE'],
cpu: '256',
memory: '512',
networkMode: 'awsvpc'
};
const runTask = async () => {
try {
await ecs.runTask({
cluster: 'batch-cluster',
taskDefinition: 'batch-processor',
launchType: 'FARGATE',
capacityProviderStrategy: [{
capacityProvider: 'FARGATE_SPOT', // Try spot first
weight: 1
}, {
capacityProvider: 'FARGATE', // Fallback to Fargate
weight: 1,
base: 1 // Always maintain some on-demand capacity
}]
});
} catch (error) {
console.error('Task failed:', error);
}
};