Data Parallelism vs. Model Parallelism vs. Pipeline Parallelism
How do you distribute a model across multiple GPUs/nodes? Different strategies for different model sizes and hardware constraints. Data, tensor, pipeline, sequence, and expert parallelism.
Intent & Description
🎯 Intent
Choose the right parallelism strategy to distribute model training and inference across multiple GPUs/nodes based on model size, hardware constraints, and communication bandwidth.
📋 Context
Large models don’t fit on single GPUs. Different parallelism strategies distribute different aspects: Data parallelism distributes data batches (model replica on each GPU). Tensor parallelism splits individual layers across GPUs (high communication). Pipeline parallelism distributes model layers across stages (lower communication). Sequence parallelism handles long sequences. Expert parallelism routes different experts to different GPUs for MoE models.
💡 Solution
Start with DDP (data parallelism) — simplest, scales to multi-node. Add tensor parallelism within a node (NVLink bandwidth is sufficient). Use pipeline parallelism across nodes (lower bandwidth requirement). Use 3D parallelism (data + tensor + pipeline) for 100B+ parameter models. Use FSDP (Fully Sharded Data Parallelism) as ZeRO-3 equivalent.
Real-world Use Case
📌 TL;DR
Parallelism strategies: Data (DDP) for models fitting on 1 GPU. Tensor (split layers) for large layers with NVLink. Pipeline (split stages) for sequential models. 3D (data+tensor+pipeline) for 100B+ models. Start with DDP, add others as needed.
Advantages
- Systematic approach to distributed training
- Each strategy optimized for different scenarios
- Combination strategies (3D parallelism) enable massive models
- Modern frameworks (PyTorch FSDP) simplify implementation
Disadvantages
- Different strategies have different communication patterns
- Pipeline parallelism introduces bubble overhead
- Complex to debug and monitor distributed training
- Hardware selection affects optimal strategy
# Parallelism Strategies
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.distributed.pipeline.sync import Pipe
# Data Parallelism (DDP) - Simplest, model replica on each GPU
def setup_data_parallelism(model, local_rank):
model = model.to(local_rank)
model = DDP(model, device_ids=[local_rank])
return model
# Tensor Parallelism - Split layers across GPUs (high communication)
class TensorParallelLinear(torch.nn.Module):
def __init__(self, in_features, out_features, world_size):
super().__init__()
self.out_features_per_gpu = out_features // world_size
self.weight = torch.nn.Parameter(torch.randn(
self.out_features_per_gpu, in_features
))
def forward(self, x):
# All-reduce for result aggregation
output = torch.mm(x, self.weight.t())
dist.all_reduce(output, op=dist.ReduceOp.SUM)
return output
# Pipeline Parallelism - Layers across stages (lower communication)
def setup_pipeline_parallelism(model, chunks=4):
# Split model into stages
stages = torch.nn.ModuleList([
torch.nn.Sequential(*model.features[i::chunks])
for i in range(chunks)
])
return Pipe(torch.nn.Sequential(*stages), chunks=chunks)
# 3D Parallelism (Megatron-LM style)
class MegatronParallelism:
def __init__(self, model, tensor_parallel_size, pipeline_parallel_size):
self.tensor_parallel_size = tensor_parallel_size
self.pipeline_parallel_size = pipeline_parallel_size
self.data_parallel_size = (
dist.get_world_size() //
(tensor_parallel_size * pipeline_parallel_size)
)
self.model = self.setup_3d_parallelism(model)
def setup_3d_parallelism(self, model):
# Combine data + tensor + pipeline parallelism
model = self.apply_tensor_parallelism(model)
model = self.apply_pipeline_parallelism(model)
model = self.apply_data_parallelism(model)
return model
# FSDP (Fully Sharded Data Parallelism) - ZeRO-3 equivalent
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
def setup_fsdp(model):
model = FSDP(model)
return model
# Usage example
def train_with_parallelism(model, strategy='ddp'):
if strategy == 'ddp':
model = setup_data_parallelism(model, local_rank)
elif strategy == 'tensor':
model = setup_tensor_parallelism(model)
elif strategy == 'pipeline':
model = setup_pipeline_parallelism(model)
elif strategy == '3d':
model = MegatronParallelism(model, tp=4, pp=2)
elif strategy == 'fsdp':
model = setup_fsdp(model)
return model