Learn to scale your deep learning models using DistributedDataParallel (DDP). Master multi-GPU environments, gradient synchronization, and effective data sharding.
Previously in this course, we explored the Project Milestone: Custom Architecture Setup, where we defined the core components of our Transformer. Now that we have a functional model, we need the compute power to train it effectively. This lesson introduces DistributedDataParallel (DDP), the industry-standard approach for scaling training across multiple GPUs.
When training deep learning models, the primary bottleneck is often the time taken to process a single batch. Data parallelism solves this by replicating the model across multiple GPUs, where each GPU processes a unique subset of the data simultaneously.
In a standard DDP setup, the process follows these steps:
all-reduce operation).To run DDP, you must manage process groups. PyTorch uses torch.distributed to coordinate communication between GPUs. You don't manage individual threads; you manage individual processes, where each process is assigned to a specific device (GPU).
PYTHONimport torch import torch.distributed as dist from torch.nn.parallel import DistributedDataParallel as DDP import os def setup(rank, world_size): # Initialize the process group os.environ[CE9178">'MASTER_ADDR'] = CE9178">'localhost' os.environ[CE9178">'MASTER_PORT'] = CE9178">'12355' dist.init_process_group("nccl", rank=rank, world_size=world_size) torch.cuda.set_device(rank) def cleanup(): dist.destroy_process_group()
The nccl backend is the gold standard for NVIDIA GPUs, providing highly optimized collective communication primitives.
If you have a global batch size of 128 and 4 GPUs, each GPU must process exactly 32 samples. If you simply load the same data on every GPU, you are effectively training on the same data 4 times, which is incorrect.
You must use DistributedSampler to ensure that each GPU receives a non-overlapping slice of the dataset:
PYTHONfrom torch.utils.data.distributed import DistributedSampler from torch.utils.data import DataLoader # Assuming CE9178">'dataset' is your training data sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank) dataloader = DataLoader(dataset, batch_size=32, sampler=sampler) # In the training loop for epoch in range(epochs): sampler.set_epoch(epoch) # Essential for shuffling consistency for batch in dataloader: # ... training logic ...
Here is how you wrap your model and execute the synchronization. Note that DistributedDataParallel handles the all-reduce of gradients automatically during the backward() call.
PYTHON# 1. Initialize model and move to GPU model = MyTransformer().to(rank) ddp_model = DDP(model, device_ids=[rank]) # 2. Training step optimizer = torch.optim.AdamW(ddp_model.parameters(), lr=1e-4) for inputs, targets in dataloader: inputs, targets = inputs.to(rank), targets.to(rank) optimizer.zero_grad() outputs = ddp_model(inputs) loss = criterion(outputs, targets) loss.backward() # Gradients are synchronized here optimizer.step()
rank and world_size as command-line arguments.DataLoader with one using DistributedSampler.DDP and ensure you are using the correct device_id.torchrun (the recommended utility for distributed training):
torchrun --nproc_per_node=4 train.pysampler.set_epoch(epoch): If you skip this, your data order will be identical every epoch, leading to suboptimal training and potential overfitting.rank).BatchNorm behaves differently in DDP because it calculates statistics per GPU. Use SyncBatchNorm if your batch sizes per GPU are very small.dist.init_process_group is called before any model operations and that all GPUs reach the backward() call. If one GPU hangs, the entire system will wait indefinitely.We've moved beyond single-GPU training by implementing Distributed Training via DDP. By sharding our data with DistributedSampler and synchronizing our gradients via the nccl backend, we can now scale our Transformer training linearly with the number of available GPUs. This architecture is essential for handling the compute budgets discussed in Scaling Laws and Compute Budgets.
Up next: We will tackle Tensor and Pipeline Parallelism to scale models that are too large to fit on a single GPU.
Master Mixture-of-Experts (MoE) layers to build scalable, compute-efficient LLMs. Learn to design expert routers, implement sparse layers, and balance load.
Read moreLearn how to implement gradient accumulation to simulate large batch sizes on memory-constrained hardware and maintain training stability with effective LR scaling.
Data Parallelism Strategies