Data Parallelism Strategies: Scaling PyTorch with DDP

Learn to scale your deep learning models using DistributedDataParallel (DDP). Master multi-GPU environments, gradient synchronization, and effective data sharding.

PyTorchDDPDistributed TrainingMulti-GPUDeep LearningScalingaimachine-learningpython

Previously in this course, we explored the Project Milestone: Custom Architecture Setup, where we defined the core components of our Transformer. Now that we have a functional model, we need the compute power to train it effectively. This lesson introduces DistributedDataParallel (DDP), the industry-standard approach for scaling training across multiple GPUs.

The First Principles of Data Parallelism

When training deep learning models, the primary bottleneck is often the time taken to process a single batch. Data parallelism solves this by replicating the model across multiple GPUs, where each GPU processes a unique subset of the data simultaneously.

In a standard DDP setup, the process follows these steps:

Replication: The entire model is copied to every GPU.
Sharding: The global batch is split into smaller sub-batches (shards), one for each GPU.
Forward Pass: Each GPU performs a forward pass on its data shard.
Gradient Synchronization: After the backward pass, GPUs communicate to average their gradients (the all-reduce operation).
Optimizer Step: Each GPU updates its local model weights using the averaged gradients, ensuring all replicas remain identical.

Configuring Multi-GPU Environments

To run DDP, you must manage process groups. PyTorch uses torch.distributed to coordinate communication between GPUs. You don't manage individual threads; you manage individual processes, where each process is assigned to a specific device (GPU).


PYTHON
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
import os

def setup(rank, world_size):
    # Initialize the process group
    os.environ[CE9178">'MASTER_ADDR'] = CE9178">'localhost'
    os.environ[CE9178">'MASTER_PORT'] = CE9178">'12355'
    dist.init_process_group("nccl", rank=rank, world_size=world_size)
    torch.cuda.set_device(rank)

def cleanup():
    dist.destroy_process_group()

The nccl backend is the gold standard for NVIDIA GPUs, providing highly optimized collective communication primitives.

Implementing Data Sharding

If you have a global batch size of 128 and 4 GPUs, each GPU must process exactly 32 samples. If you simply load the same data on every GPU, you are effectively training on the same data 4 times, which is incorrect.

You must use DistributedSampler to ensure that each GPU receives a non-overlapping slice of the dataset:


PYTHON
from torch.utils.data.distributed import DistributedSampler
from torch.utils.data import DataLoader

# Assuming CE9178">'dataset' is your training data
sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)
dataloader = DataLoader(dataset, batch_size=32, sampler=sampler)

# In the training loop
for epoch in range(epochs):
    sampler.set_epoch(epoch) # Essential for shuffling consistency
    for batch in dataloader:
        # ... training logic ...

Worked Example: The DDP Training Loop

Here is how you wrap your model and execute the synchronization. Note that DistributedDataParallel handles the all-reduce of gradients automatically during the backward() call.


PYTHON
# 1. Initialize model and move to GPU
model = MyTransformer().to(rank)
ddp_model = DDP(model, device_ids=[rank])

# 2. Training step
optimizer = torch.optim.AdamW(ddp_model.parameters(), lr=1e-4)

for inputs, targets in dataloader:
    inputs, targets = inputs.to(rank), targets.to(rank)
    
    optimizer.zero_grad()
    outputs = ddp_model(inputs)
    loss = criterion(outputs, targets)
    loss.backward() # Gradients are synchronized here
    optimizer.step()

Hands-on Exercise

Modify your training script to accept rank and world_size as command-line arguments.
Replace your standard DataLoader with one using DistributedSampler.
Wrap your model in DDP and ensure you are using the correct device_id.
Run your script using torchrun (the recommended utility for distributed training): torchrun --nproc_per_node=4 train.py

Common Pitfalls

Forgetting sampler.set_epoch(epoch): If you skip this, your data order will be identical every epoch, leading to suboptimal training and potential overfitting.
Mixing CPU and GPU operations: Always ensure your model and inputs are on the same device (rank).
Batch Normalization: Standard BatchNorm behaves differently in DDP because it calculates statistics per GPU. Use SyncBatchNorm if your batch sizes per GPU are very small.
Deadlocks: Ensure dist.init_process_group is called before any model operations and that all GPUs reach the backward() call. If one GPU hangs, the entire system will wait indefinitely.

Recap

We've moved beyond single-GPU training by implementing Distributed Training via DDP. By sharding our data with DistributedSampler and synchronizing our gradients via the nccl backend, we can now scale our Transformer training linearly with the number of available GPUs. This architecture is essential for handling the compute budgets discussed in Scaling Laws and Compute Budgets.

Up next: We will tackle Tensor and Pipeline Parallelism to scale models that are too large to fit on a single GPU.

Back to Blog

Data Parallelism Strategies: Scaling PyTorch with DDP

The First Principles of Data Parallelism

Configuring Multi-GPU Environments

Implementing Data Sharding

Worked Example: The DDP Training Loop

Hands-on Exercise

Common Pitfalls

Recap

Similar Posts

Mixture-of-Experts (MoE) Layers: Scaling Efficiently with Sparsity

Gradient Accumulation and Batch Sizing: Training at Scale

Distributed Optimizer States: Mastering ZeRO for Massive Models