Tensor and Pipeline Parallelism: Scaling Large Model Training

Learn to scale models beyond single-GPU memory limits. Master Tensor Parallelism, Pipeline Parallelism, and activation checkpointing for efficient training.

Deep LearningLLMsDistributed SystemsDeepSpeedPyTorchaimachine-learningpython

Previously in this course, we explored Data Parallelism Strategies: Scaling PyTorch with DDP. While Data Parallelism is excellent for scaling throughput by replicating the model, it fails when your model is too large to fit into the memory (VRAM) of a single GPU.

In this lesson, we move beyond data replication to model partitioning. We’ll cover how to split your model across devices using Tensor and Pipeline Parallelism, and how to use activation checkpointing to trade compute for memory.

Understanding Model Parallelism

When a model’s parameters, gradients, and optimizer states exceed the VRAM of a single GPU, you must employ Model Parallelism. Unlike Data Parallelism, where each GPU holds a full copy of the model, Model Parallelism divides the model's structure itself.

Tensor Parallelism (TP)

Tensor Parallelism splits individual layers across multiple GPUs. For a dense layer $Y = XA$, we can partition the weight matrix $A$ along its columns ($A = [A_1, A_2]$). Each GPU computes a partial output ($Y_1 = XA_1, Y_2 = XA_2$) and then synchronizes via an AllGather operation. This is highly effective for large Transformer layers, such as the attention projections or MLP blocks.

Pipeline Parallelism (PP)

Pipeline Parallelism partitions the model vertically by layers. If you have a 48-layer transformer, you might place layers 1-12 on GPU 0, 13-24 on GPU 1, and so on. This creates a "pipeline" where data flows from one device to the next.

Implementing Pipeline Stages

Pipeline Parallelism introduces the "bubble" problem: if GPU 1 is waiting for GPU 0 to finish its forward pass, GPU 1 sits idle. We mitigate this by splitting the batch into smaller "micro-batches," allowing multiple GPUs to work on different parts of the pipeline simultaneously.

In modern distributed systems, DeepSpeed provides a robust abstraction for these patterns. Here is how you define a simple pipeline stage:


PYTHON
import torch
import torch.nn as nn
from deepspeed.pipe import PipelineModule, LayerSpec

# Define your model as a list of layers
layers = [
    LayerSpec(nn.Linear, 1024, 1024),
    LayerSpec(nn.ReLU),
    LayerSpec(nn.Linear, 1024, 1024)
]

# Partition the model across GPUs
model = PipelineModule(
    layers=layers,
    num_stages=2,
    partition_method=CE9178">'parameters'
)

Activation Checkpointing: Trading Compute for Memory

Even with parallelism, storing activations for backpropagation consumes massive amounts of VRAM. Activation Checkpointing (or gradient checkpointing) solves this by discarding intermediate activations during the forward pass and recomputing them on-the-fly during the backward pass.

This reduces the memory complexity from $O(L)$ to $O(\sqrt{L})$ where $L$ is the number of layers.


PYTHON
from torch.utils.checkpoint import checkpoint

class TransformerBlock(nn.Module):
    def __init__(self, layer):
        super().__init__()
        self.layer = layer

    def forward(self, x):
        # Instead of self.layer(x), use checkpoint
        return checkpoint(self.layer, x)

Hands-on Exercise

Setup: If you have access to a multi-GPU node, initialize a torch.distributed process group.
Task: Create a simple 4-layer MLP. Use torch.nn.Sequential to wrap your layers.
Challenge: Manually partition the first two layers to cuda:0 and the last two to cuda:1. Implement a forward pass that moves tensors between devices using .to('cuda:x').
Checkpointing: Wrap the middle layers with torch.utils.checkpoint and monitor VRAM usage using torch.cuda.memory_allocated() before and after the change.

Common Pitfalls

Communication Overhead: Tensor Parallelism requires frequent all-reduce operations. If your interconnect (e.g., NVLink) is slow, the communication latency will negate the performance gains of splitting the tensors.
Pipeline Bubbles: Under-utilizing the pipeline with too few micro-batches. Always ensure micro_batch_size * num_micro_batches matches your total batch size to keep the pipeline full.
Checkpointing Overhead: While checkpointing saves memory, it increases compute time by roughly 20-30% because activations are calculated twice. Only use it when you actually hit OOM (Out of Memory) errors.

Recap

Tensor Parallelism splits layers internally; use it for compute-bound operations within a single transformer block.
Pipeline Parallelism splits layers across devices; use it when the model is too deep/large for one GPU.
Activation Checkpointing is your primary tool for reducing the memory footprint of deep networks by discarding intermediate activations.

By combining these techniques with the data-parallel strategies we discussed previously, you can train models that are orders of magnitude larger than your hardware's physical constraints.

Up next: Efficient Dataset Loading and Prefetching — we'll ensure your data pipeline keeps up with your newly scaled model.

Back to Blog