Mahamudul Hasan Rubel
HomeBlogCoursesAboutProjectsSkillsExperiencePhotosContact
Mahamudul Hasan Rubel

Senior Software Engineer crafting high-performance web applications and SaaS platforms.

Navigation

  • Home
  • Blog
  • Courses
  • About
  • Projects
  • Skills
  • Experience
  • Photos
  • Contact

Get in Touch

Available for senior/lead roles and consulting.

bd.mhrubel@gmail.comHire Me

Subscribe to the newsletter

Get new articles and course lessons delivered to your inbox. No spam, unsubscribe anytime.

© 2026 Mahamudul Hasan Rubel. All rights reserved.

Built with using Next.js 16 & Tailwind v4

Back to Blog
Lesson 13 of the Advanced AI/ML: Deep Learning, LLMs & Production Systems course
AI/MLJune 27, 20264 min read

Tensor and Pipeline Parallelism: Scaling Large Model Training

Learn to scale models beyond single-GPU memory limits. Master Tensor Parallelism, Pipeline Parallelism, and activation checkpointing for efficient training.

Deep LearningLLMsDistributed SystemsDeepSpeedPyTorchaimachine-learningpython

Previously in this course, we explored Data Parallelism Strategies: Scaling PyTorch with DDP. While Data Parallelism is excellent for scaling throughput by replicating the model, it fails when your model is too large to fit into the memory (VRAM) of a single GPU.

In this lesson, we move beyond data replication to model partitioning. We’ll cover how to split your model across devices using Tensor and Pipeline Parallelism, and how to use activation checkpointing to trade compute for memory.

Understanding Model Parallelism

When a model’s parameters, gradients, and optimizer states exceed the VRAM of a single GPU, you must employ Model Parallelism. Unlike Data Parallelism, where each GPU holds a full copy of the model, Model Parallelism divides the model's structure itself.

Tensor Parallelism (TP)

Tensor Parallelism splits individual layers across multiple GPUs. For a dense layer $Y = XA$, we can partition the weight matrix $A$ along its columns ($A = [A_1, A_2]$). Each GPU computes a partial output ($Y_1 = XA_1, Y_2 = XA_2$) and then synchronizes via an AllGather operation. This is highly effective for large Transformer layers, such as the attention projections or MLP blocks.

Pipeline Parallelism (PP)

Pipeline Parallelism partitions the model vertically by layers. If you have a 48-layer transformer, you might place layers 1-12 on GPU 0, 13-24 on GPU 1, and so on. This creates a "pipeline" where data flows from one device to the next.

Implementing Pipeline Stages

Pipeline Parallelism introduces the "bubble" problem: if GPU 1 is waiting for GPU 0 to finish its forward pass, GPU 1 sits idle. We mitigate this by splitting the batch into smaller "micro-batches," allowing multiple GPUs to work on different parts of the pipeline simultaneously.

In modern distributed systems, DeepSpeed provides a robust abstraction for these patterns. Here is how you define a simple pipeline stage:

PYTHON
import torch
import torch.nn as nn
from deepspeed.pipe import PipelineModule, LayerSpec

# Define your model as a list of layers
layers = [
    LayerSpec(nn.Linear, 1024, 1024),
    LayerSpec(nn.ReLU),
    LayerSpec(nn.Linear, 1024, 1024)
]

# Partition the model across GPUs
model = PipelineModule(
    layers=layers,
    num_stages=2,
    partition_method=CE9178">'parameters'
)

Activation Checkpointing: Trading Compute for Memory

Even with parallelism, storing activations for backpropagation consumes massive amounts of VRAM. Activation Checkpointing (or gradient checkpointing) solves this by discarding intermediate activations during the forward pass and recomputing them on-the-fly during the backward pass.

This reduces the memory complexity from $O(L)$ to $O(\sqrt{L})$ where $L$ is the number of layers.

PYTHON
from torch.utils.checkpoint import checkpoint

class TransformerBlock(nn.Module):
    def __init__(self, layer):
        super().__init__()
        self.layer = layer

    def forward(self, x):
        # Instead of self.layer(x), use checkpoint
        return checkpoint(self.layer, x)

Hands-on Exercise

  1. Setup: If you have access to a multi-GPU node, initialize a torch.distributed process group.
  2. Task: Create a simple 4-layer MLP. Use torch.nn.Sequential to wrap your layers.
  3. Challenge: Manually partition the first two layers to cuda:0 and the last two to cuda:1. Implement a forward pass that moves tensors between devices using .to('cuda:x').
  4. Checkpointing: Wrap the middle layers with torch.utils.checkpoint and monitor VRAM usage using torch.cuda.memory_allocated() before and after the change.

Common Pitfalls

  • Communication Overhead: Tensor Parallelism requires frequent all-reduce operations. If your interconnect (e.g., NVLink) is slow, the communication latency will negate the performance gains of splitting the tensors.
  • Pipeline Bubbles: Under-utilizing the pipeline with too few micro-batches. Always ensure micro_batch_size * num_micro_batches matches your total batch size to keep the pipeline full.
  • Checkpointing Overhead: While checkpointing saves memory, it increases compute time by roughly 20-30% because activations are calculated twice. Only use it when you actually hit OOM (Out of Memory) errors.

Recap

  • Tensor Parallelism splits layers internally; use it for compute-bound operations within a single transformer block.
  • Pipeline Parallelism splits layers across devices; use it when the model is too deep/large for one GPU.
  • Activation Checkpointing is your primary tool for reducing the memory footprint of deep networks by discarding intermediate activations.

By combining these techniques with the data-parallel strategies we discussed previously, you can train models that are orders of magnitude larger than your hardware's physical constraints.

Up next: Efficient Dataset Loading and Prefetching — we'll ensure your data pipeline keeps up with your newly scaled model.

Previous lessonData Parallelism StrategiesNext lesson Efficient Dataset Loading and Prefetching
Back to Blog

Similar Posts

AI/MLJune 28, 20264 min read

Distributed Optimizer States: Mastering ZeRO for Massive Models

Learn how to implement ZeRO-3 optimization to shard optimizer states across nodes. Master distributed training memory efficiency for massive LLMs.

Read more
AI/MLJune 28, 20264 min read

Advanced Activation Checkpointing: Memory Optimization for Deep Learning

Master activation checkpointing to train massive models by trading redundant compute for memory. Learn to implement selective recomputation in your PyTorch pipelines.

Part of the course

Advanced AI/ML: Deep Learning, LLMs & Production Systems

advanced · Lesson 13 of 48

  1. 1

    Advanced Weight Initialization Strategies

    4 min
  2. 2

    Normalization Techniques at Scale

    3 min
  3. 3

    High-Dimensional Optimization Landscapes

    4 min
Read more
AI/MLJune 28, 20264 min read

Mixture-of-Experts (MoE) Layers: Scaling Efficiently with Sparsity

Master Mixture-of-Experts (MoE) layers to build scalable, compute-efficient LLMs. Learn to design expert routers, implement sparse layers, and balance load.

Read more
  • 4

    Residual Connections and Gradient Stability

    4 min
  • 5

    Gating Units and Activation Functions

    4 min
  • 6

    Implementing Multi-Head Attention

    4 min
  • 7

    Positional Encoding Architectures

    4 min
  • 8

    Transformer Encoder-Decoder Design

    3 min
  • 9

    Project Milestone: Custom Architecture Setup

    3 min
  • 10

    Tokenization Strategies for LLMs

    3 min
  • 11

    Scaling Laws and Compute Budgets

    4 min
  • 12

    Data Parallelism Strategies

    3 min
  • 13

    Tensor and Pipeline Parallelism

    4 min
  • 14

    Efficient Dataset Loading and Prefetching

    4 min
  • 15

    Fine-tuning Methodologies Overview

    4 min
  • 16

    Parameter-Efficient Fine-Tuning (LoRA)

    4 min
  • 17

    Quantized LoRA (QLoRA)

    4 min
  • 18

    Alignment with RLHF

    4 min
  • 19

    Direct Preference Optimization (DPO)

    4 min
  • 20

    Project Milestone: Domain-Specific Fine-Tuning

    3 min
  • 21

    Vector Databases and Similarity Search

    4 min
  • 22

    Retrieval Strategies for RAG

    3 min
  • 23

    Context Management and Windowing

    4 min
  • 24

    Agentic Tool Use and Function Calling

    4 min
  • 25

    Chain-of-Thought and Multi-Step Reasoning

    4 min
  • 26

    Self-Correction and Iterative Refinement

    4 min
  • 27

    Project Milestone: RAG and Agent Integration

    3 min
  • 28

    Post-Training Quantization (PTQ)

    4 min
  • 29

    Model Pruning Techniques

    4 min
  • 30

    Knowledge Distillation

    4 min
  • 31

    Optimized Inference Runtimes (vLLM)

    4 min
  • 32

    TensorRT-LLM for High-Performance Serving

    3 min
  • 33

    ONNX Runtime for Cross-Platform Inference

    3 min
  • 34

    Project Milestone: Inference Optimization

    3 min
  • 35

    CI/CD for ML (MLOps)

    4 min
  • 36

    Continuous Training (CT) Pipelines

    4 min
  • 37

    Observability and Logging

    4 min
  • 38

    Drift Detection and Data Monitoring

    4 min
  • 39

    LLM-as-a-Judge for Evaluation

    4 min
  • 40

    Scaling Deployments with Kubernetes

    4 min
  • 41

    GPU Resource Allocation and Scheduling

    3 min
  • 42

    Project Milestone: Production Deployment

    3 min
  • 43

    Advanced Activation Checkpointing

    4 min
  • 44

    Mixed Precision Training (FP8/BF16)

    4 min
  • 45

    Distributed Optimizer States

    4 min
  • 46

    Gradient Accumulation and Batch Sizing

    4 min
  • 47

    Multi-Modal Model Architectures

    4 min
  • 48

    Mixture-of-Experts (MoE) Layers

    4 min
  • View full course