Learn how to implement gradient accumulation to simulate large batch sizes on memory-constrained hardware and maintain training stability with effective LR scaling.
Previously in this course, we explored Distributed Optimizer States: Mastering ZeRO for Massive Models to shard memory across multiple nodes. While that approach is essential for massive models, sometimes you are restricted by the physical VRAM of a single local GPU. This lesson adds the technique of Gradient Accumulation, a method to simulate large-batch training on constrained hardware without requiring additional memory-heavy distributed infrastructure.
In deep learning, the batch size is a primary hyperparameter that influences both the quality of the gradient estimate and the convergence speed. Larger batches typically provide a more accurate estimate of the true gradient, leading to smoother loss landscapes and better utilization of hardware parallelism.
However, the memory footprint of a forward pass—specifically the stored activations required for the backward pass—scales linearly with the batch size. When training large Transformers or deep CNNs, you often find that a batch size of even 1 or 2 exhausts your available VRAM.
Gradient accumulation decouples the hardware batch size (how many samples fit in VRAM) from the effective batch size (how many samples contribute to a single optimizer step).
Instead of updating the model weights after every mini-batch, we perform multiple forward and backward passes, accumulating the gradients in the grad buffers of our parameters. We only call optimizer.step() once we have reached the desired effective batch size.
The mathematical intuition is simple: since the gradient of the loss is additive over samples in a batch, the average gradient of $N$ batches of size $M$ is equivalent to the gradient of a single batch of size $N \times M$ (assuming the loss function is the mean of individual losses).
To implement this in PyTorch, you must ensure your loss is normalized correctly. If you perform $K$ accumulation steps, each step's loss should be divided by $K$ so that the final accumulated gradient represents the mean over the total effective batch.
PYTHONimport torch def train_step(model, dataloader, optimizer, accumulation_steps=4): model.train() optimizer.zero_grad() for i, (inputs, targets) in enumerate(dataloader): outputs = model(inputs) loss = criterion(outputs, targets) # Normalize loss to account for accumulation loss = loss / accumulation_steps loss.backward() if (i + 1) % accumulation_steps == 0: optimizer.step() optimizer.zero_grad()
When you scale the effective batch size, you alter the noise profile of the gradient descent process. A larger effective batch size reduces the variance of the gradient, which often allows for a higher learning rate.
The most common heuristic is the Linear Scaling Rule: if you increase your effective batch size by a factor of $k$, you can often increase your learning rate by a factor of $k$ (or $\sqrt{k}$ in some architectures).
However, this is not a hard rule. Always monitor your training loss for spikes. If you notice divergence after increasing the batch size, revert to a smaller learning rate and use a warmup phase, as discussed in High-Dimensional Optimization Landscapes: Mastering AdamW and Schedulers.
Training stability is sensitive to batch size. Larger batches may converge to "sharp" minima, while smaller batches can act as a form of regularization, helping the model find "flatter" minima.
| Strategy | Memory Usage | Gradient Variance | Training Speed |
|---|---|---|---|
| Small Batch | Low | High | Slower (per step) |
| Gradient Accumulation | Low | Low | Slower (total time) |
| Large Batch (Native) | High | Low | Faster (hardware-bound) |
accumulation_steps parameter.nvidia-smi or torch.cuda.memory_summary() and compare it to a run where you try to fit 8 samples at once.optimizer.zero_grad(): If you don't zero out gradients after the step(), you will carry over gradients from previous accumulation cycles, corrupting your updates.accumulation_steps results in an effective learning rate that is $K$ times larger than intended, which almost always leads to training divergence.BatchNorm layers, the running statistics are updated every forward pass. With an accumulation step of 1, this is fine. With larger steps, the statistics are updated more frequently than the weights, which can lead to biased estimates. Use SyncBatchNorm or replace with RMSNorm as detailed in Normalization Techniques at Scale: Implementing RMSNorm.Gradient accumulation is a critical tool for practitioners working with limited hardware. By decoupling the physical batch size from the effective batch size, we can mimic the behavior of massive GPU clusters. Remember: adjust your learning rate when changing effective batch sizes, keep an eye on BatchNorm behavior, and always normalize your loss to keep the optimizer's updates consistent.
Up next: We will move into multi-modal inputs, looking at how to integrate vision encoders into our existing Transformer architecture.
Master Mixed Precision training with BF16 and FP8. Learn how to implement loss scaling, ensure numerical stability, and accelerate deep learning workloads.
Read moreLearn how to implement magnitude-based pruning to remove redundant weights, evaluate sparsity impact, and fine-tune pruned models for production efficiency.
Gradient Accumulation and Batch Sizing