Stop letting your GPU idle. Learn to implement streaming data loaders and optimize I/O pipelines to eliminate CPU-GPU bottlenecks in LLM training.
Previously in this course, we explored data parallelism strategies to distribute model weights across multiple devices. However, even the most advanced distributed training setup will fail if your GPUs spend half their time waiting for the next batch of data.
When training LLMs, the "Data Loading" phase is often the silent killer of performance. If your CPU cannot prepare, tokenize, and augment data faster than the GPU can perform a forward/backward pass, your training throughput will plummet. This lesson focuses on maximizing Data Loading efficiency to ensure your hardware investment isn't wasted on idle cycles.
In an ideal training loop, the GPU is saturated with work. In a bottlenecked loop, the GPU completes a step and then polls the system memory for the next batch, creating a "sawtooth" utilization pattern.
To solve this, we must treat data preparation as a producer-consumer problem. The CPU (producer) must stay ahead of the GPU (consumer). We achieve this through prefetching and asynchronous I/O.
Most beginners load an entire dataset into RAM. With LLM-scale corpora (terabytes of text), this is impossible. We must switch to streaming data loaders, where data is read from disk in chunks, processed, and yielded in a pipelined fashion.
| Strategy | Memory Footprint | Latency | Suitability |
|---|---|---|---|
| In-Memory | High (O(N)) | Low | Small datasets |
| Batch Loading | Medium (O(Batch)) | Medium | Standard CNNs |
| Streaming | Constant (O(1)) | High (initial) | LLMs/Huge Datasets |
For LLM training, we often use IterableDataset in PyTorch to stream data from shards (e.g., JSONL or Parquet files). The key is to decouple the reading of the file from the processing (tokenization/augmentation).
PYTHONimport torch from torch.utils.data import IterableDataset, DataLoader class StreamingTextDataset(IterableDataset): def __init__(self, file_paths, tokenizer): self.file_paths = file_paths self.tokenizer = tokenizer def __iter__(self): # Worker-specific file selection worker_info = torch.utils.data.get_worker_info() paths = self.file_paths for path in paths: with open(path, CE9178">'r') as f: for line in f: # CPU-heavy tokenization happens here yield self.tokenizer(line, truncation=True, padding=CE9178">'max_length') # Optimization: num_workers > 0 enables multiprocessing # pin_memory=True speeds up the transfer from CPU to GPU loader = DataLoader( StreamingTextDataset(files, tokenizer), batch_size=64, num_workers=8, pin_memory=True, prefetch_factor=2 )
num_workers: Set this to the number of available CPU cores. If you set it too high, context switching overhead will degrade performance.pin_memory=True: This allocates the data in page-locked memory, which allows for faster asynchronous transfers to the GPU via DMA (Direct Memory Access).prefetch_factor: This determines how many batches each worker will pre-load. Setting this to 2 or 4 is usually sufficient to hide I/O latency.In our running project, we previously set up the architecture. Now, we must update our training script to ingest the tokenized sequences using the StreamingTextDataset defined above. This ensures that as we move toward full-scale training, our I/O throughput scales linearly with our compute.
Modify your existing training loop to implement a custom IterableDataset.
IterableDataset that reads this file line by line.torch.utils.benchmark to measure the time taken to iterate through the dataset with num_workers=0 vs num_workers=4.nvidia-smi while running both versions.num_workers=0, your data loading will be strictly single-threaded and likely bound by CPU speed.__iter__ loop, the CPU will become the bottleneck again. Offload as much as possible to the GPU (e.g., using kornia for images or torch-native operations for text).webdataset or huggingface datasets to handle sharding efficiently.Efficient Data Loading is the foundation of high-performance training. By leveraging IterableDataset for streaming, utilizing multi-process worker pools, and enabling pin_memory, you effectively hide I/O latency. Remember: your model is only as fast as the data you feed it.
Up next: We will discuss Fine-tuning Methodologies Overview, focusing on selecting the right strategies for domain adaptation when resources are constrained.
Master Knowledge Distillation to transfer intelligence from massive teacher models to efficient student models, optimizing your AI systems for production.
Read moreLearn how to implement magnitude-based pruning to remove redundant weights, evaluate sparsity impact, and fine-tune pruned models for production efficiency.
Efficient Dataset Loading and Prefetching