
124x Slower: What PyTorch DataLoader Actually Does at the Kernel Level
TL;DR: PyTorch's DataLoader can be 50-124x slower than direct tensor indexing for in-memory GPU workloads. We reproduced a real PyTorch issue on an RTX 4090 and traced every CUDA API call and Linux kernel event to find the root cause. The GPU wasn't slow - it was starving. DataLoader workers generated 200,000 CPU context switches and 300,000 page allocations in 40 seconds, leaving the GPU waiting an average of 301ms per data transfer that should take microseconds. The Problem A PyTorch user reported that DataLoader was 7-22x slower than direct tensor indexing for a simple MLP inference workload. Even with num_workers=12 , pin_memory=True , and prefetch_factor=12 , the gap remained massive. GPU utilization sat at 10-20%. We reproduced it. The gap was even worse on our hardware: Method Time vs Direct Direct tensor indexing 0.39s 1x DataLoader (shuffle=True) 48.49s 124x slower DataLoader (optimized, 4 workers, pin_memory) 43.29s 111x slower The workload is trivial: 7M samples, 100 feature
Continue reading on Dev.to Python
Opens in a new tab



