124x Slower: What PyTorch DataLoader Actually Does at the Kernel Level

TL;DR: PyTorch's DataLoader can be 50-124x slower than direct tensor indexing for in-memory GPU workloads. We reproduced a real PyTorch issue on an RTX 4090 and traced every CUDA API call and Linux kernel event to find the root cause. The GPU wasn't slow - it was starving. DataLoader workers generated 200,000 CPU context switches and 300,000 page allocations in 40 seconds, leaving the GPU waiting an average of 301ms per data transfer that should take microseconds. The Problem A PyTorch user reported that DataLoader was 7-22x slower than direct tensor indexing for a simple MLP inference workload. Even with num_workers=12 , pin_memory=True , and prefetch_factor=12 , the gap remained massive. GPU utilization sat at 10-20%. We reproduced it. The gap was even worse on our hardware: Method Time vs Direct Direct tensor indexing 0.39s 1x DataLoader (shuffle=True) 48.49s 124x slower DataLoader (optimized, 4 workers, pin_memory) 43.29s 111x slower The workload is trivial: 7M samples, 100 feature

124x Slower: What PyTorch DataLoader Actually Does at the Kernel Level

Related Articles

The HP OmniBook 5 Is a MacBook Neo Killer, and It's Only $500

Trump defunding of NPR and PBS blocked by judge, but damage is already done

Everything is iPhone now

Terms & Conditions: Soundboks Giveaway

Our Favorite Budget Smartwatch is $69

Related Articles

News
The HP OmniBook 5 Is a MacBook Neo Killer, and It's Only $500
Wired • 1h ago

News
Trump defunding of NPR and PBS blocked by judge, but damage is already done
Ars Technica • 1h ago

News
Everything is iPhone now
The Verge • 1h ago

News
Terms & Conditions: Soundboks Giveaway
Wired • 2h ago

News
Our Favorite Budget Smartwatch is $69
Wired • 2h ago