
Tracing a 13x PyTorch Slowdown to a Hidden NumPy Synchronization
TL;DR: A .cpu().numpy() call buried inside a forward pass was forcing a full CPU-GPU synchronization on every batch, every loop iteration. The GPU would finish its work in milliseconds, then sit idle for ~2 seconds waiting for Python and NumPy to catch up. Replacing the NumPy logic with pure PyTorch ops gave a 6.4x speedup on a T4 and 13x on an RTX 5080. The fix is two lines of code. The bug Swin-MAE is a masked autoencoder built on Swin Transformers. A user training on 5.2 million images with an RTX 5090 noticed that GPU utilization kept dropping to ~30% during the forward pass. The model would spike, stall, spike, stall. The problem was in window_masking , the function that decides which image patches to mask during training. Here is what the code looked like: # The hot loop: runs once per batch, every forward pass for i in range ( B ): index_mask [ i ] = np . setdiff1d ( index_all , index_keep . cpu (). numpy ()[ i ]) for i in range ( B ): x_masked [ i , index_mask . cpu (). numpy (
Continue reading on Dev.to Python
Opens in a new tab


