Tracing a 13x PyTorch Slowdown to a Hidden NumPy Synchronization

TL;DR: A .cpu().numpy() call buried inside a forward pass was forcing a full CPU-GPU synchronization on every batch, every loop iteration. The GPU would finish its work in milliseconds, then sit idle for ~2 seconds waiting for Python and NumPy to catch up. Replacing the NumPy logic with pure PyTorch ops gave a 6.4x speedup on a T4 and 13x on an RTX 5080. The fix is two lines of code. The bug Swin-MAE is a masked autoencoder built on Swin Transformers. A user training on 5.2 million images with an RTX 5090 noticed that GPU utilization kept dropping to ~30% during the forward pass. The model would spike, stall, spike, stall. The problem was in window_masking , the function that decides which image patches to mask during training. Here is what the code looked like: # The hot loop: runs once per batch, every forward pass for i in range ( B ): index_mask [ i ] = np . setdiff1d ( index_all , index_keep . cpu (). numpy ()[ i ]) for i in range ( B ): x_masked [ i , index_mask . cpu (). numpy (

Tracing a 13x PyTorch Slowdown to a Hidden NumPy Synchronization

Related Articles

Your Mac Came With the Wrong Apps. These 7 Fix That

Why You Start Projects but Never Finish Them

FedEx chooses partnerships over proprietary tech for its automation strategy

Software You Can Love 2026 tickets are on sale

The Subprime Technical Debt Crisis

Related Articles

News
Your Mac Came With the Wrong Apps. These 7 Fix That
Medium Programming • 2h ago

News
Why You Start Projects but Never Finish Them
Medium Programming • 2h ago

News
FedEx chooses partnerships over proprietary tech for its automation strategy
TechCrunch • 2h ago

News
Software You Can Love 2026 tickets are on sale
Lobsters • 2h ago

News
The Subprime Technical Debt Crisis
Lobsters • 3h ago