
I fused 1,500 GPU dispatches into one. Here's what happened.
Every ML framework does GPU computation the same way: send a task to the GPU, wait, send the next one, wait, repeat. For a 1,500-step simulation, that's 22,500 separate GPU commands per generation. I tried something different. I wrote a WebGPU compute shader that runs the entire 1,500-step simulation in a single GPU dispatch. No round-trips. No waiting. The GPU just loops internally. The results (same hardware, no tricks) On the same Apple M2 Pro: WebGPU (Chrome): 46.2 gen/s PyTorch MPS: 0.29 gen/s That's 159x. On embarrassingly parallel workloads (Rastrigin), they're basically tied (1.06x). The advantage is specific to sequential workloads — simulations, RL rollouts, trading strategies — where each step depends on the previous one. Why can't PyTorch just do this? I tested torch.compile with the Inductor backend. It tries to unroll the loop into a single computation graph: Timesteps Result 500 Works, 2x speedup, 25s compile 1,000 RecursionError 5,000 OOM killed after 30 min The compile
Continue reading on Dev.to JavaScript
Opens in a new tab



