I fused 1,500 GPU dispatches into one. Here's what happened.

Every ML framework does GPU computation the same way: send a task to the GPU, wait, send the next one, wait, repeat. For a 1,500-step simulation, that's 22,500 separate GPU commands per generation. I tried something different. I wrote a WebGPU compute shader that runs the entire 1,500-step simulation in a single GPU dispatch. No round-trips. No waiting. The GPU just loops internally. The results (same hardware, no tricks) On the same Apple M2 Pro: WebGPU (Chrome): 46.2 gen/s PyTorch MPS: 0.29 gen/s That's 159x. On embarrassingly parallel workloads (Rastrigin), they're basically tied (1.06x). The advantage is specific to sequential workloads — simulations, RL rollouts, trading strategies — where each step depends on the previous one. Why can't PyTorch just do this? I tested torch.compile with the Inductor backend. It tries to unroll the loop into a single computation graph: Timesteps Result 500 Works, 2x speedup, 25s compile 1,000 RecursionError 5,000 OOM killed after 30 min The compile

I fused 1,500 GPU dispatches into one. Here's what happened.

Related Articles

IHP v1.5 has been released

Best Costco deals to compete with Amazon's Big Spring Sale 2026: Last chance to save

Best Walmart deals to compete with Amazon's Big Spring Sale 2026: Last chance to save

CA 32 - Filter Assignments

Working software runs locally

Related Articles

News
IHP v1.5 has been released
Lobsters • 3h ago

News
Best Costco deals to compete with Amazon's Big Spring Sale 2026: Last chance to save
ZDNet • 3h ago

News
Best Walmart deals to compete with Amazon's Big Spring Sale 2026: Last chance to save
ZDNet • 3h ago

News
CA 32 - Filter Assignments
Dev.to Tutorial • 4h ago

News
Working software runs locally
Lobsters • 4h ago