
Profiling GPU (CUDA) — What Is Actually Limiting Your Kernel?
In my last post I introduced GPU Flight — a lightweight CUDA observability tool that acts like a flight recorder for your GPU. We covered what it collects: system metrics, device capabilities, and per-kernel events. Today I want to talk about one specific metric that GPU Flight captures: occupancy . It's one of the most important numbers for understanding GPU performance, and also one of the most misunderstood. What Is Occupancy? A GPU is organized around Streaming Multiprocessors (SMs) . Each SM can run many threads simultaneously — not by context-switching like a CPU, but by actually running them in parallel. The unit of scheduling on an SM is a warp : a group of 32 threads that execute the same instruction in lockstep. An SM has a fixed warp budget — say, 48 warps on a typical Ampere GPU. When you launch a kernel with blocks of 256 threads (8 warps each), the SM can hold up to 6 blocks concurrently to fill those 48 warp slots. If something prevents that — too many registers, too muc
Continue reading on Dev.to
Opens in a new tab

