Profiling GPU (CUDA) — What Is Actually Limiting Your Kernel?

In my last post I introduced GPU Flight — a lightweight CUDA observability tool that acts like a flight recorder for your GPU. We covered what it collects: system metrics, device capabilities, and per-kernel events. Today I want to talk about one specific metric that GPU Flight captures: occupancy . It's one of the most important numbers for understanding GPU performance, and also one of the most misunderstood. What Is Occupancy? A GPU is organized around Streaming Multiprocessors (SMs) . Each SM can run many threads simultaneously — not by context-switching like a CPU, but by actually running them in parallel. The unit of scheduling on an SM is a warp : a group of 32 threads that execute the same instruction in lockstep. An SM has a fixed warp budget — say, 48 warps on a typical Ampere GPU. When you launch a kernel with blocks of 256 threads (8 warps each), the SM can hold up to 6 blocks concurrently to fill those 48 warp slots. If something prevents that — too many registers, too muc

Profiling GPU (CUDA) — What Is Actually Limiting Your Kernel?

Related Articles

The Decision Pattern That Prevents Product–Engineering Conflict

Autopilot

The Most Important Skill in Software Engineering Isn’t Coding

New interstellar hunting with Vera Rubin alerts

R: A Language for Data Analysis and Graphics (1996)

Related Articles

News
The Decision Pattern That Prevents Product–Engineering Conflict
Medium Programming • 3d ago

News
Autopilot
Medium Programming • 3d ago

News
The Most Important Skill in Software Engineering Isn’t Coding
Medium Programming • 3d ago

News
New interstellar hunting with Vera Rubin alerts
Medium Programming • 3d ago

News
R: A Language for Data Analysis and Graphics (1996)
Lobsters • 3d ago