
GCC vs Clang: Same Instructions, Different Performance (AGU Insight)
* I noticed something interesting while running a GCC vs Clang benchmark. * Same code. Same machine. Both loops are scalar (no vectorization). Yet… GCC consistently used fewer CPU cycles. At first, this doesn’t make sense. If both: execute roughly the same instructions are not vectorised Why is there a performance gap? 🔍 The Missing Piece: It’s Not Just Instructions Most people focus on: instruction count vectorization But in this case, that’s not the full story. What actually matters more is: how address computations are structured how instructions are scheduled how well latency is hidden Here is the data ⚙️ AGU Pressure (Address Generation Units) On x86 CPUs, memory instructions rely on AGUs (Address Generation Units). Complex addressing patterns like: base + index * scale + offset 👉 increase AGU pressure Whereas simpler patterns like: pointer++ 👉 are cheaper and easier for the CPU to execute efficiently 🧪 What I Observed GCC: Generates simpler addressing patterns Reduces AGU content
Continue reading on Dev.to
Opens in a new tab




