
I tested speculative decoding on my home GPU cluster. Here's why it didn't help.
I spent Saturday night testing n-gram speculative decoding on consumer GPUs. The claim: speculative decoding can speed up LLM inference by 2-3x by predicting future tokens and verifying them in parallel. I wanted to see if that holds up on real hardware running diverse workloads. For the most part, it doesn't. But the journey was worth it, and I caught a benchmarking pitfall that I think a lot of people are falling into. The setup My home lab runs Kubernetes on a machine called Shadowstack. Two NVIDIA RTX 5060 Ti GPUs (16GB VRAM each, 32GB total). I use LLMKube, an open source K8s operator I built, to manage LLM inference workloads with llama.cpp. For this test I deployed two models: Gemma 4 26B-A4B : Google's Mixture of Experts model. 26B total params but only ~4B active per token. Runs at 88 tok/s on my setup. Qwen3-32B : A dense 32B model. All parameters active per token. Runs at 20 tok/s. Both running Q4_K_M quantization, flash attention enabled, 8K context, split across both GPUs.
Continue reading on Dev.to
Opens in a new tab


