I tested speculative decoding on my home GPU cluster. Here's why it didn't help.

I spent Saturday night testing n-gram speculative decoding on consumer GPUs. The claim: speculative decoding can speed up LLM inference by 2-3x by predicting future tokens and verifying them in parallel. I wanted to see if that holds up on real hardware running diverse workloads. For the most part, it doesn't. But the journey was worth it, and I caught a benchmarking pitfall that I think a lot of people are falling into. The setup My home lab runs Kubernetes on a machine called Shadowstack. Two NVIDIA RTX 5060 Ti GPUs (16GB VRAM each, 32GB total). I use LLMKube, an open source K8s operator I built, to manage LLM inference workloads with llama.cpp. For this test I deployed two models: Gemma 4 26B-A4B : Google's Mixture of Experts model. 26B total params but only ~4B active per token. Runs at 88 tok/s on my setup. Qwen3-32B : A dense 32B model. All parameters active per token. Runs at 20 tok/s. Both running Q4_K_M quantization, flash attention enabled, 8K context, split across both GPUs.

I tested speculative decoding on my home GPU cluster. Here's why it didn't help.

Related Articles

Make your own ColecoVision at home, part 5

unnix: Reproducible Nix environments without installing Nix

Muri: The Root Cause of Overburden

Documentation Debt Is Real: How to Pay It Down Without Stopping Work

Building a dry-run mode for the OpenTelemetry Collector

Related Articles

How-To
Make your own ColecoVision at home, part 5
Lobsters • 4h ago

How-To
unnix: Reproducible Nix environments without installing Nix
Lobsters • 12h ago

How-To
Muri: The Root Cause of Overburden
Dev.to • 14h ago

How-To
Documentation Debt Is Real: How to Pay It Down Without Stopping Work
Dev.to • 14h ago

How-To
Building a dry-run mode for the OpenTelemetry Collector
Lobsters • 17h ago