Back to articles
MoE Beat Dense 27B by 2.4x on 8GB VRAM — The 35B-A3B Benchmark Nobody Expected

MoE Beat Dense 27B by 2.4x on 8GB VRAM — The 35B-A3B Benchmark Nobody Expected

via Dev.toplasmon

Start with the benchmarks In a previous article , I compared three Qwen3.5 models on the same hardware. Here are the MoE-relevant numbers. Test environment: RTX 4060 8GB / Ryzen 7 / 32GB DDR5 / llama.cpp / Q4_K_M Model Speed(t/s) VRAM GPU% CPU% RAM ngl Qwen3.5-9B 33.0 7.1GB 91% 32% 22.6GB 99 (all layers GPU) Qwen3.5-27B 3.57 7.7GB 60% 74% 28.3GB 24 (24/58 layers GPU) Qwen3.5-35B-A3B 8.61 7.6GB 95% 65% 30.8GB 99 (all layers GPU) All three models consume nearly the same VRAM (7.1-7.7GB). Yet speed varies by 10x: 33.0, 3.57, 8.61 t/s. The critical comparison is Dense 27B vs MoE 35B-A3B. The 35B model is faster than the 27B model by 2.4x, despite having more parameters. Why 35B beats 27B The answer is in the GPU utilization numbers. Dense 27B (GPU 60%) : Q4_K_M size is about 16GB. Can't fit in 8GB, so only 24 out of 58 layers run on the GPU (ngl=24). The remaining 34 layers run on CPU. The GPU finishes its portion and sits idle waiting for the CPU. 60% GPU utilization means the GPU is wast

Continue reading on Dev.to

Opens in a new tab

Read Full Article
0 views

Related Articles