Back to articles
How Much GPU Memory Does NexusQuant Actually Save?

How Much GPU Memory Does NexusQuant Actually Save?

via Dev.toJoão André Gomes Marques

How Much GPU Memory Does NexusQuant Actually Save? KV cache compression numbers like "10x" sound impressive in a paper. But what does that mean in practice, for a real GPU, serving real users? Let me give you a concrete memory calculator so you can answer this for your own setup. The KV Cache Formula For any transformer, the KV cache size is: KV_bytes = 2 × num_layers × num_heads × head_dim × seq_len × bytes_per_element The 2 is for keys AND values. bytes_per_element is 2 for FP16, 4 for FP32. For Mistral-7B (32 layers, 8 KV heads, head_dim=128, FP16): KV_bytes = 2 × 32 × 8 × 128 × seq_len × 2 = 131,072 × seq_len bytes ≈ 128 KB per token At 128K tokens: 128 KB × 131,072 = 16.7 GB just for the KV cache. GPU Memory Table Here's what that means on real hardware: GPU VRAM Max KV tokens (FP16, no NQ) With NexusQuant 10x With NexusQuant 17x With NexusQuant 33x RTX 3090 24 GB ~150K ~1.5M ~2.6M ~5M A10G 24 GB ~150K ~1.5M ~2.6M ~5M A100 40GB 40 GB ~256K ~2.6M ~4.4M ~8.5M A100 80GB 80 GB ~512K ~

Continue reading on Dev.to

Opens in a new tab

Read Full Article
2 views

Related Articles