Q4 KV Cache Fit 32K Context into 8GB VRAM — Only Math Broke

Q4 KV Cache Fit 32K Context into 8GB VRAM — Only Math Broke The biggest VRAM hog in LLM inference isn't always the model weights. Once context length grows, KV cache memory consumption overtakes the model itself. Llama-3-8B (Q4_K_M, 4.9GB) at 32K context burns roughly 4GB on KV cache alone. That's 9GB total. An RTX 4060 8GB can't hold it. # KV cache memory calculation def kv_cache_memory ( n_layers : int , n_heads_kv : int , head_dim : int , context_length : int , dtype_bytes : int = 2 , # FP16 ) -> float : """ KV cache memory usage in GB """ # K + V, two tensors bytes_total = 2 * n_layers * n_heads_kv * head_dim * context_length * dtype_bytes return bytes_total / ( 1024 ** 3 ) # Llama-3-8B (GQA: 8 KV heads) llama3_8b = kv_cache_memory ( n_layers = 32 , n_heads_kv = 8 , # GQA: 32 attention heads -> 8 KV heads head_dim = 128 , context_length = 32768 , # 32K dtype_bytes = 2 , # FP16 ) # -> 4.0 GB # Qwen2.5-32B (GQA: 8 KV heads) qwen25_32b = kv_cache_memory ( n_layers = 64 , n_heads_kv =

Q4 KV Cache Fit 32K Context into 8GB VRAM — Only Math Broke

Related Articles

Understand ARP in byte level

1SubML: Plan vs Reality

Group Lasso with Overlaps: the Latent Group Lasso approach

Dave Garage - Why your new computer is slower than your old computer

All of the String types

Related Articles

News
Understand ARP in byte level
Reddit Programming • 4h ago

News
1SubML: Plan vs Reality
Lobsters • 7h ago

News
Group Lasso with Overlaps: the Latent Group Lasso approach
Dev.to • 10h ago

News
Dave Garage - Why your new computer is slower than your old computer
Reddit Programming • 14h ago

News
All of the String types
Lobsters • 15h ago