
Q4 KV Cache Fit 32K Context into 8GB VRAM — Only Math Broke
Q4 KV Cache Fit 32K Context into 8GB VRAM — Only Math Broke The biggest VRAM hog in LLM inference isn't always the model weights. Once context length grows, KV cache memory consumption overtakes the model itself. Llama-3-8B (Q4_K_M, 4.9GB) at 32K context burns roughly 4GB on KV cache alone. That's 9GB total. An RTX 4060 8GB can't hold it. # KV cache memory calculation def kv_cache_memory ( n_layers : int , n_heads_kv : int , head_dim : int , context_length : int , dtype_bytes : int = 2 , # FP16 ) -> float : """ KV cache memory usage in GB """ # K + V, two tensors bytes_total = 2 * n_layers * n_heads_kv * head_dim * context_length * dtype_bytes return bytes_total / ( 1024 ** 3 ) # Llama-3-8B (GQA: 8 KV heads) llama3_8b = kv_cache_memory ( n_layers = 32 , n_heads_kv = 8 , # GQA: 32 attention heads -> 8 KV heads head_dim = 128 , context_length = 32768 , # 32K dtype_bytes = 2 , # FP16 ) # -> 4.0 GB # Qwen2.5-32B (GQA: 8 KV heads) qwen25_32b = kv_cache_memory ( n_layers = 64 , n_heads_kv =
Continue reading on Dev.to
Opens in a new tab
