
Light Just Cut KV Cache Memory Traffic to 1/16th
Light Just Cut KV Cache Memory Traffic to 1/16th The bottleneck in long-context LLM inference isn't compute. It's memory bandwidth. Every decode step in a Transformer scans the entire KV cache to generate a single token. That's O(n) memory reads for context length n, every single step. No matter how fast your GPU's ALUs get, this O(n) memory wall doesn't budge. A March 2026 ArXiv paper (arXiv:2603.21576, Park & Park) proposes PRISM, which offloads KV cache block selection to photonic circuits, making memory access O(1). The result: 16x memory traffic reduction at 64K tokens. Block selection energy efficiency: 10,000x. Accuracy: 100%. Why Memory Bandwidth Is the LLM Inference Bottleneck The Structural Problem with Decoding # What happens in one Transformer decode step def decode_one_token ( query , kv_cache ): # query: current token (1) # kv_cache: all past tokens (n) # Step 1: compute similarity between query and entire KV cache scores = query @ kv_cache . keys . T # O(n) memory reads
Continue reading on Dev.to
Opens in a new tab

