Light Just Cut KV Cache Memory Traffic to 1/16th

Light Just Cut KV Cache Memory Traffic to 1/16th The bottleneck in long-context LLM inference isn't compute. It's memory bandwidth. Every decode step in a Transformer scans the entire KV cache to generate a single token. That's O(n) memory reads for context length n, every single step. No matter how fast your GPU's ALUs get, this O(n) memory wall doesn't budge. A March 2026 ArXiv paper (arXiv:2603.21576, Park & Park) proposes PRISM, which offloads KV cache block selection to photonic circuits, making memory access O(1). The result: 16x memory traffic reduction at 64K tokens. Block selection energy efficiency: 10,000x. Accuracy: 100%. Why Memory Bandwidth Is the LLM Inference Bottleneck The Structural Problem with Decoding # What happens in one Transformer decode step def decode_one_token ( query , kv_cache ): # query: current token (1) # kv_cache: all past tokens (n) # Step 1: compute similarity between query and entire KV cache scores = query @ kv_cache . keys . T # O(n) memory reads

Light Just Cut KV Cache Memory Traffic to 1/16th

Related Articles

All of the String types

The Last Quiet Thing

The Great Nix Flake Check

Can open source outperform proprietary software?

Two Years of Valkey

Related Articles

News
All of the String types
Lobsters • 3h ago

News
The Last Quiet Thing
Lobsters • 6h ago

News
The Great Nix Flake Check
Lobsters • 9h ago

News
Can open source outperform proprietary software?
Reddit Programming • 10h ago

News
Two Years of Valkey
Lobsters • 10h ago