NexusQuant benchmarks: every number, honestly

When you build a KV cache compression system and plan to publish a paper, you face a choice: present the best-looking numbers, or present all of them. We chose all of them. This post is every benchmark result we have, including the ones that did not work. The pipeline Quick context. NexusQuant compresses the KV cache of transformer models at inference time, training-free: Prefill → Key-Key Attention Score → Evict → RoPE-remove → Hadamard → 2-bit E8 VQ → Temporal Delta → zstd The context manager API: with nexusquant_evict ( model , quality = " balanced " ): output = model . generate ( input_ids , max_new_tokens = 200 ) All numbers below are from an A10G GPU (24 GB). Perplexity delta is measured against the uncompressed baseline on the same passages. Mistral-7B: the full picture These are our numbers at different prefix lengths and eviction rates. Every row is real. Prefix Evict% Compression PPL Delta Verdict 500 tok 35% 10.1x +0.90% Usable for most tasks 1664 tok 35% 10.4x +0.14% Near-l

NexusQuant benchmarks: every number, honestly

Related Articles

Logos Privacy Builders Bootcamp

#05 Frozen Pipes

Replace Doom Scrolling With Intentional Reading

Web Color "Wheel" Chart

Im looking for indie apps and tools built by solo developers, their stories and perspectives for a newsletter I’m starting. If you know a solo maker or use an overlooked gem built by one please let me know! 🙏

Related Articles

How-To
Logos Privacy Builders Bootcamp
Reddit Programming • 1h ago

How-To
#05 Frozen Pipes
Dev.to • 6h ago

How-To
Replace Doom Scrolling With Intentional Reading
Dev.to • 9h ago

How-To
Web Color "Wheel" Chart
Dev.to • 13h ago

How-To
Im looking for indie apps and tools built by solo developers, their stories and perspectives for a newsletter I’m starting. If you know a solo maker or use an overlooked gem built by one please let me know! 🙏
Dev.to • 1d ago