
NexusQuant benchmarks: every number, honestly
When you build a KV cache compression system and plan to publish a paper, you face a choice: present the best-looking numbers, or present all of them. We chose all of them. This post is every benchmark result we have, including the ones that did not work. The pipeline Quick context. NexusQuant compresses the KV cache of transformer models at inference time, training-free: Prefill → Key-Key Attention Score → Evict → RoPE-remove → Hadamard → 2-bit E8 VQ → Temporal Delta → zstd The context manager API: with nexusquant_evict ( model , quality = " balanced " ): output = model . generate ( input_ids , max_new_tokens = 200 ) All numbers below are from an A10G GPU (24 GB). Perplexity delta is measured against the uncompressed baseline on the same passages. Mistral-7B: the full picture These are our numbers at different prefix lengths and eviction rates. Every row is real. Prefix Evict% Compression PPL Delta Verdict 500 tok 35% 10.1x +0.90% Usable for most tasks 1664 tok 35% 10.4x +0.14% Near-l
Continue reading on Dev.to
Opens in a new tab


