Back to articles
From one model to seven — what it took to make TurboQuant model-portable

From one model to seven — what it took to make TurboQuant model-portable

via Dev.to PythonAlberto Nieto

A KV cache compression plugin that only works on one model is a demo, not a tool. turboquant-vllm v1.0.0 shipped four days ago with one validated architecture: Molmo2. v1.3.0 validates seven — Llama 3.1, Mistral 7B, Qwen2.5, Phi-3-mini, Phi-4, Gemma-2, and Gemma-3. The path between those two points was more interesting than the destination. What Changed Fused paged kernels (v1.2.0). The original architecture decompressed KV cache from TQ4 to FP16 in HBM, then ran standard attention on the result. The new fused kernel reads compressed blocks directly from vLLM's page table, decompresses in SRAM, and computes attention in a single pass. HBM traffic: 1,160 → 136 bytes per token. # One flag. Same as before. vllm serve meta - llama / Llama - 3.1 - 8 B -- attention - backend CUSTOM Non-pow2 head dimensions (v1.3.0). Triton's tl.arange requires power-of-two ranges. Phi-3-mini has head_dim=96. Gemma has head_dim=256. All five Triton kernels needed pad-to-next-power-of-two with boundary masking

Continue reading on Dev.to Python

Opens in a new tab

Read Full Article
5 views

Related Articles