From one model to seven — what it took to make TurboQuant model-portable

A KV cache compression plugin that only works on one model is a demo, not a tool. turboquant-vllm v1.0.0 shipped four days ago with one validated architecture: Molmo2. v1.3.0 validates seven — Llama 3.1, Mistral 7B, Qwen2.5, Phi-3-mini, Phi-4, Gemma-2, and Gemma-3. The path between those two points was more interesting than the destination. What Changed Fused paged kernels (v1.2.0). The original architecture decompressed KV cache from TQ4 to FP16 in HBM, then ran standard attention on the result. The new fused kernel reads compressed blocks directly from vLLM's page table, decompresses in SRAM, and computes attention in a single pass. HBM traffic: 1,160 → 136 bytes per token. # One flag. Same as before. vllm serve meta - llama / Llama - 3.1 - 8 B -- attention - backend CUSTOM Non-pow2 head dimensions (v1.3.0). Triton's tl.arange requires power-of-two ranges. Phi-3-mini has head_dim=96. Gemma has head_dim=256. All five Triton kernels needed pad-to-next-power-of-two with boundary masking

From one model to seven — what it took to make TurboQuant model-portable

Related Articles

Start Here: Learning to develop your own way with SCSIC

Vibe Coding Isn’t for Everyone (And That’s the Point)

Sometimes We Make Mistakes (Meta’s Cost $80 Billion)

Gate.io vs KuCoin — Which Crypto Exchange Is Better? (2026)

How to Build a Real Multi-Agent Engineering Workflow With oh-my-claudecode

Related Articles

How-To
Start Here: Learning to develop your own way with SCSIC
Medium Programming • 4h ago

How-To
Vibe Coding Isn’t for Everyone (And That’s the Point)
Medium Programming • 5h ago

How-To
Sometimes We Make Mistakes (Meta’s Cost $80 Billion)
Medium Programming • 5h ago

How-To
Gate.io vs KuCoin — Which Crypto Exchange Is Better? (2026)
Dev.to Beginners • 6h ago

How-To
How to Build a Real Multi-Agent Engineering Workflow With oh-my-claudecode
Medium Programming • 7h ago