Back to articles
I shipped Google's TurboQuant as a vLLM plugin 72 hours after the paper — here's what nobody else tested

I shipped Google's TurboQuant as a vLLM plugin 72 hours after the paper — here's what nobody else tested

via Dev.to PythonAlberto Nieto

Google published TurboQuant at ICLR 2026 — a technique that compresses transformer KV caches to 4 bits per coordinate with zero accuracy loss. The paper reports 5-6x memory reduction on H100 GPUs, tested on text models like Gemma and Mistral. I wanted to know: does it work on a vision-language model processing video ? On a consumer GPU ? 72 hours later, turboquant-vllm is on PyPI. Quick Start pip install turboquant-vllm[vllm] vllm serve allenai/Molmo2-8B --attention-backend CUSTOM That's it. The plugin auto-registers via vLLM's entry point system. No code changes, no forking, no monkey-patching. For HuggingFace users: from transformers import DynamicCache from turboquant_vllm import CompressedDynamicCache cache = DynamicCache () compressed = CompressedDynamicCache ( cache , head_dim = 128 , bits = 4 ) # Pass cache (not wrapper) to model.generate() Why Vision-Language Models Matter Every other TurboQuant implementation tests on text-only models with hundreds of tokens. But a 12-second v

Continue reading on Dev.to Python

Opens in a new tab

Read Full Article
2 views

Related Articles