I shipped Google's TurboQuant as a vLLM plugin 72 hours after the paper — here's what nobody else tested

Google published TurboQuant at ICLR 2026 — a technique that compresses transformer KV caches to 4 bits per coordinate with zero accuracy loss. The paper reports 5-6x memory reduction on H100 GPUs, tested on text models like Gemma and Mistral. I wanted to know: does it work on a vision-language model processing video ? On a consumer GPU ? 72 hours later, turboquant-vllm is on PyPI. Quick Start pip install turboquant-vllm[vllm] vllm serve allenai/Molmo2-8B --attention-backend CUSTOM That's it. The plugin auto-registers via vLLM's entry point system. No code changes, no forking, no monkey-patching. For HuggingFace users: from transformers import DynamicCache from turboquant_vllm import CompressedDynamicCache cache = DynamicCache () compressed = CompressedDynamicCache ( cache , head_dim = 128 , bits = 4 ) # Pass cache (not wrapper) to model.generate() Why Vision-Language Models Matter Every other TurboQuant implementation tests on text-only models with hundreds of tokens. But a 12-second v

I shipped Google's TurboQuant as a vLLM plugin 72 hours after the paper — here's what nobody else tested

Related Articles

This is the lowest price on a 64GB RAM kit I've seen in months

What Is Computer Science? (Learn This Before It’s Too Late)

How to Build Your Own Claude Code Skill

how to make programming terrible for everyone

Rob Pike’s 5 Rules: The Secret to Building Systems That Actually Survive Production

Related Articles

How-To
This is the lowest price on a 64GB RAM kit I've seen in months
ZDNet • 5h ago

How-To
What Is Computer Science? (Learn This Before It’s Too Late)
Medium Programming • 5h ago

How-To
How to Build Your Own Claude Code Skill
FreeCodeCamp • 6h ago

How-To
how to make programming terrible for everyone
Lobsters • 7h ago

How-To
Rob Pike’s 5 Rules: The Secret to Building Systems That Actually Survive Production
Medium Programming • 7h ago