Back to articles
Compressed VLM inference from a single Containerfile — turboquant-vllm v1.1

Compressed VLM inference from a single Containerfile — turboquant-vllm v1.1

via Dev.to PythonAlberto Nieto

The hardest part of GPU inference isn't the model — it's the environment. CUDA versions, driver compatibility, pip dependency conflicts. You can have a working quantization plugin and still spend an hour getting it to run on a fresh machine. turboquant-vllm v1.1.0 ships a Containerfile that eliminates that setup. It extends the official vLLM image, installs the TQ4 compression plugin from PyPI, and verifies the plugin entry point at build time — not at runtime when you're debugging a silent fallback to uncompressed attention. What Changed in v1.1 Container support. A single Containerfile bakes turboquant-vllm into the official vllm-openai image: git clone https://github.com/Alberto-Codes/turboquant-vllm.git cd turboquant-vllm podman build -t vllm-turboquant -f infra/Containerfile.vllm . Then serve a vision-language model with compressed KV cache: podman run --rm \ --device nvidia.com/gpu = all \ --shm-size = 8g \ -p 8000:8000 \ vllm-turboquant \ --model allenai/Molmo2-8B \ --attention-

Continue reading on Dev.to Python

Opens in a new tab

Read Full Article
7 views

Related Articles