Compressed VLM inference from a single Containerfile — turboquant-vllm v1.1

The hardest part of GPU inference isn't the model — it's the environment. CUDA versions, driver compatibility, pip dependency conflicts. You can have a working quantization plugin and still spend an hour getting it to run on a fresh machine. turboquant-vllm v1.1.0 ships a Containerfile that eliminates that setup. It extends the official vLLM image, installs the TQ4 compression plugin from PyPI, and verifies the plugin entry point at build time — not at runtime when you're debugging a silent fallback to uncompressed attention. What Changed in v1.1 Container support. A single Containerfile bakes turboquant-vllm into the official vllm-openai image: git clone https://github.com/Alberto-Codes/turboquant-vllm.git cd turboquant-vllm podman build -t vllm-turboquant -f infra/Containerfile.vllm . Then serve a vision-language model with compressed KV cache: podman run --rm \ --device nvidia.com/gpu = all \ --shm-size = 8g \ -p 8000:8000 \ vllm-turboquant \ --model allenai/Molmo2-8B \ --attention-

Compressed VLM inference from a single Containerfile — turboquant-vllm v1.1

Related Articles

“Learn to Code” Is Dead… Learn to Think Instead

How One File Makes Claude Code Actually Follow Your Instructions

LeetCode Solution: 121. Best Time to Buy and Sell Stock

The Feature Took 2 Hours to Build — and 2 Weeks to Fix

Blog 15: SDLC Phase 4 — Testing

Related Articles

How-To
“Learn to Code” Is Dead… Learn to Think Instead
Medium Programming • 3h ago

How-To
How One File Makes Claude Code Actually Follow Your Instructions
Medium Programming • 4h ago

How-To
LeetCode Solution: 121. Best Time to Buy and Sell Stock
Dev.to Tutorial • 4h ago

How-To
The Feature Took 2 Hours to Build — and 2 Weeks to Fix
Medium Programming • 5h ago

How-To
Blog 15: SDLC Phase 4 — Testing
Medium Programming • 6h ago