Show HN: ZSE – Open-source LLM inference engine with 3.9s cold starts
I've been building ZSE (Z Server Engine) for the past few weeks — an open-source LLM inference engine focused on two things nobody has fully solved together: memory efficiency and fast cold starts. The problem I was trying to solve: Running a 32B model normally requires ~64 GB VRAM. Most developers don't have that. And even when quantization helps with memory, cold starts with bitsandbytes NF4 take 2+ minutes on first load and 45–120 seconds on warm restarts — which kills serverless and autoscaling use cases. What ZSE does differently: Fits 32B in 19.3 GB VRAM (70% reduction vs FP16) — runs on a single A100-40GB Fits 7B in 5.2 GB VRAM (63% reduction) — runs on consumer GPUs Native .zse pre-quantized format with memory-mapped weights: 3.9s cold start for 7B, 21.4s for 32B — vs 45s and 120s with bitsandbytes, ~30s for vLLM All benchmarks verified on Modal A100-80GB (Feb 2026) It ships with: OpenAI-compatible API server (drop-in replacement) Interactive CLI (zse serve, zse chat, zse conve
Continue reading on Hacker News
Opens in a new tab

