Show HN: ZSE – Open-source LLM inference engine with 3.9s cold starts

I've been building ZSE (Z Server Engine) for the past few weeks — an open-source LLM inference engine focused on two things nobody has fully solved together: memory efficiency and fast cold starts. The problem I was trying to solve: Running a 32B model normally requires ~64 GB VRAM. Most developers don't have that. And even when quantization helps with memory, cold starts with bitsandbytes NF4 take 2+ minutes on first load and 45–120 seconds on warm restarts — which kills serverless and autoscaling use cases. What ZSE does differently: Fits 32B in 19.3 GB VRAM (70% reduction vs FP16) — runs on a single A100-40GB Fits 7B in 5.2 GB VRAM (63% reduction) — runs on consumer GPUs Native .zse pre-quantized format with memory-mapped weights: 3.9s cold start for 7B, 21.4s for 32B — vs 45s and 120s with bitsandbytes, ~30s for vLLM All benchmarks verified on Modal A100-80GB (Feb 2026) It ships with: OpenAI-compatible API server (drop-in replacement) Interactive CLI (zse serve, zse chat, zse conve

Show HN: ZSE – Open-source LLM inference engine with 3.9s cold starts

Related Articles

The Difference between `let`, `var` and `const`

Circulation Metrics Framework for Living Systems

Red Rooms makes online poker as thrilling as its serial killer

Don’t Know What Project to Build? Here Are Developer Projects That Actually Make You Better

Why Most Developers Stay Broke

Related Articles

How-To
The Difference between `let`, `var` and `const`
Medium Programming • 2d ago

How-To
Circulation Metrics Framework for Living Systems
Medium Programming • 2d ago

How-To
Red Rooms makes online poker as thrilling as its serial killer
The Verge • 2d ago

How-To
Don’t Know What Project to Build? Here Are Developer Projects That Actually Make You Better
Medium Programming • 2d ago

How-To
Why Most Developers Stay Broke
Medium Programming • 2d ago