
Complete Guide to llm-d CNCF Sandbox — Kubernetes-Native Distributed LLM Inference
Complete Guide to llm-d CNCF Sandbox — Kubernetes-Native Distributed LLM Inference Framework At KubeCon Europe 2026 in Amsterdam, IBM Research, Red Hat, and Google Cloud jointly donated llm-d to the CNCF as a Sandbox project. Backed by founding partners including NVIDIA, CoreWeave, AMD, Cisco, Hugging Face, Intel, Lambda, and Mistral AI, llm-d is a distributed inference framework designed to run large language model (LLM) inference at production scale on Kubernetes. If you've served models with vLLM or managed inference endpoints with KServe, you've likely felt the gap: vLLM is powerful but hits scaling walls as a single Pod, while KServe provides high-level abstractions but lacks inference-aware routing . llm-d fills exactly this gap as a middleware layer, delivering Disaggregated Serving, hierarchical KV Cache offloading, and prefix-cache-aware routing — all Kubernetes-native. The Three Bottlenecks llm-d Solves Running LLM inference in production consistently hits three core bottlene
Continue reading on Dev.to
Opens in a new tab


