
LLMKube Now Deploys Any Inference Engine, Not Just llama.cpp
LLMKube started as a Kubernetes operator for llama.cpp. You define a Model, define an InferenceService, and the controller handles GPU scheduling, health probes, model downloads, and Prometheus metrics. It works well for GGUF models. But llama.cpp isn't the only inference engine. vLLM has PagedAttention. TGI has continuous batching. PersonaPlex does real-time voice AI. Triton serves multi-framework models. Locking the operator to one runtime limits what you can deploy. v0.6.0 changes that with pluggable runtime backends. The Problem Before v0.6.0, the controller's constructDeployment() was hardcoded to llama.cpp. Container name, image, command-line args, health probes, model provisioning, everything assumed llama.cpp. If you wanted to deploy vLLM, you had to create a manual Kubernetes Deployment outside of LLMKube. The Fix A RuntimeBackend interface that each inference engine implements: type RuntimeBackend interface { ContainerName () string DefaultImage () string DefaultPort () int32
Continue reading on Dev.to
Opens in a new tab


