Back to articles
llama.cpp on Kubernetes: The Guide I Wish Existed
How-ToDevOps

llama.cpp on Kubernetes: The Guide I Wish Existed

via Dev.to DevOpsChristopher Maher

It started at my kitchen table. I was spending an evening on my laptop, fascinated by how LLMs actually work under the hood. Not the API calls, not the chat interfaces, but the actual inference process. I installed Ollama on my Mac, pulled a model, and within a few hours I was completely hooked. If you've done this yourself, you know the feeling. A language model running on your own hardware. No API keys, no usage limits, no data leaving your network. Just you and the model. Ollama made it easy to get started, but I quickly wanted to understand what was happening underneath. That led me to llama.cpp, which Ollama uses under the hood, and that's where things really clicked. I could see exactly how the model was being loaded, how layers were offloaded to the GPU, how the inference loop worked. I went from curious to obsessed pretty quickly. But then the questions started piling up. How do I serve this to my team? How do I run multiple models? What happens when I want to use the NVIDIA GP

Continue reading on Dev.to DevOps

Opens in a new tab

Read Full Article
2 views

Related Articles