llama.cpp on Kubernetes: The Guide I Wish Existed

It started at my kitchen table. I was spending an evening on my laptop, fascinated by how LLMs actually work under the hood. Not the API calls, not the chat interfaces, but the actual inference process. I installed Ollama on my Mac, pulled a model, and within a few hours I was completely hooked. If you've done this yourself, you know the feeling. A language model running on your own hardware. No API keys, no usage limits, no data leaving your network. Just you and the model. Ollama made it easy to get started, but I quickly wanted to understand what was happening underneath. That led me to llama.cpp, which Ollama uses under the hood, and that's where things really clicked. I could see exactly how the model was being loaded, how layers were offloaded to the GPU, how the inference loop worked. I went from curious to obsessed pretty quickly. But then the questions started piling up. How do I serve this to my team? How do I run multiple models? What happens when I want to use the NVIDIA GP

llama.cpp on Kubernetes: The Guide I Wish Existed

Related Articles

9 Hard Truths I Learned While Building My First ML Project

building a software protection system from first principles

The Internet Is Global, But Culture Isn’t — Building CultureLens

Paramount+ just dropped to $2.99 a month - here's how to sign up

70+ Free Online Tools That Make Everyday Tasks Easier

Related Articles

How-To
9 Hard Truths I Learned While Building My First ML Project
Medium Programming • 6h ago

How-To
building a software protection system from first principles
Lobsters • 10h ago

How-To
The Internet Is Global, But Culture Isn’t — Building CultureLens
Medium Programming • 12h ago

How-To
Paramount+ just dropped to $2.99 a month - here's how to sign up
ZDNet • 15h ago

How-To
70+ Free Online Tools That Make Everyday Tasks Easier
Medium Programming • 15h ago