Can You Self-Host an Efficient AI at Home or for your Company?
Introduction This started with a simple goal: run a genuinely useful local LLM on a home setup with a 12GB GPU. On paper, that sounds like "pick a model and press run." In reality, it turned into a chain of very practical engineering trade-offs across hardware, runtime setup, memory limits, and model quality. This write-up is the path I took from first boot to a usable daily LLM. It goes through the messy parts first (driver issues, environment friction, runtime decisions), then the model-side experiments (8B baseline, quantized 20B, offloading, and quantization + offloading), plus a bonus test with AirLLM. The main thread is simple: local LLMs are absolutely workable now, but "can run" and "feels good to use" are not the same thing. The episodes focus on where that gap appears, what improved it the most, and what still costs latency, RAM, or reliability when pushing beyond VRAM limits. Episode 1 — The rig (hardware gotchas) The very first step was supposed to be simple: a 12GB GPU was
Continue reading on Dev.to Beginners
Opens in a new tab



