Can You Self-Host an Efficient AI at Home or for your Company?

Introduction This started with a simple goal: run a genuinely useful local LLM on a home setup with a 12GB GPU. On paper, that sounds like "pick a model and press run." In reality, it turned into a chain of very practical engineering trade-offs across hardware, runtime setup, memory limits, and model quality. This write-up is the path I took from first boot to a usable daily LLM. It goes through the messy parts first (driver issues, environment friction, runtime decisions), then the model-side experiments (8B baseline, quantized 20B, offloading, and quantization + offloading), plus a bonus test with AirLLM. The main thread is simple: local LLMs are absolutely workable now, but "can run" and "feels good to use" are not the same thing. The episodes focus on where that gap appears, what improved it the most, and what still costs latency, RAM, or reliability when pushing beyond VRAM limits. Episode 1 — The rig (hardware gotchas) The very first step was supposed to be simple: a 12GB GPU was

Can You Self-Host an Efficient AI at Home or for your Company?

Related Articles

The silver bullet – why building software is still hard

Solving Product of Array Except Self

How to Use Seedance 2.0 for FREE (From Any Country)

Happy 25th Birthday, Agile!

Matrix Exponentiation

Related Articles

How-To
The silver bullet – why building software is still hard
Medium Programming • 1h ago

How-To
Solving Product of Array Except Self
Medium Programming • 2h ago

How-To
How to Use Seedance 2.0 for FREE (From Any Country)
Medium Programming • 4h ago

How-To
Happy 25th Birthday, Agile!
Dev.to • 4h ago

How-To
Matrix Exponentiation
Medium Programming • 5h ago