
The Hitchhiker's Guide to Running Agentic Systems Locally
Engineering a Hybrid LLM Router for Production Agentic Systems Every agentic system eventually confronts the same wall: intelligence costs latency, and latency destroys experience. The standard prescription — throw more compute at it — is lazy engineering disguised as ambition. After months of iterating on agentic workflows on an Arch Linux rig, I found a third path. Not faster models. Not cheaper models. A smarter layer that decides which model to use — and when. Small open-weight models are excellent for routine tasks: fast, private, and inexpensive to run. Their limitation appears when prompts require multi-step reasoning, structured output, or strict tool use. But it has what I call a reasoning ceiling — a hard limit where multi-step logical deduction collapses into what I call the confidence loop : the model executes the wrong tool with complete conviction, or hallucinates a JSON schema that does not exist. Frontier APIs — DeepSeek V3.2, GPT-4o — solve the ceiling problem. But the
Continue reading on Dev.to
Opens in a new tab



