
Running Gemma 4 next to your agent runtime: notes from a small shop
My brother Brandon and I run RapidClaw. Most days it's just the two of us, a handful of customers, and a few agents chugging along in production. A few months ago we started putting small open-weight models on the same box as the agent runtime — mostly Gemma 4, a bit of Phi-4 for comparison, some Qwen. This is a short write-up of what's actually worked and what hasn't. Nothing revolutionary here. I'm writing it because I searched for "agent + local Gemma" a bunch of times last quarter and mostly found benchmark posts, not lived-experience notes. The thing we noticed The newest small models are small enough that they fit on the same machine as the agent loop. That's the whole observation. Gemma 4 4B runs fine on a 24 GB GPU next to a Node process running our agent code. Phi-4 14B is tight but works. A year ago you needed a separate inference box, which meant a network hop, which meant we just paid a hosted API and moved on. Now the tradeoff is different. You can keep the hosted model fo
Continue reading on Dev.to
Opens in a new tab

