Running Gemma 4 next to your agent runtime: notes from a small shop

My brother Brandon and I run RapidClaw. Most days it's just the two of us, a handful of customers, and a few agents chugging along in production. A few months ago we started putting small open-weight models on the same box as the agent runtime — mostly Gemma 4, a bit of Phi-4 for comparison, some Qwen. This is a short write-up of what's actually worked and what hasn't. Nothing revolutionary here. I'm writing it because I searched for "agent + local Gemma" a bunch of times last quarter and mostly found benchmark posts, not lived-experience notes. The thing we noticed The newest small models are small enough that they fit on the same machine as the agent loop. That's the whole observation. Gemma 4 4B runs fine on a 24 GB GPU next to a Node process running our agent code. Phi-4 14B is tight but works. A year ago you needed a separate inference box, which meant a network hop, which meant we just paid a hosted API and moved on. Now the tradeoff is different. You can keep the hosted model fo

Running Gemma 4 next to your agent runtime: notes from a small shop

Related Articles

An Elm-inspired language that compiles to Go, Hindley-Milner types, server-driven UI, single binary output

Visualizing Graph Structures Using Go and Graphviz

Easters - an adventofcode-like challenge for easter

It has never been about code

Loading... [13 kB]

Related Articles

News
An Elm-inspired language that compiles to Go, Hindley-Milner types, server-driven UI, single binary output
Lobsters • 4h ago

News
Visualizing Graph Structures Using Go and Graphviz
Reddit Programming • 4h ago

News
Easters - an adventofcode-like challenge for easter
Lobsters • 5h ago

News
It has never been about code
Lobsters • 6h ago

News
Loading... [13 kB]
Lobsters • 9h ago