
teaching a cat to use a mouse — literally
teaching a cat to use a mouse — literally I created this post for the purposes of entering the Gemini Live Agent Challenge, and honestly this was the feature that almost broke us. Our user's feedback was blunt: "Why aren't you using vision to control the mouse directly?" And then, more specifically: "The cursor should glide smoothly, find its target visually, move again, and click — that's the WOW factor." He was right. Sending keyboard shortcuts and accessibility API calls is reliable, but it looks like a script running. A cursor that glides across the screen, finds its target visually, and clicks — that looks like intelligence. So we built the LOOK → DECIDE → MOVE → CLICK → VERIFY pipeline. the five-stage pipeline Here's what happens when VibeCat decides to click something on your screen: LOOK — VibeCat captures a screenshot via ScreenCaptureKit. This isn't a polling loop; it's triggered when the gateway's proactive companion decides an action is needed. The screenshot goes to Gemini
Continue reading on Dev.to
Opens in a new tab



