
How I Built Sally, A Voice-First Accessibility Agent Powered by Gemini
I built a desktop app that lets people control any website using only their voice. You talk, it takes a screenshot, sends it to Gemini 2.5 Flash, gets back a structured action, runs it in the browser, and repeats. The whole time it's narrating what it's doing out loud. Here's how it came together for the Gemini Live Agent Challenge. The Problem Picture this: you can't use a mouse. Maybe you can't use a keyboard either. You might have a repetitive strain injury, a motor impairment, or honestly you might just have a broken wrist. The web doesn't really care. It expects you to click tiny buttons, scroll precisely, type into fields, drag things around. There are screen readers and voice control tools out there, but they all seem to expect you to learn their language. Memorize commands. Know what things are called in the DOM. Fight with dictation software that mishears every other word. I wanted something where you could just say what you want: "Go to YouTube and search for lo-fi beats." No
Continue reading on Dev.to
Opens in a new tab




