
I Built a Voice-First AI Photo & Document Editor with the Gemini Live API— Here's How
This article was created for the purposes of entering the Gemini Live Agent Challenge hackathon. #GeminiLiveAgentChallenge There's a version of photo editing where you don't touch a single slider. You click the part of the image you want to change, say what you want out loud, and watch it happen. That's Say Edit. Over the past few weeks I built Say Edit — a voice-first AI workspace that lets you edit images and navigate documents entirely by speaking. It's powered by the Gemini Live API , Gemini image generation , and deployed on Google Cloud Run . This article is the behind-the-scenes of how I built it, what broke, and what surprised me. The Core Idea Most AI tools make you type. You open a chat window, describe what you want, wait for a response, copy it somewhere, repeat. I wanted to eliminate every one of those steps for two use cases I found genuinely painful: Editing a photo — you know exactly what you want to change, but you have to hunt through menus, masks, and sliders to get
Continue reading on Dev.to
Opens in a new tab



