
I Built an AI That Sees Your Screen and Speaks Your Answers, Here's How
I Built an AI That Sees Your Screen and Speaks Your Answers — Here's How This post was created for the purposes of entering the Gemini Live Agent Challenge hackathon. #GeminiLiveAgentChallenge The Problem With Typing Every day we spend hours switching between tabs, typing search queries, copying text, and manually reading through pages trying to find answers. What if you could just look at your screen and ask a question out loud — and get an answer spoken back to you instantly? That's exactly what I built. Voice UI Navigator is an AI agent that: 👁️ Sees your browser screen using Gemini multimodal vision 🎙️ Listens to your voice via the Gemini Live API 🔍 Searches Google in real time to research answers 🔊 Speaks results back to you naturally No typing. No DOM access. No browser extensions. Just pure visual AI understanding — the same way a human would look at a screen. Live demo: https://voice-navigator-913580598688.us-central1.run.app GitHub: https://github.com/Kamaumbugua-dev/GEMINI_CO
Continue reading on Dev.to Python
Opens in a new tab




