
Building Wand: A Voice + Hand Pointer Live Agent with Google ADK and Gemini Live
What if you could control your browser the way you'd direct a person — just point at something and say what you want? That question led us to build Wand , a live AI agent that lets you browse the web entirely through voice and hand gestures. No keyboard. No mouse. Point your finger at a YouTube thumbnail and say "play this" — it clicks. Point at a map and say "zoom in here" — it scrolls. Say "what is this?" — it takes a screenshot, annotates it with your cursor position, and tells you what you're pointing at. Here's how we built it. The Architecture: Cloud Agent, Local Browser The first design decision was where things live. The agent — the part that listens, reasons, and decides what to do — runs on Google Cloud Run , powered by Google ADK and Gemini 2.5 Flash Native Audio via the Gemini Live API. This gives us a stable, always-on backend that any client can connect to without needing API keys or local GPU resources. The browser, microphone, speaker, and webcam stay on the local machi
Continue reading on Dev.to
Opens in a new tab



