
Running AI in the Browser with Gemma 4 (No API, No Server)
Most “AI apps” today are just API wrappers. That’s fine… until you care about latency, cost, or privacy. I’ve been exploring what it actually takes to run LLMs inside the browser, and Gemma 4 completely changes what’s possible. This is not theory this is what actually works. Why Gemma 4 is different Gemma 4 isn’t just another model release. It’s designed for: • on-device inference • agentic workflows • multimodal tasks (text, audio, vision) The important part? 👉 The E2B / E4B variants are small enough to run inside a browser tab. No backend required. ⚙️ How it actually runs in the browser Let’s cut the hype. There are only 2 real approaches: 1. MediaPipe LLM Inference (Recommended) • WebAssembly + WebGPU under the hood • Load model like: const llm = await LlmInference.createFromOptions({ modelAssetPath: "/models/gemma-4-E2B.litertlm", }); That’s it. You now have: • streaming responses • token control • temperature, top-k, etc. 2. WebGPU (Transformers.js style) More control, more pain.
Continue reading on Dev.to
Opens in a new tab



