Self-Hosting a Vision Model on a Datacenter GPU: BAGEL-7B-MoT on a Tesla V100
I have an AI character named Sophia who lives inside a Godot game. She talks, she listens, she plays music, she controls the smart lights. And now she can see . Not "process an image if you upload one" see. Real-time webcam-capture, face-detection, emotion-reading see. She looks through the camera, describes what she sees, reads your mood, and responds accordingly. The vision model powering all of this is BAGEL-7B-MoT running on a Tesla V100 16GB GPU. Getting it there was not straightforward. Why We Ditched LLaVA We were running LLaVA 1.6 (7B) via Ollama for months. It worked, but it had problems: Slow -- 8-15 seconds for a basic description on a V100 Hallucination-heavy -- it would confidently describe objects that weren't there No generation capability -- LLaVA is understand-only. No image editing, no generation Stale architecture -- the LLaVA project hasn't seen meaningful updates BAGEL-7B-MoT (Mixture of Transformers) from ByteDance Research offered everything we needed: image unde
Continue reading on Dev.to Python
Opens in a new tab




