
How I Designed a Camera Scoring System for VLM-Based Activity Recognition — and Why It Looks Different in the Real World
Part 2 of the "Training-Free Home Robot" series. Part 1 covered why fixed ceiling-mounted nodes ended up as the perception foundation. This post goes deep on one specific algorithm: how the system decides which camera angle to use for each behavioral episode, and what that decision looks like when you leave the simulation. Once I'd worked through why fixed global cameras made sense — a conclusion I reached the hard way, starting from genuine skepticism about the requirement — the next problem was entirely mine: given twelve candidate viewpoints, which one do you actually use? My advisor specified the input modality. The selection algorithm, the scoring weights, the hard FOV gate, the fallback logic — none of that was given to me. This post is that design work: where each decision came from, what tradeoffs it makes, and what changes when you move from a Unity simulation to a real room. The Core Problem My system recognizes what a user is doing — drinking, reading, typing — by sending a
Continue reading on Dev.to
Opens in a new tab



