
"How We Run AI Inference on $0/month (And Still Ship Fast)"
We run real-time multimodal AI inference for biometric emotion detection—audio, video, and text—and our cloud AI bill is $0/month. Not close to zero. Zero. While most teams burn thousands on GPU instances just to prototype, we’ve architected a system that leverages strategic caching, client-side compute, and model distillation to avoid cloud costs entirely. The key insight? You don’t need GPT-4-level infrastructure to ship impactful AI—especially when you shift inference off the server at the right layers. Our stack uses ONNX Runtime in WebAssembly to run distilled versions of our emotion classification models directly in the browser and mobile clients. Raw sensor data (microphone, camera) is processed locally using PyTorch Mobile on-device or WebAssembly-bound models via Mediapipe and Tensorflow.js. Only anonymized, low-dimensional embeddings—think 512-d vectors instead of video streams—get sent to our backend. These are cached aggressively with Redis and used for stateless batch retr
Continue reading on Dev.to
Opens in a new tab
