
Building an AI fallback system: when to use GPT-4o, when to fall back to Haiku, when to skip the LLM entirely
Not every query deserves a frontier model. A user asking "what is your cancellation policy?" does not need GPT-4o to generate the answer. A rules engine or a simple database lookup handles it in 5 milliseconds at zero token cost. We learned this the hard way. Our first production deployment sent everything through GPT-4o. The quality was great. The bill was $7,200/month for a feature that should have cost $2,000. Worse, 60% of those queries were simple enough that a smaller model (or no model at all) would have produced identical output. This article covers the three-tier fallback system we built: a rules engine for deterministic queries, a cheap model (Claude Haiku) for simple generation, and a frontier model (GPT-4o) for complex reasoning. Stack: Node.js 20, TypeScript. The three tiers Here is the routing logic: Incoming query ↓ ┌─────────────────────┐ │ Tier 0: Rules │ → deterministic lookup, no LLM │ (FAQ, status, data)│ cost: $0, latency: <10ms └─────────┬───────────┘ ↓ not matche
Continue reading on Dev.to
Opens in a new tab



