I Built a Semantic Cache That Cuts LLM API Costs by 72% - What Actually Worked and What Didn't
The Results First 100 real Anthropic API calls. Three architectures tested. One that actually worked. V3 Hybrid Engine — 100-query live benchmark: Metric Value Cache hit rate 87.5% Total cost $0.24 (vs $0.87 without cache) Cost savings 71.8% Zero-cost direct hits 54 queries Adapted (cheap model) 35 queries Full misses 9 queries Tokens saved 179,445 The warm-up curve is the real story. The cache starts cold at 42.9% hit rate on the first 10 queries. By query 20: 90%. By query 31: every single query hits cache. Queries 31–40 cost $0.00 — not approximately zero, literally zero dollars. The system is called Intent Atoms . It sits between your application and any LLM API, using FAISS vector search and MPNet embeddings to match incoming queries against cached responses. When it finds a match, it returns the cached response in ~97ms instead of waiting 8–25 seconds for a fresh generation. But the 87.5% number is the end of the story. The beginning was much uglier. V1: The Elegant Idea That Cos
Continue reading on Dev.to Python
Opens in a new tab


