Back to articles
Caching LLM Extractions Without Lying: Conformal Gates + a Reasoning Budget Allocator

Caching LLM Extractions Without Lying: Conformal Gates + a Reasoning Budget Allocator

via Dev.toDaniel Romitelli

The extraction pipeline processed 2,400 documents overnight. Cost: $380. The next morning I diffed the inputs against the previous batch—87% were near-duplicates with trivial whitespace changes. I’d burned $330 re-extracting answers I already had. Not because the cache missed. Because my cache had no right to hit . A TTL can tell you when something is old. It cannot tell you when something is wrong . And for an AI extraction pipeline, “wrong” is the only thing that matters. So I rebuilt the caching layer around a different idea: caching is a statistical validity problem , not an expiry problem. Then I paired it with a second idea that sounds obvious until you implement it: reasoning depth is a budget allocation problem , not a model selection problem. What I ended up with in production is a two-stage system: Confidence-gated cache : per-selector reuse vs partial rebuild using a multi-signal score and conformal thresholds. Reasoning budget allocator : per-span compute decisions under a

Continue reading on Dev.to

Opens in a new tab

Read Full Article
2 views

Related Articles