Managing LLM context in a real application

Ahnii! This post covers how Claudriel , a Waaseyaa -based AI assistant SaaS, handles LLM context in production: conversation trimming, per-task turn budgets, model degradation on rate limits, prompt caching, and per-turn token telemetry. The problem with unbounded context Every message you send to an LLM API costs tokens. Long-running chat sessions accumulate history fast. Left unchecked, a single active session can push input token counts into the tens of thousands per turn, even before the model generates a word. Claudriel runs multiple agent turns per user request — reading email, checking calendars, querying entities. Each turn sends the full conversation history plus tool definitions. Without guardrails, costs compound and rate limits trigger unpredictably. Trimming conversation history before it reaches the API The first line of defense is ChatStreamController::trimConversationHistory() . Before any message goes to the API, the history is trimmed to a cap of 20 messages. Older as

Managing LLM context in a real application

Related Articles

Rivian gets another $1B from Volkswagen

Uses for nested promises

Yes, you need a smart bird feeder in your life - and this one's on sale

Apple pulls the plug on its high-priced, oft-neglected Mac Pro desktop

Applying accessibility fixes with stealth for the greater good

Related Articles

News
Rivian gets another $1B from Volkswagen
TechCrunch • 4h ago

News
Uses for nested promises
Lobsters • 4h ago

News
Yes, you need a smart bird feeder in your life - and this one's on sale
ZDNet • 5h ago

News
Apple pulls the plug on its high-priced, oft-neglected Mac Pro desktop
Ars Technica • 5h ago

News
Applying accessibility fixes with stealth for the greater good
Lobsters • 5h ago