
Managing LLM context in a real application
Ahnii! This post covers how Claudriel , a Waaseyaa -based AI assistant SaaS, handles LLM context in production: conversation trimming, per-task turn budgets, model degradation on rate limits, prompt caching, and per-turn token telemetry. The problem with unbounded context Every message you send to an LLM API costs tokens. Long-running chat sessions accumulate history fast. Left unchecked, a single active session can push input token counts into the tens of thousands per turn, even before the model generates a word. Claudriel runs multiple agent turns per user request — reading email, checking calendars, querying entities. Each turn sends the full conversation history plus tool definitions. Without guardrails, costs compound and rate limits trigger unpredictably. Trimming conversation history before it reaches the API The first line of defense is ChatStreamController::trimConversationHistory() . Before any message goes to the API, the history is trimmed to a cap of 20 messages. Older as
Continue reading on Dev.to
Opens in a new tab


