LLM Evals on Real Traffic — Not Just Test Suites

The eval gap Most teams know they should be evaluating their LLM outputs. Few actually do it in production. The typical setup looks like this: you build a test suite with a handful of golden examples, run it in CI before deploys, and hope those examples are representative of what real users actually send. Sometimes they are. Often they're not. The prompts users write in production are messier, longer, and weirder than anything in your test fixtures. The edge cases that matter most are the ones you didn't think to include. Meanwhile, the interesting data — the actual requests and responses flowing through your AI pipeline every day — sits in logs that nobody looks at until something breaks. We think evals should run where the data already is. Evals on production traffic At Grepture , we built an AI gateway that sits in the request path of every LLM call — handling PII redaction, prompt management, cost tracking, and observability. That means every request and response is already logged

LLM Evals on Real Traffic — Not Just Test Suites

Related Articles

Understand OpenClaw by Building One — Part 7

The Systems Question That Separates Juniors From Seniors

[Learning notes and hw] getting started with R-cnn: Manually implementing Intersection over Union (IoU)

Botanical garden

Task 3: Delivery Man Task

Related Articles

How-To
Understand OpenClaw by Building One — Part 7
Medium Programming • 3h ago

How-To
The Systems Question That Separates Juniors From Seniors
Medium Programming • 3h ago

How-To
[Learning notes and hw] getting started with R-cnn: Manually implementing Intersection over Union (IoU)
Dev.to Beginners • 5h ago

How-To
Botanical garden
Dev.to Tutorial • 9h ago

How-To
Task 3: Delivery Man Task
Dev.to • 10h ago