Back to articles
From zero evals to a working multimodal evaluation in 30 minutes using LangWatch Skills
NewsDevOps

From zero evals to a working multimodal evaluation in 30 minutes using LangWatch Skills

via Dev.toManouk Draisma

How I went from "it works on my machine" to measurable agent quality using LangWatch Skills, Jupyter notebooks, and a path to production on AWS.* The problem nobody talks about You built an agent. It uses tools, handles multimodal inputs, answers questions from a knowledge base. You demo it to your team and it works great. Ship it. Three days later: the satellite image analysis returns garbage NDVI estimates. The knowledge base tool stops getting called for calibration questions, the LLM just wings it. Nobody noticed because there were no tests. This is the gap between "I have an agent" and "I have a reliable agent." LangWatch fills it. What I built The InField Agent is a weather station advisory system built with Strands Agents SDK. It has three multimodal capabilities: Knowledge base — calibration procedures for Davis Instruments weather stations Station status — fleet inventory, battery health, reporting gaps Satellite imagery — NDVI estimation from satellite images using vision mod

Continue reading on Dev.to

Opens in a new tab

Read Full Article
8 views

Related Articles