Automated Evals for LLM Agents

Learn how to build automated evals for LLM‑based agents, covering dataset structuring, stability checks, edge‑case handling, and reproducible regression detection.

Overview

In this live demo, I’ll show how we built an automated evaluation system (“evals”) to test AI agents powered by LLMs in production. We’ll quickly walk through our setup—how we structured test datasets, automated stability checks, handled edge cases, and implemented reproducible quality benchmarks. I’ll run a live eval demo from the terminal, showcasing how we detect regressions and ensure consistent performance in our AI agents.

Links

https://www.nextapp.co/
Autonomous AI analyzes customer videos for insights, automating workflows.

Tech stack