Choosing The Right Eval Matters More Than Tuning The Judge
Laurie Voss of Arize argues that agentic applications need the same engineering discipline as other production software: instrumentation, inspectable traces, targeted evals, and controlled experiments, not a handful of prompts that “look right.” In a hands-on workshop using a financial analysis agent, Voss shows how teams should read traces before writing evals, classify failures by root cause, and combine deterministic checks, LLM judges, custom rubrics, and human-labeled meta-evaluation. His central warning is that the choice of eval can dominate the result: the same agent scored 0 out of 13 on a correctness eval and 13 out of 13 on a faithfulness eval because the first judge was asking the wrong question.
AI Engineer·May 14, 2026·24 min read