Production Agents Need Semantic Observability Beyond Offline Evals
Raindrop’s workshop argues that production agents need a different observability model from conventional software monitoring or offline evals. Zubin Kumar, Danny Gollapalli and Ben Hylak make the case that teams should track both explicit telemetry such as tool errors, latency and cost, and implicit signals such as user frustration, refusals, task failure, capability gaps and unusual workarounds. Their framework treats real production behavior as the primary surface for finding regressions, running experiments and catching failures that do not appear as clean exceptions.
AI Engineer·May 7, 2026·17 min read