Telemetry, Not Code, Audits Nondeterministic AI Agents

Dat NgoAI EngineerSunday, June 7, 202610 min read

Dat Ngo of Arize argues that LLM observability has to account for failures in execution paths, not just broken components, because agents can call tools in different orders, branch, loop, and change behavior across runs. In his account, traces become the audit record for nondeterministic systems, while evaluation must combine model judges, human feedback, golden datasets, deterministic checks, and business metrics at the right scope. Arize’s stated direction is to connect observability, evals, experimentation, and improvement into an increasingly automated loop.

A bad output may be a path failure, not a component failure

Dat Ngo frames AI application development as “software reimagined”: familiar engineering patterns in a new flavor, often presented as magic but ultimately not magic. The difference is that the systems now being debugged are nondeterministic. A harness or agent may take different execution paths across runs, call tools in different orders, loop through different branches, and expose failures that do not appear in the application code itself.

That is why observability comes first. In Ngo’s account, observability answers the basic engineering question: what is happening inside the thing that was built? For LLM systems, the answer is not only logs or metrics. It is traces, spans, sessions, agent paths, distributional views of execution branches, latency, errors, and the relationships among components.

Arize’s approach, he says, is built around OpenTelemetry. The company’s OpenInference project is shown as “a set of conventions and plugins” complementary to OpenTelemetry for tracing AI applications. Ngo describes the operational pattern as simple: add an auto-instrumenter, often “one line of code,” and it observes what is happening around a framework or SDK, creates OpenTelemetry traces and spans, and turns those into views of the application’s behavior.

The core distinction is that traces become the audit record for agents.

If you've ever seen a trace or span, it's basically the audit record of what did my agent do? Because now we know that code doesn't audit agents or harnesses, it's actually the telemetry that does that.

Dat Ngo · Source

That audit record matters because the failure mode can be procedural rather than local. Ngo gives the example of an agent calling tool B before tool A, even though B depends on A. The agent’s output may be poor not because either component is individually broken, but because the trajectory is wrong. The fix may be to add context instructing the system that A must precede B. Without telemetry, the developer may only see a bad result; with telemetry, the developer can see the path that produced it.

Observability has to work at more than one layer

The most granular observability layer is the trace-and-span view: a record of the components the system called, their inputs and outputs, and their behavior during one run. This is the level at which an engineer can inspect a single LLM call, tool call, or component.

Dat Ngo also separates out sessions. Enterprises often need to zoom out from the individual call and inspect state across a back-and-forth interaction. In a chat application, for example, a session can show the human and AI turns over time. That level of visibility is useful when the question is not “did this tool call work?” but “was the user satisfied?” or “were all of the user’s questions answered?”

Ngo also emphasizes distributional views of agents. A single trace shows what happened once. A distributional view asks what happens across many instantiations of an agent: which branches are taken, how often traffic moves down one path versus another, where loops occur, and whether a specific branch contains a latency-heavy component. In the Arize interface, he shows agent graphs and path views intended to expose these patterns across executions rather than only within one run.

This matters because the agent’s shape is not always static. It may branch depending on the model’s decisions. It may loop. It may call one sequence of tools on one run and another sequence on the next. Observability has to make those possible paths legible, not just display a flat log.

The same layered view extends to analytics dashboards. Ngo says “analytics aren’t dead”; enterprise teams still want real-time views of what their agents look like, including metrics such as hallucination evaluation score and latency. The point is not that dashboards alone solve the problem, but that they sit alongside traces, sessions, trajectories, and distributional views as different windows into the same system.

Evaluation is the process of turning behavior into signal

Once a team can see the system, the next question is how to derive signal from it. Dat Ngo organizes evaluation into five “flavors”: LLM-as-judge, human feedback, golden datasets, deterministic or code-based checks, and business metrics. The slide he uses frames them as “LLM as a Judge,” “Code/Logic Based Evals,” “Annotations,” “User Based Feedback,” and “Business Metrics.”

Evaluation signal	What it measures in Ngo’s framing
LLM as judge	A model evaluates outputs or behavior against criteria, with complexity increasing as the judgment task becomes more specific.
Human feedback	End users, product managers, subject-matter experts, or other humans provide signal about whether the AI experience is useful or correct.
Golden datasets	Trusted labeled examples, often produced by domain experts, provide a quality reference for evaluation and for tuning judges.
Code or logic checks	Deterministic tests verify properties such as valid JSON, schema conformance, required fields, or non-null values.
Business metrics	The application is judged by whether it helps make money, save money, or save time.

Ngo’s five categories of evaluation signal for AI applications

LLM-as-judge is the most familiar of the five, but Ngo cautions that it can become complex. A judge may need to approximate the standards of a trusted person or dataset. One use of golden datasets, in his description, is to tune an LLM judge by asking whether it can approximate a domain expert’s labels or a trusted labeled set.

Human signal remains important because humans understand aspects of product experience that may not be reducible to a model call. Ngo includes both end users and internal roles here: product managers, subject-matter experts, and less technical stakeholders who understand what the AI experience should be. That signal may answer questions such as whether the user was satisfied, whether the response met domain expectations, or whether the interaction completed the intended job.

Deterministic evaluation is cheaper and often preferable when the question does not require a model. Ngo’s example is a paragraph-to-JSON workflow: if the system is supposed to produce a JSON payload, an evaluator can check whether the output is valid JSON, whether it conforms to a schema, whether it includes required fields, and whether those fields are non-null. No LLM judge is needed for that class of question.

Business metrics are the outermost signal. Ngo reduces them to three broad purposes: making more money, saving money, or saving time. Those metrics are not separate from technical evaluation. They are part of the same evaluation landscape because production AI systems are built to produce business outcomes, not only technically plausible outputs.

The people who build the system and the people who know the domain need different interfaces

A recurring tension in Ngo’s account is organizational rather than purely technical. Dat Ngo describes two personas that come together in serious AI product development. Technical users — AI engineers, developers, data scientists — are good at building, automating, and creating frameworks. Domain experts — subject-matter experts and AI product managers — understand what the AI experience should be.

The work should be assigned accordingly. People who can code should be writing code and building the system. People who know the domain should define the experience, prompt expectations, and evaluation criteria. In practice, that means evaluation tooling should not require every domain expert to become a programmer.

Ngo shows AX supporting both modes. Technical users can attach and run evaluations programmatically. Nontechnical users can select a model, run an out-of-the-box evaluator template, customize an evaluation prompt, and test it against sample data in the UI. Evaluation cannot be owned only by engineering if the criteria for success are domain-specific.

This also explains why “evals” are more than an engineering checklist. They are a collaboration surface. A product manager or domain expert may define what good behavior means, while an engineer ensures the evaluator is wired into the system and run at the right scope.

The scope of an eval is as important as its type

A team may choose an LLM judge, a deterministic check, human annotation, or a business metric — but it must also decide whether the evaluation applies to a single span, multiple spans, an entire trajectory, or a session. That distinction between evaluation flavor and evaluation scope is one of Dat Ngo’s most concrete technical points.

A span-level evaluation inspects one component’s input and output. Ngo presents this as the simplest and most familiar case: look at one part of an LLM call or component and evaluate whether its output satisfies the requirement.

A multi-span evaluation requires data across several components. Ngo gives the example of evaluating how well agents pass data back and forth to each other. That cannot be judged from one component alone; it requires observing the handoffs among multiple agents or steps.

A trajectory evaluation considers the full sequence of spans. This is the level at which a team can ask whether the system called things in the right order to finish a business process. The tool-B-before-tool-A example belongs here: the individual calls may exist, but the trajectory is wrong because the dependency order is violated.

A session-level evaluation zooms out further, treating the interaction as a state machine. It can ask whether, across a conversation, the user ever became frustrated or whether all questions were answered. This is especially relevant when the application’s value depends on a complete interaction rather than a single generated answer.

Evaluation scope	Question it can answer
Span	Did this individual component or LLM call produce the right output for its input?
Multi-span	Did several components exchange or transform data correctly across a workflow?
Trajectory	Did the agent take the right sequence of actions to complete the business process?
Session	Did the overall conversation or stateful interaction satisfy the user’s needs?

Ngo distinguishes evaluation type from evaluation scope

Ngo warns against treating evaluation as exhaustive. Just because something can be evaluated does not mean it should be. Evals carry cost, whether in model calls, human review, engineering time, or operational complexity. The goal is the minimal set of evaluations that provides enough signal to know whether the application is working as intended.

Experiments are changes that must be tested against regressions

After observability and evaluation, the improvement loop turns on experiments. Dat Ngo describes a workflow that starts with traces if the team has them: find weak signal, missing behavior, or failures, then collect those examples into a dataset. If the team does not have traces, it can upload a dataset directly as input-output pairs.

Once examples are organized into datasets, teams can run experiments. Ngo defines experiments broadly as changes to the system: prompts, models, orchestration, configurations, and other components of the agent or harness. Arize supports running these experiments in the UI or programmatically.

The reason the experiment layer matters is regression. Ngo notes that in a nondeterministic world, a change that appears to fix one issue may produce two or three regressions elsewhere. Datasets built from observed failures or uploaded examples give teams a way to compare changes across runs rather than inspect a bad output once and move on.

Arize wants the observability loop to be automated

The future direction, in Dat Ngo’s telling, is not more dashboards for humans to manually inspect. Arize has exposed the primitives through a CLI and through tools and skills so coding agents can use them. He points to developers’ comfort with Claude Code and Cursor as part of the reason Arize has made those primitives available outside the manual UI.

The enterprise product also includes an AI layer called Alyx. Ngo says Alyx can be asked whether there are issues in the application. Because Arize has the trace data and hooks into the system, Alyx can plan and run tasks. In the interface he shows, Alyx is already working through issues, with visible task output including “No error spans detected” and “High latency outliers.”

Ngo’s stated ambition is stronger than assisted debugging.

Our ultimate goal as a company is actually to automate you out of this process. Observability, evals, experimentation and improvement. We think the whole flywheel is very much automatable.

Dat Ngo

In that future, users should not even have to choose their evals. An AI system should have enough context from traces and application behavior to create evaluations on the fly, decide when a new eval is needed because something changed, and continue monitoring the system. Ngo repeats his earlier distinction: it is not magic, but it should feel like magic.

Phoenix is the open-source path; AX is the enterprise platform

Dat Ngo distinguishes Arize’s two products by deployment model and audience. Arize Phoenix is the open-source option for engineering-first users. He emphasizes that Phoenix runs as a single container, can be deployed locally, and does not require a Kubernetes layer.

Arize AX is the enterprise product, described as generally reserved for large enterprises. Ngo mentions “Uber and Booking and Reddit” while describing the kinds of large enterprises that use AX.

The product split mirrors the broader operating model Ngo lays out. For teams building agents, the basic need is telemetry: a way to see what the system did. For larger organizations, the need expands into evaluation, experimentation, domain-expert workflows, analytics, and automation. Ngo’s position is that these are not separate categories, but connected parts of the same development loop.

AI Application Architecture Evals and Benchmarks Inference and Deployment AI in Operations Agents and Autonomy Open Models