Orply.

Production Analytics Finds Agent Failures That Standard Evals Miss

Sam CharringtonScott ClarkThe TWIML AI PodcastThursday, May 7, 202620 min read

Scott Clark, co-founder and chief executive of Distributional, argues that teams running LLM agents need to look beyond pre-production evals and dashboards of known metrics. His case is that the most consequential failures often emerge only in production, where agents interact with users, tools and changing models in ways teams did not know to test. Clark proposes an observability stack in which telemetry records what happened, monitoring tracks known signals, and analytics clusters trace behavior to surface unknown failure modes that can become new evals, guardrails, prompts or system fixes.

Production failures are not just missed benchmark cases

Reliability for LLM agents, in Scott Clark’s account, is not mainly a matter of squeezing another fraction of a percent out of a benchmark. It is a matter of understanding what an AI system is doing once real users, real tools, and real operational constraints are involved.

Distributional began with a narrower pre-production testing thesis: build better statistical and Bayesian distributional tests for AI systems that account for stochasticity, chaos, and non-stationarity. Clark says the company later shifted toward post-production analytics because the bottleneck was not that teams were too afraid to deploy. They were deploying into a fast-moving environment and needed to learn quickly enough from production behavior.

The logic comes from a problem Clark says he repeatedly saw in black-box optimization. SigOpt, his prior company, helped teams optimize objective functions. The optimizer would improve the number it was given. But customers often came back with a familiar complaint: the optimized metric improved while some unstated business concern got worse. The model became more biased, another important number declined, or the system overfit the formal target.

You made the number go up, but some other number that I didn’t tell you went down.

Scott Clark

Clark’s early response was the standard optimizer’s defense: the optimizer did what it was asked to do. The benefit of a black-box optimizer is that it can optimize anything; the danger is that it will blindly optimize anything. Over time, he came to see that as the real problem rather than someone else’s failure to specify the target. “Telling a computer what you actually want,” he says, “is actually an incredibly difficult thing to do.”

Agentic systems amplify that difficulty. A single model call may be easy enough to inspect. A deployed agent may involve multiple sub-agents, multiple LLM calls, external tools, retries, hidden failures, prompt instructions, user context, and model behavior that changes over time. Clark argues that a benchmark or hand-built eval can help, but it does not give the “last mile” of trust or understanding. Teams need to find production anti-patterns they did not know to test for.

One concrete anti-pattern he describes is not a generic hallucination, but a tool-use failure that looks superficially correct. An agent may respond as if it called a tool, and may even include reasoning that implies the tool was called, while the trace shows that no such call occurred. In another version, the tool call returns an error and the agent invents an answer rather than retrying. In a financial research agent, Clark says, this can look like the model behaving as if Nvidia’s stock price is probably unchanged since the last time it saw it, rather than actually looking it up.

That failure is hard to catch with a conventional eval that asks whether the final answer is on topic, whether the response looks plausible, or whether the reasoning appears to match the output. Those checks can all pass while the trace reveals the material failure: the tool call did not happen, or it failed and was papered over.

For Clark, the lesson is not that evals are useless. It is that evals are downstream of discovery. Teams cannot build tests for every production failure mode by intuition alone, especially when user behavior and model behavior are both changing. They need a way to surface the patterns first, then decide whether those patterns should become evals, guardrails, training data, or product fixes.

Telemetry, monitoring, and analytics answer different questions

Clark describes observability for agents as a “Maslow’s hierarchy.” The base layer is telemetry: logging enough that a team can see what happened. Even dumping traces into S3 and looking at the head or tail is better than having no record. Telemetry is necessary for debugging and for verifying that the system functions at all.

The next layer is monitoring. Monitoring extracts known signals in real time: latency, number of tool calls, profanity, login errors, rejection rates, p95 latency, cost, or any other metric the team already knows it wants to track. Monitoring answers questions such as whether the site is down, whether latency doubled, or whether a known safety indicator spiked.

The top layer is analytics. Analytics is not primarily about whether a known metric crossed a threshold. It is about finding unknown unknowns: sub-patterns in the distribution of behavior that are different from the rest, even if no one had named the signal in advance. Clark gives the example of finding that 5% of agentic tool-call chains have a distinct signature from the other 95%. The interesting point is not necessarily that those traces are the slowest or most extreme. They may simply be “shaped differently.”

LayerPrimary questionTypical signals or outputsTradeoff
TelemetryWhat happened?Logs, traces, spans, raw tool-call recordsFoundational visibility rather than interpretation
MonitoringDid a known signal change?Latency, tool-call count, profanity, rejection rate, cost, login errorsTimely but limited to metrics already named
AnalyticsWhat did we not know to look for?Behavioral clusters, shifted trace distributions, candidate insights, new evals or guardrailsRicher but less immediate; works over distributions
Clark’s hierarchy separates raw visibility, known-signal tracking, and discovery of unknown behavioral patterns.

That distinction matters because monitoring can tell a team that user happiness fell from 95% to 92%, but that alone is not actionable enough. Analytics might reveal that a subset of financial-research queries are hallucinating stock lookups because the agent is acting as if it called a market-data tool when it did not. The first observation is a symptom. The second is closer to a fixable cause.

Analytics is about trying to find what you don’t already know to look for, these unknown unknowns.

Scott Clark · Source

Distributional focuses on analytics rather than monitoring, and Clark says that involves a tradeoff: richness over timeliness. Monitoring is built to say that a specific trace had a known property within milliseconds. Analytics instead says that a distribution of traces has shifted, or that a subset of traces differs from a larger population in a meaningful way. That requires looking at batches or distributions.

He compares the difference to the distinction between infrastructure monitoring and product analytics. Datadog can tell a team whether the service is down or latency is bad. A tool like Mixpanel or Statsig can reveal where users drop off in an aggregate funnel or how a cohort behaves. Both matter, but they operate at different levels. For agent systems, Clark expects analytics to sit alongside monitoring systems such as Braintrust or Datadog, not replace them.

This also determines when a team is ready for the kind of analytics Clark is describing. It is “at the top of the hierarchy.” It can provide value only if the foundations exist. A team first needs logs. It then needs enough production usage, and typically some monitoring, before analytics can discover patterns in the distribution. Clark positions it for teams that already have agents in production and want to improve them iteratively, ideally with feedback loops that become increasingly automatic.

The first question is whether behavior is different, not whether it is bad

Sam Charrington presses Clark on how the system identifies a failure such as a missing tool call. Is it tied to reduced user satisfaction, to direct inspection of traces, or something else? Clark’s answer is that the first step is deliberately weaker than judging quality. Without prior information, he says, it is difficult to say whether behavior A is better than behavior B. It is mathematically easier to ask whether A is different from B.

That “different first” approach drives the analytics workflow Clark describes. A cluster of traces may represent an anti-pattern: laziness, cheating, hallucinated tool use, or an unwanted effect of a system prompt. But it may also represent a good pattern: an emerging user need, an opportunity to develop a specialized agentic skill, or a behavior the product should encourage. Discovery precedes judgment.

The judgment step can then use LLMs. In Clark’s view, reasoning models can compare a subset of traces against a broader population and explain why they differ. Given the full trace, they may be able to say that the behavior probably was not intended and suggest fixes. The suggested fix might be as simple as adding an instruction to the system prompt requiring a tool call, or as operational as adding a caching layer to reduce 429 errors.

The analogy Clark uses is older machine learning practice. In fraud detection, a naive team might optimize for accuracy. Then it learns that precision and recall both matter. Then it learns that an F1 score can still miss the business problem if the one fraudulent transaction that gets through is worth a billion dollars. Eventually the training and monitoring objective has to include magnitude, timing, geography, customer impact, and other domain-specific considerations.

Historically, finding those refinements was data science work: humans inspected misclassifications, true positives, false positives, and feature distributions, then adjusted loss functions or added features. Clark argues that this is no longer feasible by hand for complex agent traces. The systems have too many moving parts, the data is too unstructured, and teams do not have time to manually inspect enough examples. He also argues that LLMs are well suited to comparing token-rich traces and pulling out recurring differences.

The result, in his account, is not that a machine automatically knows what the business wants. It is a pipeline for turning production differences into candidate insights: this subset is different, here are examples, here is a likely interpretation, here is a reason it may be bad or good, and here is a possible action.

A trace has to become a fingerprint before it can be clustered

The technical process Clark lays out starts with a constraint: raw traces are not, by themselves, a useful clustering substrate. Even when traces are structured according to conventions named in the interview, such as OpenInference or GenAI, they contain nested JSON, text, tool calls, responses, and metadata. Before analytics can compare them, the system has to map that structure into a vector representation: a fingerprint for a trace or session.

That vector can contain different kinds of dimensions: floats, integers, categories, counts, sequences, and evaluation-derived features. Clark names examples such as the number of tool calls, the sequence of tool calls, and evals on text.

The mapping begins with what can be extracted from the semantic convention in use. Clark says there are fields that can be pulled out of OpenInference or the GenAI semantic convention, along with default evals Distributional has found useful. Users can then extend the representation for their own context. If a team knows what task completion means for its agent, it can add that as another dimension in the vector.

Analytics can also lead to more vectors being added over time. In practical terms, once a recurring pattern is discovered, the team may turn it into a new eval or feature, which can then become part of the trace fingerprint. Clark describes this as the system learning the types of signals the user cares about, not as a fully autonomous substitute for domain judgment.

Once many traces have been mapped into high-dimensional vectors, the system can look for sub-distributions. Clark distinguishes these from simple outliers. The most useful patterns may not be the strangest traces at the tail. They may be pockets inside the broader distribution: recurring shapes that are neither the main path nor obvious anomalies.

From there, Clark describes using stratified sampling, clustering, topic modeling, and LLM-based comparison. He mentions Anthropic’s CLIO paper as an influence and describes it, without going into the acronym, as learning topic representations on LLM data — a way to identify that some people are talking about one thing while others are talking about another. In Distributional’s setting, the goal he lays out is to create a human-readable taxonomy of behavioral pockets: what this group looks like, how it differs from that group, whether the difference seems good or bad, and what might fix it.

The fix could be operational, prompt-level, architectural, or evaluative. A prompt instruction might require calling a tool. A system change might add caching to avoid rate-limit failures. An eval might be created to track the behavior directly. A guardrail might block the behavior. Examples from the subset might be used for fine-tuning, reinforcement learning, or synthetic data generation.

Clark calls this “adaptive analytics”: the analytics discovers a pattern, the team turns the pattern into something measurable or actionable, and the system, in his description, becomes better at finding related patterns later.

Clark’s data flywheel depends on analytics

Sam Charrington characterizes the production-improvement loop as the “elusive data flywheel”: a team bootstraps an agent with intuition, prompts, and initial evals; real users begin interacting with it; and the hope is that production interactions can be transformed into insight about how to improve the system. Clark agrees with that framing but argues that the flywheel needs analytics at its center.

The reason is noise. Production data contains too much unstructured and semi-structured information for teams to turn it directly into better models or prompts. In Clark’s view, analytics is the layer that extracts the right signals from that noise. Once those signals exist, they can feed downstream processes such as eval design, guardrails, fine-tuning examples, reinforcement-learning data, or synthetic datasets designed to exaggerate and study a small signal.

Clark contrasts that with other ways teams try to solve the same problem. Some rely on intuition and hand-written evals. Some generate synthetic data without a strong signal from production. Some use traditional data-science workflows, digging through traces in pandas, Splunk, or SQL. Clark’s view is that agent data does not fit cleanly into the old paradigm. It has structure, but it also contains large amounts of text and nested behavior. LLM problems, in his phrasing, require LLM solutions.

That does not mean generic evals are worthless. Toxicity checks, ROUGE, BLEU, or other broad metrics can help bootstrap a system. But Clark argues they cannot represent what a specific business truly cares about once agents become longer-running and more complex. Every domain has gotchas. Some are in an expert’s head. Some are institutional knowledge. Some are only recognizable when a human sees the example and says, “That’s bad,” or “That’s fine.”

Clark draws on examples outside LLMs to make the point. In metagenomic assembly during graduate school, he tried to design what would now be called an eval. Each clever objective ran into wet-lab realities. A biologist would explain that, every so often, a bubble in an Illumina machine warped the data. That physical artifact had to be incorporated into the objective. Fraud detection has the same property: the naive metric becomes more complicated once a bank explains customer pain, false-positive costs, transaction size, and risk.

His conclusion is that useful evals come from recursive refinement. Teams observe production behavior, discover patterns, consult domain judgment, map high-signal production data down into a smaller set of meaningful metrics, and then decide what should go up, what should go down, and what must never cross a threshold. Analytics is the mechanism he sees for making that loop tractable.

The operator path starts with broad, structured logs

The practical guidance is staged: do not start with analytics if the system cannot yet tell you what happened. Clark’s path is to log broadly, structure the logs early, monitor the obvious signals, and then use analytics once there is enough production behavior to discover less obvious ones.

The first operator obligation is instrumentation. Clark recommends OpenTelemetry as the open framework for traces and says the ecosystem is consolidating around it. He also recommends using a semantic convention, especially the GenAI semantic convention, so the trace structure knows what LLM responses and tool calls look like.

Sam Charrington clarifies the distinction: OpenTelemetry is a general logging and tracing standard used for many kinds of systems, including web servers and infrastructure. Because it is general, teams need a way to structure generative-AI-specific events. Clark describes the GenAI semantic convention as a schema. Rather than placing arbitrary fields into a JSON blob, it gives common meaning to user messages, LLM responses, tool calls, and increasingly eval-related fields.

That structure matters because downstream tools do not have to guess which nested JSON field contains the user message or tool result. It also makes enrichment easier: the analytics layer can automatically extract known dimensions for the trace fingerprint.

A team following Clark’s approach would begin by making sure the trace contains the raw material needed for later diagnosis: the user request, model responses, reasoning or intermediate steps where available, tool-call names, tool inputs and outputs, tool errors, retries, latency, token or cost signals, and whatever application-specific outcome the team already understands. If task completion has a clear meaning in the product, it should be logged. If user satisfaction, escalation, refund, abandonment, or another business signal exists, it should be connected to the session or trace rather than left in a separate system that no one can join later.

The second obligation is to monitor the obvious things quickly. Some failures do not require adaptive analytics. If latency doubles, a tool starts returning errors, the agent begins calling 40 tools where one was expected, login errors rise, or rejection rates spike, a conventional monitoring dashboard should catch it. Clark expects low-hanging fruit to emerge as soon as teams inspect traces. They may decide to track the distribution of tool calls, profanity, cost, repeated failures, rate-limit responses, or other known signals.

The third obligation is to turn early inspection into explicit checks. If a team discovers that an agent sometimes calls far too many tools, that becomes a monitored metric. If a specific tool returns frequent 429 errors, that becomes an operational signal. If the agent sometimes responds as though it called a tool when the trace shows it did not, that can become an eval, a guardrail, or a trace-level consistency check. The point is to avoid leaving discovered failure modes as folklore.

Only after those basics are in place does Clark’s analytics layer become the main lever. The trigger is not a particular scale threshold in the interview; it is the point at which the agent is in production, the team has enough traces to compare distributions, and known dashboards are no longer enough to explain what users or business owners are seeing. Analytics is meant to answer the next question: what recurring pockets of behavior are present that the team did not know to name?

The output of that step should become an artifact. Clark describes insights that include the size of the affected subset, examples, an explanation of why the pattern differs, a judgment about whether it appears bad, and a possible fix. In operator terms, that can produce several concrete actions: add a new eval, add a new guardrail, revise the system prompt, change the agent’s tool policy, add a caching layer, collect examples for fine-tuning or reinforcement learning, seed a synthetic dataset, open an experiment, or add a newly discovered signal to monitoring.

Clark describes many teams as starting in a high-temperature annealing state: shipping quickly, learning what works, accepting roughness. Charrington jokes about three nines or even one nine of uptime not mattering and about dangerously ignoring permissions; Clark answers with the phrase “dangerously ship slop.” Clark is not endorsing that as an end state. His point is that many teams begin with rapid experimentation, then need to “lower the temperature” as the product starts making money and requires reliability.

At that point, telemetry and monitoring are no longer enough. Teams need to improve the monitoring itself, decide what new spans or fields to log, discover patterns too subtle for manual inspection, and align the system to the use case over time. Clark says no one should have to read through 100,000 traces to invent a pattern by hand. The system should surface candidate patterns, and those patterns should become part of a more automatic improvement loop.

A compact version of Clark’s decision path looks like this: if there are no traces, start with telemetry; if known operational signals are not tracked, add monitoring; if the agent is in production and the known dashboards are no longer enough to explain user or business behavior, use analytics to discover what should become the next metric, eval, guardrail, prompt change, or code change.

Non-stationary models make one-time evals brittle

The need for online analytics is strongest, in Clark’s view, because the underlying models are non-stationary. Sam Charrington frames the issue by referring to recent public discussions he associates with model providers: an OpenAI post about GPT-4o talking about goblins, and an Anthropic discussion of why models seemed worse in some cases. The source does not examine those posts in detail; Charrington uses them to make a narrower point. Builders often depend on black boxes they do not control, and providers may change knobs even when a model version number appears stable.

Scott Clark says this is precisely why evals and guardrails cannot be treated as fixed. Something that worked yesterday may fail tomorrow because the model shifted around it. He describes the system as being boxed into a high-dimensional space; eventually it may find a dimension the developer did not realize was open, acquire a new skill, or produce a new side effect. A reinforcement-learning change intended to stop one behavior may produce another.

He and Charrington agree that this is not merely an anomaly. It is part of why these systems work. They are dynamic systems that evolve over time. Anthropic and OpenAI can choose stimulus and reward, but the resulting evolution can have side effects. Clark invokes the “raptors testing the fence” image from Jurassic Park: a fence that works in one situation may not work in another. He adds that this need not involve an adversarial model intentionally trying to escape. The system can simply change such that yesterday’s chicken-wire fence no longer fits today’s elephant.

Charrington asks whether analytics could provide something like a “model weather report”: a way to know how the model is doing today for the systems built on top of it. Clark extends the weather analogy by separating known measurements from newly discovered phenomena. Temperature, humidity, rejection rate, and number of tool calls are monitoring signals. They can be plotted over time and interpreted when they spike. A heat dome or atmospheric river is different: a higher-level phenomenon that may explain why lower-level metrics behave strangely, even though the team did not know to name or track it beforehand. Once discovered, it can become part of monitoring.

The tool-call example illustrates the point. A team may add a system-prompt instruction to reduce unnecessary tool use because token costs are too high. The monitoring dashboard looks good in Clark’s hypothetical: costs fall by 20%, known evals remain stable, and the system appears to be following instructions. But in a small fraction of sessions, the agent may start “cheating,” relying on memorized knowledge or pretending to have looked something up. If the team was tracking only cost, it might miss the second-order effect.

20%
cost reduction in Clark’s hypothetical prompt-change example

Charrington questions whether that is really analytics rather than just monitoring number of tool calls. Clark’s response is that analytics can reveal that number of tool calls is a metric worth monitoring in the first place. The eventual signal may be simple. The discovery of which simple signal matters is not.

Clark returns to the fraud example to make this concrete. A confusion matrix may show a classifier with apparently acceptable false-negative rates. But if the false negatives are weighted by transaction amount, the business picture can change completely: the small percentage of missed fraud may contain the massive transactions. Transaction value is orthogonal to correctness, but it determines business impact. In Clark’s view, analytics is how a team learns to combine those dimensions.

The output of analytics is an insight, not a single score

When Sam Charrington asks how to measure a metric-discovery system, Clark says the evaluation is inherently meta. The question is whether the system finds things the team would not have found, or would have taken much longer to find. The value can be measured by speed of discovery, by the business impact of the discovered issue, or by the downstream artifacts produced: new metrics, pull requests, features, evals, guardrails, or experiments.

The artifact in Distributional’s system, as Scott Clark describes it, is an “insight.” After the pipeline runs, it surfaces something like: 2% of the data has this pattern; here are examples; the system believes it is bad for this reason; here is how it could be fixed; do you want to track it or run an experiment? Then the loop continues.

That output is closer to descriptive data science than to a single predictive metric. In Clark’s framing, the value is not only that a dashboard number improved. It is that the team discovered an explanation it otherwise lacked, then changed the system accordingly. If the insight affects a specific business metric, the impact may become measurable in the usual way. But the discovery itself is part of the value he attributes to the analytics layer.

Charrington asks whether there is an academic benchmark for this kind of problem: a dataset of logs with hidden signals where different analytics approaches compete to identify useful insights. Clark says such a benchmark could exist, though he does not present one as already established. Security, he says, is an interesting adjacent use case because strange sub-distributions of behavior could indicate either an agent failure or malicious activity.

He notes that enterprise customers are increasingly attuned to security, particularly with MITRE among the frameworks or efforts being discussed in the space. “Everybody’s gotten superpowers, including the Black Hat hackers,” he says. But he also says security is difficult for Distributional to pursue deeply without better access to security data. Because the company focuses on on-prem deployment and data provenance, it does not freely collect customer logs. A serious security push would likely require a deeper partnership or acquisition.

Clark describes Distributional’s product as not open source but “open distribution”: free to use, with both SaaS and on-prem options. Many customers, he says, prefer to keep logs to themselves because traces can contain sensitive information. The system can deploy on-prem, use whatever local or third-party LLM the customer already has, and run alongside monitoring or nightly ETL jobs. He later says it is packaged as a Helm chart and can run air-gapped.

The frontier, in your inbox tomorrow at 08:00.

Sign up free. Pick the industry Briefs you want. Tomorrow morning, they land. No credit card.

Sign up free