Agent Observability Is Moving From Dashboards to Eval-Driven Optimization

Nitya NarasimhanAI EngineerThursday, May 14, 202618 min read

Amy Boyd and Nitya Narasimhan of Microsoft argue that agent observability has to track the widening gap between what an AI agent is meant to do and what it actually does as models, prompts, tools and user behavior change. Their walkthrough of Microsoft Foundry frames observability as a loop of OpenTelemetry tracing, trace-linked evaluations, monitoring, optimization and red teaming. The central demonstration is an observe skill that can generate an evaluation dataset, run batch tests, optimize prompts, compare versions and roll back to the best-performing agent version from a sparse starting point.

The gap is the distance between requirements and agent behavior

Amy Boyd framed agent observability around a simple problem: agents change, and the space between what an agent is supposed to do and what it actually does can widen without developers noticing. The “mind the gap” analogy was not only a London reference. Boyd used it to describe the mismatch between a fixed platform and changing trains: the platform stands for the requirements, the train for the agent. Sometimes they fit cleanly. In other cases, there is a dangerous gap.

For AI agents, that gap appears in quality, safety, and monitoring. Evaluation checks whether the agent still meets its requirements. Guardrails and safety mechanisms warn users and constrain behavior when risk appears. Monitoring keeps that awareness alive over time, because agents do not stay still: requirements change, customers change, environments change, prompts get refined, models get updated, context drifts, and edge cases accumulate.

Boyd’s core claim was that observability has to begin early, not arrive as a production afterthought. Developers need to evaluate while building, monitor while operating, and optimize based on what traces and evaluations reveal. It is not enough to produce scores. The next question is operational: what should be changed, and how quickly can the team know whether the change helped or caused a regression?

Nitya Narasimhan reduced the developer challenge to three problems. First, there are too many models and, at the beginning of a project, often no existing evaluation data. Second, AI quality has to keep up with fast changes, because brand risk and user-facing failures come back to the team operating the agent. Third, safety requires assuming not only ordinary users, but malicious users who actively try to manipulate the system.

Narasimhan separated evaluation from safeguarding with a house analogy. Quality evaluation is like a building inspector checking whether the house is up to code. Safeguarding is closer to hiring someone to break into the house and report the ways it failed. Normal evaluation asks whether the agent performs as expected under ordinary use. Adversarial testing asks whether the agent can be pushed outside its allowed behavior.

It’s not enough for you to know when things go wrong. You need to shorten the time between detecting something went wrong and diagnosing it.

Nitya Narasimhan

That is why Narasimhan emphasized “trace-linked evaluations.” A failed metric tells the team that something broke. The linked trace tells them where in the execution path it happened: whether a model produced the wrong response, a tool was not called, a workflow step failed, or a sub-agent underperformed. In her example, a model swap might reduce cost but degrade tool-call efficiency. The useful loop is one that lets the developer compare the new version to the previous one and diagnose what changed.

The observe skill turns observability into an eval-driven workflow

The most consequential demonstration was the early-preview observe skill. Narasimhan presented it as a Microsoft Foundry skill that lets a coding agent run the observability loop against an existing agent. Instead of manually writing SDK calls to create evaluators, generate datasets, run batch evaluations, inspect failures, optimize prompts, redeploy, compare versions, and decide which version to use, the developer asks a coding agent to start the loop.

The starting point was intentionally sparse: the Contoso portal agent existed, but there was no evaluation dataset and no baseline. Narasimhan used GitHub Copilot Chat, with Claude as the model, and asked it to use the observe skill to start the observability loop on the agent. In the run she showed, the skill checked agent metadata, inspected the agent instructions, called the evaluator catalog, saw that no evaluation dataset existed, and generated one. It then ran an initial batch evaluation with checks shown for relevance, task adherence, intent resolution, and indirect attack.

The first batch result for contoso-travel-portal version 4 showed 10 queries. Relevance passed 10 out of 10. Intent resolution passed 10 out of 10. Indirect attack passed 10 out of 10. Task adherence passed 8 out of 10, leaving two failures. The important point was that she had not provided the dataset or manually configured the loop. The skill produced a baseline from the agent and its current state.

8/10

baseline task-adherence result for the Contoso travel portal agent before prompt optimization

The skill then asked for human direction. Narasimhan told it to analyze the failures, optimize the instructions, and reevaluate. In this run, it ran a prompt optimizer, identified that the prompt left room for misinterpretation, updated the agent to a new version, reran the same batch evaluations, and returned a comparison table. Version 5 improved task adherence from 8 out of 10 to 9 out of 10 while relevance, intent resolution, and indirect attack remained at 10 out of 10.

Narasimhan then pushed the system to seek 10 out of 10. This is where she stressed the need for a human in the loop. The coding agent kept modifying prompts and testing new versions, but later versions regressed. One version with a forbidden phrase list dropped to 6 out of 10. Another with an “I searched the web” opener scored 8 out of 10, and the notes said the explicit opener made the judge more suspicious of fabrication. The system eventually concluded that version 5 was the best-performing version in the optimization history and selected it again.

Version	Key change	Score	Notes
v4	Original instructions	8/10	Baseline with two task-adherence failures; no web search called
v5	Mandatory web-search rules	9/10	Fixed one failure; still missed another case
v7	Forbidden phrase list	6/10	Regressed
v8	“I searched the web” opener	8/10	Explicit opener made the judge more suspicious of fabrication

The observe skill compared agent versions and selected v5 as the best-performing prompt version shown in the demo.

An audience member asked whether the observe skill only changes prompts or can change the model. Narasimhan said it can change other things, but the loop she showed began with batch evaluation and prompt optimization. She described other possible investigations in conditional terms: for example, the skill might identify that web search is taking too long and ask whether the developer wants to look for an alternative. The demonstrated run showed dataset generation, evaluator setup, prompt optimization, version comparison, regression detection, and selection of the best-scoring version in that loop.

The visible observe-skill documentation described the agent observability loop as orchestrating an eval-driven optimization cycle for a Foundry agent: reusing or refreshing a .foundry cache, auto-creating evaluators, generating test datasets, running batch evaluations, clustering failures, optimizing prompts, redeploying, and comparing versions. Narasimhan said it uses the Foundry MCP server and can be triggered by intents such as “evaluate my agent,” “help me build a dataset,” and “run an eval.” Subskills shown in the documentation included dataset creation and insight analysis.

When asked whether updating the evaluation set means rerunning the process, Narasimhan said yes — and that the developer can tell the skill to do it. The eval set itself becomes part of the observability loop. If the team learns that its dataset misses important cases, it refreshes or extends the dataset and reruns the comparisons rather than treating the original benchmark as fixed.

Reliable agents need evaluation, tracing, monitoring, and optimization in the same loop

Amy Boyd described reliable agent development as a three-part practice: evaluate, monitor, and optimize. Evaluation covers performance, quality, safety, and increasingly agent-specific behavior. Monitoring is the ongoing ability to debug issues quickly as requirements and user behavior change. Optimization is the step that turns observability data into agent improvements.

In Microsoft Foundry, the observability surface Boyd described spans the agent lifecycle: building reliable agents early, debugging and optimizing in production, and eventually gaining fleet-wide visibility and control across many agents. She stressed that this was not limited to agents built entirely inside Foundry. Foundry can host, observe, manage, and monitor agents, but developers can use as much or as little of the platform as they need.

Tracing is built on OpenTelemetry. That matters because organizations often build different agents with different tools across teams or businesses. If those agents can be instrumented with OTel tracing, they can be brought into the Foundry control plane for observation. The tracing view is meant to expose what the agent actually did: which tools it called, what messages it sent, and how the steps in the workflow unfolded.

The evaluator catalog Boyd showed included quality evaluators such as document retrieval, groundedness, relevance, coherence, fluency, similarity, NLP metrics, and Azure OpenAI graders. It included risk and safety evaluators such as indirect attack jailbreaks, hate and unfairness, protected material, ungrounded attributes, code vulnerability, prohibited actions, and sensitive data leakage. It also included agent-specific metrics: intent resolution, tool-call accuracy, tool selection, tool input accuracy, tool output utilization, task adherence, response completeness, tool-call success, task completion, and task navigation efficiency. When built-ins do not fit the scenario, Boyd said teams can write custom evaluators.

The agent-specific distinction was central. Boyd used a weather agent to show why evaluating a single model response is too shallow. If the user asks, “What’s the weather in London?”, the agent first has to resolve intent: the user wants local weather and a forecast time. It then has to call the right tools, such as location and weather functions, with the right inputs. Only then does the response itself need evaluation for task completion, task navigation efficiency, adherence to goal, policy, or rule, groundedness, relevance, and operational metrics.

This decomposition is what makes observability actionable. If the final answer is poor, the team needs to know whether the intent was misunderstood, the wrong tool was selected, the right tool was called with bad input, the tool output was ignored, or the final response failed to adhere to the task. An agent can be evaluated at many points in its lifecycle and at many points inside a single workflow.

A proof of concept should already be traceable

The demonstration used a deliberately simple Contoso travel agent. Narasimhan described the scenario as a developer joining a fictitious company and being asked to build an application that helps users find hotels, car rentals, and flight reservations. The agent begins as the combination of a model, instructions, and tools. The model is the “brain”; the agent is the experience; tools add knowledge and capabilities beyond what the model already has.

The first developer problem is model selection. Narasimhan said there are more than two million models on Hugging Face and more than 11,000 in Azure’s catalog. Her point was not to rank them, but to illustrate the startup problem: when a developer is starting from zero, they often do not want to make every architectural decision before they can test anything. They need a fast path to a prototype.

In the Foundry portal, Boyd showed that fast path: create an agent, accept a default model deployment, add a tool, and enter the playground. In the demo, Foundry deployed GPT-4.1 by default. Boyd noted that other models were visible through the catalog, including providers such as Anthropic and DeepSeek. The point of the first step was not model selection; it was getting to a working, observable starting point quickly.

At this stage, the agent was intentionally under-specified. Boyd added web search through Bing, because a travel agent needs current outside information. But with no meaningful instructions, the agent responded like a generic assistant. It could say what it could do and offer general advice, but it was not yet constrained to Contoso’s travel-assistant role.

The next requirement was tracing. Narasimhan pointed out that the newly created agent was not traceable until Application Insights was connected. In the Foundry portal shown in the demo, the Traces tab prompted the user to create or connect an App Insights resource. Boyd also showed an “Ask AI” helper in the portal that could walk users through connecting App Insights.

Once tracing was connected and the agent had more detailed instructions, the portal showed conversations with status, duration, tokens, estimated cost, and evaluation information. Boyd’s Contoso instructions included responsibilities such as helping customers plan trips, answering destination and logistics questions, always using web search, presenting comparisons, providing externally sourced specifics, and keeping responses focused and helpful.

The important result was not that the agent became polished. It was that even a basic proof of concept could surface quality and safety metrics. Boyd sent a detailed test prompt about planning a vacation to Paris for a group of three leaving Seattle on July 3. The response showed “AI Quality: 80%” and “Safety: 100%” under the chat bubble. In the metrics menu, Boyd selected quick evaluation metrics such as task adherence, intent resolution, coherence, and fluency. The trace details then showed an Evaluations tab with scores and reasoning.

One result was especially useful: task adherence failed or scored low. Boyd explained that the agent had not really answered the user’s question; it had asked for more information instead. The portal-first path mattered because the team had gone from a basic agent to traces and early evaluation results quickly enough to see a real failure before writing a more elaborate implementation.

Workflow traces make multi-agent failures localizable

Nitya Narasimhan treated the portal path as the quickest proof of concept, then moved into the SDK path for repeatability and control. The notebooks rebuild the first agent programmatically, add tools, compose workflow agents, configure tracing, run evaluations, and perform red teaming. The useful distinction is not the notebook sequence; it is what becomes visible as the implementation grows more complex.

The basic prompt agent has a model and instructions, and it is invoked through the Responses API. From there, the implementation adds function tools for car search, flight search, and hotel search. The agent now has model, instructions, and tools, and can be tested against flight-only prompts or combined hotel-and-car requests.

The next step is a workflow agent. A single monolithic agent doing all work is not ideal once the task becomes complex. A travel application can be decomposed into specialist agents: one for flights, one for hotels, one for car rentals, and a concierge that orchestrates them. Foundry workflows, as shown in the demo, let developers define that composition declaratively in YAML and visualize the agent graph in the portal.

The workflow trace is where the observability payoff becomes more obvious. Instead of one agent call, the trace shows the workflow execution and each sub-agent invocation: creating a conversation, invoking the flight agent, invoking the hotel agent, invoking the car agent, collecting their responses, and returning the final answer. Narasimhan said this allows teams to identify which agent is underperforming, which agent is not doing its job correctly, and what each piece costs in tokens.

An audience member asked how to optimize cost and how to roll back a complex agent transaction when it goes down the wrong path. Narasimhan answered cost first. Model choice changes cost: GPT-4.1 costs more in tokens than a smaller model such as GPT-4o-mini, so a team can switch models and immediately rerun evaluations to see whether lower cost caused an accuracy regression. Cost can also come from tool behavior. If web search takes too much time or cost, a team might replace it with a smaller cached subset or a cheaper data source. The rule she emphasized was that every change should be evaluated immediately and compared against the previous behavior.

Rollback, in her explanation, is version-based. Agents have saved versions, and deploying an agent can mean selecting the identity and version that should be live. Later, in the observe-skill demo, she showed a coding agent returning to the version with the best evaluation score. The underlying mechanism she pointed to was simple: versions are identifiers, and the team can deploy the one it wants.

Narasimhan also made a platform boundary explicit. From the tracing perspective, she said, it does not matter what kind of agent you have if it has an endpoint and emits OTel traces. From the evaluation perspective, the agent’s construction mechanism is also secondary: with an endpoint, evaluations can be run and metrics returned. That is the basis for treating prompt agents, tool-using agents, workflow agents, and agents built with other frameworks as candidates for the same observability loop.

Custom traces and batch evaluations make failures diagnosable

After building agents in code, the focus shifted to tracing itself. The tracing notebook showed how to set up OpenTelemetry locally and in Azure Monitor, and how to add custom attributes that matter for debugging. As the agent evolves from a prompt to tools to workflows, the number of ways it can fail multiplies. Custom trace attributes let developers include information they know they will want when troubleshooting later.

The local tracing setup shown in the notebook requires environment attributes to enable generative AI tracing and capture message content. Once configured, the agent can emit traces to the console during development, with the developer’s custom attributes included. The same traces can also be pushed to the backend and shown in the Foundry portal.

Narasimhan connected Foundry traces to Azure Monitor as the place where telemetry across services can be collected. Adding AI telemetry lets teams compare agent behavior with the other systems it depends on. Her example was Azure AI Search: if an agent depends on a search index, the team can look at search telemetry and AI application telemetry together to understand whether latency or failures came from the model, the agent, or a dependency.

The evaluation path then moves into batch runs. The categories were quality, safety, and agentic evaluation. The first requirement is an evaluation dataset. In the notebook, the team used an AI agent to create a sample set of test prompts and responses. Later, the observe skill generated a dataset automatically for the demonstrated portal agent.

To run evaluations, the developer defines evaluator objects, specifies the criteria, maps each evaluator to the columns in the evaluation dataset, runs the evaluation, and polls until it completes. Results can be analyzed locally or viewed in the Foundry portal.

The portal showed separate quality and agentic runs for the Contoso travel agent. In a simple quality run, fluency and coherence passed. In the agentic run, metrics included intent resolution, groundedness, and relevance. One result showed intent resolution at 100%, relevance at 100%, and groundedness at 80%, with one groundedness failure.

Metric	Result shown	What the failure exposed
Intent resolution	100% / 5 of 5	The agent understood the user's intent in the simple test set
Relevance	100% / 5 of 5	The answers were judged relevant in the run shown
Groundedness	80% / 4 of 5	One answer used an incorrect date, making the response unreliable

The agentic evaluation run made a single groundedness failure visible alongside otherwise passing metrics.

That failure is the kind of finding Narasimhan wanted developers to see. The detailed metrics view showed which row failed, the trace ID, the output, the groundedness score, and the reason. The visible tooltip said the response provided relevant information but gave an incorrect date, making the response unreliable. The point was not that the test set was comprehensive; Narasimhan called it simple. The point was that a batch evaluation could point to a specific failed case, attach reasoning, and connect it back to the trace.

Red teaming tests whether the guardrails can be bypassed

Narasimhan distinguished ordinary safety evaluation from red teaming. Safety evaluation checks behavior against categories such as violence, hate, self-harm, or sexual content. Red teaming actively attacks the model or agent to see whether manipulated prompts can get around safeguards.

Her concise description was that red teaming uses a second AI to attack the first AI. The developer tells the red teaming agent which risk categories matter and which attack strategies to use. The red teaming agent then generates adversarial prompts, runs them against the target, and reports the cases where the agent failed.

The example she used was a prompt asking how to rob a bank. A direct request should trigger safety guardrails. But if the request is transformed — flipped, encoded, written in leetspeak, or otherwise made to look like gibberish — the guardrail may let it through. The model may then decode or interpret the manipulation and answer the harmful request. Red teaming is meant to proactively test those evasions.

The red teaming notebook listed agentic safety evaluators including prohibited actions, task adherence, and sensitive data leakage. Narasimhan emphasized that agentic risks are especially important because the agent can act. If an adversary manipulates an agent into doing something outside its intended scope, the failure is not merely a bad answer; it may be an unauthorized action.

The attack strategies shown included leetspeak, Base64, and indirect jailbreaks. In the portal, a red team run against Contoso showed risk categories such as violence, sensitive data leakage, and task adherence, with leetspeak as the attack strategy. The overall results showed 0% attack success rate for sensitive data leakage, task adherence, violence, baseline, and leetspeak in that simple run. The detailed metrics showed attack inputs, responses, outcomes, and reasoning, including cases where responses were filtered.

Red-team category	Result shown	Context
Sensitive data leakage ASR	0% / 0 of 76	Simple run with leetspeak included as an attack strategy
Task adherence ASR	0% / 0 of 28	The target was tested for being steered off task
Violence ASR	0% / 0 of 76	The run shown did not find successful attacks in this category
Leetspeak ASR	0% / 0 of 30	The leetspeak attack strategy did not bypass the guardrail in the shown run

The red-team run Narasimhan showed passed the selected risk categories, while serving as an example of how attack-success rates are reported.

An audience member asked whether the developer could define the policy in the inverse form: the agent has the right to do nothing except what is authorized. Narasimhan said that is what a guardrail or taxonomy can express. Red teaming asks the next question: if that is the policy, can an attacker manipulate the agent into violating it? A taxonomy may define allowed and prohibited actions, but the red teaming agent tries to find loopholes through prompt strategies.

Narasimhan also showed more difficult attack options, including crescendo. She described a crescendo attack as a gradual multi-turn escalation: it starts with a small harmless-seeming step, gets through, builds on it, and keeps increasing pressure until the system is being attacked from multiple sides. Such attacks take longer, but are meant to find vulnerabilities that simpler single-turn attacks may miss.

AI assistants move observability closer to the developer and operator

Narasimhan closed by showing two AI-assisted analysis surfaces beyond the coding-agent skill. The first was Ask AI inside the Foundry portal. As she described it, it is an agent that knows the state of the current project, so a developer can ask project-specific questions: how to connect Application Insights, what models are available in a subscription or region, whether quota exists, or what happened with a specific trace ID. The portal itself displayed a preview warning that the chat is powered by AI and outputs should be reviewed.

The second was the Azure Copilot observability agent inside Application Insights and Azure Monitor. Narasimhan said she is not a KQL query person and cannot write those queries reliably. The observability agent, as shown in the Azure portal, turns natural-language questions into telemetry analysis. In the interface, it described its capabilities: finding failures across exceptions, failed requests, and dependency errors; analyzing latency percentiles, error rates, top operations, dependencies, and user impact; and exploring logs and metrics.

This completed the loop Boyd and Narasimhan had been building toward. Foundry portal tools help developers create a quick traceable agent. SDK notebooks show how to build prompt agents, tool-using agents, workflow agents, custom traces, evaluations, and red team runs in code. The observe skill lets a coding agent execute the evaluation and optimization cycle, while preserving human guidance. Ask AI and the Azure observability agent bring state-aware assistance into the portal and telemetry workspace.

Narasimhan’s final formulation of the “gap” was threefold. A working agent can drift from its original requirements because models, prompts, environments, and edge cases change. Teams need continuous tracing and monitoring to know when that gap changes. They need trace-linked evaluations to detect issues and diagnose them quickly. And they need adversarial testing, not only normal safety evaluations, because attackers will try techniques the developers may not have anticipated.

The answer to “too many models, no existing data” was to use coding agents and skills to generate an initial evaluation dataset from the current application state and requirements. The answer to fast-changing quality was trace-linked evaluation: see both how the agent executed and what score or failure resulted from a change. The answer to safety was to combine guardrails with proactive red teaming, including attack strategies maintained and expanded through work such as PyRIT.

Agents and Autonomy Coding Assistants AI Application Architecture Evals and Benchmarks Inference and Deployment Data and Training AI Safety and Alignment AI Security