Applied AI Shifts From Model Quality To Quality Loops

Applied AISunday, May 17, 20261h 2m to watch13 min read

As agentic systems move across tools, codebases, policies, and customer context, quality is becoming a property of the surrounding system rather than a single model response. Richard Ngo, Eugene Yan, Marlene Mhangami, Chris Lovejoy, and Stephen Chin each point to versions of the same operating pattern: define success outside the model call, observe the steps, constrain risky actions, and feed failures back into tests, memory, or product changes.

1. The new applied-AI problem: quality is a system property

Agentic AI changes what “model quality” means. When a model only answers a question, teams can mostly evaluate whether the answer is accurate, useful, safe, or on-brand. When a model acts across tools, workflows, codebases, policies, APIs, user interfaces, and customer-specific context, quality depends on the entire system around the model: how the task is specified, which context is retrieved, which tools are available, what the agent is allowed to do, how each step is traced, and how failures become future tests.

Richard Ngo’s distinction between tool AI and agent AI is the cleanest starting point. In the tool paradigm, a model helps a human decide what to do. The action horizon is short, and the human remains the bottleneck. In the agent paradigm, systems pursue goals over longer horizons, interact with tools, manage dependencies, and correct course. Ngo’s concern was frontier-scale: long-horizon, situationally aware systems whose goals cannot be inspected directly. But the practical engineering implication is already visible in applied AI. Once systems act, not just answer, the model call is no longer the unit of trust.

The operational answer running through the AI Engineer Singapore material was consistent: define what good means, observe the system’s steps, constrain irreversible actions, and turn failures into repeatable improvements.

The conference material turned that broad claim into an engineering stack. Eugene Yan treated evaluation as a development primitive rather than a post-hoc check. His framing was familiar to software teams: a test set of inputs, assertions over outputs, and metrics that show whether a change improves or regresses behavior. The important shift is that assertions have to reflect product requirements, not generic benchmark preferences. Logical tasks should use deterministic checks where possible; generative tasks may need LLM judges, but with known failure modes such as position bias and length bias.

20–50

automated examples Yan said can outperform repeated manual vibe checks when run continuously

That same discipline applies inside retrieval-augmented generation. Yan’s warning was that a final answer can look correct while the retrieval component is broken, because a capable model may guess from parametric memory. The fix is component-level evaluation: test query rewriting, retrieval, synthesis, and final answer quality separately. Amr Ahmed’s structured-extraction architecture made the same point for long documents: schema, question generation, extraction, grounding, and verification should be separable stages, not one large prompt whose failure is hard to diagnose.

Traces are the observability half of the same system. Arize’s Phoenix was presented as a way to inspect full LLM application traces and spans: retriever calls, prompts, generations, tools, latency, token usage, and user-rated failures. The diagnostic claim matters because a hallucination is not always “the model’s fault.” It may come from bad chunking, irrelevant retrieval, a broken tool call, or state passed incorrectly between steps.

The other side of quality is constraint. Adaption’s wholesale quote workflow showed the pattern: the LLM extracts messy buyer requests and matches them to catalog items, but deterministic code queries SQL and calculates the final price. Haotian Zhang’s tool-abstraction talk made the same case for APIs: function calling needs schemas, validation, authentication, retries, and provider-specific translation. Code enforces boundaries that model-generated strings cannot reliably enforce.

Fine-tuning, retrieval, long context, and test-time search then become tools for placing behavior in the right part of the system. Fine-tuning can move repeated, latency-sensitive formats or routing behavior into parameters. Retrieval can provide current state and external knowledge. Long context can reduce some context-management work, but latency and distraction keep architecture relevant. Test-time search can let agents explore, backtrack, and score possible action paths before committing. None of these is a universal answer; all are ways to make the surrounding system more observable, controllable, or adaptive.

Safety, productivity, and enterprise explainability are not the same problem. Ngo’s sleeper-agent framing is not the same as a coding agent producing brittle tests, or a bank needing to justify a credit recommendation. But they increasingly share an operational shape: quality has to be defined outside the model’s immediate output, the system’s intermediate steps have to be visible, high-risk actions need deterministic boundaries, and production failures need to feed back into evals, tests, memory, or training data.

Agentic AI Is Turning Model Quality Into a Systems ProblemAI Engineer

2. Coding agents expose the testing problem first

Coding agents make the quality-loop problem easy to see because they can increase output while weakening the evidence that the output is correct. Marlene Mhangami’s Playwright argument starts from that tension: AI-assisted teams may ship more code, but code volume is not the same as product progress. In her reading of software-engineering evidence, clean engineering environments let AI contribute more usefully, while messy codebases let AI amplify entropy.

The testing failure she focused on is not a lack of tests. It is tests that validate the implementation rather than the behavior users care about. If an agent writes a feature and then writes tests after the fact, it can produce self-affirming tests that confirm its own assumptions. Mhangami’s simple tax example captures the risk: a bad test can assert that add_tax(100) equals add_tax(100), while never encoding the actual requirement that UK VAT is 20%.

The stronger workflow reverses the order. The feature request becomes the test oracle. The agent extracts expected behavior from the request, writes failing Playwright tests against visible user behavior, and only then implements code to make the tests pass. The developer’s work does not disappear; it moves toward review, refactoring, design cleanup, and maintaining the codebase conditions that make agentic development tolerable.

Workflow	Where the test oracle comes from	Main risk
Weak agent workflow	The implementation the agent just wrote	Tests can ratify the agent’s assumptions instead of checking the requirement
Behavior-first workflow	The feature request and user-visible behavior	The test may still be incomplete, but it is external to the implementation

Mhangami’s testing distinction is about whether quality is defined before or after the agent writes code

Playwright matters in this story because it moves evaluation closer to product behavior. A browser test that searches for a toy and expects the right result is not merely checking that a method returns a value. It is checking that a user can perform the intended action in the application. Mhangami’s Tailspin Toys demo used a product-management request as the source of truth: simple text search, Azure AI Search with fallback, category filters, and price filters. The agent inspected the project, generated failing Playwright tests, implemented the features, and then exercised the product in a visible browser.

The toy-store specifics are less important than the pattern. Behavioral tests define expected outcomes before an agent acts. They create an external standard that the generated code has to satisfy. That directly echoes Yan’s evaluation discipline from AI Engineer Singapore: turn implicit checks into repeatable tests, and make the test reflect the product requirement.

Mhangami’s warning about green test suites also broadens beyond coding. Coverage can look healthy while the system remains wrong. In agentic AI, the equivalent failure appears in many forms: a retrieval pipeline that produces a plausible answer from the wrong documents, a tool-using agent that reaches the right result by an unsafe path, or an enterprise assistant that gives a recommendation without preserving the chain of evidence. In all cases, a local success signal can hide a broken quality loop.

The practical tension for software teams is that AI increases throughput before it increases trust. If the codebase has weak tests, poor type coverage, stale documentation, tight coupling, or unclear module boundaries, faster code generation can create more review burden and more rework. Playwright is not the universal solution to that problem. Mhangami herself noted that API tests may be better where an API is available. But her workflow shows the broader applied-AI lesson: the agent should not be allowed to define and validate its own success entirely from inside the artifacts it just produced.

Playwright Lets Agents Test Feature Requests Before They Write CodeAI Engineer

3. Vertical AI needs people who own the loop, not just reviewers on call

Vertical AI adds the organizational layer to the same quality problem. Chris Lovejoy’s argument is that many vertical products do not fail because the frontier model is too weak. They fail because the organization has not operationalized the domain judgment required to define, measure, and improve quality.

That distinction matters because “evaluation” is not always a purely technical system. In meeting notes, medical documentation, prior authorization, and customer-specific workflows, quality may depend on taste, professional judgment, local policy, workflow nuance, or customer context. Generic benchmarks cannot fully determine whether the product output is good. Someone or something has to define what “good” means for that domain and make that definition change the product.

Lovejoy’s Oracle, Evaluator, and Architect framework is useful because it asks where domain judgment sits in the quality loop.

Model	Domain expert’s role	When it fits
Oracle	Judges outputs and directly changes prompts, context, tools, or product behavior	Quality is subjective or taste-driven, and direct review can still cover the product
Evaluator	Defines metrics, review systems, and failure modes for engineers to act on	Quality is measurable and manual engineering iteration is fast enough
Architect	Designs automated systems that improve from usage with less direct human intervention	Variation is too high for manual loops, and improvement can be automated

Lovejoy’s roles describe where domain judgment enters the AI quality loop

Granola is the Oracle case because the core output — meeting notes — depends heavily on taste. Lovejoy described Jo Barrow, a writer and journalist, as the person who researched what makes good notes, reviewed outputs, and directly changed prompts. The relevant expertise was not a credentialed profession; it was judgment about the product’s core output.

Tandem started similarly but needed decentralized Oracles. Medical notes vary by specialty, geography, note type, and customer expectation. One doctor could not own the full quality surface as the product scaled, so the operating model had to distribute domain ownership across more contexts. The lesson is not simply “hire more experts.” It is that quality ownership has to match the product’s variation.

Anterior shows the progression from Oracle to Evaluator to Architect. Prior authorization is more measurable than meeting-note quality because the decision can be assessed against medical evidence and policy. Lovejoy first reviewed outputs and changed prompts and code himself. As volume grew, he built metrics, review dashboards, and clinician review systems. Then customer-specific policy variation made manual fixing too slow, pushing the system toward automated improvement.

The pattern connects directly to the broader applied-AI stack. Evals and traces are necessary, but the quality signal may not exist until a domain expert defines it. A review dashboard is only useful if the right failure modes are represented. A prompt library is only useful if someone with authority can decide which behavior is acceptable. An automated improvement loop is only safe if the objective it optimizes reflects real domain quality.

Lovejoy’s strongest organizational warning is that advisory experts do not change the system. A clinician, writer, or workflow expert who reviews outputs but lacks authority over product decisions is not the same as a quality-loop owner. The missing component may be less a model, benchmark, or tool than an accountable person who can turn domain judgment into metrics, prompts, context, review workflows, and eventually automated adaptation.

That does not mean every AI company needs a credentialed expert in the formal sense. Lovejoy explicitly treats relevant judgment as broader than professional licensing. The point is operational: the organization must identify the judgment its product depends on and give that judgment a place in the loop.

Vertical AI Teams Need Domain Experts Who Own Quality LoopsAI Engineer

4. Decision memory is becoming part of the product architecture

Testing and domain review define quality before or during action. Stephen Chin’s context-graph argument extends the loop after action: enterprise agents need memory structures that preserve what the system knew, which path it took, what policy or precedent mattered, and what happened afterward.

Chin’s core distinction is between isolated retrieval and connected context. Enterprise knowledge is often scattered across Slack, CRM records, Jira, PagerDuty, Zendesk, Zoom calls, documents, customer threads, and people’s heads. A vector search may retrieve relevant text, but still miss the relationship that makes the information usable. A context graph, in Chin’s framing, stores not only entities and documents but also decisions, evidence, policies, exceptions, approvers, tool calls, causal chains, and outcomes.

Context graph is Chin’s proposed architecture for making that decision trail queryable. It should be read as one proposed answer to the problem, not as proof that graphs are the only viable architecture. The underlying need is broader: agents that make or recommend consequential decisions need inspectable state and reusable precedent.

The healthcare example shows the retrieval gap. Asked what was in Andrea Jenkins’s emphysema care plan, a direct LLM response gave generic care guidance. A baseline RAG system returned respiratory therapy, deep breathing, and coughing exercises. Chin’s GraphRAG example added medication management, smoking cessation counseling, and pulmonary rehabilitation. His interpretation was that smoking cessation depended on connected patient history that similarity search did not retrieve as the nearest relevant chunk.

The important claim is not that vector retrieval is useless. It is that grounded retrieval can still be incomplete when the missing fact is one or more relationship hops away.

Record type	What it preserves	What it misses or adds
Traditional audit log	That an action happened at a time	Usually misses causal chains, relationships, and reasoning
Context graph	Entities, policies, decisions, tool calls, evidence, exceptions, and outcomes	Adds a connected structure for asking why the action happened

Chin’s audit-log contrast is a shift from recorded action to queryable decision context

The financial-services demo makes the enterprise version concrete. A user asks whether to approve a $25,000 credit-limit increase for Jessica Norris. The system searches the customer, retrieves prior decisions, finds precedent, checks policies, detects fraud patterns, and exposes the graph to a human approver. The visible prior rejection included a credit score of 620 below a required 650, debt-to-income ratio of 48% above a 45% maximum, and one late payment in the last 12 months. The point is not the particular credit decision. It is that the recommendation is tied to prior decisions, policy constraints, account relationships, risk factors, tool calls, and visible reasoning.

This is a more structured version of the trace discipline from the AI Engineer Singapore material. A trace tells engineers what happened inside a request. A context graph tries to preserve those traces as reusable organizational memory. The next similar case should not be handled from scratch if the organization already has relevant precedent, policy interpretation, exception history, or observed outcome.

There is also a human-accountability dimension. In regulated or high-stakes workflows, the system’s value depends partly on whether a person can inspect and stand behind the recommendation. Chin’s context graph does not remove the human approver from the loop. It gives that approver a connected view of the evidence and reasoning the agent used.

Context Graphs Make AI Decision Trails QueryableAI Engineer

5. Synthesis: the applied-AI stack is reorganizing around closed loops

The four arguments point to the same operating pattern from different angles. Applied AI is moving from isolated model calls toward closed quality loops: systems that define success externally, observe intermediate behavior, constrain risky actions, preserve decision evidence, and feed failures back into improvement.

Coding agents show the pattern at the feature level. Behavioral tests written from a feature request give the agent a target it did not invent. The test oracle lives outside the generated implementation. That does not guarantee correctness, but it reduces the risk that the agent validates its own assumptions.

Enterprise and production AI show the pattern at the system level. Evals, traces, component-level RAG testing, deterministic execution boundaries, routing layers, and tool schemas make failures observable and actions bounded. A model may interpret a messy request, but prices should come from databases, API calls should be schema-validated, generated code should compile in a sandbox, and retrieval should be tested as a component rather than inferred from final answer quality.

Vertical AI shows the pattern at the organizational level. When quality depends on judgment, the loop needs an accountable expert or expert system that can define what good means and make that definition operational. Sometimes that means an Oracle directly editing prompts and reviewing outputs. Sometimes it means an Evaluator building metrics and review systems. Sometimes it means an Architect designing automated improvement. The shared requirement is that judgment cannot remain advisory if the product depends on it.

Context graphs show the pattern at the memory layer. If agents are going to recommend or make decisions, the product architecture has to preserve more than a final answer. It has to keep relationships, tool calls, policies, exceptions, decisions, and outcomes available for later inspection and reuse. A flat log records that something happened; a richer decision memory tries to preserve why it happened and how similar cases were handled.

These pieces do not imply that one stack has won. The answer is not that every team should adopt Playwright, hire an Oracle, and build a graph database. The common direction is more general: agents need external quality systems because they cannot be trusted to define, observe, and correct themselves entirely from inside the model call.

Before action
Feature requests, domain standards, policies, and expected behaviors define what quality means.
During action
Tests, traces, schemas, retrieval checks, tool boundaries, and human review make the agent’s steps observable and constrained.
After action
Failures, decisions, outcomes, and precedents feed back into evals, memory, fine-tuning, retrieval, or product changes.

Fine-tuning, retrieval, long context, and test-time search fit inside this map rather than replacing it. Fine-tuning can move repeated behavior into parameters. Retrieval can provide current knowledge. Long context can let models see more of the relevant state. Test-time search can spend more inference compute exploring and correcting possible paths. But each only helps if the team can tell whether behavior improved, where failures occurred, and what evidence supports a consequential action.