Senior Engineers Overfit AI Agent Tools to Context Models Cannot See

Philipp SchmidAI EngineerSaturday, May 30, 20267 min read

Philipp Schmid of Google DeepMind argues that senior engineers often struggle with AI agents because they design tools around context they personally understand but the model cannot see. In his account, agent-ready systems need explicit tool schemas, semantic state, recoverable errors, eval-based reliability measures and disposable harnesses, because engineers are managing probabilistic behavior rather than controlling a deterministic flow.

Senior engineers overfit tools to context the agent cannot see

Philipp Schmid argues that engineers struggle with AI agents because the control model has changed, and experienced engineers may be most exposed to the trap. They know the systems too well. A backend endpoint such as delete_item(id) may feel obvious to the developer who built the product API: they know what item means, what shape the ID takes, and what should happen when the item is not found. The agent has none of that accumulated context.

An agent sees the function schema, docstring, and tool definition. If the tool name is vague and the parameter is just id, it may guess. Schmid’s proposed alternative is more explicit and semantic: delete_item_by_uuid(uuid: str), with a docstring saying it deletes an item by UUID and returns an error if not found.

That is the practical meaning of building “agent-ready” APIs: semantic names, verbose docstrings, typed parameters, and clear failure behavior. Tool definitions become part of the model’s operating environment, not internal implementation details.

Schmid frames the broader shift as a move from “traffic controller” to “dispatcher.” Traditional software starts with a specification, then code, tests, deployment, and user interaction. The engineer controls the roads, lights, speed limits, and route. Agent development starts with instructions, then observation, adaptation, prompt and tool changes, and repeated runs. The engineer states the destination and constraints; the model may take a path the engineer did not specify.

The slide behind this metaphor makes the tension explicit: the “traffic controller” mindset treats software as deterministic, where the engineer owns the roads, signals, and laws; the “dispatcher” mindset treats agent engineering as probabilistic, where the engineer gives instructions to a driver navigating ambiguity. The trap, as the slide puts it, is that better engineers may fight the model harder, trying to “code away” nondeterminism with if-statements and rigid schemas.

You cannot code away the probability. You must manage it through evals and self-correction.

Philipp Schmid

The claim is not that control disappears. Schmid’s summary phrase is “trust, but verify.” Engineers still define goals, constraints, tools, and evaluation standards. What changes is the level at which control is applied.

Structured state can strip away the meaning the model needs

One of Schmid’s first concrete shifts is that “text is the new state.” Traditional systems tend to turn the world into booleans, enums, flags, and status fields because structured values are safe to route and easy to test. In agent systems, that structure can remove the semantic detail the model needs later.

His example is a deep research agent that proposes a plan before doing the work. A conventional interface might offer “accept plan” or “deny plan.” But a user’s actual response may be: approved, but focus on the U.S. market and ignore California. If the system only records approval status, it loses the instruction embedded in the approval. A traditional flow might force the user to reject the plan, enter more information, wait for a revised plan, and continue. An agent can instead preserve the raw semantic instruction and use it downstream.

The slide contrasts the two designs directly. The software-engineering version stores { plan_id: "123", status: "APPROVED" }; the agent-engineering version stores { plan_id: "123", text: "Approved, but focus on the US market and ignore CA." }. Schmid describes the first approach as “lobotomizing intent” and the second as preserving semantic meaning.

The same applies to memory and personalization. Schmid uses temperature preferences as a small example. A profile flag such as is_celsius: true cannot express “I prefer Celsius for weather, Fahrenheit for cooking.” The latter is not just a value; it is a task-aware preference. His broader point is that personalization, memory, and intent often cannot be faithfully represented as a single field in a user profile.

This does not mean every system abandons structure. It means agent-facing state should preserve meaning when meaning matters. Schmid says the relevant context may be text, images, video, or audio, but the important shift is away from assuming that all application state can be reduced to clean structured fields without loss.

Hard-coded flows break when intent changes midstream

Philipp Schmid says agent systems require handing over more control than many software engineers are used to. In microservice-style systems, a user intent often maps to a route: classify the request, choose a workflow, execute the workflow. That works when the interaction is stable. It works less well when the user’s intent changes during the interaction.

His customer-support example starts with a user saying, “I want to cancel my subscription.” A conventional system may classify the intent as churn and enter a cancellation flow. But an agent might respond with a retention offer: 50% off for the next three months. If the user accepts, the intent is no longer churn; it is retention. A hard-coded churn flow may have lost a customer by continuing down the path selected at the first turn.

Real interactions loop, backtrack, and pivot. Engineers should describe the outcome and constraints rather than enumerate the whole path. The agent needs enough latitude to react to the conversation as it changes.

Failures should be fed back, not treated as reasons to restart

Philipp Schmid argues that errors in agent systems should be treated as inputs. If a tool call fails, a schema mismatch occurs, or a downstream service returns an error, the agent should often receive that information and try a different approach rather than crash the whole run.

The reason is partly economic and partly contextual. In older HTTP-style request flows, retrying from scratch was often cheap. If a product search failed, the system could rerun the request. In agent systems, a run may take five or fifteen minutes, consume meaningful compute, and accumulate context across multiple steps. Restarting at step four of five wastes the earlier work and may lose the context that made the later step possible.

Schmid compares the right posture to error handling in Go, where a function can return a value or an error and the caller handles both as part of normal control flow. For agents, the failed tool execution can be returned to the model as text: the error happened, here is what it was, try another approach. The traditional path raises a runtime error and crashes; the agent path returns the error string into the loop so the model can recover.

This is one of the places where “trust, but verify” becomes operational. Recovery has to be designed. The system should expect long-running agents to encounter strange cases and should provide ways to continue forward rather than assuming that any exception invalidates the run.

Evals replace assertions where nondeterminism matters

Traditional tests assume that for input A and code B, output C should result. Philipp Schmid says that assumption no longer holds cleanly for agents. The same input may lead to different steps and sometimes different outputs because the system is nondeterministic.

The shift is from relying only on unit-test-style assertions to using evals. The question changes from “did this exact assertion pass?” to “how often does this work, and at what quality?” Engineers should run multiple trials, measure pass percentages, and use qualitative grading where the output is subjective. For tasks such as research reports or customer-support responses, correctness may not be captured by a binary assertion. An LLM-as-judge or a human expert can be part of the evaluation loop.

47/50

passes in Schmid’s eval-readiness example

Schmid also emphasizes negative cases: an agent should be tested on whether it ignores requests outside its scope. Triggering blindly degrades the system. And while tracing remains necessary for debugging, grading should focus on outcomes rather than the exact path. If one user’s request requires four more research steps and another consumes more tokens, the relevant question is whether the result is correct and useful.

This turns production readiness into risk management. An agent that succeeds one out of ten times on the same prompt is too flaky for a customer-facing setting. The goal is not to eliminate all variance. It is to understand how often the system succeeds, where it fails, and whether the quality threshold is high enough to ship.

Build the harness as if it will be replaced

Philipp Schmid ends with a constraint on engineering style: build to delete. As models and agents improve, teams will rebuild the same systems many times. Software in this layer is disposable.

That does not mean engineering discipline matters less. It means the discipline is applied differently. The system should preserve semantic context instead of flattening it too early, expose tools in language the model can use, recover from errors inside a long-running loop, and measure reliability through evals rather than assuming deterministic assertions will cover the behavior.

The right response, in Schmid’s framing, is not to overfit every workflow and harness as if it will be permanent. It is to make modules that can be replaced as the underlying model capability changes.

AI Application Architecture Evals and Benchmarks Agents and Autonomy