Agent Safety Requires Specs, Not Just Larger Eval Sets

Steven WillmottAI EngineerSunday, May 31, 20267 min read

Steven Willmott of SafeIntelligence argues that larger models are not automatically safer agents: the same capability that lets them handle more tasks can also help them understand adversarial instructions and misuse broader infrastructure access. His proposed answer is spec-driven validation, in which an agent is tested against an implementation-independent behavioral spec covering rules, domain boundaries, rights and roles, ground truth, domain knowledge and robustness requirements. The point is to make security and reliability testing follow from what the agent is allowed to do, not just from a dataset of expected answers.

Bigger models can widen the attack surface they are meant to control

Steven Willmott frames agent evaluation around a problem that is easy to miss if “smarter” is treated as a synonym for “better.” A larger model may be more capable at the intended task, but that same capability can make some failures more feasible, not less.

His example is a jailbreak wrapped in a poem. A smaller model may simply fail to understand the poem. A larger model can understand the wrapper, extract the hidden instruction, and execute the malicious intent. In that case, the larger model is not safer because it is more intelligent; the intelligence is part of what makes the attack work.

It's not obvious that bigger is safer, and it's not obvious that bigger is better.

Steven Willmott · Source

The same trade-off appears in agent scope. An agent with a broad remit has more ways to be useful, but also more places where it can be manipulated. It has more domains it is willing to discuss, more actions it may be authorized to take, and more behavior that must be tested. A support bot that answers questions is one class of risk. An agent able to wire millions of dollars is another.

Willmott’s target is not model capability in the abstract. It is deployed, often automated agents. In that setting, the goal is narrower: an agent should be “good enough to perform” without being “capable of arbitrary harm.” Harm has two dimensions. One is linguistic and behavioral: what instructions the agent can receive, how flexibly it interprets prompts, and whether adversarial phrasings change its behavior. The other is infrastructural: what tools, rights, and tasks the agent can actually exercise.

Cost and latency also matter. Using a large model for a simple task, such as basic math, may be slower and more expensive than a more specialized implementation. The point is not that small models are always preferable. It is that deployment requires a balance among capability, safety, cost, speed, and the size of the exposed surface.

A test dataset is not enough to define what an agent should do

In Steven Willmott’s framing, machine-learning validation has often treated datasets as the practical definition of desired behavior: examples go in, outputs are compared, and metrics such as F1 or accuracy summarize performance. For agents, that is too thin.

A dataset can show examples of “good” behavior, but it does not by itself define the boundaries of allowed behavior, the domain universe, the permissions model, or the amount of variation the agent must tolerate. Nor does it always define harm. An agent may simply fail at a task, or it may do exactly the wrong thing in response to a malicious request. Those are not the same failure.

Willmott calls the alternative spec-driven validation or spec-driven testing. The spec is written around the role or task, independent of the agent implementation. It asks what would be needed to design a task benchmark by itself before deciding which model, framework, or stack will run it.

That benchmark needs more than input-output examples. Willmott names six ingredients: ground truth, ontologies or dictionaries, rights and roles, rules, domain knowledge, and robustness requirements. Ground-truth examples still matter; they are the “golden” test cases. Rules are another. A customer-support agent may be forbidden to offer more than a stated discount threshold, or to issue refunds more than 30 days after purchase. Willmott gives one spoken example of “more than 10%”; his product-support slide gives “No discounts > 20%.” The shared point is not the specific number, but the existence of a policy boundary that must be tested.

Ontologies and dictionaries define the relevant universe. An airline chatbot should know about the destinations that airline actually flies to, not the entire space of plausible destinations. A company may also have internal terminology for policies, products, and processes that outside users would not know. If that terminology affects behavior, it belongs in the spec and in the test generator.

Domain knowledge is separate from generic language understanding. Willmott gives the example of business terms such as gross profit and gross sales: a general model might blur them, but in business contexts they are meaningfully different. For finance, science, product support, or other specialized agents, the spec has to capture which substitutions are valid and which are not.

Rights and roles define how behavior changes depending on who is asking. An agent may respond differently when a user is logged in, logged out, has account privileges, or has access to particular data. Robustness requirements define how stable the agent must be under stress: typos, mobile-style communication, rephrasing, parametric changes, and context shifts.

Spec component	What it defines	Example from Willmott
Ground truth	Known good examples	Support samples, refund scenarios, store-location responses
Rules	Behavioral limits and policies	Discount caps such as 10% or 20%; no refunds after 30 days
Ontologies / dictionaries	The valid universe of terms and entities	Store locations, opening times, product names, airline destinations
Rights and roles	Permissions and behavior by user state	Logged-in versus logged-out access; account privileges
Domain knowledge	Context-specific distinctions	Gross profit versus gross sales
Robustness requirements	Allowed variation before failure	Typos, rephrasing, mobile communications, context change

Willmott’s agent spec extends evaluation beyond input-output examples.

The spec becomes the input to security and robustness testing

The practical value of an agent spec, Steven Willmott says, is that it can drive two kinds of validation: red-team security checks and robustness checks.

Security testing benefits from knowing the agent’s purpose. If an agent is designed to operate in a domain, it will be more willing to discuss that domain, and it may have more authority there. That is where attacks are likely to matter. A banking agent’s risky edge is not a random topic outside its remit; it is the area where it can act inside banking infrastructure. The spec helps generate attacks that are relevant to what the agent is actually empowered to do, and it supplies criteria for whether the attack succeeded.

Robustness testing asks a different question: whether the agent can still do its job when the input changes. Willmott compares this to Safe Intelligence’s earlier work on vision models, where validation might ask whether a system can still detect a runway at sunset, at sunrise, in fog, or with camera shake. For agents, the analogous stressors are language and context variation: typos, rephrasings, parameter changes, mobile-style messages, and shifts in the surrounding conversation.

His product-support example shows the workflow. The spec provides the raw material: support samples, refund scenarios, personalization cases, store locations, product codes, inventory systems, permissions, data-access rules, approval boundaries, refund and discount limits, politeness requirements, domain knowledge, and robustness categories. A validation system can then generate variants inside those boundaries, cache stress tests, run them across agents, and compare outcomes against the rules and context the spec supplies.

Willmott connects this to current agent-evaluation practice. In LLM work, “eval” often means the test set. He accepts that usage, but argues that serious agent testing has to capture the task and context around the test. Agent cards, including those in the A2A spec, describe what an agent does, and prompt-management platforms increasingly allow richer metadata about why a test exists. Those are useful starts, especially for generating variants.

But an agent card describing a skill such as scheduling a meeting still does not answer enough evaluation questions by itself. A tester still needs to know the valid envelope: whose meetings can be booked, what kinds of changes are acceptable, which permissions apply, and where the agent should stop.

Behavior specs should outlive the current stack

Steven Willmott wants the behavioral spec to be independent of the implementation. A company may start with one agent framework and later move to another; his examples include LangChain, Vertex agents, and other infrastructure choices. If behavioral tests are tied too closely to the current stack, they will not survive those changes.

The better model is closer to integration tests, unit tests, and penetration tests that can be run independently against whatever agent implementation exists at the time. The spec becomes a reusable behavioral artifact: versioned, maintained, and applied across multiple agents and stacks.

That also creates a feedback loop. If the spec defines expected behavior, rights, rules, domain boundaries, and robustness requirements, then test results can show where the agent’s behavior is weak. Willmott describes this as a kind of improvised or “backyard” reinforcement learning: not training the base model directly, but running the agent, collecting failures, filling robustness gaps, and iterating around the outside.

His longer-term direction is toward formats and tooling for these specs. He points to his own background in API infrastructure and says he helped write the OpenAPI spec, joking that he partly apologizes for it. The analogy is that agent behavior specs should be expressible in portable files, kept in a GitHub repo, pulled into different tools, and heavily versioned.

The closing instruction is simple: start thinking in agent specs. Specify behavior. Keep it separate from implementation. Make implicit assumptions explicit. Use the spec not only to check whether a model answered a test case correctly, but to define the task, the role, the permissions, the acceptable variations, and the boundaries of harm.

AI Application Architecture Evals and Benchmarks AI Security AI Safety and Alignment Agents and Autonomy

Bigger models can widen the attack surface they are meant to control

A test dataset is not enough to define what an agent should do

The spec becomes the input to security and robustness testing

Behavior specs should outlive the current stack

The frontier, in your inbox tomorrow at 08:00.