Interwhen Verifies AI Agent Actions Before They Become Irreversible

Yash LaraMicrosoft ResearchThursday, May 14, 20266 min read

Microsoft Research’s Amit Sharma presents Interwhen as a framework for moving AI agents from post-hoc checking to verified execution while they are still acting. The open-source library uses LLMs to turn natural-language instructions, policies, and partial responses into smaller verifiable properties, then applies symbolic or model-based verifiers to tool calls and intermediate behavior. Sharma argues that this lets agents continue normally when checks pass but interrupts them when a verifier detects a violation, addressing risks that final-output review may catch too late.

Verification has to move inside the agent’s work

Yash Lara frames Interwhen as a response to a specific reliability problem in agentic AI: agents increasingly do work that cannot be treated as harmless text generation. The issue is not only whether a final answer looks correct. It is whether the agent’s intermediate behavior remains within bounds while it is acting.

The motivating examples from Amit Sharma are agents operating inboxes, executing complex organizational workflows, or interacting in the physical world. In those settings, tool calls can make irreversible changes. An agent might send an email, write to an organizational database, or move a robot nearby. That changes what verification must cover. A final-output check can arrive too late if the risky action has already happened.

The question Sharma poses is how to move from “agentic execution” to “verified agentic execution.” Interwhen is Microsoft Research’s proposed framework for that transition: a real-time verification system that checks partial model responses and intermediate actions, extracts properties from natural-language instructions and policies, and uses verifiers to return pass/fail results, feedback, and diagnostics while the agent is still running.

Current checks are either too broad or too narrow

Two common verification approaches leave opposite gaps, in Amit Sharma’s account.

The first is using an LLM as a judge. In that setup, the verifier receives the task instruction, any domain policies, and the model response at a given time. The appeal is generality. The limitation, in Sharma’s framing, is that this becomes a monolithic verifier that may miss nuanced errors. He calls it unreliable.

The second approach is formal verification. Sharma treats formal verifiers as reliable, but says they are usually restricted to math and code. That leaves a gap for agentic workflows where the relevant constraints are written in natural language, spread across policies, user context, tool-call arguments, and evolving model responses.

Interwhen is meant to occupy that gap. Its premise is that formal verifiers can be useful outside narrow math-and-code settings if the system can first project messy natural-language material into smaller verifiable properties.

The key innovation we have is an LLM-based projection step that automatically breaks down outputs into a list of verifiable properties.

Amit Sharma

The architecture Sharma presents has five main parts. The inputs are the task instruction, optional domain policies, and the model response at time t. A parsing engine projects those inputs into an intermediate representation. A formal specification is created with “holes” filled by LLMs at runtime. Formal methods verify parts of the response against the inferred specs. The output is a verification result: pass or fail, with feedback and diagnostics that can be used to steer the agent.

Sharma emphasizes three design properties: Interwhen breaks inputs into verifiable properties rather than relying on a single all-purpose judge; verification runs asynchronously so model execution is not stopped unless an error is found; and the system is plug-and-play, allowing practitioners to bring existing verifiers.

Natural-language policy becomes atomic verifier logic

The retail-agent example on Tau2-bench shows how Amit Sharma wants the abstraction to work in practice. The policy text says the agent can help users cancel or modify pending orders, return or exchange delivered orders, modify a default address, and provide information about profiles, orders, and related products. It also includes procedural and policy constraints: at the beginning of a conversation, the agent must authenticate the user by asking for an ID and PIN, given that the user already provides the user ID; refunds must go to the original payment method or to a gift card.

Interwhen’s first step is to extract “structured, atomic properties” from that kind of text. A policy sentence such as “Refund must go to original payment method or a gift card” becomes a property that can be checked. Sharma shows it being converted into Python verifier logic: if the tool is not a refund tool, the rule does not apply; if it is, the verifier checks whether the requested payment method is either the original payment method for the item or a gift card. If not, it returns an error such as “Refund to payment method not allowed.”

That conversion is described as a one-time operation that can be assessed and checked. The LLM-assisted step transforms policies into explicit properties and specifications that verifiers can enforce.

At runtime, the system fills in the variables needed by the verifier using the user context and the model’s current response. Sharma’s example has a user saying, “I want to return everything I ordered.” The agent retrieves order details and receives an order ID. It then attempts a return tool call using credit_card as the payment method. The conversation state includes a user ID, the order ID, and an original payment method of paypal.

The verifier detects the mismatch and returns: “Refund to credit_card not allowed.” Sharma says the model then takes this feedback in the next turn and corrects the mistake.

The monitor only interrupts when verification fails

Interwhen’s efficiency claim depends on what Amit Sharma calls a “fork-and-verify” design. Verification should not only improve correctness; it has to be practical enough to use. The framework therefore includes a separate monitor system that tracks the agent’s output in real time and calls verifiers asynchronously.

For reversible tool calls, the system creates a new thread of model execution, extracts the inputs needed by the verifiers, performs verification, and interrupts inference only when an error is found. If no error is found, the agent continues as designed.

The property here is that the agent's execution is stopped only if an error is found. Otherwise, it goes as designed.

Amit Sharma · Source

The enforcement point is failure: verification runs alongside model execution, but it becomes blocking when a verifier reports a violation.

The benchmark claim is tied to verifier enforcement

Amit Sharma says Interwhen has produced state-of-the-art results on agentic benchmarks such as Tau2-bench, and that even small models can rival the accuracy of frontier models when using the framework. He also says the framework improves output quality across customer service, travel booking, logical puzzles, and agent-safety tasks.

The benchmark slide lists four task families and the corresponding verifiable properties and verifier languages:

Task	Verifiable property	Verifier language
Maze	Valid moves	Python
ZebraLogic	Assignments satisfy clues	Z3
Tau2bench	Tool calls match policy	Python, Lean
AgentSafetyBench	Tool calls are safe	Python

Interwhen examples pair each benchmark task with a property to verify and a verifier implementation language.

100%

verifier enforcement: no answer if verifiers do not pass

Sharma presents “100% verifier enforcement” as a key benefit alongside the framework’s plug-and-play nature and its compatibility with proprietary and open-weight models. In the slide’s wording, that means “no answer if verifiers do not pass.”

Microsoft Research has open-sourced Interwhen on GitHub at github.com/microsoft/interwhen. Sharma positions the release as an invitation to build and test verifiers for real-world use cases, not just as a finished benchmark artifact. The collaboration goal he states is to support a community focused on AI verification and safety, with the broader aim of “verified reasoning for any LLM” and, eventually, verified agentic execution.

Agents and Autonomy AI Application Architecture Evals and Benchmarks AI Safety and Alignment AI Research Methods