Agents Move From Chat Prompts To Engineered Work Surfaces

Applied AISunday, May 24, 202647 min to watch14 min read

Rachel Nabors, Lou Bichard, and Google’s AI Studio examples point to the same applied-AI shift: agents need interfaces, context, and coordination layers around the model. The work is moving toward graphical surfaces, callable browser and backend capabilities, explicit state and gates, and reviewable pipelines for generated applications.

1. Agents are moving out of the chat box

Rachel Nabors gives the cleanest user-facing version of the shift: chat is becoming an on-ramp, not the destination. She treats the chat box as a transitional interface for agents, comparable to a command line. It is universal, flexible, and comfortable for technical users, but it forces everyone else to know what to ask before the product has shown what it can do.

Chat is the lowest common denominator of the user experience.

? rachel-nabors

That is not merely an aesthetic critique. Interface shape changes what users can discover, trust, and complete. Nabors calls the inert conversational surface “Starfish design”: it sits there waiting for the user to supply intent. In a familiar product such as Linear, that may be tolerable because the user already understands the domain. As a general agent interface, it is a weak default. A blank box makes users carry the product model in their heads. The agent can retrieve, reason, and select tasks, but the work surface often needs visible affordances: navigation, state, previews, controls, and modes.

Her rebuilt Rachel the Great comic archive gives the argument a concrete architecture. One canonical source of truth — a manifest.json plus characters.json — feeds three surfaces: a conventional static website for human readers, a remote MCP server for agents such as Claude, and WebMCP tools exposed directly in the browser for browser agents. The goal is not to add a chatbot to a website. It is to let the same durable content support humans, assistant-hosted tools, and browser-native agents without collapsing all interaction into prose.

Surface	User or agent	What it proves
Static site	Human reader in a browser	The web remains the richest native surface for reading, navigation, and media.
Remote MCP server	Assistant such as Claude	Agent tools can be installed like web services and retrieve structured archive data.
MCP app	Assistant-hosted embedded interface	A tool response can become an HTML/CSS/JS work surface, not just text or JSON.
WebMCP tools	Browser agent	A page can expose its capabilities directly instead of making an agent infer them from screenshots or DOM traversal.

Nabors’ archive shows the same content moving across web, agent, and browser-agent surfaces

The most important bridge is the MCP app. Nabors shows a get_page tool whose response is associated with a ui.resourceUri, allowing Claude to render a comic reader inside the conversation. The agent still helps find the right story and page, but the reading experience becomes graphical and navigable: panels, commentary, comments, next and previous controls, and a transcript mode. The embedded surface looks and behaves like a small website because it uses the website’s assets, styles, and resources, subject to the sandboxing constraints of the host.

That pattern is stronger than “the agent returns richer text.” A tool call can return an interface. The model can remain involved in task selection and retrieval while the user does the actual work in a purpose-built surface. For applied AI, that is a different product primitive from chat completion.

Nabors’ WebMCP point is the mirror image. Instead of putting a miniature website inside an agent, a website can expose its functions to an agent in the browser. Current browser agents often rely on screenshots or DOM traversal, which makes them guess at capabilities a site already understands. WebMCP lets a page register a capability such as “next page” with a name, description, input schema, and execution callback. A browser agent does not need to identify the right arrow visually or parse the page tree; it can call the named function.

Nabors presents MCP as useful but unevenly implemented. MCP apps are sandboxed iframe islands. They lack normal localStorage, ordinary network access, and unrestricted external links. Assets and fonts require content-security permissions. WebMCP is also not identical to MCP; Nabors describes it as “to MCP as JavaScript is to Java,” related in spirit but not one-to-one compliant.

Those caveats strengthen rather than weaken the practical point. Agent products are becoming real enough that their problems are no longer only model problems. They need installable tool surfaces, context primitives, content-security rules, state management, and browser-level capabilities.

Agent Interfaces Are Moving From Chat to Web-Native SurfacesAI Engineer

2. The backend problem is coordination, not another agent runtime

Lou Bichard’s infrastructure frame is the backend counterpart to Nabors’ interface frame. If Nabors asks how agents should meet users, Bichard asks how agents should meet each other and the software-development lifecycle.

His central distinction is between runtime, orchestration, triggers, and coordination. Runtimes are where agents execute: threads, worktrees, containers, microVMs, or cloud development environments. Orchestration starts and manages workflow runs. Triggers start sessions from pull-request events, cron jobs, webhooks, monitoring alerts, or manual invocation. Bichard argues that those layers are increasingly available. The unresolved primitive is coordination: the system that tracks state, enforces gates, manages handoffs, determines ownership, and tells an agent whether it can move from one lifecycle stage to the next.

agent-infrastructure primitives Bichard separates: runtime, orchestration, triggers, and coordination

The distinction matters because many teams are trying to solve coordination with artifacts built for humans. GitHub can hold pull requests, review comments, labels, merge conflicts, and CI failures. Linear can hold tasks. CI can report whether a build or test suite passed. Bichard’s claim is that none of these is an agent-native coordination layer. A swarm of coding agents can flood GitHub with activity, but the result may be less legible for the supervising human, not more.

The software-development lifecycle exposes the gap. A human reads “plan, develop, test, review, ship” and infers many hidden micro-steps. “Plan” may include breaking down a ticket, identifying dependencies, estimating scope, checking acceptance criteria, and deciding whether the work fits in one pull request. A coding agent does not reliably infer those transitions. If the organization wants agents to move through the lifecycle, the lifecycle has to be decomposed into explicit states, gates, retries, approvals, and ownership rules.

Layer	Typical question	Bichard’s assessment
Runtime	Where does the agent execute, and with what isolation?	Mostly solved through worktrees, containers, microVMs, and cloud development environments.
Orchestration	How are workflows kicked off and managed?	Effectively solved for many teams.
Triggers	What starts a session?	Available through PR events, schedules, webhooks, alerts, and manual actions.
Coordination	What state is valid, who owns the next step, and can the agent proceed?	The missing primitive for agent swarms and software factories.

Bichard separates execution infrastructure from the coordination layer agents need to traverse work

The public examples in his material suggest that large companies are already rebuilding parts of this stack internally. Stripe’s Minions system was described as producing roughly 1,300 pull requests per week with human-reviewed but “zero human-written code,” using isolated devboxes, deterministic steps such as linters and tests, and more than 400 MCP tools. Ramp’s Inspect system was described as a background agent responsible for roughly 30% of merged pull requests, wired into services such as Sentry, Datadog, LaunchDarkly, Buildkite, GitHub, and Slack. Bichard’s Bank of Ona experiment used product, implementation, review, triage, QA, and security agents to produce 575 pull requests for a banking application.

Those examples do not add up to one settled pattern. Stripe, Ramp, and Ona are building different systems. The common signal is that companies are independently creating agent-factory infrastructure because one background coding agent is not the end state. Once agents run in fleets, the operational question becomes who owns the next action, what context each agent receives, how work is handed off, how failures are retried, and when a human must intervene.

Bichard’s proposed form factor makes the abstraction concrete: a CLI gateway. A local or remote coding agent should be able to call a system and ask, in effect, “Have I satisfied this gate, and what transition is valid next?” That is different from CI failing after the fact. It is a callable coordination surface during the work, backed by state machines, durable execution, policy enforcement, and human approvals.

The design pressure matches Nabors’ interface argument. Agents need surfaces and protocols, not only smarter models. In the user layer, that means graphical, web-native work areas instead of blank chat boxes. In the backend layer, it means state, gates, ownership, and transitions instead of repurposed issue trackers and pull-request noise.

Agent Swarms Need a Coordination Layer, Not Another RuntimeAI Engineer

3. Google is packaging the agent stack as an application pipeline

Paige Bailey and Guillaume Vernade show how Google is packaging the same pieces into developer workflows. Multimodal prompting, model selection, tool use, code generation, media generation, auth, storage, app scaffolding, and local inference are being pulled into a more continuous builder loop.

AI Studio is positioned less as a prompt playground and more as a reproducible pipeline builder. Bailey’s dinosaur-video demo shows the pattern: load a five-minute YouTube segment, select a Gemini model, turn on Google Search grounding, ask for sightings by timestamp, and then export code that preserves the model name, file URI, video offset, prompt, and tool configuration. The model call is inspectable rather than trapped in a demo transcript.

30,901

tokens in Bailey’s five-minute dinosaur video slice loaded into AI Studio

Her LEGO example makes the same point for tool use and cost visibility. With code execution enabled, Gemini analyzed an image of LEGO bricks, wrote and ran Python, and returned bounding boxes around green bricks. AI Studio surfaced the estimated API cost for the Flash-Lite run as $0.000555. Bailey emphasized that similar patterns could count objects, determine orientation, find items across video frames, or return timestamps. The working unit is multimodal input plus executable analysis plus visible cost plus exportable code.

AI Studio Build extends that flow into generated applications. Bailey prompted it to create “ShelfScan AI,” an app that lets a user sign in with Google, upload a bookshelf photo, identify visible books from spines, use grounding to fill in missing details, and save the collection to Firestore. The generated app had a landing page, Google sign-in, upload flow, Firestore connection, and account-linked persistence after debugging.

The debugging is the useful part. The app initially failed with insufficient Firestore permissions. The generated environment investigated user creation and security rules, identified a length mismatch around a stored image URL, and proposed a rule change. Bailey’s point was not that generated apps are production-ready. It was that the system produced a file structure and reasoning path a developer could inspect.

Vernade’s instructions for “vibecoding” turn that into engineering discipline. He asks generated apps to use a file per feature or related feature, add docstrings, start each file with an explanatory comment, maintain a root Design.md, centralize configurable model names, and include a way to test setup without altering data. His rationale is reviewability: if a model edits an unrelated file, the developer should see that as a warning. Logs are similarly basic, because a final error message often omits the sequence that caused the failure.

Stage	What Google’s workflow exposes	Why it matters
Prototype	Multimodal input, tools, model choices, prompts, offsets, and cost estimates in AI Studio	The working interaction is inspectable before it becomes code.
Export	Code with model configuration and input references preserved	Developers do not have to reverse-engineer a playground result into an API call.
Build	Generated apps with auth, Firestore, uploads, grounding, and persistence	Model calls are being wrapped in ordinary application infrastructure.
Review	Logs, file structure, tests, design documents, and debuggable rules	Generated apps still need the same control surfaces as human-written systems.
Deploy closer to the user	Gemma on phones, laptops, local endpoints, and single-GPU cloud deployments	Smaller agentic loops can move to cheaper, private, or lower-latency environments.

Google’s stack is presented as a pipeline from prompt to app infrastructure and deployment target

The local-model material belongs to the same pipeline story rather than a separate model roundup. Ian Valentine’s Gemma 4 examples showed agentic loops running on phones, laptops, and local endpoints. On a Pixel, Gemma called local JavaScript skills and loaded small web apps. In LM Studio, a Gemma model served an OpenAI-compatible endpoint on localhost, enabling apps and coding tools to point at a local model. Valentine also ran an orchestrator and 10 sub-agents locally to generate SVGs, then used OpenCode against the local endpoint for an edit-run-debug loop.

That changes deployment targets. Agentic behavior is not confined to cloud chat surfaces. It can live in a browser-hosted skill, an Android intent, a local coding loop, a laptop endpoint, or a single-GPU Cloud Run deployment. But the same seams remain: syntax errors, missing game elements, model-specific configuration, context limits, and the need for feedback loops.

Vernade’s media-generation pipeline makes the orchestration point clear. Gemini interpreted The Wind in the Willows and produced structured prompts; Nano Banana generated character and chapter images; Veo animated images; Lyria generated music; text-to-speech produced narration. The pipeline worked by handing structured outputs from one model to another. It also exposed seams: Veo cannot directly ingest a Lyria audio file, video prompts written for still images may be under-specified, and character consistency depends on careful reference management.

Google’s packaging does not remove the need for engineering. It moves the engineering into the workflow: inspectable prompts, portable tool calls, reviewable generated applications, orchestrated media handoffs, and local models where latency, privacy, or cost matter.

Google’s GenAI Stack Turns Multimodal Prompts Into Application PipelinesAI Engineer

4. The common bottleneck is engineered context

The limiting factor is no longer only model capability. It is whether the surrounding system gives the model the right context, at the right time, in a form that can be inspected, constrained, and corrected.

Nabors draws the first distinction: resources are for durable context; tools are for action. Her comic archive can expose tools for listing comics, searching transcripts, finding characters, and retrieving pages. But full transcripts and commentary are not best treated as actions the model must discover one call at a time. They are durable context. MCP resources should let the harness pre-prime relevant documentation, transcripts, or commentary instead of forcing the model to read tool descriptions, decide to fetch documentation, and spend tokens dragging it into context.

Her frustration is that resources are loosely defined and weakly surfaced in clients. The resource may exist on the server, but if the agent harness does not make it visible or usable, the practical effect is limited. That gap matters for every documentation site that thinks adding MCP tools will make coding agents understand its docs. In Nabors’ view, an agent should not have to decide to call documentation as a tool when the harness already knows the user is working in a relevant mode.

Bichard reaches a similar conclusion from the software-factory side. Harness engineering encodes missing knowledge into the repository and its feedback systems: AGENTS.md, architecture decision records, runbooks, API contracts, setup instructions, ownership files, tests, linters, type checks, pre-commit hooks, CI checks, preview environments, logs, traces, feature flags, and reproducible builds. Those artifacts are not paperwork. They are the operating environment for agents.

The hard failure mode is context rot. Long sessions degrade. Agents lose track, skip steps, declare completion too early, or optimize for approval rather than correctness. Bigger prompts alone do not fix that. Bichard points to smaller tasks, explicit gates, antagonistic agents, sub-agents, and state machines. The system has to decompose the work and supply feedback, not merely hope a broad instruction remains salient.

Google’s workshop provides the media and app-building version of the same rule. Vernade’s character-consistency problem in The Wind in the Willows was not solved by telling the image model to “be consistent” and passing everything into every prompt. He structured chapter outputs to include the chapter name, prompt, and list of characters appearing in the scene. The pipeline then passed only the relevant character reference images into Nano Banana. With five characters, brute force may work. With 40 characters and multiple reference views, selective context becomes architecture.

Bailey’s generated-app debugging also depends on engineered context. The model could reason through Firestore permissions because the environment had files, rules, logs, generated code, and a debuggable workspace. Vernade’s generated-code hygiene pushes that further: feature-level file organization, design documents, central configuration, logs, and test paths all make the system legible to humans and models.

Context architecture is the more useful term than prompt craft for this stage of applied AI. Prompting still matters, but the work is shifting toward deciding what belongs in a resource, what belongs in a tool, what belongs in a repository file, what belongs in a state machine, what belongs in a generated app’s logs, and what should be withheld to keep the model from drowning.

That is the common bottleneck beneath very different applications. A comic archive, a coding-agent swarm, and a multimodal app builder all depend on the same engineering move: turn implicit human knowledge into explicit, durable, callable, reviewable context.

5. What builders should watch next

The practical question is which missing pieces become standards, which become product features, and which remain custom infrastructure inside teams.

First, MCP resources need to become visible and usable in mainstream clients. Nabors’ distinction between tools and resources is important only if clients let builders use resources as durable context. If resources remain server-side declarations that users and harnesses cannot easily see, documentation and archive use cases will keep falling back to awkward tool calls.

Second, WebMCP or related browser-agent standards need real adoption. Browser agents should not have to infer every capability from pixels or scrape the DOM when a site can expose “search,” “next page,” “submit,” “filter,” or “export” as named functions with schemas. The open question is whether websites, browser vendors, and agent developers converge on a common pattern or produce a fragmented set of browser-agent affordances.

Third, agent coordination has to take a form builders can actually call. Bichard’s CLI gateway is one plausible shape because it fits local coding agents, remote agents, CI, and version-controlled workflows. But the coordination layer could also appear as a protocol, a workflow product, a dev-platform feature, or an extension of existing issue and CI systems. The test is not branding. The test is whether an agent can ask what state it is in, whether a gate is satisfied, who owns the next step, and what transition is valid.

Fourth, generated full-stack apps need standard reviewability patterns. Bailey’s Firestore bug was useful because it was inspectable. Vernade’s hygiene rules point toward a likely baseline: logs by default, file-per-feature structures, design documents, centralized configuration, safe test fixtures, and rollback paths. Generated code that cannot be reviewed or debugged will remain a demo artifact, even if it looks complete at first run.

Fifth, local models such as Gemma may change where small agentic loops run. On-device and laptop agents are not replacements for frontier cloud systems, but they can make some automations cheaper, more private, lower-latency, and available without network access. The relevant watch item is not benchmark parity alone. It is whether local models can reliably call tools, edit files, use app-specific functions, and participate in feedback loops.

Finally, builders should watch how much of today’s scaffolding gets absorbed by base models and platforms. Bailey argued that when everyone sprints to build the same workaround — vector databases for small context windows, language-specific fine tunes, agent frameworks, MCP servers — it may indicate that the workaround is temporary. The audience pushback was also important: domain-specific workflows may remain durable because they encode customer knowledge, reproducibility requirements, integration constraints, and review obligations that a general model provider will not fully absorb.