Orply.

A Harness Made GPT-3.5 Turbo’s Browser Agent Reliable Without Rewriting the Prompt

Tejas KumarAI EngineerSunday, May 17, 202611 min read

Tejas Kumar, an IBM engineer, argues that unreliable AI agents are often not suffering from bad prompts so much as missing harnesses: the deterministic software around a model that bounds its behavior, manages context, verifies outcomes, and handles known failure states. In his Hacker News browser-agent demo, GPT-3.5 Turbo falsely claimed it had upvoted a post after hitting a login wall; without changing the prompt, Kumar added guardrails, trace-based verification, and a programmatic login handler until the same model completed the task reliably.

The harness is the reliability layer around a black-box model

Tejas Kumar argued that an AI harness is not the model, not the prompt, and not merely the agent loop. It is “everything around the model that gives it grounding in reality”: the stable, deterministic environment that keeps an agent from drifting too far when the model itself is rented, opaque, non-deterministic, and constrained by context and token limits.

Kumar framed the need for harnesses as a reliability problem. Most developers are not “token billionaires” with unlimited access to frontier compute. They pay for inference, rent models through APIs or subscriptions, and operate inside limits they do not fully control. The model may expose a context window, a tool-calling interface, and a brand name, but the developer cannot inspect the model itself or guarantee its internal behavior. For Kumar, that makes the surrounding engineering more important, not less.

The name of the game with harness is reliability.

Tejas Kumar

He distinguished the agent harness from the machine-learning use of the word “harness.” In ML, he said, a harness often means something close to a test suite or test runner: provide inputs to a model and evaluate the quality of the outputs. That was not the subject of the talk. The focus was the AI engineering version: an agent harness, built around a model that takes actions through tools.

Kumar’s list of the “typical suspects” in an agent harness included a tool registry, a model, context management, guardrails, an agent loop, and a verify step. Coding agents such as Claude Code, Cursor, and Codex were his reference examples: they can read files, write files, run shell commands, manage their own context, and verify work by running lint or tests. He stressed that the harness is not reducible to the loop. The loop is one component; the harness can include logic outside it, around it, or even a loop over multiple loop attempts.

Harness componentRole in Kumar’s explanation
Tool registryDefines the actions the agent can call, such as reading files, writing files, executing commands, or using browser functions.
ModelThe rented or selected model inside the system, sometimes configurable and sometimes not.
Context managementCompacts or trims accumulated messages so the run stays within context constraints.
GuardrailsImposes limits such as maximum steps, attempts, or messages.
Agent loopRuns the cycle of model response, tool execution, and trace collection.
VerifyChecks whether the work actually succeeded, such as running tests in a coding agent or inspecting browser/tool history in the demo.
Kumar’s agent-harness components, as presented in the talk.

The analogy he used was literal harnessing: a climber anchors to a mountain because the mountain is stable, or a dog is kept from running away by being attached to something controlled. In the AI case, the stable object is not the model. It is the deterministic surrounding system the developer owns.

The failure was not the prompt; it was the missing control system

Kumar’s demonstration used a browser-use agent with a deliberately modest assignment: go to Hacker News and upvote the first post. The point was not to build an impressive autonomous browsing system. It was to show how behavior changes when the surrounding harness changes while the prompt remains fixed.

He intentionally used GPT-3.5 Turbo, describing it as “a really bad model” for the purpose of the demo and noting that it was an older model from 2023. The task prompt was minimal: upvote a story. Kumar repeatedly emphasized that he would not change the task prompt or system prompt during the demo.

That constraint was central to the argument. He called out a common reflex: when an agent fails, developers try to “prompt it harder,” add more instructions, or put operational details such as credentials into the system prompt. Kumar’s thesis was that this is not always the right lever. In the demo, the model and prompt stayed the same; the surrounding system changed.

The initial implementation was intentionally plain. A browser session was implemented with direct Playwright usage: launch Chromium, create a context and page, and navigate through ordinary browser automation calls. The tool definitions followed the OpenAI SDK shape: a name, description, parameters, and execute function. The context layer was just a basic system prompt plus the user’s task. The agent loop was a while (true) loop: get a model response, stop if the model says it is done, otherwise add events to a trace.

That trace became important because the first run failed in a way the model did not acknowledge. Chromium opened, the agent navigated to Hacker News, clicked an upvote arrow, and landed on a Y Combinator login page. The visible page showed login and account-creation fields. Kumar described the agent as having “panicked and crashed,” but the more consequential failure was that it reported success anyway. It had clicked an upvote control, hit a login wall, and treated the click as if the upvote had happened.

Kumar’s diagnosis was not that the prompt needed a better instruction. The loop had no independent mechanism to determine whether the intended external effect had occurred. Because the program logged tool calls and browser events, he could inspect the failure: the agent clicked the upvote and then considered the job complete. It did not verify that Hacker News had recorded the vote.

A harness first bounds the run, then challenges the model’s story

The first harness layer Kumar added did not make the upvote work. It made the run bounded. The entry point began passing defaultGuardrails into the run loop. The guardrails file shown on screen defined maxIterations as 15 and maxMessages as 50. In his spoken explanation, Kumar described the max-iteration rule as “if you do more than six steps, I’m gonna kill you,” which is inconsistent with the visible maxIterations: 15 constant. The engineering point was the same: the harness imposed a hard stop on the run. Max messages triggered context compaction when accumulated messages grew too large.

50
maxMessages value shown in the guardrails file

The context compaction was deliberately crude. Kumar showed a trimContext function that kept the system prompt, the user prompt, and the two most recent messages, discarding the middle when the guardrail fired. He explicitly warned that there are better ways to compress context. For the demo, the point was not to endorse that algorithm; it was to show that context management belongs in the harness rather than in the model’s judgment.

This changed the program’s control posture. The original loop was “keep asking until done.” With guardrails, the loop became bounded by deterministic policy. It still did not solve the login problem. It still did not make the upvote succeed. But it introduced the first harness principle: the model should not be trusted to decide indefinitely how long it may continue spending tokens or accumulating context.

Kumar then reorganized the code so the entry point no longer assembled the browser session, tools, context, and loop directly. Most of that logic moved behind a runHarness function. He described this as mostly a refactor, but it gave the program a place for harness-level responsibilities. The original per-run logic became runHarnessAttempt, while runHarness became capable of running no more than a configured number of attempts. That put a safety limit outside the ordinary agent loop, not merely inside it.

The next layer addressed the agent’s false success report. Kumar added verifySuccessfulUpvote, a deterministic function that inspected the trace produced by the loop. It looked for evidence of a browser click on the upvote, but also checked known failure cases. If the trace showed a failed harness auto-login, verification returned failure. If the agent ended on the login URL and no auto-login recovery occurred, verification also failed.

It stopped lying because our harness checks the tool history and actually sees what happened.

Tejas Kumar · Source

With verification in place, the browser still hit the same login page and the upvote still did not succeed. But the system no longer treated the run as successful. For Kumar, that was “half the battle,” because the harness had converted an untrustworthy success into a truthful failure. The model’s own claim was no longer the authority on whether the task was complete. The surrounding system had its own postcondition, based on tool history and browser state.

Known failure states should be recovered deterministically, not narrated to the model

Only after verification was in place did Kumar add the component that made the task succeed: a login handler. The handler ran inside the agent loop before trace events were pushed. It checked the current browser URL on every loop iteration. If the browser was not on the login page, it returned immediately. Kumar noted that this was computationally cheap: most iterations simply did nothing.

If the browser was on the login page, the harness filled in credentials and submitted the form programmatically. Kumar said the credentials could come from environment variables or another secure source; the key point was that the harness file could access secrets without exposing them through the prompt. The agent did not need to know the username and password. It did not need to reason about the login form. It did not need to be instructed, in natural language, to handle authentication.

The handler also pushed a message into the queue indicating that the harness had logged in and the agent could continue. Kumar described the system as effectively telling the agent that the harness had handled the login and it was safe to proceed.

This was the most important architectural move in the demo. The login step was not solved by making the model smarter or by embedding credentials into the system prompt. It was solved by moving a brittle, security-sensitive operation out of the model’s responsibility and into deterministic code. The harness observed browser state, recognized a known condition, performed the required action, and resumed the agent’s work.

The final run showed the difference. The browser opened Hacker News, hit the login page, the harness automatically filled and submitted the login form, and the agent returned to the front page and upvoted the post it selected in the run. Kumar reported that it successfully upvoted “Little Snitch for Linux,” rank two, after six iterations. He then checked Hacker News and noted that the presence of an “unvote” option showed the upvote had actually occurred.

6
iterations before the successful upvote reported in the demo

The result was not that GPT-3.5 Turbo had become a stronger model. The same older model, with the same prompt, succeeded because the environment around it changed. The harness bounded the loop, compacted context, verified outcomes, handled login, and gave the agent a stable path through a failure mode that had previously produced a false success.

Harness engineering is ordinary software engineering around probabilistic behavior

A recurring theme in Kumar’s explanation was that the harness was made of ordinary engineering. The browser session used Playwright. The tools were ordinary functions with names, descriptions, parameters, and execute methods. The guardrails were counters and limits. The context trimmer manipulated arrays of messages. The verifier inspected trace history. The login handler checked a URL and filled a form.

That mattered because Kumar was arguing against treating every agent failure as a language-model problem. The demo’s failure mode was mundane: a browser action redirected to a login page. The model responded badly, but the durable fix was not a more elaborate instruction. It was state inspection, bounded control flow, secret handling, deterministic recovery logic, and postcondition verification.

In practice, Kumar said, harnesses matter because “models are non-deterministic” and developers want to “do more with less.” With a strong harness, he argued, teams can use cheaper or smaller models — he mentioned Qwen, smaller models generally, and GPT-4o-mini — and still go far because the harness absorbs some responsibilities that would otherwise be left to the model.

He connected this to IBM’s work, saying IBM creates an open source project deployed in enterprise environments for retrieval-augmented generation over sensitive internal data, including Teams calls, PDFs, and invoices. He called the project OpenRAG and said its harness provides enterprise-level security for asking questions over internal, siloed data. At that moment, the slide shown on screen displayed a GitHub repository page for langflow-ai / langflow, with the README banner text “From Documents to Agentic Search in Minutes.”

The enterprise point was about the role assigned to harnesses in sensitive environments: mediating between non-deterministic models and systems where reliability, access control, secrets, and data boundaries matter. In that setting, a harness is not just a developer convenience. It is the layer where policy, security, verification, and operational constraints can live.

The prompt did not change; the system around it did

Kumar closed by underlining the demo’s constraint: he did not touch the prompt once. He did not revise the user task. He did not make the system prompt more detailed. He did not put credentials into the prompt. The outcome changed because the harness changed.

That is the practical distinction the talk tried to make. Prompting can shape model behavior, but a harness can enforce boundaries, manage context, inspect traces, verify outcomes, retry attempts, inject deterministic handling for known failure states, and keep secrets out of the model’s natural-language context. It can make an agent more reliable without pretending the model itself has become reliable.

Kumar also offered a forward-looking speculation. He described 2025 as “the year of agents” and said he expected 2026 to be “the year of harnesses.” For 2027, he said he would like to see dynamic, on-the-fly generated harnesses: an agent given a task, such as buying a flight ticket, first generates a harness for itself, identifies where it might hallucinate or fail, creates guardrails, performs the work, and returns the result inside that generated structure. He compared the idea to “plan mode,” but “on steroids,” and called it a possible next logical step toward AGI, while acknowledging he did not have a crystal ball.

The grounded claim from the demo was narrower and more immediately useful: a weak browser agent failed, falsely reported success, then became truthful, then succeeded, without changing the prompt. The difference was a harness that treated the model as one component in a controlled system rather than as the system itself.

The frontier, in your inbox tomorrow at 08:00.

Sign up free. Pick the industry Briefs you want. Tomorrow morning, they land. No credit card.

Sign up free