Every Addition to an AI Agent Can Make It Worse

Ara KhanAI EngineerTuesday, May 19, 202613 min read

Ara Khan of Cline argues that agent maturity is less about adding autonomy than about knowing what not to add. In a talk structured around four levels of agent building — from frameworks to state machines, Kanban-managed workflows and cloud deployment — Khan says frontier models increasingly reward simpler prompts, deliberate architecture and visible human control. His central warning is that every extra instruction, abstraction or automation layer can make an agent worse.

Agent maturity starts with resisting the panic to automate everything

Ara Khan frames the current agent-building environment as a “mass psychosis problem”: engineers are surrounded by tools that appear to promise parallel autonomous work, while still leaving open the basic question of how much control a human should retain. His version of the question is blunt: should you have “15 agents ripping through all the time” while you “vibe,” or should you “read through every line of code”?

Khan’s answer is not to reject agents, but to slow down the rush to indiscriminate automation. The goal, as he presents it, is to distinguish between fast experimentation and production-grade engineering. Some problems are worth prototyping quickly with existing frameworks. Others require deliberately designed state machines, careful architecture, visible workflows, and eventually cloud infrastructure.

He organizes that progression into four levels of agent maturity:

Level	What you do	What you learn
1	Use a framework	How agents work at a surface level
2	Build it yourself as a state machine	Architecture, modularity, model-independence
3	Kanban the flow	Observability, debugging, systems thinking
4	Ship to the cloud	Parallelization, scale, production operations

Khan’s four levels of AI agent maturity

The structure is partly a practical ladder and partly a warning. Moving up the ladder does not mean adding more complexity for its own sake. Khan repeatedly returns to the opposite rule: every addition to an agent can make it worse.

Frameworks are useful until the work becomes serious

At the first level, Khan’s advice is intentionally simple: if the question is whether an agentic approach can solve a problem at all, “just use a framework.” He names LangChain, LangGraph, CrewAI, AutoGen, and LlamaIndex as examples shown in the slide, while noting that he does not use them in his own work at Cline because his job is to build agentic experiences directly.

For early product discovery, however, he sees frameworks as legitimate. If the task is rudimentary — aggregating emails, testing whether an agent can help with a workflow, getting something working without caring deeply about the best model or highest-quality implementation — an agent framework can produce a working prototype quickly. In his phrasing, “any of the agent frameworks can give you something that just works in like half an hour.”

That usefulness has a ceiling. Khan argues that once the work moves toward production, frameworks become limiting. The missing properties are the ones that matter when an agent becomes core infrastructure: customizability, modularity, and enough flexibility to accommodate where models and APIs are going next. He acknowledges that some people would disagree, then dismisses that disagreement with the line: “a lot of those people are wrong.”

The practical distinction is not that frameworks are bad. It is that they belong at the stage where the main thing to learn is whether agents work for the problem at all. Once the goal is to build a serious agent, Khan says the builder needs to own the architecture.

A serious agent is a state machine, not magic

The first rule for building agents directly is that every agent should be understood as a state machine. Ara Khan argues that the apparent complexity of agentic systems reduces to a recursive loop: a while loop with conditions and end states.

It is a while loop with a few conditions and a few end states.

Ara Khan · Source

The state-machine model matters because it gives the builder a way to reason about what the agent is doing at any moment. Khan describes a task in Cline where the user asks the agent to read a few files and explain them. The task begins with the user request, moves into an action tool that reads the files, returns the result into context, and eventually reaches an attempt-completion or task-complete state. More complex agents may run through the same state machine for eight to ten minutes, or even hours, but the core representation is still a loop of request, model response, tool call, context update, user input, or completion.

The slide shown for this rule makes the point visually. “User Task” enters an “Execution Loop,” which sends an “API Request,” receives an “LLM Response,” chooses among tool calls and continuation states, and eventually reaches states such as “Task Complete,” “Wait for User Input,” “Ask User,” “Create New Task,” or “Provide Feedback.” Khan’s emphasis is not on a particular diagram but on maintaining a mental model of where the system is in that diagram.

That mental model is what prevents agent building from becoming a collection of ad hoc prompts and tool calls. If the agent is a state machine, then architecture means deciding the states, transitions, termination conditions, user-interaction points, and how results are added back to context.

The bitter lesson is that more instructions can degrade the agent

Ara Khan’s second rule is the one he treats as hardest won: “Every single thing you add to an agent risks making it worse.”

Every single thing you add to an agent risks making it worse.

Ara Khan

His target is not only long prompts. It is the whole instinct to improve an agent by layering on more instructions, more edge-case handling, and more elaborate if-else logic. For frontier models, he says, that often backfires. The lesson from building agents at the current model frontier is to “get out of the way of the model.”

Khan gives a concrete example from what he describes as the CodeX repository: the prompt for GPT-5.3 is, he says, one-third the size of the prompt written for GPT-5. His explanation is that newer models are capable enough that too many instructions create “sensory overload.” The model receives so much guidance that it becomes harder, not easier, for it to identify the right action.

That leads to a different engineering habit: pruning. Rather than treating every failure as a reason to append another rule to the system prompt, Khan says builders should ask whether the new rule is degrading the system. He says Cline took this far enough to rewrite the entire product because older versions had accumulated too much “junk.” He also says Claude Code has, he thinks, been rewritten from scratch at least seven times, while caveating that people on that team would know better.

The rule, in Khan’s account, is that the agent wrapper should become simpler as the underlying model becomes more capable. Builders should not assume that additional prompt text, special cases, or control logic improves the result.

Build agents so other agents can build and test them

The third rule is more recursive: make an agent “trivially easy to be a part of a pseudo RL pipeline.” By this, Ara Khan means that the agent should be easy for other coding agents to build, test, modify, and evaluate.

The practical requirement is straightforward. The project should have a clean command-line interface and reliable build and test pathways. If an agent can run the system, change it, and test it end to end, then a long-running coding agent can work on the agent itself. Khan points to artifacts such as a properly written agents.md, appropriate skills, and CI/CD setups as part of this environment.

He describes a shift in the human-AI relationship. Previously, humans guided AI by issuing instructions. Now, in his view, humans are increasingly being guided by AI, and should structure software projects so that AI systems can operate through them easily. That means not only writing code for human maintainers, but making the repository legible and executable for other agents.

The payoff is parallelism. If the project is easy to build and test, a developer can send a long-running agent down a separate thread to make changes, validate them, and return a working result. If the build and test loop is fragile, then agentic development becomes harder for the same reason human development becomes harder: the feedback loop is unclear.

Khan calls this “super meta.” His practical point is that testability now has another consumer: not only the human engineer, but also the model-driven agent that may be asked to modify the agent itself.

Do not outsource architecture to throughput

The fourth rule, and the title of the talk, is “Don’t build slop.” Ara Khan directs the warning at the temptation created by high-throughput generation. If tokens can move quickly and agents can produce large amounts of code, it becomes easy to let them “rip through” work without enough architectural thought.

His counterweight is human architectural judgment. He says the architecture, design, and outline of what the agent is supposed to do are worth thinking through carefully. Even when engineers use agents to write code, they should read the code and understand the design. In Cline’s experience, he says, that discipline has been critical.

The strongest version of the claim is that the architecture of an agent “has to be done by a human” and “has to be done like very thoughtfully.” Khan allows for using an agent as a collaborator in that process — having a conversation about architecture, exploring designs, thinking through the state machine — but he warns against letting models simply generate the code without that prior design work.

That rule is in tension with the rest of his argument in a useful way. Khan wants builders to use agents aggressively: to run them in parallel, to let them modify systems, to build cloud workflows where many agents can operate at once. But he does not treat that throughput as a substitute for deciding the architecture.

Frontier-model APIs can create subtle lock-in

The fifth rule is that “frontier labs kinda wanna lock you down.” Ara Khan presents this as contested — “a lot of people will disagree” — but argues that frontier-lab APIs can make model interchangeability harder in subtle ways.

The example he gives is reasoning traces in newer models he names as Opus 4.6, Gemini 3.1 Pro, and 5.3 CodeX. Khan describes reasoning traces as part of both the cache and the test-time compute loop. In multi-turn conversations with those models, he says, the developer needs to send the traces back in the precise format expected by the API. If they do not, the API may still return a response, but performance can degrade silently.

That silent degradation is the dangerous part in Khan’s account. The system does not necessarily fail in a visible way; it simply performs worse. He says many people are missing “massive performance gains” from newer models because they are not using the APIs in exactly the intended way.

He also argues that routing through something like OpenRouter does not fully solve the problem. In his view, the builder still needs to understand the asymmetries among frontier-lab APIs and test carefully that each integration is working correctly. Model-independence, in this framing, is not just a matter of swapping endpoints; it requires knowing where vendor-specific API behavior affects performance.

This connects back to the second level of maturity: when building an agent yourself, modularity and model-independence become part of the architecture. Khan’s warning implies that independence is harder to preserve when model vendors expose features that improve performance only when used in vendor-specific formats.

Kanban is Khan’s interface for inference-bound work

The third maturity level is not about the internals of an agent but about the user interface for working with many of them. Ara Khan’s claim is that Kanban boards are the right form factor for multi-agent orchestration.

He says he made this claim publicly on March 26 in a post from @arafatkatze. The post, shown on screen, argued that multi-agent workflows suffer from two problems. First, they are inference-bound: the user often waits five to ten minutes while an agent runs in the background. Second, parallel agents may mutate the same source code, creating isolation problems and merge conflicts. The Kanban board, in that post’s wording, lets the user act as an “engineering manager” for individual-agent ICs, watching tasks resolve with a headline-level view.

Khan expands that argument in the talk. If an agent is working for eight to ten minutes, the user has idle time. After a point, “doom scroll” stops being useful, so the natural move is to start another agent. That means two or three agents may be running in parallel, each inference-bound, and each potentially touching the same codebase. The interface has to expose what each agent is doing and help isolate the work.

A Kanban board gives the user a task-level view rather than forcing them to context-switch into each terminal. Khan’s slide describes the board as providing “a clear headline level outlook of every parallel agent: what it’s working on, where it is, whether it’s stuck”; the ability to watch agents without switching into each terminal; dependency chains so work can flow in the right order; and diff review on click when an agent completes a task.

The visual example shown on screen uses columns such as Backlog, In Progress, and Review, with tasks represented as cards assigned to agents. One card shows an agent creating a loading skeleton component and offers actions such as commit, open PR, and inspect the worktree. The important feature is not the specific product chrome; it is that agent work is visible as a set of concurrent, reviewable units instead of as scattered terminal sessions.

Khan says the mental model turns the user into an engineering manager and the agents into ICs. Cards represent tasks; columns represent states such as backlog, in progress, review, and done; and the user supervises, reviews, chains, and redirects work. He notes that Claude Code shipped what he describes as the same pattern about ten hours before the talk, which he takes as confirmation of his March 26 prediction. He also says the approach can work through Claude Code, Cline, or other coding agents.

The board does not remove planning. When an audience member asks how requirements gathering works inside Kanban, Khan answers that a task can be opened to show the entire trace. The interface he shows is the actual CLI of CodeX embedded in that task context. His usual pattern is to have a conversation there until the agent has enough clarity, then decide that it can “go out on your own” and continue while he shifts attention elsewhere.

A follow-up asks whether the task transitions state when it needs review or asks a question. Khan says yes. A task begins in an “In Progress” state; when it needs human input, it moves to “Review.” That clarifies the role of Kanban in his design: it is not a replacement for interactive prompting or planning, but a state and observability layer over that interaction.

Cloud agents are how Khan would operationalize parallelism

The fourth level is shipping agents to the cloud. Once an agent works, has been tested, and has a usable interface, the next question is scale: how to support many tasks, many users, or an organization with thousands of people.

Ara Khan’s answer is to move the complexity out of local machines. Rather than requiring each user to install dependencies and manage complex local workflows, he argues for doing the hard setup once in the cloud. The cloud agent can then provide parallelism, shared infrastructure, and programmatic orchestration.

The slide lists four benefits: complete parallelizability, no local dependencies, programmatic orchestration, and horizontal scaling. Khan’s preferred scaling model is not one larger agent, but more agents: ten, fifty, a hundred instances, each working on a different task or branch of a task.

He says cloud agents are underused relative to their value, especially for long-running work. He personally uses them from his phone for tasks that may run 15 to 20 minutes. In his example, he might ask an agent to build a VS Code extension, sign in, click settings, change a theme, and test the result in the terminal. He says cloud agents can manually perform those QA clicks, decide whether the workflow worked, and keep iterating if it did not. A task like that can take 50 to 60 minutes.

The advantage is that many such tasks can run in parallel without occupying the developer’s local environment. A user can launch them from a phone or laptop, let them run in separate cloud machines, and later return to pull down the resulting PRs.

For teams, Khan adds, cloud agents provide a common setup that many people can share and mutate. That matters when the goal is not merely personal productivity but organizational deployment. The local-machine version of agents may be powerful, but it is hard to standardize. The cloud version creates a shared execution environment.

Khan’s forecast is Kanban for control and cloud for compute

Ara Khan’s forecast is that most of the UX for working with agents will become Kanban-like, while most of the compute will move to the cloud. The board provides observability, dependency management, review, and state transitions. The cloud provides parallel execution, long-running tasks, shared environments, and horizontal scale.

He does not present the four levels as a mandatory sequence for every project. They are heuristics. A team may start with the “bare minimum easy thing” and move up or down depending on how much effort the problem deserves. Frameworks are useful for surface-level validation. Hand-built state machines teach architecture and model independence. Kanban makes agent flow visible. Cloud deployment turns agent work into scalable operations.

The unifying constraint is that agent maturity is not the same as maximal automation. Khan’s operational recommendations all point back to “don’t build slop”: use frameworks when speed matters, own the state machine when seriousness matters, remove unnecessary instructions, make the system testable by agents, keep humans responsible for architecture, assume model APIs will have sharp edges, supervise parallel agents through a board, and move execution to the cloud when scale demands it.

AI Application Architecture Inference and Deployment Agents and Autonomy Human-AI Interaction Coding Assistants