Orply.

Code Agents Need Context Engineering, Not Larger Prompts

Nupur SharmaAI EngineerMonday, June 8, 202612 min read

Nupur Sharma of Qodo argues that larger context windows have not solved a core agent failure: models still tend to use the beginning and end of an input while losing important material in the middle. Her case is that agent quality depends less on giving a model more context than on engineering how context is retrieved, ranked, constrained and checked. She describes Qodo’s approach as a mix of iterative retrieval, specialist agents, judge nodes and bounded orchestration that reserves high-reasoning models for discovery while using stricter, lighter steps for validation.

Longer context does not make the agent read everything

Nupur Sharma describes a shift many AI engineering teams have already lived through: moving from deterministic DevSecOps-style pipelines, where a failure crashes visibly and can be fixed, into agentic systems where the failure mode is often silent. The agent does not necessarily crash. It appears to reason, calls tools, produces output, and still may have dropped the part of the input that mattered.

Her central warning is that context size is not the same thing as context use. Early LLM systems depended on static prompts and short context windows, which forced developers to manually decide what mattered. If the wrong chunks were included, the model had little chance of producing the right answer. Agentic workflows were meant to improve on that by giving models tools: search the documentation, inspect files, act, search again. But those workflows introduced a different failure: agents can loop because the tool-using system does not know where to stop. Multi-agent systems add specialization, but they also add coordination problems, because more agents and more tools can mean more conflicting interpretations.

The practical result, in Sharma’s framing, is that “context” itself is no longer the only bottleneck. Models increasingly allow teams to “dump a lot of context, a lot of data.” The unresolved question is whether the model is smart enough to decide what is important inside that data.

Qodo sees a recurring pattern in its own benchmarking, according to Sharma: current LLMs tend to use the beginning of the input and the end of the input, while the middle is treated as less salient or effectively discarded. The visual evidence she uses labels this “The Context Trap” and “The ‘Lost in the Middle’ Phenomenon”: high recall for information at the beginning of a prompt, high recall at the end, and frequent failure when critical data or instructions are buried in the middle. The displayed chart compares answer accuracy by the position of the document containing the answer across ten retrieved documents, showing the U-shaped pattern for Claude 1.3, Claude 1.3-100k, and GPT-3.5.

Agents look at the starting point, end point, and try to provide you the results. This is like a U curve where some of the things from the start, some of the things from the end makes sense, but whatever you are providing in between, that is not taken up.

Nupur Sharma · Source

When asked how she knows this, Sharma grounds the claim in Qodo’s own work on agentic code review. The team gives agents context and checks whether the resulting review actually reflects that context. When they attempt to provide a whole codebase, the initial goal tends to stay in focus and information appended at the end may also remain visible. But intermediate context — such as Jira information, MCPs, or other supporting materials — may be purged by the LLM as it tries to “make sense” of the task on its own.

The failure is particularly consequential for code review because teams are tempted to give the agent everything: the pull request, the repo, the ticket, the tooling outputs, the conventions, the historical decisions. Sharma’s position is that this is exactly the wrong default. The engineering task is not to maximize context volume; it is to optimize what reaches the model and where it reaches it.

Context optimization is an engineering layer, not a larger prompt

Nupur Sharma treats “context engines” as one answer to the problem, but not a universal one. A context engine, in her analogy, is like a bouncer for a high-speed car: it imposes a search pattern and ranking logic so the agent receives what is likely to matter for the task. For large, messy codebases, that can be useful. But building and scaling such a system can become its own product.

Indexing takes effort. At the scale of hundreds of repositories — she mentions 600 or 700 — mapping and indexing start to slow down. The system can again become unpredictable unless the organization is explicitly investing in context-engine infrastructure.

Several context-optimization strategies appear in Sharma’s taxonomy, each with different costs and failure modes. Hierarchical summarization creates summaries for files and folders so agents can inspect summaries first and decide whether to go deeper. It can help agents understand project structure, but requires ongoing LLM processing: each created or changed file may need to be summarized or remapped. Knowledge graphs are more complex but valuable when the codebase has logical dependencies across files or repositories. If one file impacts another, which impacts another, a graph can represent that relationship — but the developer effort and parsing infrastructure are high.

For many internal agent-building use cases, Sharma favors iterative retrieval. Instead of expecting the agent to absorb a complete corpus, the system gives it something closer to a library card: a topic index and a way to read deeper when something is relevant. This still has cost implications because recursive API calls can add up, but she argues that it requires relatively low developer input and tends to produce better results than indiscriminate dumping.

Self-correction is a recovery mechanism for cases where context is lost. A critic node evaluates whether the agent’s result is relevant to the original goal. If not, the system can ask the agent to try again. The tradeoff is latency and additional token cost, because the agent may run repeatedly. But it avoids requiring developers to build extensive upfront structures.

SolutionBest forDeveloper effortCost impact
Context engineLarge, messy codebasesHigh: needs search and ranking logicModerate indexing overhead
Hierarchical summarizationUnderstanding project structuresMedium: needs background LLM processingHigh upfront pre-summarization
Knowledge graphComplex logic and deep dependenciesVery high: needs a code-parsing engineVariable graph DB hosting
Iterative retrievalProactive agents using toolsLow: needs a read toolHigh recursive API calls
Self-correctionHigh-stakes tasksMedium: adds a critic nodeLower upfront cost; added latency/token cost
Sharma’s comparison of context-optimization strategies and their tradeoffs.

The important distinction is that none of these techniques assumes the model will reliably identify the right evidence just because it is somewhere in the context window. They all impose structure before or around the model: ranking, summaries, graph relationships, retrieval loops, or criticism. The model is still reasoning, but the system is narrowing and checking what it is allowed to reason over.

More capable models can waste tokens deciding how to work

Nupur Sharma identifies a second failure mode: the “orchestration paradox.” As models become more capable, she says, they can spend more effort deciding how to solve the problem than solving it. Given tools and freedom, the agent may keep asking which method to use, whether there is a better method, whether it should research further, and whether another path might be superior. Sharma says this can become a loop in which API tokens are consumed on tool selection and planning instead of task execution.

Her example is a high-reasoning model such as Opus. Asked to solve a task, it may challenge its own approach repeatedly: maybe not this way, maybe another way, maybe research more. The result is “research mode” as a trap. The failure pattern is described in the presentation as “Reasoning Drift: Dynamic Loop Degradation”: recursive reasoning without tool execution, and resource exhaustion from thinking about tool selection rather than solving the task.

Sharma describes an “80/20 hybrid approach” used by her teams as a way to constrain that drift. The system permits flexible, high-reasoning exploration for roughly 80% of the work: discovery, planning, tool use, and deciding what to inspect next. But the final 20% is constrained by hard gates: validation, summarization, security checks, formatting, committing, or other deterministic operations where the desired behavior is closer to “if X, then Y.”

This does not eliminate the possibility that the 80% exploration phase loops. Sharma says organizations use counters and timeouts to bound it. Some stop after four or five iterations and force the agent to work with the latest result. Others impose a time limit, such as five minutes, after which the system proceeds with the last decision and revisits the task if the result is poor.

The model choice also changes. Sharma argues that the exploratory 80% benefits from “latest and greatest” high-reasoning models. The rigid 20% does not necessarily need them. A critic node, for instance, does not need to research the best thing to do. It needs to compare the original goal, the attempted result, and the desired form of validation or summary.

If you are using anything like discovery or you're trying to see which tool to use, you're trying to plan, those 80% research models are really good. But if you are again trying to create a summarization... the 20% works really well.

Nupur Sharma

The argument is not that deterministic software replaces agents. It is that capable agents need bounded orchestration. Without it, the system can convert model intelligence into indecision, and convert orchestration into token burn.

One generalist agent becomes another version of the context problem

Nupur Sharma treats the “one big agent” pattern as a third failure mode. Because context windows have grown, teams may assume that a single agent can handle testing, code review, security analysis, Jira interpretation, style feedback, and architectural reasoning in one pass. Sharma says this overwhelms the model. If four tasks are assigned, the agent may deliver strong results for two while the other two disappear into the same middle-of-context failure pattern.

The proposed fix is a mixture of agents: smaller specialist agents, each focused on a narrow task. In code review, that can mean a security agent looking for security flaws, another agent inspecting code differences, another checking Jira issues, and others focused on specific review dimensions. The benefit is context isolation: each agent receives domain-specific information and instructions rather than a diluted all-purpose prompt.

But specialization creates another problem. Multiple agents can each produce plausible outputs that do not fit together. Sharma uses a travel analogy: one agent finds the best hotel, another finds the best location, another finds the best flight; the hotel may be in Greece while the flight goes from Amsterdam to Portugal. Each answer can look locally optimized and globally incoherent.

The judge agent is the reconciliation layer. It receives the outputs from specialists and decides whether they make sense together. In Qodo’s code-review architecture, as Sharma describes it, a context collector first gathers context from the PR, the context engine, and tools. It then bifurcates that context and passes relevant slices to specialist agents. The judge receives their findings, rechecks relevance against the pull request and context engine, and filters the suggestions down to what makes sense for the developer.

The architecture slide describes three properties: parallel execution, context isolation, and “signal over noise.” Multiple expert agents analyze simultaneously, each operates in a dedicated context with domain-specific knowledge, and the judge layer curates findings based on team priorities, resolves conflicts, and deduplicates insights. The displayed diagram shows a context engine, PR tools, context tools, local/shared tools, static analysis, tests, Git, a context collector, Agent A and Agent B, a judge, and a Git interface for PR review interaction. A separate mixture-of-agents slide claims that central synthesis of specialized reports can increase accuracy by 30%.

30%
accuracy increase claimed for the mixture-of-agents judge pattern in the presentation material

This architecture is also Sharma’s answer to the lost-middle problem in practice. Instead of one model receiving every available document, each specialist receives the subset judged most relevant to its narrow role. The judge is then responsible for synthesis rather than raw discovery across an unbounded prompt.

Rules must override habits when history teaches the wrong lesson

Nupur Sharma was pressed on two implementation questions after the talk: how the agents communicate, and how the system is calibrated to an organization’s standards. On communication, her answer is direct: Qodo uses LangChain underneath to build the infrastructure for different agents. Results are collected and turned into prompts for subsequent agents; where multiple outputs need to be joined, another agent may collect those results and create a refined prompt for the next step.

The harder question is whether context isolation deprives agents of the full picture needed for architectural judgment. An audience member argued that specialist agents may work for simple checks — linting, test presence, narrow code quality issues — but architecture requires balancing security, framework choices, and system-wide tradeoffs. If each agent runs autonomously on a slice of context, how does the system make decisions that require a broader view?

Sharma answers by comparing the architecture to older human code review processes. A senior engineer might know the codebase, packages, and team conventions. A security person might inspect hardcoded APIs, SQL injection risks, and other security-specific concerns. An auditor might ask about logging and compliance requirements such as ISO or SOC2. These people did not all have the same expertise; they applied specialized knowledge to the same change.

Her claim is that agents can mirror that pattern if the right context is collected centrally and then distributed appropriately. The context collector “knows everything,” in her phrasing, and provides relevant context to each agent. Architects can provide guidelines, compliance teams can provide guidelines, and specialized agents can validate against those instructions. The judge then resolves conflicts and weighs outputs based on team priorities and historical behavior.

Calibration comes partly from the context supplied to agents. A model does not inherently know what matters to a specific organization: healthcare, retail, and finance teams may use the same Java framework in different ways and care about different risks. Qodo therefore gives customers ways to guide agent behavior, according to Sharma. One source is PR history. The system indexes prior pull requests, looks for similar issues, and compares how reviewers and developers responded in the past. Sharma says this history is supplied twice: first to the sub-agents when they are identifying findings, and then to the judge agent when it decides which of perhaps 15 recommendations are worth showing.

But PR history is only one signal, and the audience member challenges it as a source of truth. Prior merged code may reflect bad habits, not good standards. Sharma accepts the limitation: PR history “can be one” source, alongside organization-provided resources and explicit rules, including guidelines supplied through a web portal by architects or compliance teams.

The most important distinction is between preferences inferred from history and rules declared by the organization. In Sharma’s description of Qodo’s weighting, compliance guidance can mark something as important, or distinguish an error from a recommendation, and that affects whether a finding is surfaced. Every time a developer accepts a suggestion, that suggestion type is weighted more heavily for the next run; if a developer does not accept it, it receives less weight. Similar historical issues and whether developers implemented prior feedback also affect weighting.

History, however, does not get the final word where an explicit rule exists. The hard case raised in the room is hardcoded API keys: if developers have repeatedly done it in the past, history alone might suggest that practice is normal. Sharma’s answer is that rules and bugs are treated differently. If a rule says not to hardcode keys, the system highlights it regardless of whether developers want that feedback. For bug-like findings, repeated reviewer disagreement may reduce weight over time. For explicit rules, organizational guidance overrides preference.

That boundary also defines what an organization should expect from an out-of-the-box agent. Some customers do not want to provide rules or regulations and still want a review. In that case, Sharma says they should not expect agents to find something highly specific to how the organization works, except where prior PR history gives the system enough signal about what has mattered before.

The architecture, then, is not presented as a promise that multi-agent systems automatically understand organizations. Sharma describes a system that narrows each agent’s task, preserves organizational context through retrieval and rules, and makes the judge responsible for converting many local judgments into one useful review.

The frontier, in your inbox tomorrow at 08:00.

Sign up free. Pick the industry Briefs you want. Tomorrow morning, they land. No credit card.

Sign up free