Orply.

Context Engines Make Coding Agents Mergeable, Not Just Functional

Brandon WaselnukAI EngineerTuesday, May 26, 202613 min read

Brandon Waselnuk of Unblocked argues that coding agents are failing less because they lack access to tools than because they lack organizational context. In his account, MCP connections, larger context windows and naive RAG give agents more material, but not the judgment to know which code patterns, Slack decisions, ownership signals or backwards-compatibility rules matter. His proposed answer is a runtime context engine that reasons across code, PRs, documents, conversations and social structure before the agent writes code, so its output is closer to something a long-tenured engineer could merge.

Agents do not lack access; they lack tenure

Brandon Waselnuk frames the current problem with coding agents as a context problem, not an intelligence problem. When an engineer starts an agent from a CLI or an agentic IDE, it is as if a capable software engineer has appeared with no knowledge of the organization, the codebase, the team’s conventions, or the history behind prior decisions. The human then has to supply that missing context by steering, correcting, pointing at files, explaining patterns, and rejecting plausible but wrong work.

It's like a brilliant software engineer has just spawned. And it knows nothing about what it needs to do. And it knows nothing about your org.
Brandon Waselnuk · Source

Many engineering organizations, in his account, are still using people as the context engine. Earlier AI-assisted development looked like tab completion; the human remained responsible for the judgment. Agentic IDEs changed the interface, but not the burden: the engineer still triggers jobs, watches the run, and intervenes when the agent goes down the wrong path. That “care and feeding” is what makes agent work feel like babysitting.

Waselnuk’s analogy is the career path of a software engineer. On day one at a company, a new engineer may be smart but has little usable context. Over time, they accumulate it through mentorship, planning, code reviews, architecture decisions, incidents, experiments, rollouts, and a large number of pull requests. They become effective not because they have memorized everything, but because they know how to ask good questions, gather accurate context, and discard information that is not useful.

Agents need an equivalent mechanism. Static instruction files and curated repositories are a step beyond pure human prompting, but Waselnuk argues they are insufficient. Files such as claude.md or agents.md can store key company conventions, but they are static. Someone must maintain them, and they do not include the runtime and organizational information engineers actually rely on: recent Slack decisions, PR history, incident learnings, team ownership, rejected approaches, migration plans, or the fact that a running implementation in main may be contradicted by a later discussion from a senior technical leader.

A context engine, as he defines it, is not merely a larger prompt or a search index. It is a system that reasons across the company’s knowledge corpus at runtime, pulls from static and live sources, analyzes relationships among those sources, and returns a small, token-optimized research packet to the agent before the agent starts planning or writing code. The aim is to make subsequent agent actions better because the first packet gives the agent the team’s patterns, constraints, and relevant history.

Waselnuk quotes Andrej Karpathy’s line, “The gap is not intelligence. It’s context,” and treats it as the central diagnosis. An agent with access to code, logs, docs, tickets, and tools may produce output that compiles. But in his experience, that output often fails human review because it does not match the organization’s real expectations.

Three common fixes give the model more material, not more judgment

The three fixes Waselnuk rejects are naive RAG over documents, connecting enough MCPs, and relying on a bigger context window. Each gives the model more material without giving it the organizational understanding needed to decide what matters.

Naive retrieval-augmented generation falls down because it tends to stop too early. Waselnuk borrows the term “satisfaction of search” from radiology: when a radiologist finds one plausible explanation on an X-ray, they may stop looking, even though other findings could matter. Agents behave similarly. If asked to build a Zendesk integration, an agent may call a tool, find the first source that looks like a pattern, and treat it as sufficient. If that first result is incomplete, obsolete, or from the wrong part of the system, the agent can produce a plausible implementation that misses the actual root cause or the team’s preferred approach.

That failure mode pushes the engineer back into the loop. The engineer reviews the output, says “no,” points the agent to the correct file or pattern, and starts another round of correction. The work again becomes babysitting.

MCPs solve a different problem. Waselnuk describes them as useful pipes to data, not as understanding. A new engineer with access to every repository, ticket, and documentation system still does not know where to look, what matters, which service already exists, or which convention is binding. In his example, an agent may write an entire service from scratch, only for a senior engineer to reject it because the organization already has the relevant service elsewhere.

The larger context window has similar limits. Unblocked once hoped that a million-token context window would make the problem go away. In Waselnuk’s view, it does not. Filling a huge window with raw material does not mean the model can reason effectively over all of it. He describes the extra space as useful mainly for “needle in haystack” problems, not for understanding entities, relationships, conventions, conflicts, and authority across a real engineering organization. Even a much larger window, he argues, would not solve the core problem if the material is not organized and reasoned over.

The visible part of the problem is code that compiles. In the iceberg slide Waselnuk uses, what sits below the waterline is original intent, team conventions, architecture rationale, incident learnings, migration plans, testing standards, Slack decisions, design-doc tradeoffs, and rejected approaches. His point is that the hidden context is what decides whether code is mergeable. Without it, an agent can satisfy the compiler and still violate the organization’s real constraints.

A context engine has to reason about truth, not just retrieve facts

A real context engine needs six capabilities: unified system context, targeted retrieval, token optimization, conflict resolution, data governance, and personalized relevance. The engine is not only retrieving information; it is trying to decide which information is relevant, who it is relevant to, and how to present contradictions.

Unified system context means the engine can reason across systems of record: code and pull requests, planning tools, documents, conversations, and other corporate knowledge sources. The system has to connect those sources rather than treat them as independent silos.

Targeted retrieval depends heavily on graph structure. Waselnuk emphasizes the social graph as a practical technique. If an engineer asks how to implement a Zendesk integration, the engine should know which codebases that engineer works in, whose code they review, who reviews theirs, and what “this integration” likely means in their organizational context. At large customers, he says, with as many as 20,000 members, the same natural-language request can mean very different things depending on the asker, their team, and their history.

20,000
members at some Unblocked customer organizations, according to Waselnuk

Personalized relevance follows from that graph. A useful packet for one engineer is not necessarily the right packet for another. The engine should score material based on the engineer’s repository, teammates, work history, and the surrounding technical area.

Conflict resolution is one of the harder parts. Waselnuk gives the example of running source code in main that appears to be the source of truth, while a Slack thread contains a CTO saying that implementation was wrong. A search system might return both. A simpler agent might pick one. A context engine, in his account, has to use authority signals to help the downstream agent understand the contradiction. If the CTO says the implementation is wrong, the engine should tell the agent that, while still providing the code and surrounding evidence.

He does not present this as solved. He says Unblocked has techniques for handling “truthiness,” but that it remains difficult and is “not fully solved.” The engine must avoid both hiding conflicts and flattening them into a single answer without explanation.

Data governance is also not optional. The system must respect permissions and policies across sources, especially when ingesting Slack or Microsoft Teams conversations. If the requester is a participant in a private chat, the engine may return information from that chat to them. If someone else asks, it must not expose private messages they cannot access. Waselnuk also ties this to MCP delivery because OAuth models can carry through access controls.

Token optimization is not just a cost issue. Without an engine, an agent may spend many tokens grepping a codebase to rediscover factory patterns or fallback infrastructure, only to lose that work when the session ends. A context engine amortizes that work by reasoning across the relevant corpus and returning a compressed answer: ranked, small, and sufficient for the next step.

The comparison was not whether the code compiled

The comparison Waselnuk presents turns on a simple distinction: code can compile and still be wrong for the system it enters. He describes two runs of the same coding task: same prompt, same model, different context. One run had MCP access to the required SaaS tools but no Unblocked context engine. The other used the context engine. The task was to add adaptive thinking mode for Claude 4.6 models to a Kotlin SDK organized as a monorepo with three modules: direct API, Bedrock, and batch processing, while preserving backwards compatibility for callers using the old budget_token method.

The run without the context engine passed code checks and compiled. But the slide showed a GitHub pull request review with “changes requested.” The senior engineer’s comment said the implementation auto-detected adaptive mode for 4.6 models, breaking existing callers using budget_token; the team convention was opt-in new features with backwards-compatible defaults. Waselnuk says the code “would have broken our entire system if we had shipped it.”

He is explicit that the numbers in the comparison were generated by Claude. His claim is not that he is presenting an independently audited benchmark; it is that the presentation’s reported comparison captured the key differences between the two runs.

CriterionWithout context engineWith context engine
Validation and integration tests2.5/10 — unit tests only; no API integration, Bedrock, or batch coverage; bugs in 2 of 3 modules would ship undetected9.5/10 — 19 unit tests plus API integration tests across all 3 modules; regressions caught before merge
DRY code and quality bar5.5/10 — custom token logic and a redesigned file batch API; substantial review rework required8.5/10 — reused file batch details and existing serialization patterns; single clean pass
Doesn't break existing implementation2/10 — auto-detect silently changed behavior across existing callers8/10 — opt-in default preserved existing behavior
Respects team conventions3/10 — broke a factory-method pattern, added 12 imports in one file, and broke linting tools9/10 — followed factory-method patterns in every module with no compilation errors or team feedback
Waselnuk’s reported comparison from the presentation, with scores he says Claude generated from the two runs.

“All checks passed” was not a good proxy for readiness. The naive implementation looked like a prototype: tests existed, code compiled, and the agent had access to the necessary systems. But it missed fallback behavior, compatibility expectations, module coverage, and local conventions. By contrast, Waselnuk says the context-engine run produced a pull request that a senior engineer approved with only a nitpick.

His broader claim is that at organizational scale — he names “20 plus” people as the point where this begins to matter — agents will often mock, invent, or simplify unless they arrive with the team’s context. The issue is not whether an agent can generate code. It is whether the code is mergeable under the standards of the team that owns the system.

Unblocked’s own failures shaped the design

Unblocked’s design changed after three failures: optimizing for access rather than understanding, hiding conflicts rather than surfacing them, and caching answers rather than computing them.

First, more tools did not create understanding. Waselnuk says the team initially assumed that wiring in more tools would let the model figure things out. In his view, years of agent development have not made that assumption safe. The context has to be brought into the harness deliberately.

Second, hiding conflicts made the system worse. Early versions found disagreements among sources and let the agent choose one. If documents, code, tickets, and conversations disagree, the disagreement itself is important context. A system that silently selects one version may give the agent confidence in the wrong answer. The engine has to surface authority signals and explain the contradiction well enough for downstream planning or review.

Third, caching correct answers created stale answers. Unblocked initially cached good responses for latency and token savings. Waselnuk compares this to writing documentation: the moment it is written, it begins to drift from the system it describes. A cached answer that was correct today may be false tomorrow because code, ownership, requirements, or deployment state changed. He acknowledges that some questions are stable, but says caching answers caused enough correctness problems that he strongly recommends against it as a general strategy.

These lessons clarify what he means by computing context at runtime. The engine should not merely retrieve a canned answer to “How do I add Zendesk?” It should inspect the current state of code, conversations, ownership, and conventions, then assemble a response for the present request.

The same engine serves agents, code review, and the rest of the company

The context engine is infrastructure for multiple workflows, not just coding agents. In Unblocked’s architecture, data flows from code and PRs, planning tools, docs, and conversations into the context engine, then back out to coding agents, code review, messaging apps, a CLI, and an API.

For human engineers, one surface is Slack. Waselnuk says Unblocked’s context engine sits in customers’ Ask Engineering channels, detects questions, scores its confidence, and responds automatically when appropriate. Support, sales, or other teams can ask what is running in production or how a product behavior works, and the engine answers from the same underlying context it would provide to an agent.

He lists several use cases for teams that are further along in AI adoption: hydrating agent plans before code generation, enriching tickets with cross-system context, triaging issues, validating bug reports, supporting incident response and root-cause analysis, and giving customer success or sales teams relevant product, ticket, and customer context.

Teams customize the engine through skills, workflows, and specialized agents. Waselnuk says Unblocked provides a cookbook of skills that customers fork and adapt to their own standard operating procedures. A workflow might prepare an incident timeline, draft a support response, or assess a ticket’s priority. Specialized agents can call the same engine rather than each building a separate understanding of the organization.

The strongest requirement appears in his statement that AI-generated code should feel like it was written by someone who has been on the team for years. That does not mean the agent needs a personality or a larger prompt full of slogans. It means it should know the factory patterns, the fallback path, the integration conventions, the relevant people, the old migration plan, and the Slack decision that changes how the code should be interpreted.

A social graph gives the engine a place to start

One component of the approach is engineering-social-graph, a tool shown in a GitHub repository attributed on screen to github.com/jtms1/engineering-social-graph. The repository description says it builds a developer social graph from GitHub PR history, identifies domain experts, detects team clusters, and generates interactive visualizations of engineering collaboration from data already in GitHub repositories. Waselnuk says the tool was built for a workshop and would be open-sourced after the event.

The demo is visually important because it shows the kind of organizational structure the engine can use. In the “Social Comment Network” view, nodes represent developers and edges represent collaboration in pull request comments. The interface shows who reviews whose PRs, weighted degree, communities, heat maps, peer tables, teams, and experts. Waselnuk points to one engineer’s node and explains that the graph can show whom that person works with, whose code they review, and what areas they are active in. Labels for areas of expertise are generated with an Anthropic API key.

The graph becomes a pivot for retrieval. If a query comes in from a particular engineer, the engine can use the graph to infer the relevant people, repositories, and areas of the codebase. If the engineer writes a vague prompt — “there’s a bug, gotta fix it” — the engine has a better chance of finding the right bug because it knows the asker’s context and collaboration history.

In a second demo, Waselnuk shows an agent named Ghosty being asked: “How do I make a new first class integration to Zendesk?” He explicitly tells it to use the MCP. The agent chooses a research task tool, constructs a query, runs high-effort reasoning, receives a research packet, and then sends exploration agents into the right places. The resulting plan identifies that Zendesk should be added as a new data source provider, that the organization uses the same factory-based integration pattern as Jira, Confluence, Linear, GitHub, and others, and that the provider is not currently in the enum or integration factories.

Waselnuk says the plan is not perfect and would likely need a couple more prompts, but he treats it as “one hell of a plan” because it found the important structure: provider registration, factory patterns, library modules, client work, and the relevant implementation path. He describes the practical loop as: use the engine for planning, let the agent execute, then use the engine again during code review. In his view, the engine is especially strong at planning and review because those phases depend most heavily on context rather than raw code generation.

The frontier, in your inbox tomorrow at 08:00.

Sign up free. Pick the industry Briefs you want. Tomorrow morning, they land. No credit card.

Sign up free