Agent Failure Should Drive Enterprise AI Knowledge Base Curation

Raj NavakotiAI EngineerThursday, May 7, 202617 min read

Raj Navakoti argues that enterprise AI agents fail less because of model limits or retrieval plumbing than because companies have not made institutional knowledge legible. In his Demand-Driven Context workshop, he proposes building agent-ready knowledge bases from the bottom up: give agents real tickets or incidents, observe where they fail, and turn those failures into structured, validated context blocks. The method, shown through smaller-scope examples and prototypes including work from IKEA Digital, is presented as an incremental curation loop rather than a proven enterprise-scale system.

The missing layer is not the model or the retrieval system

Raj Navakoti frames the enterprise-agent problem as a memory problem. Current agents are strong at reasoning, computation, code generation, and general knowledge, but they start with effectively no knowledge of a specific company’s domain. His opening analogy is Memento: a capable person who cannot retain the context needed to act coherently over time.

The gap he focuses on is not whether an agent can write code, review pull requests, generate tests, or summarize documents. It is whether it can move real enterprise work through systems like Jira without a human quietly supplying all the hidden assumptions. His blunt version of the question was: if agents are so capable, why are enterprise Jira epics not moving?

He separates the knowledge required to complete work into three categories. Green knowledge is general knowledge the model already has: REST conventions, broad deployment practices, standard test-writing patterns. Orange knowledge can be taught through skills, rules, instructions, or other agent extensions: “do this in our preferred way.” Red knowledge is institutional: undocumented, scattered, outdated, or sitting in people’s heads.

Tier	Knowledge type	How the agent gets it	Examples from Navakoti’s task breakdown
Green	General knowledge	Already present in the model	REST standards, zero-downtime deployment strategy, infra-as-code platform conventions, ordinary unit tests
Orange	Company-specific but teachable	Skills, rules, specs, instructions, or other extensions	Capacity template validation rules preserved
Red	Institutional knowledge	Must be discovered, documented, and validated	Partner access tiers, market overrides, undocumented retry behavior, upstream provider throttle limits, retired analyst configuration

Navakoti’s green, orange, and red taxonomy distinguishes model knowledge from teachable rules and undocumented institutional context.

In his example Jira tickets, the green and orange parts are not the main blocker. A story to publish API contracts includes normal OpenAPI and REST expectations, but also partner access tiers and deprecation rules. A platform migration includes zero-downtime deployment and infrastructure-as-code conventions, but also custom market overrides. An incident about settlement batches includes ordinary root-cause and monitoring work, but also an undocumented retry mechanism and an upstream provider throttle limit. A customer bug includes ordinary unit tests, but also state-specific regulation and a retired analyst’s configuration priority.

Agents can look impressive while failing on the last category. In enterprise delivery, the red items often decide whether the work is actually done.

40%

of institutional knowledge Navakoti described as tribal and undocumented

He described the typical enterprise knowledge base as a monolith: some useful information, stale information, unreliable information, duplicate material in different places, and a large share of tribal knowledge. In his account, plugging many retrieval tools into that monolith does not make the knowledge coherent. It only gives the agent more ways to retrieve noise.

Retrieval only works on knowledge that exists

The current industry answer, as Raj Navakoti describes it, is incomplete. Enterprises connect agents to Confluence, Jira, SharePoint, GitHub, Slack, and similar systems through RAG, Graph RAG, MCP servers, APIs, skills, rules, specs, and instructions. That architecture can work for documented knowledge. It cannot retrieve what was never written down.

He described his own path through the common attempts. First, dump everything into context: the agent drowns in irrelevant material. Then build RAG: it chunks, embeds, and retrieves, but still cannot recover missing knowledge. Then build a knowledge graph: useful in principle, but stale if nobody maintains it. Then manually curate the knowledge base: the work burns out because the curator is guessing what to document.

Navakoti said he had personally built many MCP servers and tried to prove that agents could semi-autonomously complete Jira work once connected to institutional systems. His experience was that the data coming back was often nondeterministic, unreliable, and untested. He also argued that engineers often check whether a retrieval tool returns output, not whether that output helps solve the actual task.

That distinction matters because the problem is not just access. It is quality, freshness, and absence. If a page exists but is a year old, the agent may not know whether to trust it. If several documents say similar things, the agent may have to decide which one is authoritative. If the real rule lives only in a senior engineer’s memory, retrieval cannot surface it.

Navakoti’s answer is to put a curation step before retrieval. Demand-Driven Context, as he presents it, sits between the messy enterprise context and the retrieval layer. It turns the monolith into smaller, structured, agent-ready context blocks before RAG, MCP, or other retrieval mechanisms consume them.

Don’t push knowledge. Let agents tell you what they need.

Raj Navakoti

The contrast is important. The conventional approach pushes available knowledge into the agent and hopes the relevant part is there. Demand-Driven Context pulls from actual work: give the agent a real problem, let it fail, and treat the failure as evidence of exactly what context is missing.

Failure becomes the discovery mechanism

Demand-Driven Context is presented as an analogue to both microservice decomposition and test-driven development. The enterprise knowledge base begins as a monolith. The goal is not to document everything in advance. The goal is to break the monolith into reusable context blocks by observing where agents fail on real problems.

Raj Navakoti also pointed to an arXiv preprint, shown in the workshop as “Demand-Driven Context: A Methodology for Building Enterprise Knowledge Bases Through Agent Failure,” authored by Raj Navakoti and Saideep Navakoti. The abstract shown on screen describes the same premise: LLMs fail on enterprise-specific tasks because domain procedures, system intricacies, and context-dependent rationales are often not captured in formal documentation.

His onboarding analogy is a new team member. A company does not usually ask a new hire to study a 500-page corpus and return only when fully trained. It gives the person some orientation, points them to resources, assigns a real work item, and lets questions reveal the missing context. If the person is disciplined, they also improve the documentation as they learn. DDC applies that loop to agents.

The cycle has six steps:

Step	Action	Purpose
1	Give the agent a real problem	Use work that resembles production tasks rather than abstract documentation prompts
2	Watch it fail	Let the agent generate a demand checklist of what it does not know
3	Fill only the identified gaps	Avoid trying to curate the whole knowledge base up front
4	Curate into structured reusable entities	Turn answers into context blocks, not one-off chat responses
5	Re-run the agent	Check whether it can now reason with domain context
6	Repeat across problems	Let knowledge compound as more failures reveal more gaps

The DDC cycle uses agent failure to identify and structure missing enterprise context.

Navakoti compares this to TDD: “You don’t write all the tests upfront. You don’t curate all the context upfront.” In TDD, a failing test creates a concrete demand for code. In DDC, a failing enterprise task creates a concrete demand for context. The team curates only enough knowledge to make the agent succeed on that class of problem, then validates and tightens the entities before moving to the next problem.

The intended output is not a chat history. It is a structured knowledge base of entities: systems, APIs, processes, terminology, data models, decisions, and related objects. The knowledge is reusable across agents and roles. A platform operations agent, a support agent, and a code review agent might reason over different parts of the same curated context base.

The methodology is explicitly incremental. Navakoti is not saying one pass produces a complete institutional brain. He is saying repeated failures expose the high-value missing knowledge faster than blank-page documentation projects do.

The evidence is early, and the scope matters

Raj Navakoti does not present DDC as a proven enterprise-wide system. In the Q&A, when asked whether he had used the approach at scale, he said he had not used it at enterprise scale. He had tried it at smaller scopes because a full enterprise contains multiple domains, many systems, and no single person with all the needed expertise.

His recommended unit is narrower: a team, its recent Jira tickets, its incidents, and its own documentation. At that level, he said, the process is faster and more useful. If the scope is too broad, curation becomes a meeting with five or six experts trying to reconstruct a whole domain.

The evidence he emphasized was also limited. One of his limitation slides said “15 cycles on one domain is evidence not proof.” He also said small teams may not need this at all; if people can simply talk to each other and the documentation is already good, the method may not add much.

The other major limitation is human labor. Manual curation is painful. Automation helps, but Navakoti called it early. Getting domain experts to curate is the hardest part, and he described that as a people problem rather than a technology problem.

That caveat sits near the center of the method. DDC is a way to generate demand signals for missing knowledge. It does not remove the need for people who can validate, correct, and maintain that knowledge.

The manual loop showed both the promise and the pain

The live loop used Claude Code, a mock monolithic knowledge base, and a context base organized as files. Raj Navakoti emphasized that the implementation was not specific to Claude Code. At work, he said, he used Copilot; for the workshop, he used Claude Code because he expected the audience to be familiar with it. The core was an approach implemented with skills, rules, agents, hooks, and a persistence layer for curated knowledge.

The sample task was an incident root-cause analysis. The incident affected roughly 32,000 customers who had not received order confirmation emails over two hours, while customer support ticket volume had spiked 400%. Orders themselves were not affected; notifications were.

The agent first searched the available mock enterprise knowledge base. Navakoti stressed that this first step is already what retrieval systems do. The missing part is what happens after retrieval fails. A human employee who cannot find the answer in Confluence does not stop; they ask questions, learn the missing terms and business logic, and ideally document what they learned. The agent was instructed to follow that pattern: identify undocumented terminology, missing flows, unclear systems, and business logic it could not infer.

Navakoti said the first attempt produced a very low confidence score because the agent could not find the institutional context. He then supplied a prepared high-level answer representing what a domain expert might provide. The agent used that answer to solve the incident and create new structured entities in the knowledge base. According to Navakoti, one problem surfaced six never-documented entities, and after he filled the gaps, the agent discovered and curated another five or six entities.

He then applied the loop across a set of incidents. In the first cycle, the agent had almost no reusable context and a confidence level around 1.5 out of 5. Across roughly 14 to 15 cycles, the curated context accumulated. By later incidents, confidence scores moved above 4 out of 5.

1/5 → 4.5/5

agent confidence improvement Navakoti reported after repeated DDC cycles

The lesson from the manual loop is not that teams should sit with agents all day answering questions. Navakoti said the opposite. After about 15 manual conversations, he was “questioning my life choices.” Manual DDC was evidence that the loop could work, but it did not scale as an operating model.

The important shift, in his framing, is that the agent moves from being a passive consumer of institutional knowledge to participating in knowledge management. It does not merely ask for context; it helps identify, classify, and store what was missing.

Historical work already contains the demand signal

Enterprises already have examples of demand: Jira tickets, incidents, pull requests, resolved support cases, and other completed work. Those artifacts show what people actually needed to know to solve problems.

Instead of waiting for an agent to fail during live operations, Raj Navakoti proposes scanning historical work against the existing knowledge base. The scanner asks, for each prior task, whether the documentation needed to solve it is clean, stale, incomplete, missing, or tribal. It then consolidates those findings into a prioritized curation backlog.

In the workshop, the Context Gap Scanner loaded a preset SRE / Platform Operations domain with files such as a knowledge base, runbooks, recent incidents, an escalation matrix, resolved tickets, and recent pull requests. The scanner was shown at context-gap-scanner.web.app. Navakoti said it performs three steps: generate probes, run probes, and analyze gaps.

Scanner stage	What the system does	What it produces
Generate probes	Turns historical incidents, tickets, or PRs into tests of the knowledge base	Questions a new joiner or agent would need answered
Run probes	Searches the connected knowledge base for the needed systems, terms, flows, and rules	Found, stale, incomplete, missing, or tribal classifications
Analyze gaps	Consolidates repeated failures across work items	A prioritized curation backlog and category-level readiness scores

The scanner workflow shown in the demo turns historical work into tests of enterprise knowledge coverage.

A probe is essentially a test of whether a knowledge base contains what a new joiner would need. If an incident mentions a customer notification service and the connected documentation has no such service, that becomes a gap. If the documentation exists but is old, the scanner can flag it as potentially stale. Navakoti said Confluence extraction can include metadata such as date, creator, and last updated time, which can be used for thresholds. He also cautioned that an old document is not automatically useless; the tool should surface the issue, while people decide.

The scanner produced an AI-readiness score and category coverage. In the example, overall readiness was 58 out of 100. Coverage varied: systems and services were stronger, domain terminology was stronger, while business processes and tribal knowledge were weaker.

Category	Coverage
Systems & services	75%
Business processes	48%
Domain terminology	88%
Data models & schemas	65%
System integrations	78%
Tribal knowledge	35%

The scanner example showed uneven documentation coverage across enterprise knowledge categories.

The scanner also produced prioritized curation items. Examples included identifying the people who know manual split-brain resolution procedures, documenting step-by-step split-brain resolution commands for ledger database access, and explaining why silent email drops do not generate alerts. These are not generic documentation chores. They are specific pieces of missing operational knowledge tied to real incidents.

Navakoti’s automation pipeline has three broad functions: extract demand from work items, consolidate the missing knowledge into entity categories, and curate the results into a context lake. The scanner can classify knowledge health across systems, APIs, processes, terminology, data models, and decisions.

Entity area	Clean	Stale	Incomplete	Missing	Tribal
Systems	34	12	8	3	7
APIs	89	47	23	11	2
Processes	8	5	12	15	29
Terminology	22	3	6	28	31
Data models	45	18	9	14	5
Decisions	3	3	4	21	14

The consolidated analysis example classified knowledge base health by entity type and quality.

The output is a curation workspace, effectively a Kanban board for institutional knowledge. Items move from backlog to elicitation, review, and then into the context lake. The example backlog included “refund override” business definition, payment gateway fallback logic, order processing service entity, vendor API contract format, and region-specific pricing rules.

Navakoti’s preferred operating model is to do this before retrieval and before live agent execution. In Q&A, he said he first tried doing the loop operationally, at the moment a work item arrived, but it took too much time and patience. His later preference is to scope a team, scan its recent tickets and incidents, and improve the context base in advance.

The context lake is a Git repository, plus a map

The curated context needs somewhere to live. Raj Navakoti is opinionated about the first version: use GitHub, or a Git-based repository. He said a SaaS product will probably appear for this eventually, but his preference is to start with Git because it already has the workflows engineering teams need.

His reasons are practical. Git gives version control, so every change is tracked. Pull requests create human review, so agents do not add unchecked knowledge directly to the base. Engineers already know the workflow. Branches allow safe experimentation. The files can be machine-readable and human-readable. If needed, the repository can later publish into Confluence, Slack, or another system.

The repository shown for the framework was github.com/ea-toolkit/ddc. The visible README described DDC as “TDD for knowledge bases — failing agents drive curation, not failing tests.”

This repository is the “context lake.” In the example structure, it contained areas such as systems, APIs, processes, terminology, data models, and decisions. Individual context files included metadata like status, verification state, and freshness.

Context lake feature	Why Navakoti wants it in Git
Version control	Every change to the knowledge base is tracked
Pull requests	Human review can sit between agent-generated updates and accepted context
Branching	Teams can experiment safely before merging changes
Engineer familiarity	The workflow already matches how many engineering teams review code
Portable output	The same repository can later publish to Confluence, Slack, or another destination

Navakoti’s context lake proposal uses Git workflows to govern agent-readable knowledge.

Navakoti also recommends a meta model: a domain map that defines entity types and their relationships so agents can navigate knowledge rather than search a pile of files. He describes it as the map at the entrance of an IKEA store: not the merchandise itself, but the navigation structure. The agent needs to understand how a domain is organized.

The main DDC meta-model slide listed 13 entity types:

Group	Entity types from the 13-type slide
What the organization delivers	Offerings, capabilities
Who is involved	Teams, personas
How work happens	Processes, business events
What technology supports it	Systems, APIs, platforms, data models, data products
Domain language	Business jargon, tech jargon

Navakoti’s 13 entity types define a map agents can use to navigate domain knowledge.

A later diagram also grouped architecture decisions with contextual domain knowledge. The point was the same: without a map, an agent receives a large set of files and has to infer which documents matter. With a model, it can reason about relationships: if it changes a system, which business process might be affected; which APIs are involved; which domain terms are relevant; which data products or decisions constrain the work.

The result Navakoti wants is a cache-like set of context blocks. Instead of boiling the ocean on every task, the agent first uses the curated 20% of documentation that solves most common problems. The remaining monolith can still exist behind links or retrieval tools for less common cases.

This is where DDC sits in the broader enterprise AI stack. The context layer remains Confluence, Jira, SharePoint, GitHub, Slack, and similar stores. DDC extracts, consolidates, and curates that messy context into structured blocks. Retrieval systems then operate over those blocks. Agents then use skills, hooks, rules, and specs to produce domain-specific work.

The value is in discovering missing context

The most valuable outcome, in Raj Navakoti’s account, is “knowing the unknown.” The approach surfaces knowledge that nobody had realized was undocumented. Without a demand signal, teams can spend months debating what to document. With real tickets and incidents, the missing knowledge becomes concrete.

The second value is shifting knowledge management work away from humans as the sole curators. Navakoti does not suggest removing human experts. Human validation remains central. But he wants agents to help identify gaps, structure entities, and maintain the context base rather than merely consuming answers.

He described four categories of value. For the agent, the shift is from generic advice to domain-expert reasoning: naming specific systems, tracing flows, recognizing patterns across problems, and becoming useful on real Jira tickets. For the knowledge base, the value is structured entities that did not exist before and can be reused across agents and roles. For the team, the same KB can help both human new joiners and AI agents ramp up.

In the numbers he presented, the manual work completed 15 cycles, the demand checklist shrank each cycle, entity reuse moved from 0% to 42%, and agent confidence moved from 1/5 to 4.5/5.

Metric shown	Value
Cycles completed	15
Entity reuse	0% → 42%
Agent confidence	1/5 → 4.5/5
Structured entities that did not exist before	46

Navakoti’s value slide connected repeated DDC cycles with higher reuse and higher reported agent confidence.

He also presented prototype performance metrics: legacy KB accuracy at 12%, accuracy after migration at 54%, and accuracy after tribal curation at 87%, with 83% KB migration progress, “6 weeks vs 6 months manual,” 4.2x agent task accuracy, and 68% reduced onboarding time.

Navakoti’s own caveats are narrower than a broad enterprise claim: small teams may not need this, manual curation is painful, automation is early, domain-expert participation is the hard part, and 15 cycles in one domain are evidence rather than proof.

The method is therefore not a promise that agents will independently fix enterprise documentation. It is a way to make missing context observable, scoped, and actionable.

The real constraints are scale, humans, truth, maintenance, cost, and permissions

The Q&A tested the method against enterprise conditions. The strongest objections were not about whether agents can write Markdown. They were about whether organizations can operate the loop without overwhelming people or creating another stale knowledge system.

On scale, an audience member noted that the examples were scoped and asked whether the approach had been used broadly. Raj Navakoti said no: he had not used it at enterprise scale. Broad enterprise scope quickly becomes unwieldy because multiple domains and systems require multiple experts. His answer was to start at the smallest useful scope: one team’s tickets, incidents, and documentation.

On human burden, an audience member worried that the approach could “denial of service” team members by making agents ask endless follow-up questions. Navakoti agreed the risk is real. He said he does not expect leaders today to easily allocate engineers to sit and fix context, but he believes the priority will rise as organizations move toward managing semi-autonomous agents. His view is that companies are over-focused on model quality, agent harnesses, and retrieval, while under-investing in context quality.

On source of truth, another audience member argued that in large enterprises the real source of truth is often code, not documentation. Navakoti said he had tried applying the method to codebases and got mixed results. Code alone worked well, and textual documentation alone worked well, but combining GitHub and Confluence created conflicts. The agent might infer one theory from the repository while the documentation says something else. His current response is to add skills or rules that rank sources: for example, prefer GitHub as source of truth when relevant, fall back to Confluence otherwise. He described that as ongoing work.

On skills, an audience member asked whether the same failure loop should update agent skills, not just the knowledge base. Navakoti said his current skill was static and that he had not tried evolving skills through the loop, though he agreed that a failing skill could in principle evolve. He returned to his preference for fixing context before retrieval rather than doing the whole process at live operational time.

On maintenance, the question was whether today’s correct context becomes tomorrow’s wrong context. Navakoti said the scanner can flag duplication and staleness based on metadata such as creation and update dates, and can take the latest changed document as source of truth when there is only one version. But if the latest document itself is wrong, he argued, that is not specifically an agent problem. A human following wrong documentation would also implement the wrong thing. The audience pressed on cadence and cost; Navakoti’s answer was that scanned domains in his experiments stayed below about 100,000 tokens, but he treated larger enterprise cases as use-case dependent.

On cost and retrieval size, one audience member worried that if the context lake sits in GitHub, tasks might require reading several documents, consuming time and context. Navakoti said that after large context windows became available, he had less concern. In his experiments across domains, consolidated context averaged around 96,000 tokens, which he said fit easily. He had tried Graph RAG and intent-based retrieval, but found that putting the whole scoped context into the window often produced better results than RAG, unless the domain approached a million tokens.

On permissions, Navakoti said the approach itself does not solve access control. If implemented as a GitHub repository, permissions can use GitHub’s read, write, and merge controls. If implemented as SaaS, access control becomes the product’s responsibility. He noted that the public scanner used presets because he did not want workshop participants uploading proprietary data.

The method may also apply beyond documentation. An audience member asked whether agent failures could reveal gaps in internal tooling: for example, an internal CLI that lacks a bulk operation agents need. Navakoti said the approach could be extended. Since DDC can document business processes and how applications operate, it could also surface gaps in business processes or tooling.

AI Application Architecture RAG and Knowledge Systems Evals and Benchmarks Agents and Autonomy Enterprise AI Adoption