Orply.

BDD and ADRs Give AI Coding Agents Enforceable Project Memory

Michal CichraAI EngineerWednesday, June 3, 20267 min read

Michal Cichra of Safe Intelligence argues that AI-assisted development does not fail for lack of prompts so much as for lack of enforceable memory. In his talk, he makes the case for keeping ADRs, PRDs, BDD scenarios and design-system rules close to the code, so product intent and architectural decisions can be found by humans, retrieved by agents and enforced by Git hooks and CI. His most specific claim is that Cucumber-style executable specifications have become useful again because they connect human-readable product behavior to tests that prove the software still does what the spec says.

The missing part of spec-driven development is enforcement

Michal Cichra frames the consistency problem in AI-assisted software development as an old product problem that now arrives faster. Teams forget why a flow exists, why a feature was built, why code has a certain shape, or where new work belongs. People leave. LLMs compact context and have no durable memory between sessions. The product keeps accumulating inherited behavior while the reason for that behavior disappears.

Cichra uses the five-monkeys story as the metaphor: after enough turnover, a team may still enforce a rule but no longer know why the rule exists. His answer is not more prompting. It is to capture product and architecture decisions in documents close enough to the codebase to be found, reviewed, linked, and enforced.

The stack is familiar: ADRs for architecture decisions, PRDs for product goals, BDD scenarios for executable behavior, and design-system rules for UI. Cichra treats these artifacts less as planning paperwork than as a control surface for both humans and agents. The documents explain intent; the harness makes violations fail.

ADRs explain the rule, not just the prohibition

An Architecture Decision Record, in Cichra’s formulation, records what decision was made, why it was made, and how it is enforced. It can point to reference documentation and code examples, but it does not require a single canonical format. The useful format is text that a person or agent can locate when a tool rejects a change.

There is not a single format that you need to use, it's just a concept.

Michal Cichra · Source

One ADR example bans ORM queries in templates. It carries an ID, accepted status, the enforcing tool, file patterns, and the decision itself: templates had previously called .filter() and .count() directly, which made N+1 queries invisible until production. The decision was to pass only pre-fetched typed data to templates and ban ORM models from the template layer. The enforcement mechanism was import linting.

ArtifactSource exampleEnforcement or use
ADRNo ORM queries in templates; templates receive pre-fetched typed dataImport-linter blocks ORM imports in template-related files
PRDScheduled eval runs catch regressions automaticallyConnect model, set schedule, receive alert on regression
The talk’s ADR and PRD examples show the same pattern: record intent in text and connect it to a concrete path or enforcement mechanism.

That is the pattern Cichra wants: the tool should not merely say “this is forbidden.” It should tell the developer or agent what rule was violated, why the rule exists, and how to fix it. An agent rejected at commit time can be linked back to the ADR, read the rationale, and iterate.

The same architecture-enforcement idea extends beyond templates. Cichra says his team splits code into layers to prevent N+1 queries, enforces that split by linting module imports, and requires database reads to return plain shapes rather than ORM objects. He describes having “another like 50 ADRs” defining the product architecture.

PRDs keep the feature’s purpose available after the build

A PRD, as Cichra uses it, is deliberately lighter than a large requirements document. It captures why a feature exists, what problem it solves, what outcome is expected, and the journey a user takes through the application. It preserves the connection between problem, goal, and user path.

A scheduled-evals PRD captures the pattern. Its “why” is that regressions should be caught automatically. The problem is that evals only run when someone remembers to trigger them. The goal is to let users schedule evals and receive alerts on failure. The journey is summarized as: connect model, set schedule, receive alert on regression.

For Cichra, that compact record is useful not only to agents but also to the same human team six weeks later. It lets the team ask whether the feature still matters, whether it should be kept, and whether it can be deleted. A PRD is part of the system’s memory, not just the ticket that justified initial implementation.

Cucumber connects product intent to executable behavior

Cichra’s most specific answer to current spec-driven development is readable executable scenarios that connect PRDs, critical user journeys, and tests. His complaint is that a markdown spec can describe how a product is supposed to work without proving that the product actually behaves that way. The gap is especially painful with AI-generated code and tests.

“One thing harder than reading an AI code is reading AI tests,” he says. BDD provides an intermediate layer: human-readable scenarios that can also execute.

Cichra describes Cucumber as “almost forgotten” technology that has become useful again. A Cucumber feature can describe behavior in a language reviewers can understand, then map those statements to steps that execute as code. The minimal scenario he uses is a feature called “Executable spec”: given an executable specification exists, when Cucumber runs, then the scenario passes.

The production value is not the toy scenario. It is the linkage. Scenarios can connect directly to PRDs and critical user journeys. They are readable enough to review and executable enough to test. Intended behavior, the reason for that behavior, and the test that verifies it can point back to one another.

Cichra does not present BDD as new. His claim is narrower: in an environment where AI may generate both implementation and tests, the old value of readable executable specifications becomes newly practical.

UI consistency needs rules agents can see and reuse

For UI, the same discipline becomes a design system and pattern library. Consistent interfaces were difficult before AI and remain difficult with agents; the answer, in Cichra’s view, is still to define reusable components, usage rules, and previews.

The design system records the language of the interface: what a primary button is, its shape, color, size, states, and usage rules. His example rule is that only one primary button should be visible on a page at any time. Components and patterns should be defined, previewed, and composed from smaller pieces into larger ones.

The additional AI-specific point is that previews are not only for humans. Cichra says agents can see them too. That makes it possible to review whether generated UI adheres to the visual system and to reuse existing components rather than producing ad hoc variations. Without that, he says, UI work becomes “chaos, like with code.”

The harness matters more than the prompt

Cichra describes an intentionally mundane enforcement loop: Git hooks, CI, linters, type checks, formatting, duplication checks, architecture checks, document linting, and tests. The agent’s goal is to deliver a pull request, and to do that it must use Git. That makes commit, push, and PR checks natural enforcement points.

The same commands run locally in hooks and later in CI. If an agent skips or avoids a local check, CI catches it. Cichra argues that style, formatting, and similar issues should no longer consume review attention. They are rules, not discussion topics, and should be automated so code review can focus on higher-level concerns.

What you cannot find, you cannot enforce.

Michal Cichra

The concrete architecture checks are stricter than ordinary linting. Templates cannot import ORM access. End-to-end BDD tests cannot import any module that could access the database. Those import boundaries force tests to interact through browser-facing application behavior and keep rendering templates from talking to the database. The goal is not to keep finding N+1 queries; it is to prevent that class of query by construction.

Cichra also separates the generic loop from task-specific focus. The loop stays the same: work, check, receive feedback, iterate. Skills determine what the agent attends to. An ADR skill looks up rules governing the change. A PRD skill asks which goals the work serves. A UI loop skips some checks in order to iterate quickly in a browser and reconcile afterward. A test skill selects focused tests based on coverage and file changes. Goal execution records decisions and pending work so they can be reviewed later.

Long-running agent sessions depend on recoverable context

Michal Cichra acknowledges the cost: the system is context-heavy and the feedback loop can be slow. Research alone can consume a large share of context. But he says he is no longer afraid of context compaction. In his sessions, he has seen 20 to 50 context compacts, and he says that has been acceptable because the important facts survive by being recoverable. The agent can look them up again.

20–50
context compacts Cichra says his sessions can run through

The intended operating model is a multi-hour autonomous session with a clear goal and rules that do not depend on a prompt remaining intact. ADRs record decisions, PRDs capture product goals, BDD runs the specs, the design system rules the UI, and the harness ties those artifacts into the commit and CI loop.

The frontier, in your inbox tomorrow at 08:00.

Sign up free. Pick the industry Briefs you want. Tomorrow morning, they land. No credit card.

Sign up free