OpenClaw’s 3,000-Commit Day Shows Code Review Becoming the Bottleneck

Vincent KocAI EngineerFriday, June 5, 202611 min read

Vincent Koc uses OpenClaw’s high-velocity refactor to argue that agentic software development is becoming an industrial management problem, not a prompting trick. In his account, a project that briefly touched 82% of its core codebase and produced thousands of commits exposed a new bottleneck: the human ability to supervise parallel agents, trust the test harness, reject bloat, and stop sessions that have lost the plot.

The bottleneck moved from hands to taste

Vincent Koc frames OpenClaw’s recent velocity as something more than luck or “row flipping to the max,” using the transcript’s garbled phrase for the idea that the project is simply improvising at speed. The project’s commit volume is extreme enough that he says GitHub rate-limits him hourly, but his claim is not simply that more code is being generated. It is that software production is moving into a different mode of production, closer in his telling to the shift from hand looms to centralized mills.

In the older model, engineers wrote code in editors. In the emerging one, agent swarms work across repositories, and engineers become factory managers. The bottleneck is no longer “the weaver’s hands,” as his manufacturing analogy put it. It becomes taste: the ability to decide what should exist, what should not, which agent work is worth preserving, and when scale is turning into bloat.

Koc ties this to a tension in his own work. In his day job, he works around evals: structured systems, telemetry, and controlled measurement. In OpenClaw, by contrast, he says he has had to operate with “blind faith in the harness.” Those worlds are beginning to converge. The open-source project’s speed depends on trust in tests, workflow, and agent-management process, even when the humans cannot manually inspect everything at the pace it is produced.

His analogy is deliberately industrial. Britain has been through a production shock before, he says, with mills and cotton produced at extreme volume. The relevant question now is “how do we build at scale?” The old ways of working, in his view, do not hold when autonomous coding agents can produce changes faster than conventional review habits can absorb.

OpenClaw’s velocity made review itself the problem

Koc situates OpenClaw among other examples of large-scale agentic coding. The comparison set he presented included Anthropic using “16 parallel Claudes” on a “100K-line C compiler” over two weeks; Spotify producing “650 AI PR’s/m” and having “No Hand Code Since Dec 25”; and Steve Yegge pushing “50 PRs/day” solo as a “vibe maintainer.” Koc glossed the Spotify line as the company supposedly no longer writing code by hand, and said he could relate to Yegge’s phrase.

OpenClaw’s own number was “800 commits/day peak,” across what Koc described as roughly 10 to 15 core maintainers, all with day jobs.

His personal contribution graph showed 2,886 contributions on March 15, 2026. Koc rounded this to “close to 3,000 commits per day” and said the commit history visibly stops when he sleeps and resumes when he wakes.

2,886

contributions shown on Koc’s GitHub graph for March 15, 2026

His point was not that this is a stunt. “This is gonna become the norm everywhere else,” he warned. At that level of output, trying to review pull requests in the usual way “may not work.” But he resisted the idea that this is just chaos. “Somewhere in the mix is engineering,” he said. The hard part is identifying what that engineering is when the visible artifact is a flood of diffs.

The first phase, as he describes it, was “commitmaxxing”: pushing as many commits as possible and letting agents burn tokens for long stretches. He compared the passive version of that workflow to waiting around and hoping something happens. The next version, which he jokingly called “Bart looping,” asks for a more opinionated loop: yes, run agents, but do not treat token volume itself as the reward mechanism.

That distinction matters because OpenClaw’s scale creates a selection problem as much as an implementation problem. Koc repeated a point he attributed to Peter Steinberger: the challenge becomes deciding whom to say no to. If tokens are cheap, maintainers can say yes to every feature request and merge everything. Koc’s view is that this would turn the codebase into “an absolute fire dump.” The speed only becomes useful if paired with taste and refusal.

The Great Refactor was a refusal to let features become bloat

The “Great Refactor” began in the middle of several concurrent pressures. Koc and Peter Steinberger were at Nvidia, where they were helping with “NeMo Claw.” Steinberger, connected to a Mac Studio at home over VPN, was running about 15 Codex sessions. Koc was running another 10 or 15. Including sub-agents, he estimated they were collectively running as many as 60 to 70 agents, with perhaps 15 foreground “swim lanes.”

At the same time, another maintainer moved folders around in the codebase, including what Koc described as entire channels: “all our conversations with like MS Teams and Slack” ended up moving to another location in the codebase. That change became the catalyst for something larger.

OpenClaw had many contributors raising PRs because they wanted to build features. The maintainers did not want every feature in core. Their answer was to cut the codebase apart around a plugin architecture. Koc presented the rationale this way: if OpenAI, Mistral, or Anthropic had a provider-specific area, that piece of provider code could be separated from everything else rather than buried inside a growing monolith.

The decision arrived under less-than-ideal conditions. “It was 2:00 in the morning, we’re tired,” Koc said. “We thought why not refactor the entire codebase.” The scale was substantial: nine days, 2,700 commits, 638,615 insertions, 271,117 deletions, and 82% of core lines touched. The outcome line was blunt: “Plugins Were Launched!”

Measure	Value
Duration	9 days
Commits	2,700
Insertions	638,615
Deletions	271,117
Core lines touched	82%
Outcome	Plugins launched

The Great Refactor, as presented by Koc

The near-failure point came the night before launch. Koc said he was trying to sleep around 1 a.m. while the tests were not passing. He wondered whether he had “vibed too hard” — his phrasing for pushing agentic development past the point of control. The team did bring the codebase back together, and the unlikely safety net was a set of unit tests that AI-generated code “loves to generate.” They were “awful” in one sense, because they overfit to the existing code. But after the team ripped the system apart, those same overfit tests became useful. If they went green, the team knew it was at least “somewhat close.”

That is the practical meaning of Koc’s “blind faith in the harness.” At this speed, the tests are not a ceremonial last step. They are part of how humans reestablish orientation after a codebase has been changed faster than a person can fully read.

The factory is built from swim lanes, not magic prompts

When people ask Koc for the “magic sauce,” his answer is deliberately plain: many Codex sessions arranged into swim lanes. The number can be five, ten, or twenty, but the organizing principle is that different classes of work occupy different lanes.

His working surface becomes a kind of factory floor: many coding sessions visible at once, mentally divided into lanes by risk, urgency, and supervision load. The layout is less important than the segmentation. The human operator needs a way to distinguish work that can run with minimal attention from work that needs active conversation.

In one example, CI might sit on one side, features in another, bugs in another. If the codebase is stable and Koc wants to refactor tests, lanes one and two may get that work. They do not need close babysitting; he can tell them to take their time, make the tests pass, commit, and push through. Lanes three and four might handle more specific feature or issue work, such as Docker or one of the messaging channels, where he remains in conversation with the agents as they investigate and report back. A fifth lane might watch new P0 and P1 issues, use GitHub or other data, or ingest activity from Discord during a release: “what’s happened in the last two hours that I need to be paying attention to?”

The limiting factor in this setup is not always token supply. Koc says tokens are “no longer the problem,” depending on whom one asks. The constraints become raw compute and his own brain space: the capacity to monitor many semi-autonomous sessions without losing track of which ones are healthy, confused, dangerous, or simply not worth continuing.

His tooling is pragmatic and not always ideal. He uses git worktrees heavily, and says he “kind of” wishes he had not. Every PR he touches can become a new worktree, leaving him with 70 or 80 active git worktrees on his machine in a day. That became “hell” when combined with a heavy test harness. He built additional support around his Codex sessions so they understand git worktrees, can self-heal if he hits escape or a process crashes, and can recover sparse-checkout state.

But he does not present this as a universal best practice. He says he probably should have followed the simpler approach used by Steinberger and others: clone the repository ten times and point ten Codex sessions at those clones. The broader claim is that his system is not based on exotic prompting modes. “I don’t use plan mode or spec mode,” he said. “I have a conversation with the agent, and we work through it.”

Reading reasoning tokens is a management skill

One of the central skills in Koc’s account is not writing prompts. It is knowing when an agent is “bullshitting you.”

He compares his monitoring posture to the scene in The Matrix where experienced operators read cascading code as meaningful reality. His version is less cinematic but similar in spirit: after enough exposure, he says, he can “feel the reasoning tokens.” He recognizes when a session sounds off not necessarily because of the code it is editing, but because of how it explains itself. It starts waffling. It fails to make sense. It no longer seems to know what it is doing.

It doesn’t sound off because of what it’s doing, it sounds off because of how it’s explaining itself to me. It’s waffling. It’s not making sense.

Vincent Koc · Source

Koc sees this as analogous to managing people. If an employee began clearly bluffing, a manager would stop and ask what was going on. With agents, his response may be to kill the session, leave that part of the code to another maintainer, or return to it days later. The judgment is intuitive, but he says the intuition was built through the sheer volume of “tokenmaxxing” he did in the previous year.

That is why he pushes back on the question “How do you manage 10+ agents?” by rewriting it as “How do you manage 10+ staff?” The questioner, he says, had no answer. Koc had worked in large organizations, including airlines, and had managed AI teams of 30 or 40 people. For him, the management problem was not conceptually new. For engineers scaling coding agents, however, the missing skill may be less about compilers or prompts and more about soft management: how to ask what is happening, how to detect evasion or confusion, and how to run the factory without being consumed by it.

Process turns agentic coding from noise into engineering

Koc calls his loop an “agent development environment,” or ADE. It includes a registry of reusable “skills,” which he likens to dotfiles and calls .skills. He says both his dotfiles and dot-skills are available on GitHub, though some skills remain private. One public example is a skill for writing technical documentation, co-created with developer-experience and other engineers.

The loop is not just writing a skill and using it. Koc describes a “skills gym,” including tools such as Geppetto, which he contributes to, and a process of sending Codex through prior session logs to improve a skill based on recent use. He can then deploy that improved skill into OpenClaw or his personal environment, using something he referred to as “vercel skills.sh” as part of the loop. He has added testing and other elements on top, but the important point is that these skills are maintained like engineering assets, not treated as one-off prompts.

The same applies to pull request management. Koc showed a graph representation of a PR with 73 nodes and 106 edges, produced from semantic graphing and vector embeddings over GitHub data. This was his attempt at a common maintainer problem: there are too many PRs and issues to inspect directly. A person joins the project and decides to cluster everything. Another person sends a different flavor of the same PR or issue. The result can become “utter noise.”

The useful signal, for Koc, is pressure. If enough related issues cluster around the same underlying problem, that may mean many other “clankers” have decided it is important. That can tell him where to focus. OpenClaw may not call this a roadmap, but he says there is a process for deduplicating, consuming, and prioritizing the incoming work.

Evals also enter the system after the refactor. Koc says the team built a “fake Slack of sorts,” using both synthetic models and real models, so they could run evaluation loops to check that providers and channels still worked. This is where his two worlds — structured evals and high-velocity agentic open source — become more visibly connected. The factory needs instrumentation, even if it does not look like a conventional enterprise software process.

The next constraint is token efficiency

Koc’s closing claim is that the locus of advantage has shifted. It is “not the model,” and “not the agent.” It is the process. His shorthand for the transition is that 2025 was about “tokenmaxxing” — running lots of tokens through agents and learning what raw volume could do. In 2026, he argues, the problem is not wasting them.

2025 was about tokenmaxxing. 2026 is about not wasting them.

Vincent Koc

That claim does not make agents less central. It makes the process around them more central. Koc’s OpenClaw example is full of brute-force scale: dozens of agents, thousands of commits, hundreds of thousands of insertions, most of the core codebase touched. But his engineering argument is narrower and more practical. The value comes from swim lanes, harnesses, tests, work isolation, reusable skills, PR clustering, eval loops, and the managerial judgment to stop a session when it starts to drift.

The dark factory is not a place where humans disappear. In Koc’s account, humans move up a level and become responsible for taste, triage, supervision, and refusal. The output may be too large to read line by line, but the work is still shaped by process. Without that process, cheap tokens produce bloat. With it, they can support changes at a scale that conventional review habits are not built to absorb.

Evals and Benchmarks Agents and Autonomy Human-AI Interaction Coding Assistants