Orply.

Tool-Call Repairs Let DeepSeek v4 Beat Opus 4.7 in Internal Evals

Shawn WangAhmad AwaisLatent SpaceSaturday, June 6, 202614 min read

Ahmad Awais, founder of CommandCode.ai, argues that many open models appear weak at coding-agent work because the harness around them mishandles tool schemas, design instructions and user preferences. Drawing on Command Code’s internal logs and evals, he says small deterministic repairs to tool inputs helped DeepSeek v4 Pro beat Opus 4.7 in six of ten internal comparisons. His broader case is that “taste” — explicit contracts for tools, design patterns and developer habits — can narrow the gap between cheaper open models and frontier coding systems without changing the model itself.

The open-model tool-calling problem may be a harness problem

Ahmad Awais says Command Code’s recent gains with DeepSeek came from a narrow, practical diagnosis: many open models that look bad at tool calling may instead be running into harness and schema failures that a coding agent can repair. The headline result — DeepSeek v4 Pro beating Opus 4.7 six out of ten times — was, in Awais’s account, on Command Code’s internal evals. He also made a point of saying the comparison was to Opus 4.7, not Opus 4.6, which he described as “a better model.”

The claim came out of Command Code’s own usage. Awais said the system was handling “a couple billion tokens a day” when he began comparing DeepSeek v4 Pro against Opus 4.7. Later, he described the current volume only roughly, saying Command Code was doing “anywhere from 600 billion tokens right now.” In those logs, he found repeated tool-call failures that made models appear slow, stubborn, or unusable in coding-agent settings.

The basic case is familiar to anyone building an agent harness. A user asks how authentication works in a repository. The agent needs to list directories, read files, inspect code, and answer. Those operations are mediated through tools with schemas: shell commands, file reads, file writes, edits. When the model emits a malformed argument, the tool validator rejects it. With DeepSeek v4 Pro, Awais said, returning a raw Zod error often did not help. The model would send the same wrong tool call repeatedly.

He described this as “tool confusion.” In one example, a tool accepted optional parameters, but the model sent an empty object or null where omission was expected. Zod rejected it. The model retried the same call. Across large runs, Awais said this could mean “50 plus” tool-call failures in a session, hidden from users by some coding-agent interfaces.

DeepSeek v4 pro has this weird alpha male energy where whatever it sends you, it thinks that that is the right thing to do.

Ahmad Awais · Source

Awais was careful to mark part of his explanation as intuition. His “hot take” was that some open models may be trained in a way that teaches them to treat their own outputs as correct because they learn from data presented as high quality, sometimes from stronger models. But the engineering fix he described did not depend on that theory. It depended on looking at failures and repairing them deterministically.

In the post Awais showed, he framed the underlying thesis plainly: “why open model bad at tool calling” is “almost always a harness problem, not a model problem.” After building a tool-input repair layer, he wrote, “deepseek v4 pro was beating opus 4.7 6/10 times on our internal evals.”

Four small repairs covered most of the failures

The failure modes, in Awais’s telling, were not random. In his post and explanation, he reduced them to a small compositional set that repeated across DeepSeek Flash, DeepSeek v4 Pro, GLM, Qwen, Kimi, and MiniMax.

Failure modeExample from the sourceRepair approach Awais described
Optional field sent as null`null` sent where the field should be omittedRepair the invalid value only after the validator identifies the issue path
Array encoded as a JSON string`["a","b"]` emitted as a string instead of an arrayParse the stringified array, with this repair ordered before bare-string wrapping
Placeholder wrapper used incorrectlyA single argument wrapped in `[{ }]` where the schema expected an arrayApply one of the small ordered repairs when the validator localizes the mismatch
Bare string sent where an array was expected`"foo"` sent instead of `["foo"]`Use the repair layer to make the input match the expected array shape
Awais’s reported catalogue of recurring tool-input failures in open models

The repair layer began as “30 to 100 lines” per repair, organized like database migrations: one small repair file per known pattern. The important detail was not just that Command Code fixed the input. It also returned the tool result along with a note explaining what had been repaired and what the model should have sent.

Awais’s analogy was teaching someone to drive. If they are about to hit another car, first you prevent the crash; afterward, you explain what they should have done. In tool-calling terms, the agent still gets the file contents, command output, or write result it needed, but it also receives a model-readable correction.

He said this often changed behavior quickly after the model received a repaired result. Instead of seeing dozens of repeated validator failures, the model sees progress, learns the expected contract, and continues.

One example was readFile. If the model asked to read a file but failed to specify an offset, Command Code might infer that the first read should return the first 100 lines. If the file was actually a log and the model needed the last 100 lines, Awais said the model could adapt once it had a successful result rather than being trapped in a loop of schema errors.

Another example was stranger. Awais showed DeepSeek Flash sometimes emitting a file path as a Markdown autolink:

filePath: [Users/x/proj/notes.md](http://notes.md)

The writeFile tool then created files literally named like Markdown links until the pattern was caught. Awais argued this was not a hallucination in the usual sense, but “the post-training chat distribution leaking through” into a tool argument. The fix, as he described it, was a small deterministic cleanup.

Validate first, repair only where the schema complains

The design choice Awais emphasized was the order of operations. His first attempt was preprocessing: normalize inputs before validation by stripping nulls, parsing stringified arrays, and applying similar transformations. That broke quickly.

The reason was silent corruption. If writeFile content happened to be a JSON string, a greedy preprocessor might rewrite the file contents before they ever reached disk. A smoke test could miss that because the tool call would appear to succeed.

The fix was to invert the flow: validate first, then repair only after failure, and only at the paths identified by the validator. Awais described the approach as:

  • Parse the input as-is.
  • If validation succeeds, ship it untouched.
  • If validation fails, walk the validator’s issue list.
  • For each issue path, try the known repairs in order.
  • If repair succeeds, log tool_input_repaired:${toolName} and proceed.
  • If repair fails, log tool_input_invalid:${toolName} and return a model-readable retry message.

That ordering matters because valid inputs are never touched. The schema, not the repair layer, defines what is wrong. Awais described the structural insight this way: preprocessing encodes a prior about what is broken; letting the validator complain first makes “the schema the prior,” and spends repair budget only where expectation and reality misalign.

The order of repairs also mattered. His post noted that JSON-array parsing must run before bare-string wrapping, otherwise ["a","b"] could become ["[\"a\",\"b\"]"]. In other words, even a tiny repair catalogue can introduce bugs if the transformations are not sequenced carefully.

The same architecture also produced telemetry. Command Code can log per-tool repair rates and watch them change over time. Awais gave an example of repair rates dropping from 50% to 10% on /notes, and of detecting when a model regresses on a specific tool before users feel the full failure.

The broader claim was not that the model improved. It did not. The contract changed. Awais wrote that “the harness is where you mediate between distributions”: four small repairs, a couple of regexes for autolinks, one relational default, and one prefix change made the interface more forgiving exactly where it needed to be.

When errors disappear, the model’s behavior changes

Awais’s reported result was not merely fewer exceptions. He said the “vibe of the model completely changes” when tool-call errors drop. Models become less blocked, more creative, and able to explore longer.

He gave a related observation about permissions. In his view, coding agents often perform worse when permission prompts interrupt the flow. Even if a user keeps accepting each request, the slowdown and repeated intervention can steer the model badly. He sees a similar effect with tool errors: repeated failures change the model’s trajectory.

One Command Code user, according to Awais, ran DeepSeek for 12-hour-plus sessions and had used about 70 billion tokens, enough to break Command Code’s usage page. Awais said he had not personally run sessions like that; he presented the anecdote as part of why reducing tool confusion made open-model coding sessions feel more viable.

Shawn Wang pressed on whether this was specific to DeepSeek or a broader open-model pattern. Awais said he first assumed it was DeepSeek-specific, then reviewed 30 days of logs and found similar behavior in Kimi. Command Code then fixed Kimi models and MiniMax models, and Awais said the team now had “16,000 different repair variations” across hundreds of billions of tokens.

He also argued that the perception gap around DeepSeek — some users saying it was excellent, others saying it was slow or bad — often comes from the harness. If a developer swaps the base URL and API key in a Claude-oriented coding agent, they may be using a harness built around Claude’s forgiving tool behavior. Claude, Awais said, is “really, really lenient with tool calls” and can often infer what to do after a malformed error. Open models may not. A harness that works for Claude may therefore make DeepSeek look worse than it is.

Awais said Command Code made some of the repair logic “completely open” so it could be implemented in any coding harness. His closing view on this thread was that “skill issue” applies to the harness more often than to the model.

The same repair idea extends to design slop

Awais then extended the same pattern beyond tool calls: many AI-generated designs look bad not because models cannot write CSS, but because they lack a design contract.

Command Code’s /design work, as he described it, is an attempt to encode design taste into the harness. Awais said the system had 16 modes, 24 reference documents, and more than 4,500 lines of encoded design taste from designers. It reads a codebase, identifies what is broken, and edits real files — “no figma,” “no markdown mockups.”

The specific target was what Awais called AI design slop: the familiar blue-violet or indigo gradient, feature-card grids, glassy blur, oversized stats, centered stacks, and default template-like layouts. He was precise that he likes purple; in his phrasing, the problem is “indigo slop.”

Awais said Command Code talked with designers and asked them to label AI-generated interfaces. The tells, in his account, were surprisingly small in number and accounted for roughly 90% of the “this looks AI-generated” signal.

AI-design tellDescription from the source
tech gradientBlue-violet glossy treatment applied broadly
generic tech hueIndigo used because the surface is “software”
feature tile gridIcon, heading, and sentence repeated with equal weight
accent railA colored stripe on a card edge used as decoration pretending to be organization
unearned blurGlassmorphism without a depth system
stat monumentOversized numbers filling space where a product story belongs
icon topperRounded-square icon above every heading as template filler
bounce everywhereElastic easing used because the API exposes it, not because it has purpose
default typeWhatever font the training distribution currently favors
center stackEverything centered because no composition decision was made
The design smells Awais said designers identified in AI-generated UIs

The deeper issue, in Awais’s framing, is compositional rather than cosmetic. Models often choose layout before they choose purpose. A dashboard and a landing page have different jobs, but an LLM may reach for the same centered hero and cards because that is the mode of the training distribution.

Command Code’s response was “work-pattern-first composition.” Before touching visual properties, the agent has to identify what kind of surface it is designing.

PatternPurpose
MonitorStatus boards, alerts, metrics, live priority
OperateCommand bars, canvases, inspectors, direct manipulation
CompareTables, matrices, split views, ranked lists
ConfigureGrouped settings, forms, previews, commit areas
LearnArticle flow, walkthrough rhythm, progressive sections
DecideFocused pitch, proof, risk reduction, one dominant action
ExploreSearch, filters, maps, galleries, reversible discovery
The seven surface patterns Command Code’s design skill asks the agent to choose from before designing

When Wang asked where the seven patterns came from, Awais said they came from conversations with designers rather than from a single book or formal canon. A dashboard, for example, is a Monitor surface: its purpose is status, alerts, metrics, and live priority. A landing page is a Decide surface: proof, risk reduction, and one dominant action.

Awais also described a concrete CSS preference: forcing LLMs to use oklch() for color. He said he personally struggled with the CSS color function, but LLMs “understand it super well.” In his observation, models do not control lightness as well in HSL, while oklch() lets them manage palettes more coherently. Wang added that this was part of the reason oklch() exists: color theory advances, and CSS functions change with it.

The design point mirrored the tool-calling point. The problem is often not raw capability. It is a contract gap between a vague user instruction like “make it prettier” and the structured taste a good designer would apply before touching the implementation.

Taste is a claim about learned micro-decisions, not just rules

The broader product idea behind Command Code is “taste”: a repository-local, continuously updated record of preferences learned from how a person or team actually works. Awais framed it as a way to avoid repeatedly correcting an agent on the same small choices.

He traced this to his own background. He said he has been coding for roughly 27 years and has published more than 300 open-source repositories. Much of his work is on cutting-edge projects where there may be no documentation for an agent to retrieve. In that setting, he said, his own opinions can matter more than anything an LLM can find through RAG.

Taste began as a way to encode those opinions. If Command Code sees Awais consistently use pnpm for package management but npm for globally linking a local CLI, it should learn that distinction. If he repeatedly uses tsup rather than another bundler, or vitest rather than another test framework, the agent should stop asking or guessing.

A Taste.md file Awais showed included examples under # cli:

  • use pnpm as the package manager for CLI projects, confidence 1.00
  • use TypeScript for CLI projects, confidence 0.95
  • use tsup for bundling CLI applications, confidence 0.95
  • use vitest for writing and running tests, confidence 0.95
  • use clack for interactive user input in CLI projects, confidence 0.90
  • prefer clack prompts over raw command-line arguments, confidence 0.85
  • default to lowercase -v for the version command, confidence 0.85

Awais said this CLI taste file was generated after building many CLIs with Command Code. It can be pulled into a repository with npx taste pull, then used by any coding agent: “follow my taste of building CLIs,” build the requested tool, and show taste compliance at the end.

Command Code’s own framing contrasts taste with hand-written rules. In that framing, rules are what a developer remembers to write down; they tend to be broad and become stale. Taste is meant to capture micro-decisions: the exact PR workflow a developer repeats, the local debug flag that should not appear in help output, the branch checkout habit after sending a PR. Awais said he would never manually write many of these small preferences into a skill file, but they improve workflow when the agent learns them.

DimensionRulesTaste
SourceWhat you write downContinuously learned from you
UpdatesWhen you rememberEvery session
GranularityBroad guidelinesMicro-decisions
TrajectoryDecaysCompounds
Over timeDrifts from realityCompounds accuracy
Command Code’s own comparison of hand-written rules and learned taste

Awais called taste a “meta-neuro-symbolic model” and described it as the highest-order layer managing skills and rules. If an LLM already knows something, he said, it should not end up in a taste or skill file because that is wasted context. The useful layer is what differs from the model’s default distribution and what recurs in the developer’s actual work.

Transparency became necessary because merging taste is a human decision. Awais said the team initially hid the learned preferences and simply compared Command Code against Claude Code, producing a “wow” moment for developers. But when many engineers work across many branches, taste files diverge. Command Code cannot know which learned preferences should survive a merge. So the files live in the repository, appear in pull requests, and can be edited by humans.

Awais also described an emerging workflow among users: build the first version of a project with a high-quality model — he named Opus and “GPT 5.5” as examples in that user pattern — generate a taste file, then continue development with much cheaper models guided by that taste.

Command Code is being positioned as hackable, but curated

Command Code comes out of a longer arc in Awais’s work. He said that after receiving early GPT-3 access from Greg Brockman and Sam Altman in July 2020, his stated use case was suggesting the next line of code or a code snippet — more than a year before GitHub Copilot. He started building a CLI called CLI AI, which eventually became part of Langbase, an AI cloud that Awais said reached 1.2 billion agent runs a month. The team later pivoted toward the view that “there is only one type of agent and that is a coding agent.”

Command Code is now the product expression of that view: a full coding agent supporting commercial and open models, with particular traction around open models because their tool-call brittleness creates more need for harness-level mediation.

The company’s site describes a $5 million raise to build “the first coding agent that continuously learns your coding taste.” Awais said the team plans to open source Command Code soon, possibly around the AI engineering conference in San Francisco, if he can work through the quirks of a six-year-old repository. His stated goal is to make the tool “completely hackable,” so developers can modify any part of it regardless of the company’s business model.

He located Command Code among three philosophies for model access and coding agents. One approach is “Windows,” where every game works; he compared OpenRouter to that model-access posture. Another is “Linux,” where users build their own drivers; he compared LiteLLM to that. Command Code, he said, is aiming for something more like Apple: not every model, but a curated set of the best open and closed models, with enough hackability for users to add a local model if they want.

That curation is part of Awais’s product argument: Command Code is not trying to become a list of 1,500 models and make users decide. He said Qwen 2.5 Max had become the second most used model on Command Code within two or three days of release, while DeepSeek remained central to the current usage story. Wang noted that DeepSeek had announced plans around a DeepSeek coding agent. Awais connected that to commenters tagging DeepSeek researchers and asking why DeepSeek was not building one, then said a hiring announcement followed later; he did not establish that the comments caused the announcement.

Awais’s final point was deliberately portable. The repair ideas do not require Command Code. The same tool-input repair logic, design-contract framing, and taste-file approach can be used in other coding harnesses.

The frontier, in your inbox tomorrow at 08:00.

Sign up free. Pick the industry Briefs you want. Tomorrow morning, they land. No credit card.

Sign up free