Fast Coding Models Require Smaller Tasks and Continuous Validation

Sarah ChiengAI EngineerFriday, May 22, 202613 min read

Sarah Chieng of Cerebras argues that fast coding models such as Codex Spark, which she says can generate code at roughly 1,200 tokens per second, require more disciplined developer workflows rather than looser ones. In her account, a 20x speedup over models such as Sonnet and Opus makes old habits — large prompts, unattended agents, delayed validation, and sprawling context — produce technical debt faster than developers can inspect it. Her playbook is to use speed for bounded execution, continuous testing and linting, variant generation, stricter permissions, and external memory that keeps short sessions from losing the plan.

Fast inference turns old coding habits into faster technical debt

Sarah Chieng argues that fast coding models do not simply make existing developer workflows more convenient. They change the risk profile of those workflows. Habits that formed around slow generation — giant prompts, one-shot attempts, huge commits, unattended agent swarms — become more dangerous when the model can produce code far faster than a human can read it.

Chieng framed Codex Spark, built by Cerebras with OpenAI, as the first example of a broader regime developers should expect: coding models that generate at roughly 1,200 tokens per second. She contrasted that with the Sonnet and Opus families, which she said typically generate code at about 40 to 60 tokens per second. The issue is not only that Spark is faster; it is that a 20x speedup removes the waiting time that used to constrain bad process.

1,200 tokens/s

Codex Spark generation speed cited by Chieng

Recent coding-model speeds, in Chieng’s account, had remained relatively stable even as models became larger, smarter, and capable of handling more context. She placed many familiar models in a broad range of roughly 50 to 150 tokens per second. Codex Spark forced a different scale: around 1,200 tokens per second, far outside the range developers had built habits around.

The same prompt discipline, review discipline, and context discipline that were already necessary become non-negotiable when output accelerates. As Chieng put it, developers who previously produced “50 tokens per second of bad code” risk producing “1,200 tokens per second of bad code” unless they change how they work.

Unless we change our habits, we are not gonna have good code in the future.

Sarah Chieng · Source

The practical implication is counterintuitive: fast models require slower developers. Not slower in throughput, but slower in the sense of being more deliberate, more bounded, and more willing to steer, validate, prune, and restart sessions before the model drifts.

The speedup is coming from the whole inference stack, not a single trick

Sarah Chieng attributed the jump in coding speed to optimization across the entire AI inference stack. Hardware, model architecture, and runtime inference techniques are all being reworked at once, which is why she argued developers should treat Codex Spark less as an isolated product and more as an early signal of a category shift.

At the hardware layer, she focused on the “memory wall.” In her explanation, a large share of inference latency comes from memory movement: weights and KV-cache values being moved between memory and the chip. She said hardware and memory movement can account for 50% to 80% of inference latency. On a traditional Nvidia GPU setup, memory is stored off-chip in HBM, creating a memory-bandwidth bottleneck.

Cerebras’s approach, as Chieng described it, is to move memory closer to compute. The Cerebras WSE-3 wafer was presented as having 900,000 cores, with memory distributed across the chip in SRAM so each core has direct access to the values it needs. Chieng also named Groq as another company thinking about reducing the cost of memory movement.

Chieng then described disaggregated inference, which she said has become commercialized in recent months. Traditional inference runs both prefill and decode on the same hardware. Prefill processes the user’s input tokens, embeds them, and adds them to the KV cache; because that work can happen in parallel, Chieng described it as compute-bound. Decode generates output token by token; because it is sequential, she described it as memory-bound. Disaggregated inference splits these phases across different systems: compute-optimized hardware for prefill and memory-optimized hardware for decode. As examples of the commercial importance of this pattern, Chieng cited Nvidia buying Groq for $20 billion and Cerebras partnering with AWS to serve the Cerebras wafer and AWS Trainium together.

At the model-architecture layer, she pointed to mixture-of-experts models. Rather than activating the entire model for every token, a mixture-of-experts architecture activates only a subset of experts. Chieng described this as a way to get the intelligence of a larger model while keeping per-token compute closer to that of a smaller one. She also highlighted Router-weighted Expert Activation Pruning, or REAP, as a way of pruning experts that are not being activated for a given use case, reducing model size and memory overhead while preserving generative performance.

At the runtime layer, she cited inference optimization work from companies including Together, Baseten, Modal, and Fireworks. The example she used was KV-cache reuse: storing and reusing previously computed token representations so the system does not need to recompute attention over the sequence at every step.

Output speed, in this framing, is not merely a UX enhancement. It is the visible result of coordinated changes across hardware, architecture, and serving systems. Developers are the next layer that has to adapt.

Agent swarms look productive until nobody can explain the code

Sarah Chieng used examples of AI coding setups circulating on social platforms to illustrate the current temptation: six Claude Code terminals running in parallel, 11 concurrent Codespaces, hundreds of agents, eight agents across five screens, multiple MVPs being built at once. She acknowledged the appeal. If developers spend time on Twitter or LinkedIn, she said, the ambient message is that anyone not working this way is “living in the Stone Age.”

Her objection was not to parallelism in itself. It was to parallelism without verification. In the setups she described, the apparent productivity comes from generating large amounts of code at once, often faster than a developer can inspect or understand. The result, she argued, is “massive amounts of code that nobody is verifying.”

That problem gets worse as inference speeds rise. Slow generation at least imposed friction: developers had time to wait, notice, interrupt, or reconsider. With fast models, unattended workflows can produce technical debt at a scale Chieng said developers have not dealt with before.

The alternative she proposed is not to stop using agents, but to reduce the number of concurrent threads and increase the amount of steering. Instead of ten agents running while the developer context-switches away, she recommended two to three sessions at most, and only for genuinely independent tracks such as frontend and backend work or separate services. The developer should be “sitting next to” the code, reviewing diffs, understanding what is changing, and keeping context fresh.

Fast generation changes the ideal interaction from batch job to pair programming. Chieng argued that developers should no longer treat a coding model as something they prompt, leave, and return to after a delay. With real-time output, the model can answer questions, collect repo context, explain complexity, and propose alternatives while the developer remains in control.

The AI should always be helping you make decisions, not the other way around.

Sarah Chieng · Source

That distinction matters in her playbook. The developer is not meant to become a passive approver of whatever the model produces. The developer supplies the goal, narrows the scope, asks why something fails, decides when an implementation is wrong, and tells the model what to change next.

Model choice becomes a three-variable decision: intelligence, cost, and speed

Sarah Chieng said developers have historically tended to choose coding models based on intelligence, and sometimes cost. With a 20x spread in generation speed, speed becomes a third axis. The practical recommendation is to use larger, slower models for planning and longer-horizon reasoning, then use faster models for execution.

Chieng distinguished GPT-5.3-Codex from GPT-5.3-Codex-Spark by task fit, speed, and context. GPT-5.3-Codex was described as better for long-horizon workflows, running at about 80 tokens per second with a 400K-token context window. GPT-5.3-Codex-Spark was described as better for real-time collaboration, running at about 1,200 tokens per second with a 128K-token context window.

Model	Best for	Speed	Context
GPT-5.3-Codex	Long-horizon workflows	~80 tokens/s	400K tokens
GPT-5.3-Codex-Spark	Real-time collaboration	~1,200 tokens/s	128K tokens

Chieng’s model-selection framework distinguishes planning work from fast execution work.

The workflow she recommended is to ask a larger Codex model to produce the plan, then spawn Codex Spark sub-agents to execute individual steps. In her example, the larger model receives a broad implementation request and generates a plan. Spark agents then handle bounded tasks such as updating documentation, researching, or cleaning up.

Chieng also recommended converting successful sessions into “skills.” The idea is to capture trajectories that produced good results and turn them into repeatable workflows. A larger model can handle the initial difficult task; once the pattern works, that path can be saved as a skill and reused by a faster model in the background. She emphasized three properties of this approach: skills are repeatable workflows, predefined workflows are more consistent, and they can happen in the background.

The point is orchestration. A faster model is not always the right model for every part of the job, just as the largest model is not always the right model for every action. Planning, execution, review, cleanup, and repetition each have different speed and reasoning requirements.

Instant validation raises the standard for every step

One of Sarah Chieng’s strongest claims was that fast inference makes validation “basically free.” At 1,200 tokens per second, she argued, developers no longer have a good excuse for waiting until the end of a task to run tests, lint, review diffs, or verify UI behavior.

Her examples included pre-commit hooks, test suites, readiness reports, diff reviews, linting, type systems, and browser-based QA automation. The tools are not new. The change is that a fast coding model can configure and run them continuously without turning the workflow into a long wait. Chieng’s recommendation was to add validation to every step rather than treat it as a final gate before pushing code.

The same logic applies to refactoring. Chieng said LLMs “love to produce new code,” so every line that does not prove its worth should be cut. With a fast model, developers can ask for cleanup after each checklist item rather than postponing cleanup until just before commit. Examples included deleting unused imports, removing unnecessary lines, and making function structure consistent.

The purpose of refactoring in her framing is not only prettier code. It also helps the developer maintain a mental model of what changed and where responsibilities belong. Small refactors after small tasks are easier to understand than a large cleanup pass after a sprawling generation session.

Generating many options lets the developer select for taste

Sarah Chieng’s most concrete example of speed creating a new workflow was “cherry picking.” With slower models, asking for one implementation of a navigation sidebar might take five minutes and produce one acceptable result. With Codex Spark, she suggested asking for many variants in the same time budget and choosing the best.

Her example prompt asked for a navigation sidebar with four tabs and a Tokyo Night theme. Instead of settling for one version, a developer could ask Spark to generate 15 versions. More aggressively, the developer could spawn five sub-agents, each generating 15 variants, producing 75 options to inspect.

75 versions

example output from five sub-agents generating 15 variants each

Chieng argued that this is especially useful for domains where variety matters: research directions, architectural options, and graphic design. The value is not that the model suddenly has taste. In fact, she said the opposite: models themselves do not have taste, and AI-written UI or prose is often easy to recognize.

Her claim was that quantity can partially compensate for that weakness. Instead of manually designing an example, hunting for examples, or writing such a detailed prompt that the developer might as well do the task, the developer can generate enough candidates to choose from. Chieng described this as a way to “artificially induce taste” into the output: the taste comes from selection, not from the model’s intrinsic judgment.

That makes the developer’s role sharper. The model explores; the developer curates. The faster the model, the more practical it becomes to use generation as a search process rather than a one-shot production process.

Guardrails should get tighter as models get faster

Sarah Chieng repeatedly tied speed to smaller scopes and stricter permissions. In the slow-model era, developers often asked for more per session because each interaction was expensive. In the fast-model era, she argued, the better pattern is to ask for less, watch more closely, and correct the model as it works.

Her examples of permissions and restrictions were deliberately concrete: ban file deletion, including “no rm -rf”; allow the model only to read and edit rather than create new files; set a maximum diff size; require tests and types to pass before commit; require a commit message. Her slide suggested max diff limits such as no more than 150 lines of code, or even around 30 lines unless explicitly asked.

Guardrail	Purpose
Ban deletion commands such as no rm -rf	Prevent destructive actions
Read and edit only; no new files	Limit scope expansion
Max diff size such as ≤150 LOC or smaller unless asked	Keep review manageable
No commit until tests pass, types are correct, and a commit message is included	Make verification a precondition

Chieng’s examples of tighter permissions for fast coding agents.

She also recommended using steering instructions in the middle of a session: “Only change ____,” “Don’t touch types yet,” “Show me the diff before continuing,” or “Research ____.” The model can generate tests, implement against those tests, integrate and run them, fix failures, and produce a final implementation, but the developer should sequence those actions.

The underlying premise is that fast interaction makes micro-management viable. When each turn is slow, developers are tempted to bundle instructions and hope. When each turn is fast, the developer can make the task smaller, inspect the intermediate result, and adjust immediately.

Context management becomes urgent when compaction arrives in 30 seconds

Sarah Chieng closed the technical playbook with context management because speed compresses the time it takes to hit context limits. If a slow workflow used to take ten minutes to fill a context window, she said, dividing that by 20 means developers can encounter compaction in roughly 30 seconds. The risk is that earlier constraints degrade or get lost just as the model is producing code at high speed.

Her high-level rule was to break large tasks into small, bounded goals. She offered a mental model for context usage: from 0% to 30%, the model is sharp and follows instructions precisely; from 30% to 60%, it remains good but may show minor drift on complex tasks; from 60% to 80%, compaction is likely and earlier constraints begin to degrade; from 80% to 100%, compaction is guaranteed and the developer should start a new conversation.

Context usage	Model behavior
0–30%	Sharp and follows instructions precisely
30–60%	Good, with minor drift on complex tasks
60–80%	Compaction likely; earlier constraints begin to degrade
80–100%	Compaction guaranteed; start a new conversation

Chieng’s context-usage framework for deciding when to keep working and when to restart.

To make small sessions workable, Chieng proposed a four-file external memory system: AGENTS.md, PLAN.md, PROGRESS.md, and VERIFY.md.

File	Role in the external memory system
AGENTS.md	Role-specific context, commands, and guides
PLAN.md	Checklist-style definition of done
PROGRESS.md	Current work, changes, failures, and next steps
VERIFY.md	Exact commands to run before moving on or opening a PR

The four-file memory system Chieng used to keep fast sessions bounded without losing continuity.

AGENTS.md stores role-specific context, commands, and guides for agents and sub-agents. PLAN.md contains the checklist-style definition of done created at the beginning of the work. PROGRESS.md records what is being worked on, what has changed, what failed, and what steps remain. VERIFY.md contains the exact commands the model should run to confirm the work before a pull request or before moving to the next item.

The benefit is continuity without stuffing every previous turn into the active context. A new session can read PROGRESS.md, determine what has already happened, and pick up the next bounded task. It can read VERIFY.md to know how to prove the task is complete. The session starts over, but not from scratch.

Her example combined the earlier model-orchestration pattern with this memory system. First, use GPT-5.3-Codex to plan a website that shows calendar invites across Google, Outlook, and Zoom. Then switch to GPT-5.3-Codex Spark and instruct it to read PLAN.md and VERIFY.md, implement only checklist item one, keep changes minimal and localized, show the git diff, run verification commands, and update PROGRESS.md with results and the next step.

That pattern captures the broader argument: fast execution works best when surrounded by persistent plans, explicit verification, narrow scopes, and clean handoffs.

The better developer experience is conditional

Sarah Chieng did not present fast coding models as a reason to relax engineering discipline. She presented them as a chance to make disciplined workflows less painful. If validation, refactoring, explanations, variant generation, and review all become fast enough to do continuously, then developers can improve code quality without staring at a screen for 30 minutes between turns.

The Codex commands she highlighted support that workflow: /permissions to choose what Codex can do, /experimental to toggle experimental features, /skills to improve performance on specific tasks, /review to find issues in current changes, /rename to rename a thread, /new to start a new chat, /resume to resume a saved chat, and /fork to fork the current chat.

The commands were secondary to the discipline around them. Fast models make real-time collaboration possible, but they also make sloppy interaction more consequential. The developer has to choose the right model for the phase, create repeatable skills from good trajectories, validate continuously, generate alternatives when selection matters, limit concurrency, tighten permissions, refactor as a habit, and externalize memory before context decay becomes invisible.

Inference and Deployment Agents and Autonomy AI Infrastructure and Compute Human-AI Interaction Model Releases Coding Assistants