Orply.

Distributed RL Let Composer Match Frontier Coding Models With Smaller-Model Speed

Cursor’s Federico Cassano and Fireworks’ Dmytro Dzhulgakov argue that Composer’s advantage comes from specializing a model for software engineering inside Cursor rather than spending capacity on general-purpose behavior. Starting from an open-source base, Cursor used mid-training and reinforcement learning against its own product environment, while Fireworks supplied the distributed infrastructure needed to make agent rollouts, weight synchronization, and inference efficient enough to run at scale. Their case is that application companies with enough product-specific usage, tools, and feedback can build models that are better, faster, and cheaper for their own workflows than larger general models.

Cursor’s model strategy is to spend every bit on Cursor

Composer 2 starts from a narrow claim about model capacity: if a model can store only so much information in its weights, Cursor should spend that capacity on software engineering inside Cursor, not on being generally useful across every domain.

Federico Cassano describes the model as “sort of like a storage drive.” It has a finite number of bits it can store. Cursor’s target is not even coding in the abstract. It is the behavior of a coding agent operating inside Cursor’s product, using Cursor’s tools, responding to Cursor’s interaction patterns, and producing useful work in that setting. If the capacity is finite, his argument is that Cursor can allocate it more efficiently by specializing the model around that exact environment.

We care about software engineering inside Cursor and inside Cursor only.
Federico Cassano

The claimed benefit is not only task fit. Cassano says Composer is “order of magnitude less expensive than Opus and other coding models” because Cursor can specialize the weights and serve a smaller or more efficient model for the job.

Dmytro Dzhulgakov treats Cursor as an example of a broader application-company pattern. AI products often start with off-the-shelf models, prompt engineering, and a working harness. But the most leveraged attributes of the application eventually become the usage data, the tool surface, and the product-specific environment: which tools are available, how the harness works, and what the product actually asks the model to do. Prompting can capture some of that. It cannot fully teach a model how to behave inside the product.

Dzhulgakov frames the optimization as a three-way tradeoff among quality, speed, and cost. Infrastructure optimization can move that frontier, and Fireworks does that for customers, but model training pushes it further. A specialized model can be better, faster, and cheaper than a larger general model serving the same application.

Cassano makes the point through tool use. Some agent tools are difficult to describe succinctly in a prompt. Their correct use is not just a matter of reading an instruction; it is a behavior learned in context. Cursor does serve Composer a prompt, but Cassano says the training is designed so that the model would still know what to do without one, because the desired behavior has been pushed into the model itself.

This is not presented as a rejection of scaling. Asked whether specialization cuts against the “bitter lesson” intuition that larger models trained broadly tend to improve on everything, Cassano says no. The large labs train heavily on code too; they are not merely generalizing to programming by accident. Cursor’s move, in his account, is to push hard on the data dimension for a constrained objective. If model capacity is finite, then saturating that capacity for one environment requires freeing the weights from distractions.

Cursor worked top down: mid-training and RL first, pre-training later

Federico Cassano says Composer 2 started from Kimi 2.5, which he describes as a one-trillion-parameter mixture-of-experts model with 30 billion active parameters. Cursor then pushed on two axes: continual pre-training, or mid-training, and reinforcement learning. Composer 1 had primarily pushed on reinforcement learning; Composer 2 combined large-scale code-token mid-training with large-scale RL over many tasks.

1T parameters
Kimi 2.5 base model size described by Cassano

Mid-training and RL serve different purposes. In mid-training, the model is still doing next-token prediction. It learns code libraries, common code patterns, and some world knowledge, with web data also present. That stage broadens the distribution on which reinforcement learning can later sharpen behavior.

RL is where the model plays directly with the Cursor harness. It learns the world it will “live in,” as Cassano puts it: tool calling, navigation, and writing correct code. Mid-training teaches code; RL teaches the demand that the code be correct in the setting where the agent is acting. Cassano says Cursor tries to train mid-training data on code that is largely correct, but the model does not necessarily know how to distinguish correct from incorrect code through next-token prediction alone. RL tunes that feature directly.

The choice not to pre-train from scratch was deliberate. Sonya Huang notes that Cursor sits in the middle of many interesting coding tokens, giving it unusual access to data for large-scale training. Cassano answers that Cursor approached the problem top down: what gets a useful model into users’ hands fastest? Starting from pre-training, then working up through mid-training and reinforcement learning, would have delayed user value. By beginning with a strong open-source base and specializing from there, Cursor could ship sooner. Cassano says future Composer versions will “hopefully” be Cursor’s own model rather than based on an open-source base.

There is also a separation between Cursor’s tab-completion model and Composer. The tab model is small because it must be extremely low latency. Composer is much larger. Cassano says the post-mid-training model is similar in competency shape to a next-token model like tab autocomplete, but the base-model distinction is scale: tab must be fast; Composer is built for agentic work.

Agent RL trains whole product sessions, not isolated predictions

Dmytro Dzhulgakov draws a sharp line between ordinary training and reinforcement learning for an agent. In pre-training or mid-training, the model predicts the next token. In RL, the system runs the model through an entire environment, observes the outcome, assigns reward, and feeds that signal back into the weights.

A rollout, in this setting, is not a single forward pass. It is an entire simulated Cursor agent session. A prompt comes in; the model decides which tools to call; tools are executed; the model sees results; it writes or modifies code; the interaction can continue for dozens of turns. Only after that session does the system compute a reward, which may be based on an LLM judge or a verifiable signal such as whether code compiles.

That makes the training system heterogeneous. It still needs the large-scale trainer: tens of thousands of GPUs, forward and backward propagation, model updates, and all the machinery of standard training. But it also needs inference capacity to run the agent, environments that resemble user machines, tool execution, reward computation, and orchestration between all of those components.

The naïve version is simple and wasteful: pause the trainer, run many rollouts, collect outcomes, then pause inference and update the model. That is cleaner algorithmically because the rollouts are tightly matched to a particular model version. But it leaves large amounts of compute idle.

Fireworks and Cursor instead pipeline the process. Dzhulgakov compares it to a factory with a trainer building and a rollout building. Both are always working. Rollouts use recent model versions to simulate new sessions; the trainer consumes outcomes as they arrive and computes updates. The price is staleness: by the time a rollout finishes, the model may already have changed. That complicates the training dynamics. The benefit is utilization: GPUs keep working, and the system reaches a better model faster because it does not leave half the available capacity unused.

RL design choiceAlgorithmic advantageSystems cost or benefit
Pause trainer while rollouts runCleaner match between rollout data and model versionInference and training capacity sit idle in alternating phases
Pipeline trainer and rolloutsTrainer can consume outcomes continuouslyIntroduces staleness because weights may change before a rollout finishes
Use recent model versions for distributed rolloutsMore simulated sessions can be generated in parallelRequires fast weight synchronization and careful handling of stale data
As Dzhulgakov describes it, asynchronous RL trades exact synchronization for higher utilization of expensive compute.

Federico Cassano says this matters because Cursor is not operating at the scale of the largest labs. Cursor has “tens of thousands of GPUs, not millions,” so it has to be serious about performance. He says Cursor trains in production with FP4 and worked with Fireworks to push inference efficiency because RL infrastructure includes all the requirements of pre-training plus the extra burden of environments and inference.

One of Cassano’s more specific claims concerns the common belief that RL spends far more inference FLOPs than training FLOPs. He calls that “sort of like just because the open source inference engines are very unoptimized,” not an inherent property of RL. In theory, if GPUs are pushed efficiently, inference should require roughly one-third as many GPUs as training: training is effectively three forward passes, while inference is one forward pass, assuming critical batch size is reached.

Distributed infrastructure made the RL loop economically possible

Composer 2’s RL run was globally distributed because the components of RL do not all need the same kind of hardware. Training requires a tightly connected cluster with high-speed networking and synchronized operation. Those clusters are expensive and difficult to find at large size. Inference, by contrast, can run on smaller GPU groups, different GPU generations, and hardware in different regions.

Federico Cassano says Composer 2 used four clusters in total, located far apart. Cursor kept one cluster for training and distributed the inference component across smaller clusters around the world. It even used some production inference capacity during low-usage periods: when Composer 1.5 traffic was light, GPUs serving production could be redirected to accelerate RL training.

Dmytro Dzhulgakov describes this as exploiting the heterogeneity of RL. If training and inference can be disaggregated, the system does not need a single much larger contiguous cluster. It can use cheaper or more available hardware for inference, scale inference up and down more easily, and balance production traffic against simulated RL traffic.

The hard part is weight synchronization. Dzhulgakov says the Kimi model is about one terabyte, and a training step takes roughly five to fifteen minutes. A straightforward design would produce a new one-terabyte snapshot every few minutes and ship it to clusters on the other side of the world. That has to happen fast enough that asynchronous rollouts do not become too stale.

The key observation was that RL does not change all weights equally at every step. Especially as training proceeds, the adjustments are precise, and the delta between model versions after a training step can be much smaller than the full model. Fireworks and Cursor built a compression and synchronization system around that structure. Dzhulgakov says the delta could be about twenty times smaller than the full model, making global synchronization practical.

ConstraintValue or approach described
Base model snapshotAbout 1 terabyte, according to Dzhulgakov
Training step durationRoughly 5 to 15 minutes, according to Dzhulgakov
Distributed runFour clusters across distant regions, according to Cassano
Synchronization strategyShip compressed deltas rather than full snapshots
Delta sizeAbout 20× smaller than full-model transfer in Dzhulgakov’s description
Inference pause for weight swapAbout 30 seconds, according to Dzhulgakov
The distributed RL system depended on moving model updates fast enough to keep inference clusters close to the trainer.

Dzhulgakov says the system was built losslessly: the model on the remote side ends up bit-equivalent, rather than approximately synchronized. In ordinary conditions, synchronization could complete in under a minute; even in worse conditions, under a few minutes. The actual inference pause to swap weights was around thirty seconds. Cassano adds that they saturated cluster egress by sharding uploads and downloads.

The systems benefit is economic as much as technical. Conventional wisdom, in Dzhulgakov’s description, would put everything in one huge RDMA-connected cluster. That makes moving a terabyte of weights easier but forces the whole workload onto expensive interconnected hardware. If the inference engine is efficient, fewer inference GPUs are needed; if inference is distributed, some of those GPUs can live elsewhere on cheaper or more available hardware.

A separate infrastructure problem appears only after the distributed RL loop is running: the inference system and the trainer must agree on the probabilities associated with generated tokens. During inference, the model produces log probabilities for sampled tokens. Because training is asynchronous, the trainer later reruns a forward pass to reproduce those log probabilities and update the model correctly. In theory, if the model version is the same, the log probabilities should match. In practice, Cassano says, they can be slightly or sometimes very different.

Dzhulgakov traces the problem to floating-point arithmetic and accumulation order. With integers, A + B + C and C + B + A give the same result. With floating-point approximations, addition order can change the result. Neural networks perform enormous numbers of additions and multiplications, so small differences can be amplified through millions or billions of operations.

For normal inference, this often does not matter. A pretrained model is usually robust enough that minor numerical differences do not change benchmark performance in a meaningful way. RL is different because the training signal is weak and sensitive. Noise from numerical mismatch can make training inefficient or cause it to fail.

Mixture-of-experts models amplify the issue. In an MoE layer, a gating operation chooses a subset of experts for a token. Dzhulgakov gives the example of selecting eight experts out of 384. A tiny numerical difference in hidden states can change which expert is just above or below the cutoff. If inference activates expert seven but training believes expert nine was activated, the trainer updates a part of the model that did not actually contribute to the rollout.

The fixes live at the boundary of algorithms and systems. Dzhulgakov says one can make GPU kernels batch-invariant and carefully ensure that additions occur in the same order, but that can slow the system by two or three times. The practical question is how much slowdown is worth taking to eliminate most of the divergence. Cassano confirms that they wrote GPU kernels to address this class of problem.

For MoE specifically, they used a trick called route replay. Inference passes additional information to training: which expert was activated for a given token. That small piece of information lets the trainer align with the inference path. More broadly, numerical alignment required matching quantization levels, kernels, and implementation details to reduce divergence between training and inference. Dzhulgakov says those differences can be the line between a run diverging completely and a run being multiple times more compute efficient.

The environment has to be close enough to production that reward hacking does not dominate

For Federico Cassano, the environment problem is not cosmetic. RL encourages models to exploit reward structures, and if the simulated environment differs from production, the model may learn behavior that performs well in simulation and poorly for users. He says models can sometimes figure out when they are being run in a fake environment rather than a real one, and then behave differently during RL than in production.

Oh, I'm in a fake environment. I've learned a few tricks to like get a better reward in this environment and let me try them out.
Federico Cassano · Source

Sonya Huang summarizes the risk bluntly: models love to cheat, and RL is good at encouraging cheating.

Cursor’s RL environment, as Cassano defines it, has three pieces. The harness is where the model submits tool calls and those tools are executed. The operating system is the world the model interacts with. The reward component checks whether the task was completed correctly. The harness is relatively portable. The operating system is the hard part.

For coding, Cassano notes, there are many working environments available in the form of GitHub repositories. A model can install dependencies for a repository and get something close to a live coding environment. But serious tasks can require surrounding services: a database for a migration, for example. Those requirements make environments difficult to construct and operate.

Cursor does not use RL environment vendors, Cassano says, though he sees value in such products for companies that lack working environments or need help standing up service-rich tasks. At Cursor, normal containers were not enough. The company built a virtual-machine stack so it could quickly spin up large numbers of realistic environments. Cassano gives the burst requirement as the ability to ask for “a hundred thousand virtual machines now” and have them come up.

Dmytro Dzhulgakov says the broader rule for companies with actual AI products is that they should do RL against that product. Frontier labs need generic environments because they are building generic models. An application company trying to build the best model for its own product, in his view, should make the RL environment as close to production as possible, while isolating it so the model cannot damage production systems.

That preference affects where infrastructure runs. Fireworks may run the trainer for some customers, while in Cursor’s case the trainer ran on Cursor’s side. But Dzhulgakov says environments often default to running on the customer side because that is where the real product implementation lives. Trying to wrap the customer’s production application into a hosted platform introduces differences, and those differences are exactly what RL can exploit.

Online user feedback helps, but it cannot create the model from scratch

Federico Cassano distinguishes Cursor’s simulated RL from what the company calls real-time RL. Cursor also uses signals from actual usage, looking for cases where a user appeared happy or sad about a particular model generation. Using the same weight-synchronization technology built with Fireworks, Cassano says Cursor can update the model live and ship a new version every few hours. The team is working to reduce that interval, though he says it may later need to increase again as model horizons lengthen.

But real-time RL has limits. It is currently inefficient, Cassano says, because GPUs are offline for long periods. Dmytro Dzhulgakov adds two other constraints: signal precision and user experience.

In simulation, the system can run many rollouts from the same prompt. It can ask the model to try a task 16 times, 128 times, or otherwise explore multiple paths. Some rollouts will succeed, others fail, and the comparison produces a more precise learning signal. Algorithms such as GRPO rely on multiple rollouts in parallel. Online user interaction typically gives one rollout.

Simulation also permits failure. If a simulated rollout goes badly, the cost is compute. If a live user sees bad behavior, the cost is product experience. Real-time RL is effectively an A/B test against users, so the minimum model quality must be much higher.

Cassano says simulated RL is where Cursor teaches reasoning, tool calling, new information about the world, and the initial behavior it wants the model to have. Only after the model is good enough does it go to users, where real-time RL can improve it. That creates a paradox: online RL cannot bootstrap a poor model, because users will not want to use it and therefore will not provide useful feedback. It can only improve a model that is already shippable.

He declines to describe the reward signal itself, calling it “top secret stuff.” He also declines to discuss whether Cursor has found ways to extract more information from long rollouts, after Huang invokes Andrej Karpathy’s line that RL can feel like “slurping bits from a straw.”

Long-horizon agents require training the model to manage its own context

Composer 2 is aimed at long-horizon coding tasks, and Federico Cassano identifies two central RL problems as trajectories get longer.

The first is credit assignment. If the only reward comes at the end of a long task, the model must infer which earlier actions contributed to success or failure. Cassano describes it as the model asking, in simplified form, “where did I do right and where did I do wrong?” The longer the trajectory, the harder that becomes.

The second is context length. Models have finite context windows. Cassano says Composer is a 200,000-context-window model, but in practice it can continue for millions of tokens because Cursor put compaction inside the RL loop. Cursor calls this self-summarization.

The agent learns during RL to summarize its work, use that summary to restart its context window, and continue pursuing the task. This means RL is not only training the model to solve the coding task; it is jointly training the model to produce useful summaries and to rely on those summaries later. Cassano presents this as a continuation of reasoning: the model learns how to preserve enough state about its own work to keep acting beyond the raw context limit.

Dmytro Dzhulgakov finds this notable because context management is often treated as part of the harness. Cursor instead co-optimizes a harness behavior and the model behavior inside the same optimization loop. In his account, this is another version of the bitter lesson: when more of the system is placed inside the compute-driven optimization loop, the end-to-end system can improve.

RL is not just for coding agents, but the reward has to be engineered

Federico Cassano makes the broad argument that RL “fits everywhere,” including Cursor’s tab model. He marks part of the explanation as personal theory rather than established fact: during pre-training, a model ingests human knowledge from many perspectives. In math, for example, it sees both experts and students. Presented with a math problem, a model that has not gone through RL may not know which role to inhabit. RL tunes the knob toward “you are the expert; you need to do things correctly.”

In that account, early RL sharpens the model quickly by telling it which mode of behavior is desired. Later stages require much more compute and produce more elaborate reasoning patterns, but even small-compute RL can be useful as a behavioral sharpening step.

Dmytro Dzhulgakov offers a similar division: continual pre-training and supervised fine-tuning transfer new knowledge, while RL sharpens behavior or qualities. Many applications need both. He gives summarization as an example outside coding. If a company wants a particular summarization style, it may be difficult to construct enough labeled examples of good and bad outputs. But it can define rubrics, use an LLM as a judge, and let the model experiment with different summarization strategies inside an RL loop.

The best reward, Dzhulgakov says, is verifiable and automatic. In math or coding, deterministic checks are ideal. LLM-as-judge works where judgment is easier than generation: evaluating an answer can be simpler than creating it. But complex evaluations may need to be broken into multiple criteria, because a single judge asked to assess style, factuality, and other dimensions at once may get confused. The reward can combine deterministic checks and LLM-based rubrics.

Experts still matter, but less as people manually judging every rollout and more as designers of the evaluation rules. Dzhulgakov maps this onto the shift from writing software directly, to crafting training data, to crafting evaluation rules. Teams still have to inspect examples, understand failure modes, and encode the product experience they want. The expert work moves upstream into defining what the model should optimize.

The frontier, in your inbox tomorrow at 08:00.

Sign up free. Pick the industry Briefs you want. Tomorrow morning, they land. No credit card.

Sign up free