Orply.

Zed Uses Student Models to Filter Production Traces for Zeta 2

Ben KunkleAI EngineerSaturday, May 30, 20266 min read

Ben Kunkle, Zed’s edit predictions lead, explains how the company built Zeta 2 as a small production model for one latency-sensitive task: predicting a user’s next code edit on every keystroke. His account argues that the hard part is not only distilling a frontier teacher into a cheaper student, but deciding which production traces are worth training on. Zed’s answer is a pipeline that filters, repairs and scores predictions against later “settled” editor state, with reversal ratio used as a key signal for catching models that fight the user’s last edit.

Zeta 2 is trained for one narrow job: predict the next edit before the user makes it

Ben Kunkle describes edit prediction as a constrained model problem rather than a general coding-assistant problem. The model receives code around the cursor, the cursor position, recent edits, nearby type and variable definitions, and diagnostics or errors. From that context, it predicts the next edit the user is likely to make.

The constraint that shapes the system is latency. Edit prediction runs on every keystroke, with a budget under 300 milliseconds. That makes the task suited to a small, specialized, fine-tuned model: not a broad assistant, but a model trained to do this one thing quickly.

Zed’s training pipeline draws on opt-in production snapshots. The pipeline also shows synthetic data from git commits, but Kunkle focused on the production snapshots: Zed can capture the editor state around a prediction request, then turn that snapshot into a training example after it passes through a distillation pipeline.

The teacher model creates the target, but it has to be filtered and repaired

Zed uses a frontier model as a teacher. It gives the teacher the same inputs the edit-prediction model would receive and asks what prediction it would make. Kunkle’s point was not that the frontier model is unreliable in the abstract, but that at production scale its outputs are varied enough to require careful handling: ask it 100,000 times, he said, and “they’re gonna give you a hundred thousand one answers.”

That forced work on the prompt sent to the frontier model, and it also led to a quality-assurance stage before the examples are used to train Zeta 2. Zed runs static offline evaluations against teacher predictions, including checks for whether a prediction simply undoes what the user just typed or ignores the editable-region boundary supplied to the model.

Predictions that fail those checks are routed into a repair step rather than discarded immediately. Another frontier-model call receives a similar prompt plus the failure mode — for example, that the prediction crossed a boundary or reversed the prior edit — and is asked to fix it.

Once repaired, the teacher output is converted into the expected output format for the student model. Kunkle emphasized that everything through this point is reusable across experiments: Zed can cache the distilled and repaired examples, then vary the student-facing format for different runs. Those experiment-specific choices include whether to include diagnostics and how much edit history to provide.

Each stage reads JSONL, enriches an example by adding or moving fields, and writes JSONL back out. A typical full training run uses about 100,000 examples; smaller experiments use roughly 10,000 to 50,000.

~100k
examples in a peak Zeta 2 training run

The strongest production examples are near the user’s eventual edit, but not identical to it

The more novel part of the pipeline is what Zed calls “settled data.” The premise is simple: after a prediction request, the user eventually writes the code they wanted. Because Zed is the editor, it can wait until the editable region stops changing, snapshot that later state, and use it as a signal about what the model should have predicted.

Kunkle was careful about the noise in this signal. The user may change their mind. An agent may rewrite the region completely. The final code may be unrelated to what was reasonable at the moment the prediction was requested. So Zed does not treat the settled state as a clean label.

Instead, Zed compares possible model predictions against the settled state using a Levenshtein-like distance. One version of the method is to generate 10 teacher predictions for each example and check whether any are close to what the user eventually wrote. If at least one is close, that tells Zed two things: the example was predictable, and the settled state was not just unrelated noise.

The cost problem is immediate. At 100,000 training examples, 10 teacher completions per example becomes one million frontier-model requests. Kunkle called that prohibitively expensive.

The workaround depends on Zeta 2’s student model approaching teacher quality. Instead of generating 10 frontier completions, Zed can run a student checkpoint 50 times at negligible cost, then compare those predictions to the settled region. The same filtering idea remains: find examples where at least one plausible prediction lands close to what the user eventually wrote.

The useful examples are not the ones at either extreme of the distance distribution. If the prediction is very far from the settled state, Kunkle treats that as likely noise. If it is extremely close, the edit may be too obvious — his example was typing function add A plus, where B is the predictable continuation. The valuable region is in the middle: close enough to show the model could have predicted what the user wanted, but not so trivial that it teaches little.

That middle region is also where Zed expects fresh information to appear, including code past the student model’s training-data cutoff: new functions or patterns the model has not seen before, but which users actually wanted in the editor. Zed generally does not train directly on the final settled state, because it remains noisy. It trains on the candidate prediction closest to the settled state.

Reversal ratio is tracked as a failure signal because there is no single correct answer

Offline evaluation uses a held-out test set, so the model is not evaluated on examples it trained on. Zed scores predictions with deltaChrF, exact lines, reversal ratio, and kept rate. Kunkle described deltaChrF as essentially their Levenshtein-style measure: an n-gram comparison across multiple values of n.

The reversal ratio is more behavioral. Kunkle defined reversals as “undoing exactly what you just typed.” That failure can look superficially like an edit, but it is a bad experience in an editor because the model is fighting the user’s last action.

Zed also evaluates against three teacher completions per example, because many edit-prediction cases do not have one correct answer. If a student prediction is close to one of three distinct frontier-model completions, Zed treats that as evidence the prediction is good.

Kunkle also stressed that offline evaluations do not necessarily correlate with what users want in the editor. That is why Zed tracks model behavior after deployment through structured logs and dashboards, including latency, kept rate, token counts, acceptance rate, and A/B performance across model versions.

For production rollout, Zed built an internal edit-prediction experiments page. Kunkle showed a Zed Admin Panel view with experiment controls and said one experiment was sampled at 15% while another received the rest of production traffic. The page lets the team move a model from partial traffic to larger exposure or to the live running version. The model labeled v0211seedcoder, he said, was released as Zeta 2 the prior week.

Zed is also testing newer production diagnostics. Kept rate compares the text after a prediction with the later settled state and measures how many characters from the prediction survived. Diagnostic error counts snapshot how many errors existed before and after a prediction, then use that as another quality signal.

Settled state is currently a ten-second editor heuristic, not a repository-level signal

Asked whether Zed uses signals such as git commits to decide when a user is satisfied with a block of code, Kunkle said it does not currently look at git commits. It could, but the present heuristic is editor-local: if the user stops editing that area for 10 seconds, Zed snapshots it as settled.

That means Zed will not snapshot cases where a user keeps editing the same location for longer without pausing for 10 seconds. The heuristic is deliberately rough, which is why settled state functions as a filtering signal rather than a direct training label.

The frontier, in your inbox tomorrow at 08:00.

Sign up free. Pick the industry Briefs you want. Tomorrow morning, they land. No credit card.

Sign up free