DeepSeek Uses Visual Primitives to Make Image Reasoning Cheaper

Károly Zsolnai-FehérTwo Minute PapersFriday, May 22, 20266 min read

Károly Zsolnai-Fehér presents DeepSeek’s “Thinking with Visual Primitives” paper as a meaningful shift in visual AI: not a model that merely sees images, but one that can reason by marking them with points, boxes and paths. He argues that this makes tasks such as counting and maze tracing cheaper, more accurate and easier to inspect, with the paper reporting strong benchmark results while using about 90% fewer visual tokens than many frontier systems. He also cautions that the work is a blueprint rather than a released model, and still depends on triggers and may struggle with fine visual detail or unfamiliar topology problems.

The claim is not that DeepSeek can see images, but that it can point while thinking

Károly Zsolnai-Fehér frames DeepSeek’s “Thinking with Visual Primitives” as a departure from the usual way multimodal systems reason over images. Vision-capable AI is not new: many systems already accept images, and some accept video, including open models. The claimed advance is narrower and more consequential. Instead of forcing the model to describe an image in words as it reasons, the method lets the model use visual primitives such as points and bounding boxes inside its reasoning process.

The motivating example is counting people in a rugby-team photo. A previous technique, in Zsolnai-Fehér’s description, might reason verbally: people in the upper left, striped players in two rows, perhaps three rows, some standing, some sitting. That kind of verbal description is both error-prone and unnecessarily expensive. A human, by contrast, would often use a finger: point at each person and count.

The paper’s method translates that intuition into model behavior. The system can generate coordinate-like visual references — for example, XML-style bounding-box tags such as <box>[x1,y1,x2,y2]</box> — while reasoning. In the rugby example from the paper, the “before” image is the original team photo, and the “after” image has orange boxes drawn around each person’s head. The point is not merely that the system returns a number. It can mark what it is counting.

Don’t describe images like a poet. Point like a human.

Károly Zsolnai-Fehér

The central claim is that this makes visual reasoning more accurate and faster. Zsolnai-Fehér ties that to the cost of modern AI use: hardware and tokens are expensive, so any approach that uses fewer visual tokens while preserving performance matters.

The efficiency result matters only because the benchmark result survives it

The most concrete performance claim is visual-token efficiency. Zsolnai-Fehér says the method needs “about 90% fewer visual tokens than most frontier models.” The paper’s “Token Efficiency” chart compares KV cache entries across several systems and highlights “Ours-284B-A13B” as using fewer entries. Another chart summarizes the same point as fewer KV cache entries for “Visual Primitives” than for “Old Models.”

~90%

fewer visual tokens than most frontier models, according to Zsolnai-Fehér’s summary of the paper

But he immediately adds the relevant caveat: thinking less does not matter if the answer is wrong. The paper’s “Average Score on Selected Benchmarks” chart reports 77.2% for the highlighted model.

77.2%

average score reported for Ours-284B-A13B on selected benchmarks

Zsolnai-Fehér characterizes that result as a free system matching or beating almost everything in the comparison, including “billion dollar systems.” He also later says the paper does not have a model attached that he knows of. He calls the research free and open, and describes the paper as a blueprint for the technique rather than as a packaged model release.

The benchmark discussion is not presented as unqualified trust. Zsolnai-Fehér explicitly raises the concern that benchmarks are being gamed. A referenced UC Berkeley article is titled “How We Broke Top AI Agent Benchmarks: And What Comes Next,” by Hao Wang, Qiuyang Mang, Alvin Cheung, Koushik Sen, and Dawn Song. The DeepSeek result is framed as more credible because the paper reports an average over seven benchmarks with “in-house benchmarks excluded,” a phrase highlighted from the paper.

That exclusion matters because, in his words, creating a benchmark that fits one’s own method is “one of the oldest tricks in the book.” He illustrates the point with a joke benchmark for “You-ness,” where the subject of the benchmark is guaranteed to rank first. The claim is not that public benchmarks are immune to problems. It is that the reported average highlighted here is not built from the paper’s own in-house benchmarks.

The method also makes parts of the reasoning inspectable

Visual primitives are presented as an interpretability gain as well as a speed and accuracy tool. In maze-navigation examples from the paper, the system does not only answer a question about reaching a target; it produces a visual path that can be inspected. A hexagonal maze has a start point and a red target, with a path traced through it. Another example uses tangled connections among icons such as a crown, fish, pizza, and octopus; when asked where the crown connects, the highlighted path leads to the octopus.

For Károly Zsolnai-Fehér, this is a step toward AI systems whose reasoning can be understood in a more concrete way than a final answer or “a soup of numbers.” He is careful not to overstate the examples: they are simple. But he argues that, when something goes wrong, having intermediate visual primitives may make errors easier to find and fix.

In the counting task, the boxes show what the model treated as countable objects. In the topology task, the traced points or path show the route used to reach the conclusion. The practical value is that the model leaves behind something visual to inspect when evaluating the answer.

The training recipe is distillation from specialized visual teachers

The implementation explanation centers on what the paper calls an on-policy distillation objective. Zsolnai-Fehér simplifies the setup as a group of expert AI models, each good at a different form of visual reasoning. One “expert” might be especially good at drawing boxes. Another might be good at tracing mazes with points. The goal is not to keep separate experts, but to train one student model that can perform multiple kinds of visual thinking.

In his description, the student model proposes what it would do, and the teacher models respond with what they would have done. Repeating that process distills the teachers’ specialized knowledge into the student. This is why he describes the method as “distilling the knowledge of a bunch of expert teachers into a student.”

The paper is described as a blueprint, not a released model. Zsolnai-Fehér says it does not have an attached model that he knows of; it lays out the concept in detail. Because he describes the research as free and open, he argues that the technique could potentially be added to many existing systems, including free ones.

That distinction defines the practical claim. The immediate contribution is a method for adding visual-primitive reasoning to models, rather than a packaged system that users can simply run as demonstrated.

Less visual input is not the same as solving vision

The broader claim Károly Zsolnai-Fehér draws is that progress in multimodal intelligence may not come only from showing models higher-resolution images. A highlighted excerpt from the paper states that “the future of multimodal intelligence lies not just in seeing more pixels,” and he restates the point as “less is more”: DeepSeek cut visual tokens sharply and, in the reported benchmark comparison, still performed strongly against frontier systems.

He also gives three limitations.

First, the model does not automatically use this kind of pointing-based reasoning. It needs an explicit cue. The paper excerpt says the approach “relies on explicit trigger words.”

Second, visual primitives such as bounding boxes are not equally suitable for every visual problem. Boxes may work well for people in a team photo, but they are less adequate for very fine structures such as blades of grass or strands of hair. Zsolnai-Fehér connects this to a recurring theme in graphics and vision problems: thin structures remain difficult.

Third, the topological reasoning demonstrated in the examples may not generalize as robustly as desired. It can work on the mazes and connection diagrams used in the paper, but he cautions that it may be less reliable when presented with something completely new.

His conclusion therefore holds two claims together. He calls the work a possible breakthrough and warns against misleading headlines and hype. The result is significant because it changes the cost and inspectability profile of visual reasoning; it is incomplete because it still depends on triggers, can struggle with fine-detail visual tasks, and may not generalize cleanly across unfamiliar topology problems.

Data and Training Evals and Benchmarks AI Research Methods Multimodal AI

The claim is not that DeepSeek can see images, but that it can point while thinking

The efficiency result matters only because the benchmark result survives it

The method also makes parts of the reasoning inspectable

The training recipe is distillation from specialized visual teachers

Less visual input is not the same as solving vision

The frontier, in your inbox tomorrow at 08:00.