RecursiveMAS Lets AI Agents Collaborate Without Translating Through English

Károly Zsolnai-FehérTwo Minute PapersFriday, June 19, 20266 min read

Károly Zsolnai-Fehér presents RecursiveMAS, a paper by Xiyuan Yang, Jiaru Zou and coauthors, as an attempt to fix a coordination cost in multi-agent AI systems: agents repeatedly translating internal work into English for one another. The paper’s claim is that agents can instead pass latent numerical representations directly, improving collaboration while cutting token use. Zsolnai-Fehér says the reported gains are substantial on small models, including better math results and far fewer tokens, but frames the work as early research rather than a deployable agent product.

RecursiveMAS attacks the translation cost between agents

RecursiveMAS is presented as a way to make multiple AI agents collaborate without forcing every handoff through English. The diagnosis from Károly Zsolnai-Fehér is that many agent failures are not only failures of individual model quality, but failures of coordination: one model does some work, writes it out as text, another reads it, re-encodes it internally, and the process repeats.

That matters because agents are being asked to do increasingly consequential work: booking the cheapest flight, running continuously to manage a calendar, submitting insurance claims, or scanning a codebase for vulnerabilities and patching them. Zsolnai-Fehér frames the boom as unusually fast even by AI standards, with a star-history.com chart depicting OpenClaw’s GitHub stars rising almost vertically past long-running projects such as Linux and React.

The same autonomy creates new failure modes. An arXiv paper titled “AI Agents May Always Fall for Prompt Injections” appears alongside headlines about AI “turbocharging” text-message scams and an OpenClaw agent being tricked into phishing attacks with user data compromised. The point is not only that agents can be insecure or unreliable; it is that those risks compound when agents depend on one another.

His example is a holiday-planning system split across agents. A flight agent hallucinates a cheaper airport 400 miles from the intended destination. A hotel agent then optimizes locally by booking a cheap, non-refundable room nearby. Each step can look reasonable inside a narrow role. Together, they produce a non-refundable room the user will never see.

“Recursive Multi-Agent Systems,” by Xiyuan Yang, Jiaru Zou, and coauthors, starts with a familiar multi-agent pattern: a planner writes a plan, a critic critiques it, and a solver produces the answer. Zsolnai-Fehér initially says he saw “nothing interesting” there, because this planner-critic-solver structure is common. The important change is not the assignment of roles. It is the communication channel between them.

The agents stop translating their work into English

The central architectural question is blunt: why should agents talk to each other in plain English at all?

A brain-to-text neural-interface example from Willett et al. 2020 makes the point. In that demonstration, a person thinks of letters and text appears on a screen. But an ordinary alphabet is a human-facing writing system, not necessarily an optimal representation for thought. The demonstration moves from recognizable letter-like traces to an “Optimized Alphabet” made of abstract strokes.

RecursiveMAS applies the same logic to agent communication. In a text-based multi-agent system, an agent has to decode its internal state into tokens and full sentences. The next agent then has to read that text and re-encode it into its own internal representation. RecursiveMAS instead passes raw, undecoded numerical representations between agents. The paper’s phrase is “cross-agent latent state transfer.”

Instead of using English words, they pass raw, undecoded numbers directly to the next agent. Send raw brain signals if you will. Call it cross-agent latent state transfer.

Károly Zsolnai-Fehér · Source

Yang and Zou et al.’s architecture diagram describes “Latent Thoughts” as last-layer embeddings moving through “Inner Link” connections between agents, while “Outer Link” outputs remain part of the broader loop. In the depicted setup, agents collaborate across recursion rounds: round one, round two, round three. The theory is that the agents can refine an answer across rounds more cheaply than if every intermediate step has to be expressed in text.

The mechanism therefore targets two effects at once: better answers through repeated collaboration, and lower token usage because most of the collaboration happens in latent space rather than generated language.

The reported gains are large for small models

The reported benchmark story is compact: better math results, far fewer tokens, and very low training cost, all on small models rather than frontier systems. On competition-level math questions, RecursiveMAS improves performance from 73% to 86%. The tested systems are “free, sub-10 billion parameter models,” not expensive frontier systems.

75.6%

average token reduction reported for RecursiveMAS versus Recursive-TextMAS

Claim	Reported figure	Context
Competition-level math performance	73% → 86%	RecursiveMAS result described in the source
Average token reduction	75.6% fewer tokens	Versus Recursive-TextMAS across listed benchmarks
Training cost	About $4	Zsolnai-Fehér’s description of the training expense
Latent-thought plateau	~80 steps	Additional latent length gives little added value

The main quantitative claims reported for RecursiveMAS

The strongest efficiency claim is token reduction. A Yang and Zou et al. chart compares Recursive-TextMAS with RecursiveMAS across Math500, AIME25, AIME26, GPQA-D, MedQA, and code generation at recursion round three. The chart’s headline says RecursiveMAS uses an average of 75.6% fewer tokens, with reductions ranging from 3.4x to 5.1x across the listed benchmarks. Zsolnai-Fehér’s shorthand is that the tokens “evaporated into the latent space”: the system avoids paying the full token cost of writing intermediate agent communication in natural language.

Training is described as costing about four dollars. In Zsolnai-Fehér’s framing, the method can push smaller systems closer to much larger, more expensive models on difficult math problems without making coordination itself prohibitively expensive.

A heatmap from Yang and Zou et al. suggests another possible scaling behavior: more recursion rounds improve results. On AIME2025, the displayed values rise as the training recursion round increases, reaching 35.3 at r=4. Zsolnai-Fehér avoids the stock AI phrase “unlock,” but says the result “might give us a new scaling law”: more rounds, better results.

The control asks whether the gain is architecture or distillation

Károly Zsolnai-Fehér identifies one subtle possible flaw. The training data for each agent’s role is written by a giant AI model. If RecursiveMAS performs well, the improvement might come not from latent-state transfer but from strong distillation: smaller agents imitating an excellent teacher.

So which one is it? A good teacher or a good architecture?

Károly Zsolnai-Fehér

The distinction matters because the architectural claim is weaker if other multi-agent designs improve just as much when trained from the same strong teacher. If a large model writes strong planner, critic, and solver traces, then the teacher signal itself could explain much of the gain.

According to Zsolnai-Fehér’s description, the authors tested this directly. They gave the same teacher to comparison architectures and to RecursiveMAS. In that controlled comparison, RecursiveMAS still outperformed. His conclusion is that the “brain linking” itself works.

The limits are small-model scale, latent-thought length, and research maturity

Károly Zsolnai-Fehér closes the technical argument with explicit limits. First, the reported tests are on smaller models. He does not claim the same gains will transfer to larger systems. If the effect does not scale up, he says, it still matters because it puts small models “on steroids.” If it does scale, he presents that as far more consequential.

Second, RecursiveMAS has an apparent optimal latent-thought length. A Yang and Zou et al. chart plots accuracy against latent-thought length for MATH500, GPQA-D, and LiveCodeBench. The curves rise and then plateau around a length of 80. Zsolnai-Fehér describes this as “somewhat of a limit on how much thinking an agent can do per round,” while noting that after 80 steps the method does not gain much value anyway.

The code and models are described as freely available. A GitHub README lists supported features including collaboration patterns such as sequential, mixture, deliberation, and distillation, plus demo inference code and setup commands.

The practical warning is explicit: this is rough, early research. Zsolnai-Fehér says readers should not expect to “just plug this in and everything will fly immediately.” The narrower claim is also the more important one: multi-agent systems may not need to spend tokens translating internal reasoning into English for other agents to consume. In the reported experiments, passing latent states directly between agents improves smaller models while substantially reducing token usage.

AI Application Architecture Evals and Benchmarks AI Research Methods Agents and Autonomy