Heterogeneous Model Routing Beats Frontier Baselines on Visual Web Tasks

Adrian BertagnoliAI EngineerSunday, May 24, 202610 min read

Adrian Bertagnoli of Callosum argues that AI scaling is moving away from monolithic models running on uniform GPU clusters and toward heterogeneous systems that route subtasks across different models, chips and workflows. He points to Callosum results in visual web navigation and recursive long-context reasoning, where mixed model-and-hardware systems reportedly matched or beat frontier baselines while cutting cost and latency, as evidence that agentic workloads should be decomposed rather than sent wholesale to the most capable model.

A mixed visual-navigation system beat frontier baselines while running cheaper and faster

Adrian Bertagnoli’s most concrete claim is that visual web navigation improves when the system stops treating every action as a frontier-model problem. Callosum used a mixture of open and closed video-action-language models on Video Web Arena and reported beating GPT-5.2 and Gemini 2.5 by 18% and 25%, respectively. The system is described as using “active perception for modality selection,” treating the web itself as a heterogeneous environment rather than a single stream of text to be passed through one model.

18% and 25%

Callosum-reported Video Web Arena gains over GPT-5.2 and Gemini 2.5

The example task shown for this benchmark is deliberately multi-step and multimodal: find a robot toy on Amazon UK that looks like one image, is playing the ball as in another image, is from the same team as the player in that image, compare the price of the same product on OnBuy, and if Amazon is cheaper, purchase it with a gift option and ship it to the offices. Bertagnoli uses this kind of task to illustrate why homogeneous treatment is wasteful. The work decomposes into visual reasoning, textual reasoning, browsing, comparison, and action. Each component does not require the same model.

He says the results show a “fundamental shift of the Pareto frontier”: single models such as Kimi K2.5 and GPT-5.2 are outperformed by heterogeneous mixtures. In one comparison, using Qwen 3 VL 8B Instruct and Kimi K2.5 was 1.3 times faster than using Kimi alone and 18 times cheaper than using GPT-5.2 alone. In another, he says “Qwen 3 plus GPT” was three times faster and 3.7 times cheaper than GPT-5.2 alone. The on-screen label for that configuration reads “Qwen & GPT-2,” so the source preserves a mismatch between the spoken model label and the visible model label. Bertagnoli’s conclusion for this benchmark is blunt: “there’s no downside in constructing this in a heterogeneous manner.”

Comparison	Reported gain
Spoken: Qwen 3 VL 8B Instruct + Kimi K2.5 vs Kimi K2.5 alone	1.3x faster
Spoken: Qwen 3 VL 8B Instruct + Kimi K2.5 vs GPT-5.2 alone	18x cheaper
Spoken: “Qwen 3 plus GPT” vs GPT-5.2 alone; visible label says “Qwen & GPT-2”	3x faster and 3.7x cheaper
Qwen-VL-Instruct on zoom subtasks vs ChatGPT / GPT-5.2 comparison shown in the source	11x faster and 43x cheaper

Callosum-reported cost and latency gains from heterogeneous visual web navigation

The most concrete decomposition is zooming. Bertagnoli says one differentiating factor was mapping subtasks such as zoom and visual reasoning for the agent to less intelligent models. His phrase is that “you don’t need GPT to zoom for you.” On those subtasks alone, Callosum reports being 11 times faster and 43 times cheaper than using ChatGPT. Those local gains, he says, accumulate into the overall result of being 3.7 times cheaper and three times faster.

This is the operational center of the argument. The system does not win by assuming open models are always enough or frontier models are always wasteful. It wins, in Bertagnoli’s account, by recognizing that web navigation contains many small actions whose required intelligence is far below the intelligence available from the most expensive model. If those actions are routed to a cheaper, faster model while preserving task performance, the cost-latency frontier changes.

The scaling problem has moved from training one model to orchestrating many kinds of intelligence

Adrian Bertagnoli frames the prevailing AI scaling paradigm as “homogeneous intelligence”: scaling single models across fleets of identical chips. That approach, he says, was largely enabled by neural scaling laws, which showed that more data and more parameters produce better models. But he treats those laws as primarily rooted in the training domain. As AI systems move further into inference-heavy, agentic workloads, the old assumption that the answer is one larger model on one kind of hardware becomes less relevant.

The shift has already begun. At the model-architecture level, mixture-of-experts systems are replacing large dense models. At the workflow level, single LLM calls are being replaced by multi-agent systems. At the hardware level, single-chip inference is giving way to systems that disaggregate prefill and decode. Bertagnoli calls this “mild heterogeneity”: the system is still largely running on homogeneous clusters, but some variation has entered through prompts, sub-agents, model selection, and execution patterns.

His proposed next step is more explicit: treat model architectures, chips, and workflows as variables to optimize together. In a more heterogeneous system, different LLMs may run on different GPUs or other chips; state-space models, diffusion models, and language models may interact; and the hardware is selected for the computational demands of the particular model and task. The most developed form, he says, is a co-evolution of systems, hardware, and software — “a complete vertical integration of intelligence and hardware.”

The reason for doing this is not just operational efficiency. Bertagnoli’s central claim is that real-world problems are themselves heterogeneous. Problems such as autonomous payments, computer use, coding, or open-ended web navigation decompose into subtasks with different demands: visual reasoning, textual reasoning, planning, search, comparison, and execution. Scaling one kind of intelligence to cover all of those subtasks is inefficient and often suboptimal.

Real-world problems are complex, multi-step, and open-ended. They decompose into sub-problems which require vastly different types of intelligences.

Adrian Bertagnoli

That claim extends to hardware. Bertagnoli points to emerging generations of silicon — next-generation GPUs, in-memory compute, custom ASICs, and other chip types including thermodynamic chips, HPC neuromorphic systems, and quantum — as sources of specialized performance. The bottleneck, in his framing, is not the existence of specialized hardware; it is the lack of an interface that can unify those chip types into the current compute stack and make them useful to AI systems without bespoke engineering for every case.

Heterogeneity is presented as a systems-level principle, not a preference for smaller models

Callosum has formalized what it calls the “principle of maximum heterogeneity,” according to Adrian Bertagnoli. The framework represents heterogeneous agents as distributions over a skill space: different colors indicate different skill profiles, and communication among agents produces what he calls a production function. A given production function may match one problem demand well and another poorly.

His contrast with homogeneous systems is simple. If the system scales one narrow capability, it can only scale one peak. If it tries to match a broad demand function using homogeneous generalists, the result is broad but shallow: a “short cylinder” that does not meet the production function well. Bertagnoli says Callosum formalized this across domains including AI, neuroscience, economics, and ecology, and found that under “any reasonable amount of constraints,” heterogeneous systems outperform homogeneous ones.

The stronger version of the claim is that, under real-world constraints, heterogeneous resources shift the Pareto front of what systems can achieve. This Pareto-front language is the bridge between theory and engineering. The argument is not that any cheaper model should replace a frontier model. It is that a system can improve its frontier of cost, latency, and capability when it has enough diversity — of models, chips, and workflows — and a way to route work to the right part of that system.

Callosum’s practical implementation is organized across three layers. At the hardware layer, agents are assigned to different hardware depending on their computational demands. At the agent layer, systems can combine frontier and open models, use multimodal video-action-language models, manage KV cache, and use latent communication. At the workflow layer, Callosum works on automated workload decomposition and a primitive Bertagnoli calls heterogeneous recursion.

This is why he repeatedly describes intelligence as a systems-level problem. The claimed gains do not come from a single substitution. They come from decomposing the work, identifying what each subtask actually requires, and then choosing a model-chip pairing and workflow structure that can satisfy that requirement with less cost or latency.

Recursive long-context work can be split across models and chips without giving up frontier-level accuracy

Adrian Bertagnoli’s first long-context example is heterogeneous recursion. He introduces it by referring to recursive language models, which he says were described in a paper from MIT the previous October. In his account, that work showed that models can suffer “context rot” even when only a small percentage of the context window is occupied, depending on the information complexity required by the prompt.

He distinguishes several kinds of information requirements. A needle-in-a-haystack task has constant information demand: the system needs one answer regardless of how large the prompt becomes. Adding up rows in a table has linear demand: as the prompt grows, the required information grows with it. Constant-demand tasks scale well across the full context window, but when the information requirement becomes linear or quadratic, performance degrades at roughly 60% to 30% of the context window.

Recursive language models address this, in Bertagnoli’s account, by treating the context as an environment rather than stuffing all of it into a prompt. Instead of passing the entire context to the model, the context sits in a file. A coding agent interacts with it programmatically through a Python REPL, using keyword searches, regex, and other methods to extract relevant sub-contexts. That sub-context can then be passed to an identical recursive agent, which can either answer the question or spawn another recursive agent.

Callosum’s extension is to make that recursion heterogeneous. Rather than using a single model on a single chip at every recursion depth, the system maps sub-contexts to different models and chips. Different context chunks at different depths can therefore be run on different model-chip pairings to save cost and time while aiming to emulate frontier-model performance.

Bertagnoli reports Callosum’s results on the ULong benchmark, which he says was the benchmark used in the recursive language model paper. He describes a GPT-5 or GPT-5.2 comparison point as taking around 2,000 seconds and costing around $3.75 for one task at the time Callosum produced the work. Callosum’s reported framing is “12x cheaper and 5x faster on long contexts” while “matching GPT-5 level accuracy,” with a second visible line saying “All models matched with GPT-4.”

Against the frontier-model comparison point, Bertagnoli says Callosum’s system running on Cerebras was seven times cheaper and five times faster. That configuration is labeled “Cerebras [Qwen-2-235B].” He says a SambaNova configuration pushed cost down further, with some latency tradeoff relative to Cerebras: 12 times cheaper and three times faster. That configuration is labeled “SambaNova [GPT-08S-120B].”

Callosum-reported comparison	Reported result on ULong
Cerebras configuration, labeled “Qwen-2-235B”	7x cheaper and 5x faster than the GPT-5 / GPT-5.2 comparison point
SambaNova configuration, labeled “GPT-08S-120B”	12x cheaper and 3x faster, with Bertagnoli noting lower cost at the expense of some latency
GPT-5 / GPT-5.2 comparison point described by Bertagnoli	Around 2,000 seconds and around $3.75 per task
Accuracy framing in the source	“Matching GPT-5 level accuracy” and “All models matched with GPT-4” both appear

Callosum-reported long-context results from heterogeneous recursion

The point is that these are architectural decisions rather than simple hardware substitutions. Different recursion depths and context chunks have different computational needs. By routing them accordingly, Bertagnoli says, the system can “emulate” the intelligence of frontier models while making large differences in price and latency.

Routing has to become automated if heterogeneity is to scale beyond bespoke choices

The question after the presentation goes directly at the implementation burden. An audience member asks how Callosum decides which tasks to run on faster, cheaper models. Is zoom hardcoded as a simple rule — “if you have to zoom, use this model” — or is there a smarter model that decides what to delegate?

Adrian Bertagnoli says Callosum began with bespoke decisions, mapping certain simple subtasks to simple models. But the company has since built an automation layer that detects task complexity and automatically predicts the best-suited model and hardware.

That answer narrows the infrastructure claim. A system can save money by manually assigning zoom to a smaller visual model, but Bertagnoli’s broader argument requires more than a library of hand-coded exceptions. It requires an interface that can unify multiple model types and chip types, then make routing decisions based on the task rather than requiring engineers to predefine every subtask.

The compute-history framing is broad but consistent with that implementation goal. Bertagnoli describes the first compute era as dominated by CPUs, which “made compute quicker.” The second, dominated by NVIDIA, made compute massively parallel. He casts the third paradigm as heterogeneous compute: mapping multi-agentic workloads onto different chips and optimizing those mappings.

He also says Callosum is working with ARIA, the UK institute, with a £3 million grant to operate what he describes as the first heterogeneous co-located cluster in the UK. The stated aim is to unlock “a new frontier of AI capabilities and cost for large-scale agentic AI.”

His closing formulation is that homogeneous scaling delivered extraordinary progress and should be treated with gratitude, not dismissed. But the next stage, he argues, is one in which models, workflows, and silicon co-evolve, and each new source of diversity can make the system “smarter, faster, and cheaper.” He ends with a line that captures the infrastructure argument: “This is the worst our infrastructure will ever be.”

AI Application Architecture Evals and Benchmarks Inference and Deployment Agents and Autonomy Multimodal AI AI Infrastructure and Compute