Macrocosmos Targets 70B-Parameter Training on 5,000 Distributed Nodes

Steffen CruzEye on AIMonday, May 25, 202614 min read

Steffen Cruz, co-founder and CTO of Macrocosmos, argues that frontier AI training is approaching an economic ceiling as larger models require multi-billion-dollar, centralized GPU build-outs. Macrocosmos’s alternative, built inside the BitTensor ecosystem, is IOTA: a distributed training network that uses blockchain for identity, coordination, auditability, and payment while training happens off-chain across idle or underused machines. Cruz says the system has reproduced baseline benchmark performance and now needs to prove it can train enterprise-relevant models, starting with a 5,000-node and roughly 70 billion-parameter target.

The hard ceiling is not only compute, but who can afford to assemble it

Steffen Cruz frames Macrocosmos around a simple constraint: frontier model training has become tied to enormous, centralized GPU build-outs, and those build-outs are beginning to resemble national-scale science infrastructure. The largest systems, he says, are warehouses packed with hundreds of thousands of GPUs connected with extremely high-speed networking so they can behave like a single computer. That architecture has worked because more compute and more data have predictably improved models under scaling laws.

His concern is that this paradigm has an economic ceiling. Projects such as Stargate and Colossus, in Cruz’s telling, are multi-billion-dollar GPU build-outs. He compares the trajectory to fundamental physics: at some point, building a larger collider requires the budget of a nation-state, and the field has to consider different experiments because the next brute-force scale-up becomes “unpalatable.”

Macrocosmos is pursuing one of those different experiments: distributed training. Instead of putting the whole training job inside one massive facility, the company is trying to train large language models across compute nodes distributed around the world. The promise is not merely lower capital expenditure. Cruz argues that a distributed network can arbitrage energy and compute in ways a fixed data center cannot. If there is surplus cheap energy in Iceland for 12 hours, the system can target that pocket of compute for the duration. If capacity is idle elsewhere, it can be folded into the training run.

Today we are training models, we're training multiple models all at once using pockets of cheap energy, which translates into cheap compute.

Steffen Cruz · Source

Cruz expects the mainstream view of model training to shift by 2028 toward methods that arbitrage cost and energy more efficiently. Macrocosmos is trying to do the early work before that shift becomes unavoidable.

The strategic claim underneath the infrastructure work is broader than cost. Cruz says the people who train and own models will have “an enormous advantage.” In that context, BitTensor and Macrocosmos are presented as a way to create, train, and deploy models outside closed “walled gardens,” without requiring users to pay only for privileged access to systems controlled by a small number of organizations.

IOTA has reproduced baseline metrics, but customer training is still ahead

Macrocosmos’s flagship project inside BitTensor is IOTA, short for Incentivized Orchestrated Training Architecture. It is focused on distributed pre-training of large language models, not primarily on fine-tuning.

That distinction matters because pre-training is the expensive part Cruz is trying to attack. He describes it as the phase where models “systematically inhale the entire internet,” often over months and across tens of thousands of GPUs. Fine-tuning, by comparison, is a shorter computational task. Macrocosmos is therefore aiming at the long-horizon workload that makes model creation expensive in the first place.

IOTA is online as a project, and Cruz directs interested users to iota.macrocosmos.ai and macrocosmos.ai. But he says Macrocosmos does not yet have customers training through the network. The company is emerging from research mode and is still calibrating and tightening the system.

Cruz says Macrocosmos has spent about nine months on IOTA and has reproduced important baseline benchmark performance metrics using its distributed approach. He compares the input compute to “wonky vegetables” — ugly produce no one wants to buy — that can be turned into “premium soup” if treated correctly. The metaphor carries his technical and commercial argument: the compute may be heterogeneous, interruptible, or underutilized, but with the right orchestration Macrocosmos believes it can become useful training capacity.

The commercial validation Cruz points toward is in the next set of deployments. He says startups are already ready and committed to working with Macrocosmos on model training in the second half of the year, with collaboration results and early partnership results expected by the end of summer.

BitTensor is a reward and coordination layer, not a place where the model lives

Craig Smith presses the question that often gets obscured in blockchain-AI projects: what is actually on the chain? Smith states the issue plainly. A blockchain, in effect, is a registry of addresses pointing toward registered assets. In AI projects, he asks, is it merely a registry pointing to an AI project, or is something more happening?

Cruz’s answer is pragmatic. In BitTensor, the blockchain is an immutable, globally shared database. It can serve as a store of value, a reward mechanism, a coordination layer, and a shared synchronization clock. The clock matters because the notion of blocks gives participants a common “drumbeat” to anchor work to. But Cruz does not describe the chain as the place where training data or compute lives. The chain is part of the control, identity, audit, and incentive substrate.

Smith summarizes the point directly: there is no training data on the chain and no compute on the chain. The compute might reside in Iceland, on someone’s machine at home, or on spare GPUs somewhere else. The deeper question is the link between that off-chain compute and the blockchain.

For IOTA, the chain supplies several functions. It registers who is in the network, exposes available workers, provides unambiguous identities and addresses, and supports payout. An off-chain layer tracks the contribution of each node during a training job. Once work is computed, that contribution record is sent back down to the blockchain layer, where it triggers payment in the IOTA token.

Cruz says this could also be articulated in dollar terms rather than token terms. The necessary primitives are the same: know who is assigned to the training experiment, know where to find them, know how to connect them, and know how much work they performed.

The blockchain’s auditability is valuable, but Cruz also warns against putting too much on-chain. Writing billions of transactions to a shared database makes the database large and unwieldy, which can undercut decentralization if only a few parties can maintain full copies. His preferred architecture is a “pragmatic middle ground”: store exactly what is needed for transparency, audit, review, and verification, and keep the rest elsewhere.

I think that there's a pragmatic middle ground where you store exactly what you need, but nothing more, to ensure the core parts of the work that you care about are captured and transparent for everyone to audit, review and verify, whereas the rest of it, you can store elsewhere.

Steffen Cruz · Source

That distinction is central to Macrocosmos’s claim. The chain is not performing training. It is making it possible to coordinate and compensate untrusted, distributed participants in a way the participants can inspect.

The core engineering problem is making unreliable machines behave like one training system

Macrocosmos operates three subnets within BitTensor, each of which Cruz describes as a separate project or service. IOTA is the one devoted to distributed training. Its technical challenge is not simply sending work to remote machines. It is creating stable training runs out of unstable supply.

Cruz describes the underlying problem bluntly: the network consists of unreliable compute. People join, people leave, devices churn, and the available hardware is heterogeneous. Macrocosmos wants to create something persistent and stable out of something noisy and unstable. The immediate use case is training models, but Cruz presents the work as a more general infrastructure problem. A similar decentralized network, he says, could eventually be useful for university high-performance computing jobs in bioinformatics, physics, or other fields where the question is how to obtain thousands of nodes for a limited period at the lowest cost.

The orchestration layer, which Macrocosmos maintains, is described as reminiscent of Kubernetes, but for heterogeneous compute nodes around the world. A user dispatches containerized code from a laptop or similar environment. Macrocosmos sends it securely to backend nodes, wherever they are located. The user can monitor logs, trace data, and normal dashboarding outputs. Underneath, code has been cloned and replicated across many devices.

The important architectural point is that each node is not running a full copy of the model. Each hosts a section of it. Cruz calls this model parallelism: large models are split into smaller pieces, and information must be routed across those pieces during training so the distributed architecture can behave as though it were one coherent model.

We don't actually have each of these compute nodes running a full copy of the model. They're actually running a small sliver of the model. It's called model parallelism.

Steffen Cruz · Source

That is why Macrocosmos needed nine months of research before Cruz felt it was emerging from research mode. The “tunnel” among machines is what turns random pieces of compute into one larger system. It must allow sections of the model to communicate and be knitted together during training.

Cruz’s account also clarifies what the user should not have to manage. The demand-side user should not be manually choosing where every GPU comes from or specifying that a particular job should use a device in a particular country for a particular number of minutes. Macrocosmos wants the system to handle the arbitrage and orchestration while leaving the user with recognizable controls over the training objective and the parameters that matter for reproducibility and determinism.

The commercial bet is a two-sided market for idle GPUs and cheaper model training

Macrocosmos sees IOTA becoming a two-sided market. On the supply side are owners of surplus compute: neo-clouds, hyperscalers, people with GPUs, and consumer-device owners with supported hardware. Cruz gives the example of a provider with 10,000 GPUs that can rent out only 9,000 or 9,500 at a given time. Plugging the unused capacity into IOTA could raise utilization and flow directly to the provider’s bottom line.

He is especially interested in short gaps. A GPU may be rented for four hours, idle for two hours, then rented again. Macrocosmos wants to turn those interruptible bursts into a continuous stream of distributed training compute. Cruz argues that if suppliers otherwise rent those GPUs at “cents on the dollar” for inference tokens, training may offer a better margin because it is a “higher order commodity” than inference.

On the demand side are researchers, startups, cash-constrained companies, and academics who want to train more models but cannot afford the conventional cost structure. Cruz expects the world to train more models over the next decade, not fewer. He cites possible enterprise demand for sovereign or in-house specialist models: legal models, medical models, and other domain-specific systems that organizations may have wanted but deferred because of the price tag.

The interface Macrocosmos wants to offer these users should not require them to reason about where each GPU is. Cruz says demand-side users should be able to work in familiar abstractions, including common libraries such as PyTorch and TensorFlow. They should specify the objective, the parameters they need to control for reproducibility or determinism, and let Macrocosmos arbitrage the rest.

No one, in Cruz’s words, should have to specify that they want “a Nigerian GPU for 12 minutes.” The system should procure the cheapest appropriate compute while preserving the controls the training job requires.

Cruz’s cost target is clear: Macrocosmos wants to reach the point where it can point to a portfolio of models and say they are as good as models that would have been trained in a centralized context, but at 10% or 20% of the cost. That is the demonstration he says would make the proposition difficult for customers to ignore.

The near-term scale target is 5,000 nodes and a 70 billion parameter proof point

The scale Macrocosmos has today is not comparable to the largest centralized AI data centers. Cruz says the largest data centers coming online are measured in hundreds of thousands of GPUs and are multi-billion-dollar projects. Macrocosmos is not there. Its “nodes” are also not presented as a one-for-one equivalent to those centralized GPU counts; Cruz distinguishes Macrocosmos’s current distributed network from the scale of those facilities.

Its argument is that the scaling mechanism is different. In a centralized build-out, doubling capacity means more physical infrastructure. In Macrocosmos’s model, Cruz says going from 10,000 nodes to 20,000 nodes does not require a fundamental change in the technology. It requires expanding the network’s distribution surface.

BitTensor itself originally imposed limits. Cruz says that in November or December of the previous year, Macrocosmos wanted to go beyond the scale natively supported by BitTensor. IOTA could have only 256 participants in experiments, which Cruz calls tiny if considered as a data center. Macrocosmos worked around that by tracking participants off-chain rather than treating every unique entity as a unique blockchain address. That allowed an arbitrary-size participant registry while preserving transparent payouts.

He says 2,500 people downloaded the macOS app in the first two weeks after launch, and that around 500 were running at the time of the interview. Macrocosmos does not pay for idle supply continuously; it pays for experiments and production runs when there is work to assign.

5,000 nodes

Macrocosmos's stated target for the middle of the year

Cruz’s stated milestone is 5,000 compute nodes by the middle of the year. He calls that a respectable cluster size, one capable of training a model that will get attention and demonstrate the utility of the technology. The accompanying model-size target is around 70 billion parameters. In his view, that is where the work moves beyond proof-of-concept small models into a category enterprise customers take more seriously.

Within a year to a year and a half, Cruz wants to go beyond 100 billion parameters. Parameter count is the number Macrocosmos uses internally as a proxy for whether the technology will be taken seriously. The company wants a set of models that can support its claim that distributed training can produce serious outputs at materially lower cost than centralized alternatives.

Milestone	Cruz's stated significance
256 participants	The original BitTensor-native participant limit for IOTA experiments, described as tiny for a data-center comparison.
2,500 macOS downloads	The number of people Cruz says downloaded the app in the first two weeks after launch.
About 500 running	The approximate number of active devices Cruz cites at the time of the interview.
5,000 compute nodes	The middle-of-year target Cruz says would be a respectable cluster for attention-getting training.
70B parameters	The model size Cruz sees as the graduation point from proof of concept to enterprise-relevant models.
100B+ parameters	The scale Cruz wants to exceed in roughly a year to a year and a half.

Macrocosmos's stated scale markers for IOTA

Consumer hardware is part of the thesis, but not CPUs

One of Cruz’s more concrete near-term examples is the rise of personal AI agents running on dedicated machines. He says he has seen people stockpiling Mac minis as agents become economically useful. The pattern he describes is a private machine running around the clock: checking email, shopping online, doing work or hobbies, and handling personal tasks.

Macrocosmos sees those machines as a source of unused compute. The company has created “Train at Home,” which lets people connect unused devices — MacBooks, Mac minis, and consumer GPUs — to the network. The device then becomes part of what Cruz calls a global supercomputer, available for model training when not otherwise needed. In return, the owner can earn passive income.

The analogy Cruz uses is property rental: if you own a property, you can Airbnb it when you do not need it. If your agent only needs four productive hours per day, the rest of the device’s capacity can be rented into the network. He speculates that an agent could eventually decide on its own that it has finished its work and can use the machine to earn money until the owner returns.

This is also where Cruz broadens the purpose of BitTensor. He describes it as a set of places where a machine can go to earn income, whether by providing human intelligence, training models, detecting whether an image is real or fake, or monetizing raw compute. He likens it to a mechanical Turk for agents and humans alike.

Cruz is specific about the current hardware constraint. Macrocosmos is not expecting CPUs to perform today’s model-training work. It supports CUDA devices and Apple silicon devices, including newer Mac minis and MacBooks. CPUs may matter for future workloads outside model training, including scientific tasks that are CPU-limited, but for current AI training, the relevant devices are GPUs and Apple silicon.

The consumer-compute path Cruz describes is therefore narrower than “any idle machine can train frontier models.” His claim is that supported devices can be stitched into larger training jobs when idle, and that the same orchestration approach may have value beyond AI training if the workload fits the hardware.

Macrocosmos also treats data as a separate decentralized layer

Smith asks whether the same general approach could be used not only for compute but for data — for example, registering or making data available so a model could train across different blocks of data.

Cruz answers by distinguishing another Macrocosmos project from IOTA. Macrocosmos has a separate web-scale data scraping service called Data Universe, which decentralizes the work of hundreds and hundreds of miners scraping social media data. Cruz says that data can be used for journalism, marketing, brand analysis, and AI model training.

He presents data, compute, and “innovation networks” as parts of a virtuous cycle, though he does not elaborate on the innovation networks in detail. His answer does not make IOTA a private client-data federation system; it points to a dedicated, highly scalable scraping system that Macrocosmos already operates separately.

That distinction also separates Macrocosmos’s work from federated learning as Smith describes it. Cruz says federated learning is built around anonymizing client data while allowing models to train continuously. Macrocosmos’s IOTA work, by contrast, is about edge devices providing training compute rather than training data. Data Universe addresses data through a different subnet and system.

The replication question is how much of the stack others will combine

Smith asks whether this model will be replicated by others, given rising demand for pre-training compute, growing model counts, expensive compute, and underutilized hardware. Cruz says he already sees other teams working on similar projects. Some are focused on model training; others are focused on orchestration technology. He believes Macrocosmos is among the few combining those two strands as it is doing now.

His broader view is that devices will become more connected over time. He describes this almost as a thermodynamic tendency: “things get more connected over time.” As devices become more intelligent and autonomous, their usefulness will increase if they can be combined like Lego bricks into larger constellations. Many problems, not just AI training, exceed the resources of any one device. Macrocosmos’s bet is that orchestration across many devices will become valuable wherever a workload needs a particular topology of compute.

Cruz does not claim Macrocosmos will necessarily be the company that takes the whole category “to the top of the mountain.” He says the work is valuable even if Macrocosmos gets only part of the way there.

The open question is whether Macrocosmos can convert unreliable, interruptible supply into training runs that customers will adopt on cost, reproducibility, and performance. In Cruz’s account, the blockchain handles identity, auditability, coordination, and payment; the off-chain orchestration layer has to make the machines behave like a training system; and the market only works if the result is cheaper without asking customers to accept an unfamiliar or fragile workflow.

AI Application Architecture Data and Training AI Infrastructure and Compute AI Business Models

The hard ceiling is not only compute, but who can afford to assemble it

IOTA has reproduced baseline metrics, but customer training is still ahead

BitTensor is a reward and coordination layer, not a place where the model lives

The core engineering problem is making unreliable machines behave like one training system

The commercial bet is a two-sided market for idle GPUs and cheaper model training

The near-term scale target is 5,000 nodes and a 70 billion parameter proof point

Consumer hardware is part of the thesis, but not CPUs

Macrocosmos also treats data as a separate decentralized layer

The replication question is how much of the stack others will combine

The frontier, in your inbox tomorrow at 08:00.