Applied AI’s Acceleration Meets Its Conditions
OpenAI, Anthropic, and SpaceX are trying to finance larger AI bets as losses, infrastructure needs, and public tolerance become harder to separate from the growth story. Across Macrocosmos, Kaggle, OLIVER, Braintrust, Google, and DeepMind, the same pressure shows up in different forms: cost, evaluation, deployment fit, organizational ownership, and proof in the physical world.

1. The AI boom is trying to get financed before the story gets harder
The market-facing story around applied AI is still one of acceleration: bigger demand, bigger models, bigger infrastructure, bigger addressable markets. Alex Kantrowitz and Ranjan Roy’s discussion on Big Technology puts a sharper frame on that acceleration. OpenAI, Anthropic, and SpaceX are not just building into demand; they are trying to sell public investors on future AI demand before the current growth narrative becomes more difficult to defend.
OpenAI’s possible IPO is the cleanest example because the reported numbers cut both ways. The Information reported OpenAI generated about $5.7 billion in first-quarter revenue, nearly $1 billion more than Anthropic in the same period, while also posting a negative 122% adjusted operating income margin. Kantrowitz translates that as OpenAI losing $1.22 for every dollar of revenue, even after excluding large items such as stock-based compensation.
Roy’s argument is not that those losses are surprising. He says everyone already knew the AI labs were burning money. The change is that public-company filings would expose the burn in standardized form. Operating income still includes core business costs — cost of goods sold, research and development, sales and general administration, people and salaries — so the figure matters. But Roy treats the deeper question as structural: AI economics may not resemble software economics. Compute is not merely customer acquisition spend that can be turned down later. It scales with use.
That is why the OpenAI-Anthropic comparison is ambiguous rather than obviously favorable to either company. Anthropic’s reported first-quarter sales were $4.8 billion, with The Wall Street Journal reporting an expected jump to $10.9 billion in the June quarter and a projected $559 million operating profit. To Wall Street, Roy says, profitability would be a powerful signal. To Kantrowitz, profitability can also imply underbuilding if demand is truly exploding and the company should be investing aggressively in infrastructure.
| Company | Reported signal | Why it is hard to interpret |
|---|---|---|
| OpenAI | $5.7B in Q1 revenue; -122% adjusted operating income margin | Losses can be read as reckless burn or as the cost of building enough infrastructure to meet future demand. |
| Anthropic | $4.8B in Q1 sales; projected $10.9B in Q2; projected operating profit | Profitability can be read as discipline, temporary timing advantage, or evidence the company has underbuilt relative to demand. |
| SpaceX | $4.28B Q1 net loss on $4.69B revenue; $28.5T total addressable market claim | A company with real space and satellite assets is asking investors to value a vast, still-speculative AI opportunity. |
The IPO race, in Kantrowitz’s reading, is partly a race to define the market before a rival does. If Anthropic reaches public investors first with faster growth and a profitable quarter, OpenAI may have to explain why its deeper losses are evidence of future strength rather than current weakness. If OpenAI moves first, it can ask the market to underwrite the view that infrastructure buildout is the bottleneck and that current losses are the price of owning future demand.
Roy adds a more skeptical reading: if companies can raise enormous sums privately, an IPO may be less about financing than liquidity and narrative transfer. But he also gives a governance version of the same point. If AI labs claim to be building systems with major public consequences, public filings would make it harder to rely on selective disclosures.
SpaceX widens the issue beyond AI labs. Bloomberg reported SpaceX showed a $4.28 billion first-quarter net loss on $4.69 billion of revenue, and Kantrowitz focused on the company’s $28.5 trillion total addressable market claim, with 93% of that opportunity described as AI. In his reading, SpaceX is pitching not just rockets and Starlink but an AI infrastructure future that includes possibilities such as data centers in space. Roy says Starlink appears to be a real and profitable business in the numbers he cites, but not the center of the IPO story as discussed. The center is the scale of future AI demand.
The practical risk is that financing AI at this scale now requires more than technical progress. It requires public capital, credible economics, infrastructure permission, and a public story that does not provoke political resistance. Kantrowitz points to university commencement audiences booing AI references, including at speeches by former Google CEO Eric Schmidt and other speakers. Roy connects the backlash to a failure by AI leaders to describe a positive social future, rather than job displacement, surveillance, and elite indifference.
The public mood matters because the buildout itself depends on democratic tolerance. Data centers need sites, power, water, permits, and local acceptance. AI companies may want public-market capital, but Kantrowitz and Roy’s discussion points to a second dependency: infrastructure also needs public permission.
2. Compute abundance is becoming a cost problem, not just an engineering problem
Macrocosmos offers the inverse of the centralized infrastructure story. Steffen Cruz, the company’s co-founder and CTO, argues that frontier training is approaching an economic ceiling because the dominant route to better models has become multi-billion-dollar, centralized GPU buildout. In his analogy, the field starts to resemble fundamental physics: eventually, the next larger collider requires nation-state-scale capital, and alternative experiments become necessary.
Macrocosmos’s alternative is IOTA, an Incentivized Orchestrated Training Architecture built inside the BitTensor ecosystem. The claim is not that a blockchain trains models. Cruz is explicit that training data and compute are not on-chain. The chain supplies identity, coordination, auditability, a shared clock, incentives, and payment. The training happens off-chain, across distributed machines that Macrocosmos is trying to make behave like one coherent training system.
Model parallelism is the core technical mechanism. Cruz says each node runs a sliver of the model, with information routed across the distributed system during training. That is also the hard part. The network consists of unreliable, heterogeneous machines: devices join, leave, vary in hardware, and may be idle only for short windows. Macrocosmos has to turn that unstable supply into a training run customers can understand, reproduce, and buy.
The near-term proof point is concrete: 5,000 nodes and a roughly 70 billion-parameter model. Cruz says that would move IOTA beyond small proof-of-concept models into a category enterprise customers take more seriously. Within a year to a year and a half, he wants Macrocosmos to exceed 100 billion parameters.
The commercial logic follows directly from the financing pressure in the IPO discussion. If OpenAI, Anthropic, and SpaceX are asking investors to believe in enormous AI infrastructure demand, then any credible way to lower, redistribute, or arbitrage training cost becomes strategically important. Cruz says Macrocosmos wants to show models trained at 10% or 20% of the centralized cost, while reaching comparable quality. That is not yet proven commercially; he says customer training is still ahead, with startups expected to work with Macrocosmos in the second half of the year.
The demand side is plausible because more organizations may want custom, sovereign, or domain-specific models without paying frontier-lab cost structures. Cruz names legal, medical, and other specialist models as examples. The supply side is a utilization story: neo-clouds, hyperscalers, GPU owners, and supported consumer hardware may have idle or interruptible capacity. Macrocosmos wants to turn those gaps into training compute.
The consumer-hardware claim is narrower than “any idle laptop can train frontier models.” Cruz says current AI training relies on CUDA devices and Apple silicon, not CPUs. Macrocosmos’s “Train at Home” can connect MacBooks, Mac minis, and consumer GPUs to the network when they are idle. The more general idea is that devices are becoming more connected and more autonomous, and that orchestration across many machines may matter wherever a workload can be distributed.
This is a cost problem expressed through an architecture problem. The technical challenge is making unreliable machines act like a single training fabric. The business challenge is proving that the cheaper fabric is good enough for customers who care about quality, reproducibility, and workflow familiarity.
3. Agents are no longer mainly a capability story; they are an evaluation and design story
The technical core of applied AI is shifting from “what can the model do?” to “what does the system around the model make possible, measurable, and reliable?” Kaggle’s Nicholas Kang and Michael Aaron describe the measurement side of that shift. Angus McLean of OLIVER describes the builder side. Together, they make the same practical point from different directions: agent performance is not a single model number.
Kang says today’s AI evaluations are “kinda broken” not because benchmarks do not exist, but because results are hard to reproduce, setups are often opaque, and benchmark authorship is too narrow. Agent benchmarks make this worse. A result may be measuring the model, the scaffold, the harness, the tools, the execution environment, or the grading process. Aaron points to a SWE-bench Pro-related claim that six frontier models were within a couple of percentage points of one another, while the harness around the model could move performance by 22 percentage points. He says he had not independently verified that specific claim, but that it did not seem unlikely.
That means “agent quality” can be a property of the whole system. Hidden choices — compaction, API settings, prompts, tool access, trace handling, execution constraints — can swing results. Kaggle’s answer is to make evaluation more like shared infrastructure: public benchmark creation, transparent runs, standardized agent exams, model-versus-model game arenas, and community participation.
The wastewater treatment benchmark is the most useful example because it shows what benchmark coverage misses when only AI researchers write the tests. Kang describes a wastewater plant engineer in Turkey with 20 years of field experience who built a benchmark covering material selection, root-cause analysis, safety protocols, and process-chart reasoning. The benchmark reflected field knowledge not likely to exist in standard training data and not likely to be invented by an AI lab. In Kang’s view, if AI systems are expected to work across the long tail of real jobs, evaluation has to come from people who know those jobs.
McLean’s advice from OLIVER starts where the benchmark problem becomes a design problem. OLIVER uses AI in production advertising workflows, generating around 4,000 assets a day for more than 200 brands, with real media spend behind some of those assets. In that setting, the useful agent is not the one with the most autonomy. It is the one with the right constraints.
His strongest recommendations are deliberately unglamorous. Do not give open web access when curated documentation is better. Ask how little context the task needs, not how much the context window can hold. Choose simple representations — HTML, Markdown, tables, timelines, folders — before elaborate agent architectures. Do not automate work you do not understand.
The context point matters because long-context models can encourage lazy design. McLean says the challenge is no longer only getting context into the model; it is keeping noise out. In advertising research, a web-connected model may absorb promotional or SEO-optimized material from competitors rather than the consumer insight the strategist needs. Tool access becomes a source of constraint, not just empowerment.
His CV example is a compact warning against over-engineering. He built an elaborate application involving provenance ledgers, event logs, graph layers, recursive interpretation, LangGraph patterns, and other machinery. Then he found the model did better when asked to write a CV in HTML. The representation solved more of the problem than the architecture.
| Problem surface | Kaggle emphasis | OLIVER emphasis |
|---|---|---|
| What is being measured? | The model, harness, scaffold, tools, traces, and grading process can blur together. | The product outcome depends on context, representation, workflow, and human judgment. |
| What improves reliability? | Transparent execution, reproducible setups, community-created benchmarks, and domain-authored evals. | Smaller useful contexts, curated sources, simple formats, and bounded autonomy. |
| What changes the unit of progress? | From leaderboard numbers to evaluated systems. | From autonomous agents to reliable task performance. |
The unit of progress is shifting from the model to the system around the model. That does not make model capability irrelevant. It means applied AI work increasingly depends on what the model is allowed to see, how it is asked to act, how its work is represented, how success is graded, and whether domain experts can recognize failure.
4. Enterprises are still organizing around the wrong abstraction
Phil Hetzel of Braintrust brings the agent-design problem into enterprise org charts. His claim is not that data scientists and machine-learning engineers are obsolete. It is that many enterprises assigned generative AI work to traditional ML teams because the word “AI” was in the mandate, even though much of the work no longer looks like traditional model development.
In the conventional ML frame, a team gathers data, labels it, trains a model, tests it, deploys it, and monitors it. In many LLM applications, the base model has already been trained by OpenAI, Anthropic, Mistral, or another provider. The application team inherits a model endpoint and then has to build the product system around it: prompts, context, tools, evals, observability, user feedback, and production workflows.
That connects directly to the Kaggle and OLIVER material. Enterprises misassign agent work when they still think the model is the main object. In applied AI, the object is increasingly the workflow or product system.
Hetzel’s critique of traditional metrics is especially relevant. Data scientists may know how to evaluate a model, but agent quality has a broader surface area than precision, recall, and F1. Those metrics can still matter, particularly when validating an LLM-as-judge against labeled data. But they do not tell a team whether an agent performs the function a product requires, handles user interactions appropriately, fits the workflow, or behaves safely in production.
The same division applies between evals and observability. Evals help during experimentation, as teams change prompts, context, tools, and agent behavior before deployment. Observability matters after deployment, when real users and production data reveal behavior that offline tests missed. Model providers may test the underlying model. Application teams still have to evaluate the agent they actually built.
Hetzel’s team model is cross-functional by necessity. Product and systems engineers treat LLMs as APIs and build the distributed systems around them. Domain experts inspect traces, annotate outputs, define what good work looks like, and adjust prompts or context in natural language. Data scientists explain the underlying technology, keep teams honest about model limits, validate evaluators, and fine-tune when needed.
| Role | What it contributes to GenAI applications |
|---|---|
| Data scientists | Model literacy, measurement rigor, evaluator validation, and fine-tuning when necessary. |
| Product / application / systems engineers | Production architecture, product implementation, integrations, eval pipelines, and observability. |
| Subject-matter experts | Domain judgment, trace annotation, user feedback, prompt and context changes, and definitions of success. |
This is a governance point as much as an org-design point. If agent quality depends on product context, domain judgment, traces, and feedback loops, then assigning exclusive ownership to a central ML team can make the work narrower than the problem. Hetzel’s stronger position is not that GenAI belongs to a different silo. It is that exclusive ownership is the wrong abstraction. The problem should determine the team.
5. The frontier is also moving downward: local models, specialized science platforms, and deployment trust
Applied AI is expanding in two directions at once. One direction moves downward into local devices, developer workflows, and constrained environments. The other moves upward into specialized scientific platforms where the payoff could be enormous but the proof burden is higher. Google’s Gemma work and Demis Hassabis’s drug-discovery optimism look very different, but both reinforce the same condition: intelligence has to fit the environment where it will be used.
Omar Sanseviero describes Gemma as the open, local, on-device expression of the Gemini research stream, not a replacement for Gemini. Gemma 4 is built for efficiency, developer integration, and increasingly agentic local use. Gemini remains Google’s route for flagship capability, broad factual knowledge, and long-running frontier tasks.
The boundary matters. Sanseviero says local models can provide agentic capabilities, function calling, system instructions, and conversation, but broad knowledge and factuality remain harder. That makes Gemma useful where the constraint is privacy, latency, offline operation, local control, or developer integration. It does not imply that open local models replace the cloud frontier for all tasks.
Gemma’s Android Studio integration illustrates the applied shape of this. Developers can use Gemma 4 locally through offline model support via llama.cpp, vLLM, or OpenAI-compatible endpoints. The reason to choose local Gemma over Gemini is practical: keeping code on-device, working offline, or avoiding an external API. Sanseviero also describes the launch as an ecosystem problem, involving nearly 50 external partners and internal Google surfaces such as Cloud, Vertex, ADK, and Android.
On-device AI therefore becomes a deployment strategy, not merely a model-size achievement. The product question is what constraints matter most: privacy, cost, battery, latency, knowledge, factuality, or task duration.
Hassabis’s account of AI drug discovery is the science-platform version of the same argument. He is optimistic that AI could help produce cures for most diseases over a 10- to 20-year horizon, but he frames that as a platform problem rather than a countdown. AlphaFold is one component. The broader project, across DeepMind and Isomorphic Labs, requires “half a dozen to a dozen” additional AlphaFold-level models for other parts of drug discovery: protein interactions, molecule design, binding, absorption, toxicity, side effects, manufacturability, and disease-specific testing.
The bottlenecks are not only computational. Hassabis separates drug discovery from clinical validation. A model may generate or optimize a candidate, but humans still need evidence that it is safe and effective. Regulation, in his words, sits partly outside the scope of AI. He argues that regulators might eventually accelerate parts of the process after enough AI-assisted or AI-designed drugs pass through traditional trials and the models’ predictions can be backtested. The sequence is evidence first, process change second.
His discussion of automated labs sharpens the same point. In coding and math, closed-loop discovery is easier because verification can be fast and clear. In biology, chemistry, physics, and materials science, the verifier may be physical. DeepMind, he says, has 200,000 new material designs without enough testing capacity. In those domains, the bottleneck may shift from hypothesis generation to physical validation.
- AlphaFold stageDeepMind’s protein-structure work supplies one component of the drug-discovery platform Hassabis described.
- Isomorphic and DeepMind platform stageHassabis says the next step is connecting additional specialized models for targets, candidates, interactions, toxicity, side effects, and disease profiles.
- Evidence and institution stageClinical trials, physical labs, and regulatory review determine which AI-generated or AI-assisted predictions can be trusted.
Gemma and AI drug discovery sit at opposite ends of applied AI, but neither is a pure capability story. Gemma’s value depends on local fit and developer deployment. AI medicine depends on specialized models, labs, clinical trials, and regulatory trust. In both cases, the model is only one part of the system that has to work.













