Applied AI Shifts From Model Choice To System Design

Applied AIWednesday, May 20, 20261h 38m to watch16 min read

Michael I. Jordan’s argument that prediction is not the system runs through the day’s applied-AI examples: evaluation fragments by use case, data becomes a rights-and-operations pipeline, and agents need economic and institutional rules around them. Parallel’s Index, Google and Blackstone’s TPU venture, and Serval’s enterprise controls all point to a market where capability matters only after access, incentives, infrastructure, and boundaries are defined.

Prediction is not the system

The useful reframing for applied AI is not that models have stopped mattering. It is that the model is increasingly the wrong unit of analysis.

Michael I. Jordan’s argument starts there. He treats “AGI” as a distracting public frame because it centers attention on an isolated intelligent machine, while modern machine-learning systems already operate inside economic, institutional, and social systems. In his account, recent models are impressive predictors and language generators. They are not, by themselves, the systems that determine whether healthcare, logistics, science, finance, education, or software work better.

That distinction matters because many of the hardest applied questions sit outside the neural network. A model output is not the same thing as institutional action. A probability is not the same thing as an inference with an error bar. A chatbot answer is not the same thing as a regulated decision. A capable agent is not the same thing as a deployable employee-facing workflow. Jordan’s examples keep returning to that gap.

In science, he points to AlphaFold as a powerful predictor that still needs statistical machinery around it. AlphaFold predictions can expand the data available for biological inquiry, but Jordan argues that scientific claims require query-specific uncertainty, not just high average predictive accuracy. His example of using roughly 200 million AlphaFold-predicted proteins to test a biological association is not a story about the model “understanding” biology. It is a story about combining model predictions with ground-truth data so inference can be corrected for the specific hypothesis being tested.

200 million

AlphaFold-predicted proteins Jordan cites as useful for inference only with query-specific uncertainty

The same move appears in his drug-approval example. A regulator can design a statistical test with known false-positive and false-negative rates, but if drug companies decide strategically which candidates to submit, the system’s outcome changes with incentives. A bad drug with a 5% approval chance is not worth pushing through if approval produces $200 million after a $20 million trial cost. It may become worth pushing through if approval produces $2 billion. The statistical test is identical; the equilibrium is not.

Applied AI question	Narrow model framing	System framing Jordan emphasizes
Scientific discovery	Does the predictor get the answer right?	What inference is valid for this hypothesis, with uncertainty and bias correction?
Drug approval	What is the test’s error rate?	What do strategic firms submit under the incentive structure?
Data markets	How much data can be optimized over?	Who owns data, who sells it, who loses privacy, and what equilibrium results?
Safety	Is the model dangerous or safe?	What full human-machine-institution system is being deployed, monitored, and constrained?

Jordan’s argument shifts applied AI from model behavior to institutional system design.

Jordan’s broader claim is that applied AI needs computer science, statistics, and economics together: computational thinking for systems and algorithms; inferential thinking for uncertainty, data collection, and valid claims; and economic thinking for incentives, markets, privacy, contracts, and strategic behavior. In that frame, “AI safety” is too vague unless it names the actual system being made safe. “Understanding” is too vague unless it explains what action the explanation enables. “AGI” is too vague unless it helps decide who does what, under what incentives, with what evidence and what constraints.

That framing connects the sharper applied-AI stories. Evaluation is splintering because different systems require different measurements. Data has become a legal and operational supply chain rather than a passive internet scrape. Publishers are looking for compensation mechanisms because agents may use content without generating pageviews. Compute is being reorganized through financing structures, power, land, cooling, and chip access. Enterprise AI is becoming practical only when agents are bounded by permissions, approvals, and audits.

The common thread is not anti-model skepticism. It is model realism. A fluent or capable predictor becomes valuable only when the surrounding system defines what counts as success, what data may be used, who gets paid, what infrastructure is available, and what actions the model is allowed to take.

Modern AI Needs Inference and Incentives, Not AGI FramingMachine Learning Street Talk

Measurement is fragmenting, not converging

Stanford’s evaluation lecture makes Jordan’s conceptual point operational: there is no single scoreboard for “good AI.” Evaluation is always a decision about what object is being measured and for what purpose.

Percy Liang’s framing treats evaluation as the step that turns an abstract goal into a metric. “Good at conversation,” “good at reasoning,” “safe,” “useful for professionals,” and “capable as an agent” do not naturally collapse into one number. A benchmark can measure a base model, an instruction-tuned model, a chat product, an agent scaffold, a safety policy, a tool-using workflow, or a real occupational task. The unit matters because the score follows the unit.

Perplexity is the native measurement for a language model as a probability distribution over token sequences. It is useful for pretraining, scaling laws, and controlled comparisons. But Liang treats it as incomplete for many applied purposes. Predicting every token well is not the same as performing a professional task well. Conditional perplexity can focus on answer tokens, but still does not solve the practical question of whether a system works for a user, a company, or a regulator.

Exam benchmarks such as MMLU, MMLU-Pro, GPQA, and Humanity’s Last Exam answer a different question. They test knowledge and reasoning under controlled, gradeable conditions. They are valuable partly because they can be made hard and scored consistently. But they also tend to saturate, invite contamination questions, and often do not resemble ordinary use. Liang’s examples show a repeated pattern: a benchmark becomes a sign of progress, frontier models climb it, and a harder benchmark is introduced.

Chat evaluation changes the problem again. Open-ended answers often have no single ground truth, so systems such as Chatbot Arena use human preferences, while AlpacaEval uses LLM judges and WildBench adds checklists. Those methods surface real weaknesses: preference data can reward style, sycophancy, or verbosity; LLM judges can be biased; and rubrics are needed because “better” is otherwise underspecified.

Agent benchmarks move still further away from pure model evaluation. SWE-bench, TerminalBench, CyBench, and MLEBench measure systems that combine a model with tools, memory, prompting, orchestration, files, environments, and iterative action. A high score may reflect the base model, but it may also reflect scaffolding, planning, context management, test design, and tool integration.

Agent scaffold evaluation is therefore system evaluation. The same model can perform differently when wrapped in a different agent loop. For applied buyers and builders, that is not a nuisance variable. It is often the product.

Evaluation type	What it mainly asks	Main applied caveat
Perplexity	Does the model assign high probability to text from a distribution?	Useful for language modeling, incomplete for task success.
Exam benchmarks	Can the model answer controlled knowledge or reasoning questions?	May saturate, differ from real use, and face contamination risk.
Chat preference tests	Which open-ended response do humans or judges prefer?	Preferences can reward style, length, or pleasing behavior over correctness.
Agent benchmarks	Can a model-plus-scaffold act successfully in an environment?	The score measures the whole system, not just the base model.
Safety evaluations	Does the system refuse, avoid harm, or comply with policy?	Safety changes by context, law, politics, and use case.
Professional-use evals	Does the system handle realistic work tasks?	More realistic tasks are harder to build, audit, and keep private.

Benchmark families answer different questions rather than forming one universal leaderboard.

Safety evaluation shows the same fragmentation. HarmBench tests refusal of harmful requests. AIR-Bench expands the taxonomy to legal, social, operational, and rights-related risks. Jailbreak tests examine whether refusals can be bypassed. Medical, legal, financial, and cybersecurity contexts introduce different harms and different tradeoffs. A model that is “safer” by one refusal metric may not be safer in a professional workflow that needs calibrated action under supervision.

Realism adds another axis. GDPVal uses tasks written by experienced professionals in major GDP sectors. MedHELM asks clinicians for real clinical tasks rather than relying only on medical exams. Anthropic’s Clio project uses model-assisted analysis of actual user-query patterns while trying to preserve privacy. These efforts move evaluation closer to applied use, but they also become more expensive, less public, and harder to reduce to a tidy leaderboard.

Benchmarks are instruments. MMLU, GPQA, SWE-bench, Chatbot Arena, HarmBench, GDPVal, MedHELM, private evals, and perplexity all answer different questions. As AI systems move into work, regulation, and customer-facing operations, the first evaluation question becomes: what exactly is being evaluated?

That question leads directly into data. If evaluation has to declare the object being measured, training pipelines have to declare what made that object possible: what data was reachable, lawful, selected, cleaned, repeated, excluded, and transformed.

AI Evaluation Benchmarks Measure Different Questions, Not One ScoreboardStanford Online

Data is becoming a legal and operational supply chain

Stanford’s data lecture extends the same argument upstream. Models are not trained on “the internet” in any simple sense. They are trained on corpora assembled through crawlers, dumps, licenses, filters, converters, deduplication systems, classifiers, and legal judgments.

Tatsunori Hashimoto’s starting point is that data is both the most consequential and least transparent part of modern language models. Open-weight releases may disclose architectures and training procedures while describing pretraining data only as “a variety of data sources.” That opacity has two explanations in the lecture: competitive secrecy and legal risk. The composition and processing of the data affects model behavior; naming the data may also invite copyright scrutiny.

The web itself is not a dataset. A crawler starts from URLs, downloads pages, extracts links, and repeats. But many web services are dynamic, authenticated, paywalled, rate-limited, or structured around sessions, forms, and application state. Facebook, X, LinkedIn, The New York Times, Discord, and enterprise tools are not simply open text files waiting to be fetched. Even technically accessible content may be limited by robots.txt, terms of service, anti-bot systems, CAPTCHAs, IP blocks, country blocks, or contractual restrictions.

Hashimoto’s point about Common Crawl is especially useful. Common Crawl is a major substrate, but it is raw material, not a finished training set. Its April 2026 crawl in the lecture contained 2.19 billion pages and 372.2 terabytes. Those pages still require HTML-to-text conversion, filtering, quality selection, deduplication, language identification, boilerplate removal, and legal triage.

2.19B

pages in the April 2026 Common Crawl dump cited in Stanford’s data lecture

That transformation is lossy and consequential. Different text extraction tools produce different downstream performance. C4’s manual rules filtered out pages with certain structures, bad words, boilerplate, and code-like syntax. CCNet filtered for documents that resembled Wikipedia. GPT-2’s WebText used Reddit links as a social proxy for quality. DCLM used model-based filtering and kept only a small slice of its original pool. CommonPile restricted itself to permissively licensed or public-domain material and accepted a different performance frontier.

Copyright makes the pipeline still less mechanical. Hashimoto emphasizes that copyright is not reducible to verbatim memorization. Copying exact text can matter, but so can expressive elements, market substitution, licensing, fair use, terms of service, and whether data was obtained lawfully. A work can be accessible online and still copyrighted. A dataset can have a permissive collection license while containing individual items with different rights. A model output can be legally permitted under a model license while still raising questions if the generating model was trained on unlicensed material.

Pipeline stage	Applied consequence
Access	Authentication, paywalls, dynamic pages, robots.txt, rate limits, anti-bot systems, and terms of service shape what can be collected.
Legal status	Copyright, licenses, fair-use arguments, piracy risk, and platform terms determine what can be used or disclosed.
Conversion	HTML, PDFs, LaTeX, repositories, issues, and pull requests must be turned into model-readable text, often losing structure.
Filtering	Quality proxies such as Reddit links, Wikipedia similarity, manual rules, classifiers, or license filters define what the model sees.
Deduplication	Copies, forks, boilerplate, benchmark overlap, and repeated text affect training balance and evaluation validity.
Synthesis and workflow data	Synthetic transformations and structured metadata can change the model’s capabilities beyond raw document content.

Training data behaves less like passive fuel than like an engineered supply chain.

Code data shows why this is a supply-chain problem rather than a simple scraping problem. GitHub training material is not merely source files; it may include issues, pull requests, comments, commit histories, documentation, review states, and workflow metadata. To train an autoregressive model, that structured software-development process must be linearized into sequences, and the dataset builder decides how much context to include.

This is where data connects to evaluation and economics. If benchmark questions, source documents, or close variants are in the training mix, scores become harder to interpret. If lawful access narrows, licensed and proprietary data become strategic assets. If permissively licensed datasets can perform “reasonably” but struggle to match token-rich frontier mixes, rights decisions change capability tradeoffs. If code models learn from issues, pull requests, and comments, the object being learned is not just syntax but process.

Data is no longer background. It is infrastructure with rights, costs, bottlenecks, and quality controls. Architectures may converge, but corpora still differ in what they include, exclude, license, repeat, clean, synthesize, and disclose.

Models Are Trained on Curated Corpora, Not the InternetStanford Online

The web’s business model is being renegotiated for agents

Parallel’s Index launch is a concrete version of the data-market problem. Parag Agrawal’s argument is that AI agents break the web’s inherited compensation system because they can use content to perform valuable work without producing the human attention signals that supported advertising and subscriptions.

The older web economy priced human behavior: clicks, impressions, subscriptions, search referrals, pageviews, and attention. Agentic use changes the accounting unit. An AI lawyer, AI scientist, finance agent, or enterprise research agent may use content to complete a task without sending a human reader to the publisher. The value may be real, but it may not show up as a pageview, ad impression, subscription conversion, or visible citation.

Agrawal’s proposed answer is not another flat licensing deal. He contrasted Index with fixed agreements between large AI labs or hyperscalers and large publishers. His criticism is that flat fees may pay for access but do not let content owners participate dynamically as agent work becomes more valuable.

Parallel wants to pay according to marginal contribution. Agrawal describes the model using Shapley values, a game-theoretic method for allocating the value of a joint outcome among contributors. In Parallel’s version, higher-quality content should earn more, content used in higher-value work should earn more, and content owners should grow economically as agents complete more valuable tasks.

Shapley-value-style attribution is an ambitious promise because the thing being priced is not merely access to an article or whether a citation appears in an answer. Parallel’s dashboard, as shown, scores domains on impressions, citations, value, and uniqueness. A domain can be useful because an agent considered it, cited it, used it for a high-value task, or found it hard to substitute.

81.7%

citation share shown for nih.gov among the compared domains in Parallel’s displayed dashboard

The significance is not whether Parallel’s marketplace becomes the standard. It is that content compensation is being rethought as an incentive-design problem. Publisher lawsuits, robots.txt restrictions, flat licenses, citations, and crawler blocks are all responses to the same breakdown: agents may derive economic value from web content in ways the web’s human-attention economy does not reward.

Parallel’s launch partners illustrate the breadth of what “content” now means. Agrawal named publishers such as The Atlantic and Fortune; factual and business data providers such as PR Newswire, PitchBook, Tracxn, and ZoomInfo; and independent creators including Alex Heath and Azeem Azhar. The displayed partner list also included Enigma and Fiscal.ai. That mix matters because enterprise agents need more than news articles. They need government data, financial data, research data, business intelligence, specialized analysis, and hard-to-substitute domain knowledge.

There is a tension in Agrawal’s framing. He argues for broad access — “everyone’s agents,” not only the largest labs, should be able to use high-quality content. But broad access without compensation is what content owners are resisting. Index is Parallel’s attempt to make openness and payment coexist: not walling content off from agents, but changing the economic link between machine use and creator revenue.

Parallel’s bet is that the web must be repriced for machine use. Content owners need a reason to make material available; agent builders need reliable access; users need systems that can cite, price, and justify the data they use. The next bottleneck is not only what data exists, but what economic mechanism persuades its owners to participate.

Parallel Launches Marketplace to Pay Publishers for AI Agent WorkBloomberg Technology

Compute is being reorganized around scarcity and control

The Google-Blackstone TPU venture shows the infrastructure counterpart to the data-market story. AI capacity is not just a question of which model exists or which chip is fastest. Bloomberg Technology and Bloomberg Intelligence’s Mandeep Singh framed it as a question of who can finance, build, power, cool, and allocate scarce compute.

Bloomberg Technology described the venture as Google turning its Tensor Processing Unit capacity into a neocloud business outside the ordinary Google Cloud structure. Blackstone is contributing an initial $5 billion equity commitment. The target is 500 megawatts of computing capacity by 2027. Caroline Hyde said the structure could be leveraged with debt financing up to $25 billion.

$5B

Blackstone’s initial equity commitment to the Google TPU neocloud venture

Singh interpreted the structure as Google’s route to meeting external TPU demand without relying only on its own cloud data-center buildout. Nvidia has an ecosystem of neocloud providers built around its GPUs. Google has comparatively few external TPU providers. If customers such as Anthropic want very large Google capacity, Singh argued, Google needs a deployment structure that can scale beyond conventional internal facilities.

The business design matters because Blackstone is expected to shoulder more of the data-center burden: power, cooling, land, and related infrastructure. Google supplies the silicon platform. That separates chip demand from the question of whether all capacity must sit inside Google Cloud’s traditional capex and operating model.

Infrastructure element	Why it matters for applied AI
Chips	TPUs give Google a path to compete with Nvidia-centered AI capacity.
Financing	Blackstone equity and potential debt leverage expand the capital structure beyond internal cloud buildout.
Power and cooling	Data-center capacity depends on physical constraints, not only chip availability.
Land and deployment	Site acquisition and construction timelines determine when capacity becomes usable.
Customer access	A neocloud structure can expose Google silicon to external demand outside the usual Google Cloud frame.
Competitive structure	TPU-based capacity pressures Nvidia-centered neoclouds such as CoreWeave, IREN, and Nebius.

The Google-Blackstone deal treats compute as a financed deployment system, not just a chip supply story.

This broadens what “AI infrastructure” means. The visible competition may be GPU versus TPU, or Nvidia versus Google, but the applied bottleneck includes capital markets, power markets, land, cooling, interconnect, customer contracts, and cloud distribution. The model layer depends on those systems before inference happens.

The timing also matters. Singh and Bloomberg framed the structure as a way for Google to ramp external TPU capacity faster than if it built everything inside Google Cloud. That resembles the Nvidia neocloud pattern in which specialized providers aggregate chips, capital, and customer demand. Google’s version differs because the chip supplier itself is pushing a neocloud-like capacity vehicle with a major financial partner.

For applied AI companies, compute access is becoming a market-design problem as much as a hardware problem. Customers need capacity with predictable terms. Chip makers need demand channels. Cloud providers need financing structures that do not leave every watt and rack on the same balance sheet. Investors need confidence that demand turns into utilization and earnings. The scarce resource is not only the accelerator. It is the whole system that turns accelerators into usable inference and training capacity.

Google Turns TPU Capacity Into a Blackstone-Backed NeocloudBloomberg Technology

Autonomy only ships when bounded

Serval’s story brings the infrastructure-and-governance thesis down to the level of enterprise deployment. Jake Stauch’s central claim is that enterprise AI is not won by giving models unlimited autonomy. It is won by surrounding them with permissions, approvals, audits, logs, scoped tools, workflows, and admin control.

Serval works in employee support and Enterprise Service Management: access requests, approvals, fixes, resets, information requests, and internal help. In the abstract, a frontier model might reason through many of those tasks. In a company, reasoning is not enough. The agent must know what it is allowed to access, which actions require approval, which tools are authorized, which systems are in scope, and how every action will be logged and audited.

The product is the boundaries. The product is the controls.

? jake-stauch

Stauch’s phrase is not a compliance afterthought. It is the reason an enterprise can deploy the system at all.

Serval’s architecture embodies the point. Customers interact with two agents. An admin agent helps create tools, skills, workflows, and knowledge. A help desk agent interacts with end users. The help desk agent can reason conversationally, but it can act only through tools and workflows that IT has explicitly built, approved, published, and permissioned. The organization constrains the action space before autonomy reaches production.

Enterprise agents become deployable not by having maximum freedom, but by having an action space narrow enough for the organization to trust.

That pattern also changes the role of workflow software. Pat Grady describes the ServiceNow abstraction as workflows on top of a database; Stauch agrees that the abstraction still works. What changes is the build process. If a workflow takes weeks or months to create, it may be obsolete by launch. Serval uses AI to let administrators describe workflows, permissions, approvals, and data-fetching logic in natural language, then generate the code.

Ease creates another governance problem: “slop automation.” If every admin can generate another password-reset workflow, the company may end up with duplicated, inconsistent internal automation. Serval’s answer is another layer of agentic oversight that can identify similar workflows, suggest consolidation, delete duplicates, and add approval steps. Even inside an AI automation product, the work shifts from raw generation to coherence, control, and maintainability.

Model upgrades are handled the same way. Serval uses models from OpenAI and Anthropic for different product areas, but Stauch says new releases are not automatically better. A model may improve in one behavior and regress in another; the surrounding prompts and guardrails may have been tuned around the prior model’s quirks. Serval runs evals, adjusts the system, rolls out slowly, and sometimes downgrades after deciding a newer model is less predictable for customers.

That is a mature applied-AI posture: capability is absorbed into a controlled product environment rather than treated as a drop-in miracle. The eval is not merely “is the new model smarter?” It is “does the new model behave reliably inside this enterprise system, with these tools, permissions, workflows, and customer expectations?”

Serval’s operating model points to the same conclusion. Stauch says the company tries to stay small, dense, and willing to rebuild quickly. It uses its own product for internal operations, questions whether departments need to exist in traditional form, and hires for “fewer better.” But even that aggressive AI-native posture is organized around practical constraints: customer insight, implementation coordination, trust, and the ability to change the product as models shift.

The day’s examples point away from a simple model leaderboard and toward the systems that make models useful or risky. Evaluation decides what “better” means. Data pipelines decide what the model can know and what legal exposure it carries. Marketplaces decide who gets paid when agents use the web. Neocloud structures decide who gets scarce compute. Enterprise platforms decide what an agent is allowed to do.

Serval Bets Boring IT Controls Will Unlock Enterprise AISequoia Capital