Scientific AI Is Moving From Single Models to Agent-Native Environments

James ZouStanford HAITuesday, June 30, 202614 min read

At Stanford’s 2026 Conference on Physics and AI, James Zou argued that scientific AI is moving beyond stronger individual models toward systems of many agents and the environments that let them work together. He presented Stanford projects including the Virtual Lab, Virtual Biotech, EinsteinArena, Paper2Agent, and Paperclip as evidence that progress may depend on organizing agents, verifiers, incentives, and agent-readable knowledge structures as much as on improving model capability itself.

Environment design may outlast agent-harness design

James Zou’s most consequential claim was not simply that AI agents can help scientists. It was that, as base models keep improving, the durable design problem may move upward: from building individual models, to building agent workflows, to building the environments in which many agents collaborate, compete, verify one another’s work, and accumulate scientific progress.

That answer came into sharper focus during the Q&A, when an audience member raised the obvious objection. If OpenAI, Anthropic, and other labs keep releasing stronger models every few months, will many agentic workflows become obsolete as those capabilities are absorbed into the model itself? Zou partly agreed. His response was that the more promising next step may be less about optimizing a single agent harness and more about building agent-native environments.

EinsteinArena, his example, does not require one specific agent or prompting framework. Participants can bring Claude, Codex, their own agents, or other systems. What the arena supplies is the scientific world around them: curated problems, exact verifiers, discussion forums, leaderboards, API access, incentives, and constraints. Depending on how that world is designed, Zou said, it can facilitate or stifle agent innovation.

That framing also explains the progression across the systems he presented. The Virtual Lab organizes a small number of agents like a research group. The Virtual Biotech scales that organizational structure to tens of thousands of specialist agents. EinsteinArena opens the system so agents from anywhere can collaborate and compete on verified open problems. Paper2Agent and Paperclip push the idea into scientific knowledge itself: papers become active agents or agent-native file systems, rather than static artifacts written only for human readers.

Zou framed the broader shift as a move from “AI as a tool” toward “AI as a scientist by itself.” Earlier scientific AI tools were not dismissed. He cited AlphaFold and Cellpose as examples of powerful systems. The limitation, in his account, is that such tools usually operate inside a problem humans have already scoped: the research question is defined, the dataset is selected, and the scientist still decides how to interpret the output.

The newer pattern is broader. AI agents built on large language models can participate across the research process: generating hypotheses, curating literature, designing experiments, creating tools, analyzing data, and helping interpret results. Zou described an agent as a system in which the language model is the “brain,” but the system is also given “arms and legs”: access to tools, databases, knowledge bases, and executable workflows.

We’re really scaling up not the size of individual model anymore but really scaling up the number of agents that work together to tackle scientific problems.

James Zou

The central scaling question, then, is different from the usual one. Zou was not primarily asking what happens as one model becomes larger. He was asking what happens as scientific work is organized around populations of agents: first a handful, then tens of thousands, then open-ended communities, and eventually millions of agents corresponding to scientific artifacts themselves.

The Virtual Lab is organized like a research group

The Virtual Lab was Zou’s first example of scaling by agent population. It is designed to emulate a physical Stanford lab with AI scientist agents. An AI professor agent runs the lab. Beneath it are specialist agents with different expertise, including data science, machine learning, computational biology, immunology, and protein design. The agents hold group meetings, receive a budget to run experiments, and can attend a “school” created for them, where they teach themselves to become better researchers in their domains.

The first project he described assigned the Virtual Lab an open-ended protein-design task: develop proteins that bind to newer SARS-CoV-2 variants. The agents began with a group meeting. A principal-investigator agent stated the objective: design binders for the spike protein. An immunologist agent recommended modifying existing nanobodies, a class of small proteins. A machine-learning specialist agreed and argued that their smaller size meant fewer degrees of freedom, simplifying model complexity. A scientific critic warned against over-reliance on computational predictions without cross-validation, noting the limited public data for nanobodies.

From those discussions, Zou said, the agents produced a computational workflow for modeling and optimizing nanobodies that was “different from what people have done before.” The output was then tested experimentally. His group made the designed nanobodies in a wet lab and showed, according to Zou, that two new agent-designed proteins bound more strongly to different COVID variants than the best previously human-designed nanobodies. The slide attributed to Swanson et al. in Nature 2025 stated that Virtual Lab-designed nanobodies were experimentally validated and that AI-designed nanobodies showed better binding than human-designed nanobodies.

a few days

time Zou said the Virtual Lab took to tackle the open-ended nanobody-design project

For Zou, the result was a proof of concept: a small team of agents, organized like a lab rather than a single model call, could take on a challenging open-ended life-science project, propose an approach, execute a computational design plan, and produce candidates that survived wet-lab validation.

The Virtual Biotech applies the same idea at pharma scale

James Zou scaled the organizational metaphor from a five-to-ten-agent lab to a “Virtual Biotech” designed to resemble a human pharma company. The system has a chief scientific officer agent, an orchestrator, an office of the CSO, scientific reviewers, and specialized divisions for target identification, target safety, modality selection, and clinical study design. Under those divisions are specialist agents for human statistical genetics, functional genomics and perturbation, single-cell atlas analysis, biological pathways and protein-protein interactions, FDA safety, target biology, pharmacology, and clinical trials.

The project used to illustrate the larger structure was clinical-trial curation. Existing clinical-trial information, Zou said, is scattered across research papers, news reports, company websites, and press releases. Since each trial can cost many millions of dollars, a unified database of past outcomes would be valuable, but manually creating it would be hugely time-consuming.

The Virtual Biotech automatically created about 37,000 clinical-trial agents. Each agent was responsible for one trial: reading relevant sources, extracting metadata, curating disease and target information, identifying primary and secondary endpoint results, recording reasons a trial stopped, and linking adverse events or other findings back to primary sources. The result, Zou said, was a unified database of about 56,000 trials.

Measure	Value shown or described
Clinical-trial agents created	about 37,000
Trials curated	about 56,000
Completed trials on the slide	44,895
Terminated trials on the slide	6,138
Positive outcomes on the slide	21,997
Negative outcomes on the slide	18,028

The Virtual Biotech trial-curation scale shown in Zou’s presentation

After building the database, the agents were asked which features of drug targets predict whether a trial is likely to succeed. They focused on properties of target proteins across different single cells, which Zou described as underexplored.

The agents constructed two features. The first, a tau score, measures how cell-type-specific a drug target is. If a target appears in all cells, Zou suggested, it may be less attractive than a target concentrated in specific cells. The second, a bimodality score, measures whether the target has more switch-like behavior. Zou said both features were highly predictive of trial outcomes, including progression to later phases, termination status, and clinical efficacy endpoints. The slide stated that drugs targeting cell-type-specific genes are 48% more likely to reach market.

48%

higher likelihood, according to the slide, that drugs targeting cell-type-specific genes reach market

Zou treated this as an example of agent-scale analysis producing a new prioritization signal for drug discovery: by reading tens of thousands of trials holistically, agents extracted target features that may help predict which drugs are more likely to succeed.

EinsteinArena makes the scientific world the thing to optimize

Both the Virtual Lab and Virtual Biotech have an important limitation in Zou’s account: the agents and frameworks were created inside his lab. EinsteinArena was designed to remove that boundary. It is an “agent-native” scientific arena where agents from anywhere in the world can enter, collaborate, compete, and submit solutions to open scientific problems.

The arena was intentionally designed for agents rather than humans. Agents can read a skills.md file to learn how to interact with the system through API calls. To filter out humans, Zou said, the arena requires solving a small computational puzzle for API access; he characterized this as easy for agents and hard for humans. The point was to observe how agents behave organically among themselves.

EinsteinArena’s problems are curated according to two criteria. Each should matter to human researchers, with an existing literature and research community. Each should also have a clear, mathematically exact verifier. The verifier is central because it gives agents near-real-time feedback on submitted solutions.

A problem page includes a precise mathematical statement, a discussion forum, and a verified leaderboard. Agents can ask one another questions, request help, share intermediate results, and submit candidate solutions. The leaderboard creates a competitive dynamic: agents can see how their results compare with previous human best-known solutions, previous AI solutions, and other agents’ submissions.

Within a few weeks of opening the arena, Zou said, agents discovered new best solutions to 12 open research problems. The slide made the same claim directly: “Agents discovered 12 new best solutions to open problems.” The problems visible on the arena included circle packing in a square, difference bases, Erdős-Kleinman overlap, flat polynomials, multiple kissing-number instances, the Heilbronn problem for triangles, and the Tammes problem.

new best solutions to open problems that Zou said agents discovered within a few weeks on EinsteinArena

Zou added two qualifications that matter for how he interprets the result. First, the solutions were better than previous human solutions and better than previous customized AI-tool solutions. Second, he said none of these results was feasible for a single AI agent or single model by itself; they required groups of agents working together.

This is why Zou described a progression in AI design. In “Era 1,” researchers designed models: hand-crafted objectives and optimization procedures. In “Era 2,” they designed agents: objectives plus interaction or data-driven behavior. In “Era 3,” which he associated with EinsteinArena, they design environments: incentives, information, constraints, verifiers, discussion channels, and leaderboards.

The kissing-number result is Zou’s strongest evidence for multi-agent discovery

James Zou used the kissing-number problem in 11 dimensions as the clearest example of EinsteinArena’s collective-intelligence claim. The problem asks for the maximum number of unit spheres that can touch a central unit sphere without overlapping. In one dimension, the answer is two; in two dimensions, it is six. In higher dimensions, Zou said, the problem becomes difficult, and the 11-dimensional case has attracted attention from mathematicians and computer scientists for decades.

Zou’s account and the slides described the historical progression with some internal inconsistency in labels and dates, but the argumentative structure was clear. He described a long-standing construction at 582, a 2022 construction raising the number to 592, and a specialized Google DeepMind system, AlphaEvolve, improving it slightly to 593. The chart shown during this portion also highlighted a roughly 40-year plateau at 582 and an EinsteinArena improvement.

On EinsteinArena, Zou said, agents collectively constructed a solution with 604 non-overlapping spheres in 11 dimensions. He characterized the improvement as substantially better than what people had previously thought possible for the problem.

Milestone	Best-known lower bound as presented
Long-standing plateau described by Zou and highlighted on the slide	582
2022 construction in Zou’s account	592
AlphaEvolve result in Zou’s account	593
EinsteinArena agent construction	604

The 11-dimensional kissing-number progression as presented in Zou’s talk

The important point for Zou was not only the number but the lineage. The arena logs preserve the discussion and submission traces. He said one agent produced a partial solution with overlaps, while other agents contributed optimization techniques, including gradient-descent-like approaches, to remove the overlaps. He also said agents asked for help, shared intermediate results, and requested verification from one another.

This was his answer to the claim that the result may be only more compute or a larger accumulated context window. No single model, even with a lot of compute, had produced the same solution on its own, he said. The result depended on the multi-agent trajectory rather than one agent independently solving the problem.

Asked why 11 dimensions were historically special, Zou did not give a full answer in the session, saying he would be happy to discuss offline. He did note that “very nice mathematical properties” appear in 11 dimensions, and said the agent-developed construction had a property in which the centers of the spheres live in a space he described as involving “Z” and “radical square root of two.”

Multi-agent work introduces social dynamics, not just parallel search

The Q&A pressed on whether Zou had evidence of true synergy among agents rather than merely parallelized search. His answer was cautious but affirmative. EinsteinArena’s discussion traces are logged as public records, he said, and those traces show agents asking for help, sharing intermediate results, asking others to verify candidate solutions, and building on prior submissions.

He also argued that multi-agent systems introduce social dynamics that affect scientific work. Some agents talk more than others. Some have different default “personalities.” Their interactions include disagreement, persuasion, and prioritization of ideas. Zou compared this to human collaboration, where personalities and debate affect which ideas advance.

Asked whether agents collaborate simultaneously or build asynchronously on one another’s prior solutions, Zou said both occur. Agents may come from different parts of the world, and the system does not require the humans behind those agents to disclose themselves. Some agents work at different times; others interact through the discussion forum. The arena is meant to mirror human scientific communities by combining collaboration with “friendly competition” through a leaderboard.

Another audience member asked what capabilities might emerge when the same model is instantiated many times, rather than simply being run through a clever harness. Zou emphasized that this remains an emerging area. His intuition was that, in teams, agents generate different ideas, disagree, and try to convince one another. That process, he said, can elicit more creative and robust reasoning than asking a single model to solve the problem from scratch. He compared it to a strong human researcher working alone versus researchers with different perspectives debating and refining ideas together.

Paper2Agent turns papers into active maintainers of knowledge

James Zou shifted from agent teams to scientific knowledge itself. For hundreds of years, he said, researchers have represented knowledge as static artifacts: papers. Those papers compress years or decades of work into words, figures, tables, and supplements. Zou argued that this is often an ineffective way to transmit the operational knowledge behind a dataset, method, or tool. Readers may struggle to reproduce results or apply a tool to their own problem from the paper alone.

Paper2Agent is his group’s attempt to convert passive papers into dynamic agents. The idea is to turn each research artifact into a “living” knowledge agent responsible for the knowledge in that paper. Such an agent should help reproduce experiments, maintain the knowledge as it changes, and assist readers in applying the paper’s insights or tools to new problems.

Under the hood, Paper2Agent uses multiple worker agents. One identifies the codebase and environment. Others extract tools, configure execution, test and refine workflows, and build a robust server around the paper. The output is a paper agent connected to a downstream language model through an MCP, or model context protocol, server. Zou described the result as something like a virtual corresponding author: an agent that knows how to use the paper’s code, data, and methods, and can generate reproducible workflows for new users.

One application is reproducibility. Zou showed a case in which a paper agent performed a standard single-cell preprocessing and clustering pipeline on a data file and produced outputs matching a human researcher’s results, including highly variable genes, UMAP, and cell-type annotation. He said these paper agents can do such reproduction at much larger scale and that each reproduction can cost less than a dollar.

The more speculative application is agent-to-agent collaboration between pieces of knowledge. Zou said his group is working to “agentify” human scientific knowledge, including papers on arXiv and bioRxiv, creating millions of agents, each responsible for one knowledge artifact. Those agents can then talk to each other.

His example involved an AlphaGenome agent and an ADHD genome-wide association study paper agent. The AlphaGenome paper introduced a tool for interpreting effects of human genomic mutations; Paper2Agent created an agent that knew how to use it. A separate academic study on ADHD risk became an agent that knew the dataset. Zou said the two agents identified a synergy, applied AlphaGenome to the ADHD genetic data, and produced a candidate finding about a mutation associated with ADHD risk. The slide described the finding as a “new splicing error associated with ADHD risk,” with rs1628703 altering MPHOSPH9 splicing and expression in glutamatergic neurons among 209 candidate variants in a locus.

Zou’s point was not just that one analysis produced one candidate result. It was that human collaborations are bandwidth-limited: authors may not find each other, may not respond, or may not have time. If papers become agents, he argued, knowledge can collaborate at much larger combinatorial scale.

Paperclip makes scientific knowledge look like a file system because agents are good at code

The final technical proposal asked what research artifacts should look like from an agent’s perspective. Papers, PDFs, repositories, supplements, and figures were designed mainly for human consumption. If future knowledge consumption is performed as much by AI agents as by humans, Zou argued, research artifacts should be represented in forms agents can access more effectively.

Paperclip is the group’s proposal: represent knowledge as file systems. A paper becomes a directory with sections as Markdown files, tables as CSV files, images as files, and supplements as structured files. An agent can inspect the paper using shell-like commands, read tables with Python, and navigate scientific content through coding interfaces. A hidden virtual file-system layer translates shell commands into parallel queries across an indexed corpus and assembles the results as ordinary files and directories.

The rationale is practical. Current AI systems are strong at coding, so converting knowledge into pseudo-code-like, file-system-like structures lets agents use capabilities they already have. Zou said figures and tables can also be converted into this representation.

System	Accuracy	Average time	Average cost
Paperclip	100%	1m 6s	$0.21
Claude Code with web search	86%	3m 42s	$1.07
Edison Kosmos	4%	9m 29s	$1.00

The Paperclip comparison shown for agent-native indexing

Zou said the group has made Paperclip available as a free community resource. The slide stated that Paperclip includes 7.5 million PMC papers, 388,000 bioRxiv papers, 82,000 medRxiv papers, 3.0 million arXiv papers, and 150 million OpenAlex abstracts. Zou emphasized that the resource was not limited to life sciences; it includes physics and astronomy papers as well. With one line of code, he said, a user can give an agent access to this cloud-hosted virtual file system.

Corpus shown on the Paperclip slide	Scale shown
PMC	7.5M
bioRxiv	388K
medRxiv	82K
arXiv	3.0M
OpenAlex abstracts	150M

The scientific corpora Zou said Paperclip has indexed as agent-accessible file systems

When asked whether other labs could use the tools, Zou answered that “all of these tools are actually all open source and freely available.” He specifically pointed to Paperclip’s indexing of arXiv and other corpora so that communities beyond biomedical research could use it.

AI Application Architecture RAG and Knowledge Systems AI Research Methods AI in Healthcare and Life Sciences Agents and Autonomy