DeepMind’s AI Co-Scientist Turns LLMs Into Debate-Driven Research Agents

Karan SinghStanford OnlineWednesday, May 27, 202619 min read

Google DeepMind’s Vivek Natarajan used a Stanford CS25 seminar to argue that scientific AI will require more than stronger chatbot-style models. He presented the company’s Gemini-based AI co-scientist as a multi-agent system built to generate, critique, rank and refine hypotheses over longer time horizons, with lab validation rather than benchmark scores as the test of usefulness. The case he made was cautious as well as ambitious: such systems may help scientists traverse large hypothesis spaces, but their value still depends on expert judgment, experimental capacity, publishing norms and safety controls.

The target is not an AI that answers science questions, but one that can keep thinking

Vivek Natarajan framed Google DeepMind’s work in science and medicine as an attempt to build general-purpose AI systems that act as collaborators for experts. The technical account centered on the AI co-scientist: a Gemini-based multi-agent system intended to generate, refine, rank, and explain scientific hypotheses over longer time horizons than ordinary chatbot interaction. AMIE, the medical co-physician effort, was signposted as part of the broader goal of making medical expertise universally accessible, but the developed system-level argument was about co-scientist.

The tension running through Natarajan’s account is that the system’s promise rests on debate-and-ranking loops that can produce more plausible hypotheses, while its bottlenecks remain external: experimental validation, expert attention, publication infrastructure, and safety. A system that can generate many candidate ideas is not automatically useful. It has to decide what is worth a scientist’s time, explain what it does not know, and survive contact with the lab.

The starting claim was that ordinary large language models are useful but mismatched to the way difficult science is done. Systems such as Gemini, ChatGPT, and Claude, Natarajan argued, mostly produce “system one” responses: fast, intuitive answers often driven by surface correlations and pattern matching. Scientific discovery, by contrast, often requires “system two” thinking: slower, more deliberate, more rigorous work in which ideas are held, revised, challenged, and tested over weeks, months, or years.

That gap matters because the ambition is not literature review or medical question answering. Natarajan traced the co-scientist project back to earlier work on Med-PaLM, Med-PaLM 2, and Med-Gemini. Med-PaLM, introduced in a 2023 Nature paper, was described as one of the first medically tuned large language models and one of the first systems to reach passing-level performance on US Medical Licensing Exam-style questions; a later version reached expert-level scores. Those results showed, in his account, that state-of-the-art foundation models were necessary for medical and scientific work, but not sufficient for valid, novel hypothesis generation.

The hypothesis-generation direction came from Stanford. After a 2023 Stanford talk on Med-PaLM, Natarajan said Stanford’s Gary Peltz suggested using large language models to identify causative genetic factors for rare diseases. At the time, the team was still working with PaLM-class models, the precursor to Gemini, and was struggling with reliability and hallucination. The idea that an LLM system could generate novel scientific hypotheses “felt wild,” because that activity belonged squarely to expert scientists.

The team nevertheless built an early agentic scaffold around the PaLM-based system. It had access to databases and tools, could retrieve and read literature, and could be asked to predict likely genes associated with rare disease variants. Natarajan said one hypothesis from that system was later validated in mouse experiments using CRISPR and published in Advanced Science. A slide described a related mouse hearing-loss study and said Med-PaLM 2 hypothesized a Crym mutation later confirmed in CRISPR knockout mice experiments. The result was promising, but it also clarified the central limitation: a naïve LLM prompt was not enough. The system needed an architecture for sustained scientific reasoning.

Natarajan used a task-versus-timescale chart to define what “progress” would mean. On one axis were scientific tasks arranged by complexity and impact: literature review; experimentation; data analysis and hypothesis generation; writing a research paper; a PhD dissertation; developing a major theory; and paradigm-shifting breakthroughs. On the other axis was the time human scientists might take: minutes, hours, days, months, years, decades, or lifetimes.

Reference point	Where Natarajan placed it	Why it mattered
AlphaFold	High impact and long human-timescale work, but outside the desired general-purpose category	Before AlphaFold, determining one protein structure could take three to five years of PhD-level work; AlphaFold produced predictions for millions of proteins, but only takes protein sequences as input and produces protein structures as output.
Gemini, GPT, Claude	Low-complexity, short-timescale scientific assistance such as literature review and evidence synthesis	They have broad natural-language generality, but Natarajan said there was little evidence, at least until recently, that base LLM systems could handle more complex scientific tasks such as hypothesis generation or writing research papers.
AI co-scientist	Aimed up and to the right: more complex tasks over longer time horizons	The system is meant to use natural-language goals, constraints, debate, ranking, memory, and tool use to sustain scientific reasoning beyond a single model response.

Natarajan used a task-versus-timescale chart to distinguish specialized scientific breakthroughs, general LLM assistance, and the intended direction of the co-scientist.

AlphaFold occupied an instructive place in that framing. Natarajan described it as having done the equivalent of millions of human-scientist years of work, because before AlphaFold, determining the structure of a single protein could take three to five years of PhD-level effort, while AlphaFold produced predictions for millions of proteins. But he placed AlphaFold outside the desired category of scientific superintelligence because it lacks generality: it takes protein sequences as input and produces protein structures as output. It is extremely capable, but specialized.

The co-scientist effort is aimed at a different property. Generality, as Natarajan defined it, is the ability to take a wide range of problems, understand them, break them down, and make reasonable progress, even if it cannot solve them outright. He argued that natural language is the key interface for that kind of generality. A scientific collaborator that accepts natural-language goals, constraints, preferences, seed ideas, literature, and experimental data can in principle be applied to many scientific domains rather than one specialized task.

The goal, then, is to move general-purpose LLM systems “up and to the right” on the scientific task chart: from short-timescale literature review toward more complex scientific tasks over longer horizons.

Self-play becomes scientific debate

The architectural borrowing from DeepMind’s history was self-play. Natarajan described AlphaGo and AlphaZero as systems in which agents improve by playing against each other in an environment, receiving reward signals, reinforcing winning strategies, and scaling with compute. In simplified form, he presented the achievement as a system that could be started from scratch, run for a period of time with the right reward structure, and become superhuman at a task.

The question was how to adapt that idea beyond games. Scientific discovery does not have the clean rules and terminal rewards of Go or chess. Natarajan’s answer was to generalize self-play into scientific debate and self-debate. Instead of agents playing legal moves on a board, the co-scientist has agents generating hypotheses, critiquing them, debating them, ranking them, and evolving them over time.

The system he described is intentionally collaborative rather than autonomous. The human scientist remains in the driver’s seat. The scientist supplies a research goal in natural language, along with constraints, preferences, ranking rubrics, seed directions, papers, multimodal materials, and experimental data. The system then computes over that context for minutes, hours, days, or even weeks, depending on the problem. Its output is a research report: hypotheses, supporting rationale, limitations, uncertainty, and a summarized view of the most promising directions.

The architecture slide showed the scientist’s inputs flowing into a multi-agent system with generation, reflection, evaluation, proximity, ranking, and evolution agents. The same diagram emphasized memory and tool use, including search and additional tools, and described the output as top-ranked hypotheses and research proposals summarized back to the scientist.

Agent or function	Role in the system
Generation	Produces scientific ideas and hypotheses, using strategies such as literature exploration or simulated expert debate.
Reflection and evaluation	Reviews, critiques, and verifies hypotheses, including with web search and deeper verification.
Ranking	Runs scientific debate tournaments among hypotheses and prioritizes them against the scientist’s criteria.
Evolution	Improves, simplifies, extends, or recombines ideas after feedback from reviews and rankings.
Memory	Stores summaries of debates, limitations, and win-loss patterns so later agents can use them.
Tool use	Allows the system to search and use additional tools, with Gemini models providing long-context, multimodal, and agentic capabilities.

The co-scientist architecture was presented as a multi-agent loop around generation, critique, ranking, memory, and evolution.

At the center is a multi-agent loop. Natarajan reduced the architecture to four recurring functions: generate ideas, review and critique them, rank and prioritize them, and improve or evolve them. These functions run asynchronously and continuously, implemented with Gemini models configured through different system prompts and strategies.

The generation agents use a library of strategies. One strategy is straightforward literature exploration: read relevant papers, form a knowledge base, and propose ideas. Another is simulated expert debate: instantiate a back-and-forth conversation between experts for several turns, then synthesize the resulting idea. Natarajan said team members also study how prominent thinkers reason in podcasts or other public materials, abstract their approaches, and encode them as prompts in the strategy library. The library is meant to evolve.

The ranking agent matters because idea generation alone does not solve the scientist’s problem. Expert scientists are not short on ideas; they are short on time and resources. A useful co-scientist must respect expert attention by surfacing ideas worth pursuing, explaining confidence, and identifying uncertainties.

If you’re truly building out a collaborative partner for these scientists, then you have to respect their time and their attention and only surface ideas that are really worth their time and their attention.

Vivek Natarajan

The ranking agent organizes debate tournaments between hypotheses. Each debate evaluates candidate ideas against the criteria the scientist supplied. The system can then compute Elo-style scores, analogous to ranking chess or Go players, to prioritize hypotheses. Because the debates occur in natural language, their summaries can be written back into system memory. Later agents can read those summaries and use them to produce better hypotheses, better critiques, and better rankings. Natarajan described this as a self-improving loop.

The output is not just the top idea. Once computation reaches an end state — for example, a target number of ideas, or a point at which the system cannot make further progress — the system clusters the ideas it explored and produces a summary document for the scientist. Natarajan described the architecture as an “agentic in silico implementation” of the kind of thought process a scientist goes through when developing new ideas.

The limits of test-time compute became an early tension. Asked whether the system saturates as more compute is applied, Natarajan said it depends on the class of problem. If the search space is large and the problem has a useful fitness function, additional compute can continue to improve the result. If the problem is trivial, extra compute is wasteful. If the problem is beyond current knowledge — his example was asking the system to build a time machine — no amount of compute will solve it. He placed many useful scientific problems in the middle: enough knowledge exists, the search space is large, and more computation plus more information can yield better hypotheses over time.

Another audience question pressed whether the system might generate incremental “gap-filling” research rather than breakthroughs. Natarajan answered that the goal is to produce ideas compelling enough that a human expert would drop their own current ideas and work on the AI-generated one. If that happens often enough, he argued, the system is working. He also acknowledged that the likely bottleneck is moving from hypothesis generation to verification and validation: there may soon be, or already are, more plausible hypotheses than science has capacity to test.

The strongest evidence is not a benchmark score, but lab validation

Validation was the recurring standard. If the system is claimed to support scientific discovery, Natarajan said, the proof must come from the lab: hypotheses should work experimentally, or at least recapitulate discoveries that were not available to the model in ordinary published form.

The first validation example was a recapitulation in antimicrobial resistance. During Thanksgiving 2024, the DeepMind team contacted researchers at Imperial College London who had been working for years on a horizontal gene-transfer mechanism in bacteria related to antimicrobial resistance. According to Natarajan, the Imperial researchers had made an important discovery but had not yet published it. They proposed a test: give the co-scientist the same research question that had driven their work for roughly eight to ten years and see what it produced.

The slide compared a conventional experimental pipeline from 2013 to 2025 with an AI-assisted hypothesis-development run over two days. The AI co-scientist was shown as generating hypotheses for the role and importance of cf-PICIs across bacterial species, ranking them by novelty, feasibility, and testability, and recapitulating top experimental findings “without previous insights.” The associated citation on the slide was “AI mirrors experimental science to uncover a mechanism of gene transfer crucial to bacterial evolution,” Cell, 2025.

Path	Timescale shown	What the slide said happened
Conventional experimental pipeline	2013–2025	Researchers proposed cf-PICIs, expanded the proposal across bacterial species, asked why they were conserved, and developed hypotheses through in vitro experimental work.
AI co-scientist assisted hypothesis development	2 days	The same research question was posed to the AI co-scientist, which generated and ranked hypotheses for the role and importance of cf-PICIs and recapitulated the experimental findings without the unpublished insights.

The antimicrobial-resistance slide framed the co-scientist as recapitulating a multi-year mechanistic discovery in two days.

Natarajan said that when the team sent the results to Jose and Tiago, the key investigators, Jose replied within about 30 minutes asking to talk immediately. On the call, Jose asked whether the team had access to his email, and then whether it had access to his ChatGPT, because the hypothesis matched what he had been writing up. Natarajan presented that reaction as the moment the team began to believe the system might be onto something.

He presented the antimicrobial-resistance case as a recapitulation of unpublished human work, not as a fully prospective discovery. The example was meant to show that the architecture could traverse a large scientific hypothesis space and arrive at a mechanistic result that researchers had developed through years of experimental work.

Other collaborations supplied different kinds of evidence, with different degrees of maturity. Physician-scientists at Houston Methodist Hospital used the co-scientist to identify repurposed drugs and combination therapies for acute myeloid leukemia. A slide showed IC50 curves for drug activity across cell lines and highlighted KIRA6 as a suggested AML repurposing target. The system’s own novelty review was deliberately qualified: targeting IRE1-alpha in AML had been explored, but not with that specific drug and treatment framing; the idea’s strength lay in its mechanistic hypothesis and patient population, especially FLT3-ITD-positive disease. The slide also noted that two other novel drugs suggested by the AI co-scientist did not work.

That caveat was important to Natarajan’s argument. He presented calibration, not perfection, as the desired behavior. The system should explain why a hypothesis is plausible, what has already been studied, what is genuinely new, and what must be validated before stronger claims can be made. A useful scientific system may still be wrong often; the claim is that it should be wrong in ways that are visible, qualified, and experimentally triageable.

A second example came from Gary Peltz’s Stanford lab and liver fibrosis. Natarajan said Peltz used the system to identify new epigenomic targets, then asked it for an experimental protocol for validation in human liver organoids and for drugs that could validate the target. The slide reported that four suggested drugs showed promising anti-fibrotic activity in hepatic human organoids. One was Vorinostat, described on the slide as an anti-cancer drug that aided regeneration and cut TGFB-induced chromatin damage by 91%.

91%

reported reduction in TGFB-induced chromatin damage for Vorinostat in the liver fibrosis case study

Natarajan used the Vorinostat example to explain the complementarity he sees between human scientists and the AI system. A liver fibrosis expert may not closely track cancer therapeutics, while the AI can search broadly across domains and surface unexpected connections. The human scientist then applies deep expertise and judgment. In his framing, the value comes from combining breadth and depth rather than replacing either one.

A third example involved researchers at the Sainsbury Lab in the UK. They had AlphaFold structure predictions and wanted to identify anomalies worth investigating. The co-scientist helped design a Structural Novelty Index, a set of features used to reanalyze those structures. Natarajan said the work uncovered a previously unknown, massive potato immunoprotein: an 11-mer plant resistosome, rather than the hexagonal form plant biologists typically assumed. The slide described implications for plant immunity and global food security and cited “AI-guided discovery of atypical protein assemblies,” bioRxiv 2025.

The co-scientist was also used with AlphaFold in the loop for protein design. In the Yamanaka factor example, the task was to design improved OCT4 proteins. Natarajan described OCT4 as one of the Yamanaka transcription factors that can rejuvenate cells but can also create tumor risk, motivating safer variants with similar activity. The co-scientist proposed sequence modifications, used AlphaFold to evaluate structural stability, and refined the designs iteratively in silico. The slide compared original OCT4 with a modified OCT4 variant and showed improved AlphaFold metrics including pTM, ipTM, and pLDDT values. Natarajan said the team was doing a lot of this work, taking and testing de novo protein designs in the lab, and that the results were looking extremely promising. He did not present the OCT4 work as a completed therapeutic result.

A related rejuvenation-factor example was explicitly marked on the slide as “Unpublished and needs peer review.” The system was asked to read literature and data, then nominate genetic factors or secreted proteins that might reduce cellular senescence. The slide, attributed to AbuGoot Lab at Harvard, compared KLOTHO as a positive control with redacted novel AI-discovered factors in 74-year-old donor cells. Natarajan said tested AI-nominated factors produced fold reductions in the percentage of senescent cells comparable to the positive control, while emphasizing that more experiments were needed.

The agent scaffold matters when the mechanism has many steps

The Alzheimer’s case study was Natarajan’s most detailed example of why he believes the multi-agent harness can outperform base LLMs on scientific reasoning.

The collaboration involved researchers at Mass General Hospital studying a clinical paradox: ACE inhibitors, widely used for hypertension and heart disease, were said to increase Alzheimer’s disease risk in APOE4 carriers. The missing question was mechanism. Natarajan said the human researchers had spent years building a complex nine-step experimental cascade connecting drug exposure to disease biology.

The co-scientist was given the same problem: why ACE inhibitors might lead to increased Alzheimer’s risk. According to Natarajan, it recapitulated the nine-step mechanistic cascade and also predicted a key step the scientists had missed. In simplified form, he described ACE inhibitors as modulating bradykinin, which interacts with a B2R receptor on brain cells and can trigger neurodegeneration. The AI identified a missing link between bradykinin and the B2R receptor.

A slide described the prospective validation in three parts: the AI identified APOE4-mediated impairment of Bradykinin B2 receptor desensitization as the pivotal link; it proposed an antagonist chase assay to isolate endosomal signaling from surface receptor activity in APOE4 models; and researchers experimentally confirmed the mechanism. The slide further stated that co-administering an NHE inhibitor rescued endosomal re-acidification. Natarajan described the lab test as a protein stability chase assay that validated the missing step. The case was also introduced on a slide as unpublished data needing peer review.

The comparison with base LLMs was presented as a benchmark on the same problem. The slide said the AI co-scientist predicted initial bradykinin accumulation, APOE4-specific B2R desensitization failure, neuronal energetic crisis and hypoxia, stress MAPKs including JNK/p38 as tau kinases, and a full-stack nine-step causal chain. Claude 4 and GPT-5, by contrast, were shown as getting the initial bradykinin accumulation but missing the B2R desensitization failure, missing the energetic crisis and hypoxia, identifying generic GSK-3B as an incorrect pathway, and plateauing at generic downstream effects.

Mechanistic feature	AI co-scientist	Claude 4 / GPT-5
Initial bradykinin accumulation	Yes	Yes
APOE4-specific B2R desensitization failure	Yes, described as a novel discovery	Missed entirely
Neuronal energetic crisis and hypoxia	Yes	Missed
Specific tau kinase identification	Stress MAPKs, including JNK/p38	Generic GSK-3B, described as incorrect pathway
Final output complexity	Full-stack nine-step causal chain	Generic downstream effects

The Alzheimer’s case study benchmark presented the agentic scaffold as materially more specific than base LLMs.

Natarajan’s conclusion from this case was that the scaffold — debate, verification, ranking, and iteration over time — is not a cosmetic wrapper. The time spent generating, testing, critiquing, and refining hypotheses is what allows the system to arrive at mechanistic detail rather than a plausible high-level story.

The Alzheimer’s example combined recapitulation, a claimed prospective lab validation of a missing node, and a comparison against base LLMs. The slide marked the case as unpublished and needing peer review.

Some outputs are hypotheses, not discoveries, and the distinction remains important

Several examples were explicitly marked as unpublished or needing peer review. Natarajan did not present all co-scientist outputs as completed discoveries. In the inverse-comorbidity case, he said he had personally been interested in the relationship between neurodegenerative disease and cancer, including the observation that Alzheimer’s patients have lower cancer risk. He asked the system to read the literature and suggest understudied pathways common in neurodegenerative disease that might also be implicated in cancer.

The system proposed DHX9 and SRRM4 as “neurogenes” relevant to small cell lung cancer. It also suggested contacting Dr. Filippo Bellegia in Germany; the visuals variously rendered the affiliation as University of Kohl or University of Koln. Natarajan said he sent a cold email with the hypothesis, and Bellegia checked whole-genome CRISPR screen and transcriptomic data. In small cell lung cancer, the two genes were overexpressed compared with other cancers. The validation slide showed SCLC at the 98th percentile for DHX9/SRRM4 expression, versus other lung cancers at the 35th percentile, breast cancer at the 28th percentile, and prostate cancer at the 22nd percentile.

Cancer type	DHX9/SRRM4 expression percentile shown
Small cell lung cancer	98th percentile
Other lung cancers	35th percentile
Breast cancer	28th percentile
Prostate cancer	22nd percentile

In the inverse-comorbidity case study, the validation slide presented SCLC as an outlier for DHX9/SRRM4 expression.

The expression-data check supported the system’s prediction about where the genes were unusually active. Natarajan still described the example as unvalidated: a system-generated connection, supported by an external data check, that might lead to new insights or discoveries.

The biological story shown on the slides was that SCLC generates atypical nucleic acids such as double-stranded RNA during rapid growth, which would normally trigger viral-mimicry death sensing through PKR. The AI hypothesis was that SCLC overexpresses DHX9 and SRRM4 to resolve toxic RNAs, allowing the cancer to evade self-destruction. In brain disease, defects in those same genes could allow toxic RNA tangles to accumulate, contributing to premature neuron death.

The system’s full reports, Natarajan said, can run more than 100 pages and include detailed candidate ideas, diagrams, unexpected connections, and suggested research contacts. In the inverse-comorbidity report, one candidate idea proposed “Z-Glues,” small molecules designed to stabilize interactions among ZBP1, Zα domains, and Z-RNA, reframing Alzheimer’s disease as a failure of RNA conformational sensing rather than only protein misfolding. Another slide listed potential experts to contact for specific ideas.

That form of output creates a practical problem: how scientists should consume large AI-generated research reports. Natarajan said the system tries to preserve detail while directing scientists’ attention to the most compelling hypotheses first. It may still include lower-ranked ideas because the system can be too conservative, and ideas it judges nonviable may contain clues useful to a human expert. Conversely, the system may report that none of its current ideas are likely to work and recommend reframing the problem or focusing on subproblems. He said this happens often in mathematical problems: the system struggles with a full proof but can make progress when the human breaks the problem into steps and works iteratively with it.

The broader unresolved issue is verification capacity. Natarajan accepted the premise that AI systems could soon produce a flood of plausible scientific hypotheses, many of which cannot all be tested. The co-scientist’s ranking and uncertainty machinery is intended to triage that flood, but the bottleneck shifts to experimental resources, prioritization, and judgment.

Safety and scientific publishing become systems problems

The audience questions pushed beyond the mechanics of hypothesis generation.

One question asked about knowledge cutoffs: whether the system could be tested on historical corpora, such as pre-1931 text, and asked to predict later discoveries. Natarajan said the Imperial antimicrobial-resistance example and the Alzheimer’s recapitulation were similar in spirit, because the relevant results were not publicly available in ordinary form. But he said isolating corpora cleanly is difficult because of leakage. The team has also been setting up tasks that predict future events, such as clinical trial outcomes, where he said the system has had reasonable success so far. He cautioned that many relevant details are proprietary, and the value depends on having the right knowledge without revealing the outcome.

Another question concerned peer review in a world where AI-generated or AI-augmented papers proliferate. Natarajan said he did not have a good answer. He mentioned that arXiv had introduced a policy under which papers containing hallucinated references could lead to a one-year block from using arXiv, but said he did not think that approach was very productive. AI-assisted peer review can be helpful, he said, but if used injudiciously it could favor only papers that pass through a particular AI filter, leaving other topics underexplored.

New infrastructure may be needed: formats and protocols for agents to share discoveries, scientific data, and context. Natarajan floated the idea of a “science context protocol” and standardized ways to describe how data were generated and shared. But he also emphasized that the social and epistemic problems are thorny, especially because humans will not be able to keep up with the volume of papers and hypotheses.

The final audience question concerned safety: whether a system capable of generating biological hypotheses could be used to design new pathogens or otherwise unsafe research. Natarajan said the co-scientist uses multiple safety layers. The first checks the scientist’s prompt or goal for nefarious intent. Because unsafe goals can be phrased indirectly, the system also monitors the ideas produced during exploration. If a threshold fraction of ideas enters unsafe territory — he said the current setting is 10% — the computation is halted and the scientist is told the research goal is unsafe. The system also inherits safety properties from the underlying Gemini models, including CBRN and text testing.

10%

unsafe-idea threshold Natarajan said currently halts co-scientist computation

Multi-agent systems expand the surface area for misuse. The answer, in Natarajan’s framing, is not one guardrail but a layered approach: input checks, ongoing monitoring of generated ideas, and base-model safety.

AI Application Architecture Evals and Benchmarks AI Research Methods AI Safety and Alignment AI in Healthcare and Life Sciences Agents and Autonomy