AI Tools Target Labeling, Simulation, and Scaling Bottlenecks in Research

Benjamin DodgeStanford HAIFriday, May 15, 20268 min read

At Stanford’s second AI+Science lightning-talk session, three researchers presented AI less as a general-purpose scientific shortcut than as infrastructure for specific measurement problems. Matt DeButts argued that PRC-linked patronage can reshape Chinese-language media markets by helping already favorable outlets survive; Samuel Young showed how self-supervised learning can extract particle structure from unlabeled detector data; and Benjamin Dodge described using AI-scale computation to make Gaussian process priors practical for 3D maps of Milky Way dust. The shared claim was that AI’s value depended on a sharply defined bottleneck: too many articles to label, too few reliable detector labels, or too large an inference problem for conventional computation.

Influence can work by changing which media outlets survive

Political actors do not have to directly censor, co-opt, or repress news organizations to shape a media market. Matt DeButts, a Stanford communication PhD candidate, described a different mechanism: patronage that supports already aligned outlets, increasing their odds of survival and changing the composition of available news over time.

DeButts framed the problem around authoritarian influence. Direct pressure on media can provoke backlash from media organizations, governments, or audiences. The alternative he studied is less direct: support outlets whose coverage is already favorable, allow those organizations to persist, and let market selection do part of the political work.

His case was the Chinese government and Chinese-language media organizations outside China, including outlets in places such as San Francisco, London, and Lima. To build the dataset, DeButts said the researchers began with Cloudflare’s top one million domains, then narrowed to Chinese-language sites producing news. They added sources including news aggregators such as Literature City, archives, and investigative reports. The resulting dataset covered 193 Chinese-language media organizations globally, nearly 14 million articles, more than 100,000 homepage snapshots, historical web traffic estimates, and social media mentions over roughly 25 years.

13.9M

news articles analyzed across 193 Chinese-language media organizations

The AI component was used to scale a labeling task. DeButts said the team used bidirectional encoders to replicate human labels for each article’s stance toward the Chinese government, then applied that model across the article corpus. The resulting comparison distinguished unaffiliated outlets from outlets that later received PRC-linked patronage.

The central finding was not simply that patronage-linked outlets published more favorable coverage. DeButts emphasized timing: outlets that had not yet received PRC money were already highly positive toward China compared with unaffiliated outlets. On censored topics in China, including Falun Gong and Tiananmen, the same pattern appeared. Patronage-linked outlets “almost completely” avoided those topics before receiving funding.

That mattered for interpretation. DeButts said the evidence suggested selection more than conversion: these organizations were being selected for pre-existing coverage rather than being made more favorable after funding.

The second test concerned survival. DeButts said the researchers compiled a dataset of more than 200 news organizations in the United States and compared publication status over time. A heatmap shown during the talk marked patronage-linked outlets, unaffiliated outlets, and defunct outlets; DeButts said the visual pattern was confirmed with a Cox proportional hazards test. Patronage, in his account, was associated with greater survival.

By 2023, he said, more than 60 percent of organizations in the global dataset were linked. The linkages were concentrated in developing economies, including parts of South America and Eastern Europe. His conclusion was that authoritarian influence should be understood not only through coercion or co-optation but also through market mechanisms: support affects survival, and survival changes the media environment.

Detector data can teach particle structure before labels exist

Samuel Young, a Stanford physics PhD student based at SLAC, described a different bottleneck: in high-energy physics, researchers increasingly have rich detector images but imperfect simulated labels. The goal is to identify particles in high-resolution 3D detector data — effectively turning an unlabeled image of energy deposits into a particle-separated representation.

Today, Young said, that task is often handled by neural networks trained on simulated data because real particle labels are inaccessible. But in neutrino physics, the simulations are themselves a weak point. Researchers do not know enough about how neutrinos interact with argon nuclei, the material used in the detectors he was discussing. That simulation gap, Young argued, is becoming a bottleneck for next-generation experiments.

His proposed route is self-supervised learning directly from raw detector measurements. Instead of requiring labels, the model learns representations by exploiting physical symmetries in the data. An electron on one side of the detector remains an electron if it appears on another side. A useful encoder should therefore map physically equivalent views to similar places in latent space.

Young described the training procedure as generating multiple physically plausible augmented views of a detector image — by rotating, translating, masking parts out, and similar transformations. Each view passes through the same neural network encoder. The model is trained so the encoded views agree.

The important point is that this pre-training does not require labels. Young showed a map of the model’s internal representation space: each point was a piece of detector data encoded by the model. Although no particle labels were provided during pre-training, coloring the points afterward by particle type revealed that the representation had organized around physically meaningful categories. He noted that overlap between electron- and photon-initiated electromagnetic showers was not a failure of the method; that ambiguity is expected, and even physicists can struggle to distinguish them.

The practical result was a comparison against a current state-of-the-art semantic segmentation model used in several neutrino detectors by Young’s group. The chart compared overall accuracy, shown as macro F1, as a function of relative dataset size. Across the training-data range, the pre-trained model followed by fine-tuning outperformed training from scratch. Young highlighted the data-constrained case: with 0.1 percent of the original dataset size — about a thousand images rather than a million — the pre-trained approach performed roughly as well as the state of the art.

1,000x

reduction in labeled training data Young said could still match roughly state-of-the-art performance

The claim was not that labels are unnecessary for every downstream task. It was narrower and more useful: raw detector data can be used to learn representations that already reflect particle structure, reducing dependence on imperfect simulation and scarce labels when training task-specific models.

Gaussian process priors need AI-scale computation to map the Milky Way

Benjamin Dodge, a Stanford physics PhD student working with Susan Clark, focused on a computational scaling problem in astrophysics: how to represent physically useful Gaussian process priors at the scale required for 3D maps of interstellar dust.

The scientific object is the Milky Way’s dust structure. Dodge described the dark patches visible in galaxy images as little grains of carbon and silicon. Researchers want the three-dimensional structure of that dust, but they cannot measure it directly. Instead, they use indirect line-of-sight integral constraints, turning the task into a large, ill-posed inverse problem.

Dodge contrasted two recent approaches. One used a 3D Gaussian process prior, where any two dust voxels are correlated a priori according to their spatial distance. Another used more exact inference over a much larger volume but lacked that kind of prior, producing what Dodge called unphysical streak artifacts. The motivation for his work was to make Gaussian process priors viable at much larger scale.

The approximation he described was Vecchia’s approximation. In the full problem, a voxel would ideally be correlated with the entire map generated so far. Vecchia’s idea is that, with an appropriate ordering, the model can condition only on nearest neighbors. Dodge stressed that this is not equivalent to truncating the covariance matrix: long-range correlation can still propagate through the conditionals. The better way to view it is that the precision matrix — the inverse covariance — is sparse.

The established range for such spatial-statistics methods, Dodge said, has been around 100,000 to a million parameters. His application needs to approach a billion parameters or more.

His group’s contributions were mostly algorithmic and systems-oriented. They changed the ordering to match a k-d tree order, making the necessary neighbor searches fast. They also modified the order to make dependency graphs shallow, enabling large parts of the map to be generated in parallel on a GPU. Dodge explicitly connected this to the “technologies of AI”: the work is not an AI model replacing a scientific model, but an effort to use modern AI hardware and software patterns for scientific inference.

Implementation was split between JAX and CUDA. JAX gives automatic differentiation, which is important for using the model in a larger inference workflow. But Dodge said JAX did not permit all the performance optimizations the group wanted, so they built a CUDA implementation that was more than an order of magnitude faster and used less than a tenth the memory. The critical step was not merely writing fast CUDA code; it was implementing derivatives for that high-performance software and plugging them back into JAX so the function could be used inside the JAX ecosystem like any other differentiable component.

The summary slide listed the package’s scope: stationary kernels, arbitrary point distributions, large dynamic range, JAX plus CUDA, exact inverse and determinant, and 3D mapping of the Milky Way. Dodge’s immediate target is dust mapping, but he presented the method as having broader applications wherever scalable Gaussian process priors are needed.

Across the talks, AI was infrastructure for evidence rather than a single technique

The three projects used AI and modern computation in different roles. DeButts used language models to scale human judgment across millions of news articles, making it possible to study political selection and survival at market scale. Young used self-supervised representation learning to reduce dependence on simulated labels in particle physics. Dodge used GPU-oriented computation and differentiable software infrastructure to make a classical probabilistic prior feasible at astrophysical scale.

The common thread was not a generic claim that AI accelerates science. Each talk identified a specific constraint: too many articles for manual stance labeling, too few trustworthy labels for detector data, or too many correlated spatial variables for standard Gaussian process inference. The proposed value of AI was in changing the feasible unit of analysis — from small hand-labeled samples to millions of articles, from simulation-dependent particle labels to raw detector representations, and from Gaussian process models at modest scale to priors approaching the size of real Milky Way mapping problems.

AI Infrastructure and Compute Data and Training AI Research Methods

Influence can work by changing which media outlets survive

Detector data can teach particle structure before labels exist

Gaussian process priors need AI-scale computation to map the Milky Way

Across the talks, AI was infrastructure for evidence rather than a single technique

The frontier, in your inbox tomorrow at 08:00.