Orply.
Topic

AI Research Methods

New technical methods, papers, architectures, training techniques, reasoning approaches, and research findings with applied significance.

AI Progress Is Being Bought With Data, Not Sample Efficiency

Dwarkesh Patel argues that recent AI progress is driven less by clear gains in sample efficiency than by an immense expansion of training data, including synthetic rollouts and highly specific human expert examples. In his account, frontier models can display broad professional competence because labs keep pushing more tasks into the training distribution, not because the systems learn new domains the way humans do. Patel says that data-heavy approach may still be commercially powerful when capabilities can be amortized across billions of uses, but it leaves unresolved whether current systems can solve their own sample-efficiency problem.

Dwarkesh PatelDwarkesh PatelJun 19, 20268 min read

RecursiveMAS Lets AI Agents Collaborate Without Translating Through English

Károly Zsolnai-Fehér presents RecursiveMAS, a paper by Xiyuan Yang, Jiaru Zou and coauthors, as an attempt to fix a coordination cost in multi-agent AI systems: agents repeatedly translating internal work into English for one another. The paper’s claim is that agents can instead pass latent numerical representations directly, improving collaboration while cutting token use. Zsolnai-Fehér says the reported gains are substantial on small models, including better math results and far fewer tokens, but frames the work as early research rather than a deployable agent product.

Károly Zsolnai-FehérTwo Minute PapersJun 19, 20266 min read

Stochastic Control Closes the Sampling Loop for Rare-Event Analysis

Microsoft Research’s Yuanqi Du and Carles Domingo-Enrich recast rare-event simulation as a stochastic optimal control problem, arguing that the committor function at the center of Transition Path Theory can be learned by using each current estimate to steer new trajectories into the transition region. Their framework turns committor estimation into a feedback loop: a transformed value function induces a Doob-style control, that control generates more useful reactive samples, and the samples improve the estimate. They present REACT-VM, an off-policy Value Matching objective with a stated first-order optimality guarantee, as the more principled version of the method, and report stronger benchmark results than variational committor-learning baselines.

Carles Domingo-Enrich · Yuanqi DuMicrosoft ResearchJun 16, 202618 min read

Natural Language Autoencoders Turn Claude’s Activations Into Testable Explanations

Károly Zsolnai-Fehér, discussing Anthropic’s paper on natural language autoencoders, argues that the work offers a limited but important way to inspect Claude’s internal activations by translating them into text and testing whether that text can reconstruct the original numerical state. The method is not presented as mind reading: its value, in his account, is that it can surface noisy but testable evidence of internal representations, including planned rhymes, resistance to a false calculator output, and signals that the model may detect some evaluations without saying so.

Károly Zsolnai-FehérTwo Minute PapersJun 16, 20266 min read

A 4B Model Beat Qwen3 235B by Learning Tool Discipline

Kobie Crawford of Snorkel argues that some enterprise AI failures are less about model size than about whether models behave correctly inside constrained tool environments. In Snorkel’s FinQA work with UC Berkeley’s rLLM/Agentica, a 235B Qwen model hallucinated a financial answer after failed SQL calls, while a 4B model fine-tuned with reinforcement learning learned to inspect tables, correct errors and calculate from retrieved data. Crawford presents the result as evidence that targeted RL, structured evals and behavior-specific training can outperform simply moving to a larger model for this class of financial analysis task.

Kobie CrawfordAI EngineerJun 10, 20269 min read

AI Research Challenge Draws 200 Teams to Study Organizational Change

Stanford HAI and Google DeepMind’s AI for Organizations Grand Challenge is presented as an effort to study AI’s effects on organizations directly, rather than treating workplaces merely as places where AI tools are deployed. Melissa Valentine and other organizers argue that the central questions are how AI changes coordination, collaboration, alignment and collective performance, with DeepMind positioned not only as sponsor but as a research setting. The scale of the response — about 200 teams from more than 150 universities, narrowed to 13 finalists — is used to show broad academic demand for that inquiry.

Martin Gonzalez · Chris Watkins · Anita McGahan · Simon Bouton · Steve Perry · Robert Sutton · Rebecca KarpStanford HAIJun 8, 20265 min read

Untied Ulysses Pushes Llama-3-8B Training to 5 Million Tokens

Together AI’s Max Ryabinin argues that training transformers at multi-million-token context lengths is chiefly a memory-scheduling problem, not a matter of applying a single long-context technique. Using a Llama 3-8B run on an 8xH100 node as the example, he shows how fully sharded data parallelism, DeepSpeed Ulysses, activation checkpointing, CPU offloading and chunked sequence training each remove one bottleneck and expose the next. His proposed addition, Untied Ulysses, chunks attention heads and reuses context-parallelism buffers, with the presented results claiming scaling to 5 million tokens with limited throughput loss.

Max RyabininAI EngineerJun 8, 202611 min read

Shannon’s Entropy Limit Frames Language Models as Text Compressors

Grant Sanderson’s 3Blue1Brown video uses the question of how far English can be compressed to rebuild Shannon’s definitions of information and entropy. Sanderson argues that prediction and compression are mathematically equivalent: a good language predictor is, in principle, a good text compressor, and Shannon’s estimate of roughly one bit per English character frames the limit such systems are trying to approach. The result is a narrower version of the slogan “compression is intelligence”: not a definition of intelligence, but an explanation of why compression theory sits so close to modern language-model training.

Grant Sanderson3Blue1BrownJun 7, 202613 min read

Frontier Labs Treat Recursive Self-Improvement as a Near-Term Control Problem

AI in the AM’s first weekly highlights edition argues that the important AI signal in early June was not a model launch but a pattern: frontier labs are treating AI-accelerated AI research as near-term, while their main control strategy remains AI systems monitoring other AI systems. Nathan Labenz presents that as a safety concern, and the source contrasts thin recursive-self-improvement plans with OpenAI’s more concrete tax-agent example, where the harness improves from practitioner corrections rather than from changes to model weights. The through-line is that value and risk are moving into the layers around the model: tax harnesses, private data and expert judgment in cyber, real-time moderation guardrails, and safety architecture in mental-health deployments.

Nathan Labenz · John Wasseige · Matthew Sanders · Brett Levenson · Prakash Narayanan · Taras Pohrebniak · Snehal Antani · Hooman Radfar · Peter Jansen · Arthur Fernandes · Tal Hoffman · Yair TsarfatyThe Cognitive RevolutionJun 6, 202624 min read

Inference Constraints Are Reshaping Language Model Architecture

In a Stanford CS336 guest lecture, Dan Fu argued that language-model inference is no longer downstream plumbing but a central research and design constraint. Fu described serving as the machinery that turns a trained model into a usable system, where schedulers, KV caches, GPU kernels, routing policies and hardware choices determine which architectures are practical, economical and reliable at scale.

Dan FuStanford OnlineJun 5, 202622 min read

LLMs Play Games Better When They Write Simulators First

DeepMind research scientist Wolfgang Lehrach argues that language models should not be asked to play games directly when their outputs are slow, strategically weak, or illegal. In a Stanford HAI seminar, he presents Code World Models, which use LLMs to translate natural-language rules and play traces into executable game simulators that planners such as Monte Carlo Tree Search or reinforcement learning can use. He also describes Autoharness, a narrower system that synthesizes code to check action legality, as part of the same broader case for turning LLM knowledge into executable structure rather than immediate moves.

Wolfgang LehrachStanford HAIJun 5, 202617 min read

AlphaProof Nexus Solved Nine Erdős Problems With Formal Verification

Károly Zsolnai-Fehér argues that DeepMind’s AlphaProof Nexus should not be judged mainly by its 9-for-353 success rate on Erdős problems, but by the kind of system it represents. In his account, the important advance is a formally verified loop: an unreliable AI generates and ranks failed proof attempts until Lean can certify a valid result. He says the work shows capability moving beyond the model itself into the harness around it, while still depending on a strong core model and a problem set amenable to formalization.

Károly Zsolnai-FehérTwo Minute PapersJun 5, 20266 min read

SpaceX, Anthropic, and OpenAI Listings Could Reshape AI Governance

Kevin Roose and Casey Newton argue that the expected IPOs of SpaceX, Anthropic and OpenAI would turn the AI boom into a public-markets event with consequences far beyond Silicon Valley insiders. On Hard Fork, they say the listings could mint vast private fortunes, reshape San Francisco housing and philanthropy, and force ordinary index-fund investors into companies whose governance and safety choices remain unsettled. The episode then turns to Kevin Hartnett, who says recent AI advances in mathematics have moved from benchmark wins to publishable research, leaving mathematicians divided over whether the technology is a tool, a threat, or both.

Kevin Roose · Casey Newton · Kevin HartnettHard ForkJun 5, 202619 min read

Geometric Priors Can Make Robot Learning Far More Data Efficient

In a Stanford Robotics Seminar talk, Northeastern computer science professor Robert Platt argues that robot learning should move between brittle hand-coded models and data-hungry generalist policies by building geometry into learned systems. His case is that representations such as equivariant point-cloud policies, spherical image embeddings, ray-based attention and image-plane control can make robots generalize over pose without having to learn that structure from scratch. Platt presents the payoff as data efficiency: geometric bias does not replace scaling, but can shift the curve so scarce robot demonstrations count for more.

Robert PlattStanford OnlineJun 4, 202618 min read

Native Multimodal Models Extend LLMs but Still Lack Unified Representations

Victoria Lin of Thinking Machines uses a Stanford CS25 seminar to argue that native multimodal models have extended much of the large-language-model recipe into images, audio, video and action, but have not yet unified multimodal intelligence. Her account is that tokenization, Transformers, autoregressive conditioning and scaling transfer only partly: images, video and action require different representations, objectives and sometimes modality-specific parameters. The result, she says, is a field moving beyond text-only systems while still relying on text as its strongest abstraction for reasoning.

Steven Feng · Victoria LinStanford OnlineJun 4, 202619 min read

Production Inference Turns Transformer Models Into a Full-Stack Systems Problem

In a Stanford CS25 seminar, Modal’s Charles Frye argues that transformer inference has become the economic and operational center of AI systems: training produces weights, but serving turns them into usable, billable products. His account treats production inference as a full-stack problem, where application latency goals, workload shape, model choice, GPU memory limits, deployment failures, observability and cost controls all determine whether a system works. Frye’s main warning is that the largest serving gains come from matching the inference stack to the application, not from treating model hosting as a generic infrastructure task.

Steven Feng · Charles FryeStanford OnlineJun 4, 202622 min read

Vision-Language Models Understand Multimodal Inputs but Still Generate Text

Stanford’s CS336 lecture on alignment and multimodality, led by Percy Liang with Tatsunori Hashimoto, argues that the core problem in vision-language systems is still how to turn non-text data into tokens a Transformer can use. The lecture traces the field from CLIP and SigLIP through LLaVA and Qwen, presenting modern VLMs as largely built around a stable template: a vision encoder, an adapter, and a pretrained language model that generates text. Liang’s larger point is that these systems are powerful multimodal input models, but not true omni models; representing images and video without losing fine detail remains the central technical constraint.

Percy Liang · Tatsunori HashimotoStanford OnlineJun 4, 202622 min read

FRIGID Scales Molecular Structure Elucidation With Masked Diffusion

MIT postdoc Runzhong Wang argues that de novo molecular structure elucidation from tandem mass spectrometry is constrained less by instruments than by computation: researchers can produce high-quality spectra, but often cannot infer the molecules behind them. His talk presents DiffMS and FRIGID, two diffusion-based inverse models that decompose the task into spectrum-to-fingerprint prediction and scalable fingerprint-to-structure generation. Wang’s central claim is that scaling helps most where chemical structure data are abundant, while forward fragmentation models can guide inference by identifying parts of a generated molecule that do not match the observed spectrum.

Carles Domingo-Enrich · Runzhong WangMicrosoft ResearchJun 4, 202612 min read

Hard Constraints Steer Generative AI Toward Chemically Valid Materials

MIT PhD student Mouyang Cheng argues that generative models for materials discovery need explicit scientific constraints, not just larger diffusion models. In a Microsoft Research seminar, he describes two approaches: diffusion inpainting that forces generated crystals to contain target structural motifs, and CrysVCD, a valence-constrained framework that generates charge-balanced formulas before predicting structures. His case is that constraints such as motifs, valence and stability screens make generative materials design more useful in a field where data are sparse and chemically invalid samples are easy to produce.

Carles Domingo-Enrich · Mouyang ChengMicrosoft ResearchJun 4, 202616 min read

Text Diffusion Trades Batch Throughput for Faster, Revisable Generation

Google DeepMind’s Brendon Dillon argues that text diffusion changes language generation by refining blocks of tokens rather than committing to one token at a time. In his account, that gives diffusion models lower latency and the ability to revise earlier text after later reasoning emerges, but it also creates a serving problem: weaker throughput when many requests are batched at scale. Dillon frames the technology as most compelling today for on-device and interaction-heavy products, where fast, revisable generation matters more than large-batch economics.

Brendon DillonAI EngineerJun 4, 202611 min read

Unified FHE Accelerator Targets Logic and SIMD Schemes on One Array

Minxuan Zhou of the Illinois Institute of Technology argues that fully homomorphic encryption will not become practical through cryptographic schemes alone, because its costs are dominated by ciphertext expansion, polynomial arithmetic, and data movement. In a Microsoft Research talk hosted by Patrick Longa, Zhou presents UFC, a unified FHE accelerator designed to support both logic and SIMD schemes by reducing their workloads to shared low-level primitives rather than building separate scheme-specific pipelines. The case for UFC is that hybrid FHE applications need both styles of computation, and that a common hardware substrate, NTT-centered interconnect, near-memory support, and compiler scheduling can outperform or avoid the inefficiencies of split accelerators.

Minxuan Zhou · Patrick LongaMicrosoft ResearchJun 4, 202615 min read

OpenAI Model Disproves Erdős’s 80-Year-Old Unit Distance Conjecture

OpenAI reasoning researchers Alexander Wei, Hongxun Wu and Lijie Chen say a general-purpose model disproved Paul Erdős’s 80-year-old unit distance conjecture, a central problem in discrete geometry, by finding a construction that beat the square-grid arrangement Erdős had proposed as essentially optimal. In the podcast, they argue the result is significant not just because of the problem’s status, but because the model was not a bespoke math system: given enough inference-time compute, it produced a proof idea that internal reviewers initially doubted and that other mathematicians quickly began using. Their broader claim is that AI is moving beyond contest math toward a collaborative role in research, where models solve hard problems and humans verify, interpret and extend the ideas.

Andrew Mayne · Lijie Chen · Alexander Wei · Hongxun WuOpenAIJun 4, 202612 min read

Nested Learning Lets AI Models Adapt Without Forgetting Core Knowledge

Cornell graduate student and Google researcher Ali Behrouz argues that continual learning requires AI systems to update on multiple time scales rather than treating training and inference as separate modes. In a Cognitive Revolution interview, Behrouz describes his Nested Learning work as a framework for models whose fast components adapt to current context while slower components preserve durable knowledge, with sleep-like phases used to consolidate what should persist. He says the approach has not solved continual learning, but offers a way to think about architectures, optimizers and memory systems as nested learning processes rather than fixed blocks.

Nathan Labenz · Ali BehrouzThe Cognitive RevolutionJun 3, 202622 min read

Axiom Math Says Verified Reasoning Can Outscale Informal AI

Carina Hong, founder and CEO of Axiom Math, argues on the AI for Science podcast that formal verification is not mainly a way to police AI errors but a mechanism for scaling reasoning itself. Speaking after Axiom’s $200mn Series A, Hong says Lean-based verified generation gives AI systems a sharper training signal than informal reinforcement learning and is essential to reaching mathematical AGI. She points to Axiom’s reported perfect score on the 2024 Putnam exam as evidence, while acknowledging that specification, provenance and human judgment remain hard limits.

Carina Hong · RJ HonickyLatent SpaceJun 3, 202623 min read

Neuroevolution Offers AI a Path Beyond Bigger Models

Risto Miikkulainen, a UT Austin professor and vice-president of AI research at Cognizant AI Labs, argues that neuroevolution offers a different path for AI than simply scaling larger models. In a conversation with Craig Smith, he says gradient descent is well suited to optimizing toward known targets, but population-based evolutionary search is better for problems where the goal is uncertain, the landscape is irregular, and useful solutions may require diversity, novelty and recombination.

Craig Smith · Risto MiikkulainenEye on AIJun 2, 202619 min read

NVIDIA Frames Cosmos 3 as Compute-Generated Data for Physical AI

NVIDIA presents Cosmos 3 as an open foundation model for physical AI, built to address what it frames as a data-scaling problem in robotics, autonomous vehicles and other systems that operate in the physical world. The company argues that real-world data cannot capture enough variability on its own, so compute must generate usable training and evaluation signals: synthetic video, predicted sensor outputs, simulation loops and action plans. Cosmos 3 is positioned as a post-trainable mixture-of-transformers system that combines multimodal reasoning with generation to support perception, prediction, simulation and action.

NVIDIAJun 2, 20265 min read

Open Image Models Converge on Flow Matching and DiT Architectures

Stanford adjunct lecturer Shervine Amidi uses Lecture 8 of CME296 to argue that modern visual generation is best understood as a stack of choices for transporting noise into data: the paradigm, representation, architecture, training procedure, and evaluation method. He presents flow matching as the current default for image-generation systems, diffusion transformers as the dominant architectural direction, and latent spaces as a practical compression tradeoff now being challenged by scaled pixel-space models.

Shervine AmidiStanford OnlineJun 1, 202623 min read

Luma AI Targets Robotics Generalization With Open Physical AI Lab

Luma AI is launching an open physical AI lab to work on robots that can generalize beyond task-by-task demonstrations, CEO Amit Jain told Bloomberg Technology. Jain argues that physical AI should be built on large-scale multimodal data systems rather than narrow robotics training alone, and that the stack must remain open because robots could become part of homes, factories, hospitals and other productive systems.

Ed Ludlow · Amit Jain · Caroline HydeBloomberg TechnologyJun 1, 20266 min read

Language Models Are Becoming the Bottleneck in Video Generation

Ethan He, who worked on NVIDIA’s Cosmos world model and xAI’s Grok Imagine, argues that the next major gains in video generation will come less from diffusion models alone than from language models, agents, and context management around them. In an interview with swyx and Vibhu Sapra, He describes Grok Imagine as a fast-built example of that shift: diffusion renders pixels, while language systems increasingly rewrite prompts, plan clips, call tools, manage memory, and turn short generations into longer, editable video.

Shawn Wang · Vibhu Sapra · Ethan HeLatent SpaceJun 1, 202628 min read

Inference Hardware and Continual Learning Are Replacing Data as AI Bottlenecks

Google chief scientist Jeff Dean argues in a Two Minute Papers interview that AI progress is not chiefly constrained by running out of public text, but by systems work: extracting more from existing data, building inference-specialized hardware, distilling large models into smaller ones, and giving models access to much larger context. Dean frames the next phase less as better chatbots than as action-driven, agentic systems that can test, simulate and learn under controlled safety gates, while acknowledging unresolved problems in continual learning, healthcare deployment and infrastructure reliability at Google scale.

Károly Zsolnai-Fehér · Jeff DeanTwo Minute PapersJun 1, 202613 min read

AI Is Lowering the Cost of Experimentation in Mathematics

Fields Medalist Terence Tao argues that AI is changing mathematics by lowering the cost of experimentation: researchers can test unlikely ideas, offload tedious computations, search literature more effectively, and keep collaborations moving. OpenAI chief research officer Mark Chen frames that shift as part of a broader goal of building tools that help many scientists make discoveries themselves, rather than positioning AI companies as the primary claimants to scientific credit.

Mark Chen · Dominique Maldague · Terence TaoOpenAIMay 30, 20264 min read

Hugging Face Ships a $299 Hackable Robot for Voice AI Experiments

Andres Marafioti argues that Hugging Face’s Reachy Mini is meant to move robotics experimentation out of expensive humanoid hardware and into a $299-to-$449 open-source platform that users can assemble, repair and modify themselves. The robot’s most-used application is conversation, and Marafioti’s account ties its social ambition to a technical stack built for low-latency speech: Parakeet transcription, Qwen 3.5 27B, and an optimized Qwen3 TTS implementation that he says improved from 0.8x to 5.8x real time.

Andres MarafiotiAI EngineerMay 29, 202612 min read

Model Behavior Depends More on Post-Training Data Than Algorithms

Stanford computer scientist Tatsunori Hashimoto’s CS336 lecture argues that post-training is less a matter of exotic algorithms than of choosing the data and feedback that turn a broadly capable pretrained model into a controllable product. He presents supervised fine-tuning as a way to extract behaviors already latent in pretraining, and RLHF as preference optimization whose results depend heavily on annotators, reward models, safety data and evaluation incentives. The lecture’s central warning is that style, refusals, hallucination, and reward hacking are not side issues; they are consequences of the data pipeline that shapes what users actually see.

Tatsunori HashimotoStanford OnlineMay 27, 202623 min read

Language-Model Data Pipelines Decide What Models Can Learn

Stanford’s CS336 lecture on data, taught by Percy Liang and Tatsunori Hashimoto, argues that language-model performance is shaped as much by corpus construction as by training itself. The lecture treats transformation, filtering, deduplication, source mixing and synthetic post-training data as engineering decisions that define what the model sees, how often it sees it and which compute is wasted. Its recurring point is that scalable algorithms are necessary, but the decisive choices still come from inspecting concrete data and deciding what “quality” means for the model being built.

Tatsunori Hashimoto · Percy LiangStanford OnlineMay 27, 202620 min read

RLVR Moves Post-Training From Human Preferences to Checkable Rewards

Stanford computer scientist Tatsunori Hashimoto presents reinforcement learning from verifiable rewards as the current practical route beyond RLHF for reasoning models, especially in math, coding and software-agent settings. His argument is that RLVR works because it replaces learned preference proxies with rewards that can be checked more directly, but that the reward remains the bottleneck: GRPO and related methods made the recipe simpler to run, while systems such as DeepSeek R1, Kimi k1.5 and Qwen show both the gains and the ways ostensibly verifiable rewards can still be gamed.

Tatsunori HashimotoStanford OnlineMay 27, 202620 min read

DeepMind’s AI Co-Scientist Turns LLMs Into Debate-Driven Research Agents

Google DeepMind’s Vivek Natarajan used a Stanford CS25 seminar to argue that scientific AI will require more than stronger chatbot-style models. He presented the company’s Gemini-based AI co-scientist as a multi-agent system built to generate, critique, rank and refine hypotheses over longer time horizons, with lab validation rather than benchmark scores as the test of usefulness. The case he made was cautious as well as ambitious: such systems may help scientists traverse large hypothesis spaces, but their value still depends on expert judgment, experimental capacity, publishing norms and safety controls.

Vivek Natarajan · Karan SinghStanford OnlineMay 27, 202619 min read

Manna Bets Low-Cost Airline Economics Will Win Drone Delivery

Manna founder Bobby Healy tells This Week in Startups that drone delivery is becoming a low-cost operations business, not a novelty market, and argues his Dublin-based company can win by applying airline-style discipline to delivery networks. Healy says Manna’s 300,000 completed deliveries, claimed 97% Irish-weather availability and new $50 million Series B position it to expand in the U.S. as regulation opens up. Theseus co-founder Ian Laffey adds a defense-side version of the same argument from Kyiv: drone scale depends less on exotic aircraft than on cheap, reliable systems that can keep working when GPS and supply chains fail.

Alex Wilhelm · Jason Calacanis · Ian Laffey · Bobby HealyThis Week in StartupsMay 27, 202619 min read

ChatGPT Lacks the Self-Generated Thought Required for Sentience

AI pioneer Terry Sejnowski argues that ChatGPT is neither a conscious mind nor a mere parrot, but an alien form of intelligence built from vast written knowledge and limited by the parts of biological intelligence it lacks. In a conversation with Craig Smith, the Salk Institute professor and Boltzmann machine co-inventor says current models can show creativity and a form of understanding, yet they have no organismic goals, no lived reinforcement, and no inner activity when not prompted. That absence of self-generated thought, he says, is the clearest reason ChatGPT is not sentient.

Craig Smith · Terry SejnowskiEye on AIMay 27, 202615 min read

Self-Consistent Interpolants Learn Clean Priors From Corrupted Data

Jiequn Han’s talk argues that transport-based generative models should be treated not only as tools for sampling clean data distributions, but as machinery for recovering and adapting those distributions when the usual clean training set is absent. His main proposal, Self-Consistent Stochastic Interpolants, learns a clean prior from corrupted observations by iterating a transport map until the learned distribution, passed through a trusted forward simulator, reproduces the observed data. Han presents the method as a black-box alternative to EM-style inverse generative modeling, with the caveat that simulator mismatch remains a central unresolved risk.

Carles Domingo-Enrich · Jiequn HanMicrosoft ResearchMay 26, 202615 min read

Flow Policies Need New Q-Learning Methods for Online Robot Adaptation

UC Berkeley PhD student Qiyang “Colin” Li argues that the flow-matching and diffusion policies now effective for robotic manipulation expose a weakness in standard Q-learning: they model complex, multimodal action chunks well, but are hard to optimize with the reparameterized actor gradients used in efficient continuous-control RL. He presents two approaches, Flow Q-learning and Q-learning with Adjoint Matching, as ways to make off-policy RL work with these policies while reusing prior robot data. The trade-off, in Li’s account, is between the stability gained by distilling flows into one-step actors and the expressivity preserved by keeping multistep flow policies.

Carles Domingo-Enrich · Qiyang LiMicrosoft ResearchMay 26, 202619 min read

Hamiltonian Flow Maps Learn Larger Molecular Dynamics Steps Without Trajectories

Michael Plainer, Winfried Ripken and Gregor Lied argue that generative models can attack molecular dynamics’ central bottleneck: the gap between femtosecond integration steps and biological processes that unfold many orders of magnitude later. In the Microsoft Research seminar, they separate the problem by timescale, using diffusion models to sample equilibrium Boltzmann states and extract force information, while proposing Hamiltonian flow maps for the intermediate regime where simulations need large, stable steps without training on expensive future-state trajectories.

Carles Domingo-Enrich · Sasank Edara · Gregor Lied · Michael Plainer · Winfried Ripken · Stanislav NikolovMicrosoft ResearchMay 26, 202618 min read

Fixed-Point Bridge Matching Makes Diffusion Sampling Scalable Without Target Data

Lorenz Richter’s seminar argues for a non-Markovian route to diffusion-based sampling when the target distribution is known only through an unnormalized density rather than data. He presents existing Markovian path-space samplers as theoretically flexible but increasingly constrained by trajectory simulation and storage costs, then proposes building reciprocal bridge measures from endpoint couplings and learning their Markovian projection by fixed-point regression. The resulting Bridge Matching Sampler, Richter says, uses a single learned control, accommodates flexible priors and reference processes, and shows improved stability and mode preservation in high-dimensional synthetic and molecular benchmarks, especially with damping.

Carles Domingo-Enrich · Lorenz RichterMicrosoft ResearchMay 26, 202618 min read

Denoising Markov Models Generalize Diffusion Through Reverse-Time Generators

Stanford Ph.D. candidate Yinuo Ren argues that diffusion, discrete diffusion, and broader jump-based generative models can be treated as instances of the same problem: choose a forward Markov process that carries data toward a simple reference law, then learn its reverse-time generator. His framework gives conditions under which that reverse generator is explicit up to unknown densities and turns the resulting approximation problem into a path-space KL objective via Doob’s h-transform. The payoff, Ren says, is a principled way to design denoising models beyond Gaussian diffusion, including discrete and Lévy-type dynamics.

Carles Domingo-Enrich · Yinuo RenMicrosoft ResearchMay 26, 202615 min read

Dynamic Measure Transport Needs New Rules for Density-Driven Sampling

Aimee Maurais argues that dynamic measure transport, now central to diffusion models and flow matching, needs different design principles when the target distribution is specified by densities, likelihoods, or prior samples rather than training data. In a Microsoft Research seminar, she presents three lines of work toward that goal: gradient-free particle dynamics using likelihood evaluations, PDE-constrained path design to avoid unstable interpolations, and localized transport velocities that exploit conditional-independence structure in high-dimensional Bayesian and data-assimilation problems.

Yuanqi Du · Yuanji Du · Aimee MauraisMicrosoft ResearchMay 26, 202617 min read

Low Intrinsic Dimension Lets Blind Denoisers Track Implicit Diffusion Schedules

Aram-Alexandre Pooladian argues that blind denoising diffusion models can dispense with an explicit noise schedule because the noisy sample can reveal its own noise level when the data are low-dimensional inside a high-dimensional ambient space. In work with Zahra Kadkhodaie, Sinho Chewi, and Eero Simoncelli, he presents theory and experiments showing that such models can track an implicit reverse-process schedule and sample accurately in polynomially many steps as a function of intrinsic dimension. The empirical comparison suggests a further advantage: blind models may avoid finite-step mismatch between a prescribed schedule and the actual residual noise in the sample.

Aram-Alexandre PooladianMicrosoft ResearchMay 26, 202617 min read

Energy-Based Fine-Tuning Improves Accuracy Without RLVR’s Validation-Loss Penalty

Mujin Kwun and Carles Domingo-Enrich present energy-based fine-tuning as a post-training method that replaces next-token imitation or task-specific rewards with sequence-level feature matching. Their argument is that supervised fine-tuning remains efficient but is trained under teacher forcing, while RL with verifiable rewards can improve accuracy without preserving the target completion distribution. EBFT instead samples model rollouts, compares their frozen-model feature embeddings with reference completions, and uses that signal for policy-gradient updates; in the reported coding and translation experiments, it matched or exceeded RLVR accuracy while producing lower validation cross-entropy than both RLVR and SFT.

Carles Domingo-Enrich · Mujin KwunMicrosoft ResearchMay 26, 202618 min read

Split-Flows Make Mapping Entropy Computable for Molecular Coarse-Graining

Tristan Bereau presents Split-Flows, a flow-based method for connecting atomistic and coarse-grained molecular representations by adding explicit noise variables for the degrees of freedom lost under coarse-graining. The argument is that this augmentation turns a many-to-one mapping into a tractable coordinate transform, enabling both generative backmapping and computation of configuration-dependent mapping entropy. Bereau says the approach makes information loss measurable for complex molecular systems, though it depends on a differentiable bijective construction and still faces scaling costs.

Yuanqi Du · Carles Domingo-Enrich · Sasank Edara · Sathya Edamadaka · Tristan Bereau · Asad Hashmi · Anshul VyasMicrosoft ResearchMay 26, 202617 min read

Meta Flow Maps Cut Reward-Alignment Costs With One-Step Posterior Sampling

Peter Potaptchik presents Meta Flow Maps as an amortized way to remove a costly inner loop in reward-aligning generative models: repeatedly simulating trajectories to estimate expected future reward from a noisy state. The method trains stochastic flow maps to produce differentiable, one-step samples from the clean-data posterior conditioned on any time and noisy state, enabling value-gradient estimates for inference-time steering and an off-policy objective for fine-tuning. In ImageNet experiments, Potaptchik argues, this lets a single-particle steered sampler outperform Best-of-1000 baselines across several rewards with far less compute.

Peter PotaptchikMicrosoft ResearchMay 26, 202616 min read

Diffusion Models Generate Images Through Critical Instability Windows

Luca Ambrogioni argues that trained diffusion models generate images through brief instability windows rather than uniform step-by-step denoising. In a Microsoft Research generative modeling seminar, he links score dynamics, conditional entropy and statistical-physics phase transitions to show how low-frequency spatial modes soften at critical times, allowing noise to organize into coherent structure. Experiments on patch models, Fashion-MNIST and ImageNet models are presented as evidence that these critical windows govern both pattern formation and the timing of effective guidance.

Carles Domingo-Enrich · Sasank Edara · Luca AmbrogioniMicrosoft ResearchMay 26, 202617 min read

Continuous Flow Models Can Be Simulated as Quantum Dynamics

David Layden, a staff research scientist at IBM Research, argues that trained continuous flow models can be recast as quantum simulation problems rather than merely classical samplers. In his account, the velocity field learned by a flow or diffusion-style model defines a Schrödinger equation whose solution is a quantum state encoding the model’s learned distribution. The result leaves training classical and theoretical, but claims that future quantum computers could provide coherent access to those distributions for downstream tasks such as Monte Carlo estimation, not just ordinary sampling.

David LaydenMicrosoft ResearchMay 26, 202617 min read

Generative AI Targets Three Bottlenecks in One Health Decisions

Harvard postdoctoral fellow Lingkai Kong argues that generative AI can address three recurring failures in high-stakes One Health decision-making: scarce deployment data, hard-to-represent constrained policies, and shifting human priorities. In a Microsoft Research seminar, he presents flow matching, diffusion models and LLM agents as tools for patrol planning, poaching prediction, HIV testing policy and reward design, with collaborations involving conservation partners, the WHO, the Gates Foundation and South African health researchers.

Lingkai KongMicrosoft ResearchMay 26, 202616 min read

Machine Learning Turns PDE Singularity Search Into Computer-Assisted Proof

Caltech applied math PhD candidate Yixuan Wang argues that high-precision computation can make singularity questions in nonlinear PDEs tractable only when it is tied to stability analysis and rigorous verification. In a Microsoft Research seminar on Navier-Stokes blowup and weak-solution nonuniqueness, Wang presents machine-learning tools such as PINNs, neural operators, and Kolmogorov–Arnold Networks as ways to discover candidate singular structures, not as substitutes for proof. His broader case is that numerics, analytical a posteriori estimates, and interval-certified computation must work together if singularities in systems such as Navier-Stokes are to be identified and verified.

Yuanqi Du · Yixuan WangMicrosoft ResearchMay 26, 202613 min read

Wavelet Score Models Show Local Interactions Drive Diffusion Denoising

Emma Finn argues that the memorization puzzle in diffusion models can be probed by replacing a black-box score network with an analytically solvable wavelet parameterization. In her Microsoft Research New England seminar, Finn presents the method as a way to isolate which data moments and dependency structures matter across noise scales. Her reported experiments on MNIST suggest that local same-scale wavelet interactions improve denoising more consistently than independent coefficient models or orientation-only coupling, while the larger question of whether the framework explains generative novelty remains unresolved.

Emma FinnMicrosoft ResearchMay 26, 202612 min read

Distributed RL Let Composer Match Frontier Coding Models With Smaller-Model Speed

Cursor’s Federico Cassano and Fireworks’ Dmytro Dzhulgakov argue that Composer’s advantage comes from specializing a model for software engineering inside Cursor rather than spending capacity on general-purpose behavior. Starting from an open-source base, Cursor used mid-training and reinforcement learning against its own product environment, while Fireworks supplied the distributed infrastructure needed to make agent rollouts, weight synchronization, and inference efficient enough to run at scale. Their case is that application companies with enough product-specific usage, tools, and feedback can build models that are better, faster, and cheaper for their own workflows than larger general models.

Sonya Huang · Dmytro Dzhulgakov · Federico CassanoSequoia CapitalMay 26, 202617 min read

Hassabis Says AI Drug Discovery Could Transform Medicine Within 20 Years

Demis Hassabis told Two Minute Papers’ Károly Zsolnai-Fehér that AI could help produce cures for most diseases on a 10- to 20-year horizon, but he framed the claim as a platform problem rather than a countdown. The DeepMind chief argued that AlphaFold is only one component of a broader drug-discovery system, with Isomorphic Labs and DeepMind building multiple specialized models to predict biological behavior, design molecules and eventually accelerate validation. He stressed that clinical testing and regulatory trust remain separate bottlenecks, and that evidence from working AI-designed drugs would have to come before any process change.

Károly Zsolnai-Fehér · Demis HassabisTwo Minute PapersMay 25, 202612 min read

Gemma Is Google’s On-Device Extension of Gemini Research

Google DeepMind’s Omar Sanseviero argues that Gemma is not a parallel alternative to Gemini but the open, local and on-device expression of the same research stream. He presents Gemma 4 as a model family optimized for efficiency, developer integration and emerging agentic use cases, while drawing a clear boundary around Gemini as Google’s route for frontier capability, broad factual knowledge and long-running tasks.

Vibhu Sapra · Shawn Wang · Omar SansevieroLatent SpaceMay 25, 202613 min read

Google’s GenAI Stack Turns Multimodal Prompts Into Application Pipelines

Google DeepMind’s Paige Bailey and Guillaume Vernade argue that Google’s generative AI stack is being organized as an application pipeline rather than a set of isolated models. In a three-hour workshop, Bailey showed AI Studio turning multimodal Gemini prompts into inspectable API calls and generated apps with auth and Firestore, while Vernade used Gemini, Nano Banana, Veo and Lyria to illustrate, animate and score The Wind in the Willows. Their case is that builders can now orchestrate prompt, code, media generation and deployment in one workflow, even as the demos exposed seams that still require engineering discipline.

Paige Bailey · Guillaume Vernade · Ian ValentineAI EngineerMay 23, 202623 min read

Enterprise AI Advantage Comes From Internal Evals and Proprietary Context

Yash Patil, chief executive of Applied Compute and a guest speaker in Stanford’s MS&E435 seminar, argues that the enterprise opportunity in AI is shifting from access to general frontier models toward the ability to define and optimize company-specific tasks. General models provide a baseline, he says, but durable advantage comes from internal evals, verifiers, feedback loops, proprietary context and product constraints that teach systems what “correct” means inside a business.

Apoorv Agrawal · Yash PatilStanford OnlineMay 22, 202618 min read

DeepSeek Uses Visual Primitives to Make Image Reasoning Cheaper

Károly Zsolnai-Fehér presents DeepSeek’s “Thinking with Visual Primitives” paper as a meaningful shift in visual AI: not a model that merely sees images, but one that can reason by marking them with points, boxes and paths. He argues that this makes tasks such as counting and maze tracing cheaper, more accurate and easier to inspect, with the paper reporting strong benchmark results while using about 90% fewer visual tokens than many frontier systems. He also cautions that the work is a blueprint rather than a released model, and still depends on triggers and may struggle with fine visual detail or unfamiliar topology problems.

Károly Zsolnai-FehérTwo Minute PapersMay 22, 20266 min read

SpaceX’s IPO Case Now Depends on AI Infrastructure Demand

TBPN’s John Coogan, Jordi Hays and guests read SpaceX’s filing as more than a rocket-company IPO: its valuation case increasingly rests on Starlink, defense and especially AI infrastructure, including a large Anthropic compute partnership. They argue that Anthropic’s reported revenue acceleration and OpenAI’s claimed breakthrough on an Erdős math problem strengthen the case that frontier AI is becoming both economically material and technically more capable. The discussion frames the day’s market news as a shift from AI adoption stories to capital-intensive infrastructure, public-market valuation and measurable frontier-model results.

John Coogan · Jordi Hays · Tyler CosgroveTBPNMay 22, 202614 min read

AI’s Bottlenecks Shift From Model Demos to Compute, Rights, and Institutions

AI, in TBPN’s latest discussion, is no longer treated mainly as a product demo but as a question of infrastructure, financing and institutional adoption. The strongest evidence came from SpaceX’s AI-heavy IPO framing, Anthropic’s reported move toward operating profit, and OpenAI’s claimed Erdős breakthrough, which the speakers used to challenge the “AI is a scam” critique. The unresolved issue is not whether the technology matters, but how quickly compute capacity, rights regimes, regulation and existing institutions can absorb it.

John Coogan · Jordi Hays · Tyler Cosgrove · Alex Tabarrok · Bill Clerico · Christina Storm · Erik Bernhardsson · Alex Norström · Jordan SchneiderTBPNMay 21, 202627 min read

Pre-Training Scale Is Losing Ground to Adaptive AI Systems

Sara Hooker, co-founder of Adaption Labs, argues in a Hugging Face ML Club India talk that AI progress is moving away from ever-larger pre-training runs as the default path and toward systems that adapt more efficiently after deployment. She says compute still matters, but the higher-return questions now concern data curation, post-training, test-time compute, interfaces, routing, and how cheaply models can learn from new information. Her case is that monolithic, one-size-fits-all models push the cost of adaptation onto users and concentrate participation among labs with the largest compute clusters.

Sayak Paul · Aritra Gosthipaty · Sara HookerHugging FaceMay 21, 202620 min read

Neuro-Symbolic Planning Makes Robot Learning More Data-Efficient

Jiayuan Mao, a Member of Technical Staff at Amazon Frontier AI & Robotics and incoming University of Pennsylvania assistant professor, argues in a Stanford Robotics Seminar that robot learning should be built around planning over compositional world models rather than direct policy fitting alone. His case is that neuro-symbolic systems — neural models embedded in symbolic constraint graphs for objects, relations, actions and effects — can learn from few demonstrations, compose skills at inference time and generalize to new objects, states and goals more reliably than end-to-end policies.

Jiayuan MaoStanford OnlineMay 20, 202617 min read

General-Purpose AI Finds Better Construction for Planar Unit Distance Problem

OpenAI says a general-purpose reasoning model has found a new family of constructions for the planar unit distance problem, a combinatorial geometry question posed by Paul Erdős in 1946. The result challenges a decades-old expectation that roughly square-grid arrangements were essentially best possible, and mathematicians including Timothy Gowers and Mark Sellke describe it as a clear case of AI producing a breakthrough on a prominent open problem. OpenAI frames the result as evidence that AI can accelerate research by exploring long, delicate chains of reasoning, while leaving problem choice and interpretation to human experts.

Mark Sellke · Mehtaab Sawhney · Sebastien Bubeck · Timothy Gowers · Lijie ChenOpenAIMay 20, 20265 min read

Robots Need Game-Theoretic Planning to Navigate Human Interaction

UC Berkeley roboticist Negar Mehr uses a Stanford robotics seminar on interactive autonomy to argue that robots cannot handle shared spaces by treating people and other robots as moving obstacles. She frames interaction as a coupled decision problem: agents must predict how others will respond to their own actions, coordinate across multiple possible equilibria, and learn from demonstrations of interaction rather than isolated behavior. Her broader case is that game-theoretic structure, multi-agent learning, and training-time foundation-model coaching can make that coupling tractable without replacing deployed control policies.

Negar MehrStanford OnlineMay 20, 202619 min read

Language Models Generalize Differently From Parameters Than From Context

In a Stanford CS25 seminar, Anthropic researcher Andrew Lampinen argues that language models generalize differently depending on whether information is stored in their parameters or supplied in context. His experiments find that models can often use relations flexibly when the relevant facts are visible in the prompt, but fail to make the same reversals, syllogistic inferences, or codebook translations when those facts have only been learned through training. Lampinen presents augmentation, retrieval, and reinforcement-learned recall as partial ways to make latent implications more usable, while stressing that parametric learning and in-context learning remain complementary rather than substitutes.

Steven Feng · Andrew LampinenStanford OnlineMay 20, 202618 min read

Gemini’s Strategy Shifts From Frontier Leaderboards to Deployable AI Infrastructure

Google DeepMind executives Tulsee Doshi and Logan Kilpatrick argue that Google’s current Gemini strategy is built less around a single frontier model than around a deployable AI stack. In their account, Gemini 3.5 Flash, the Anti-Gravity agent harness and new multimodal products such as Omni are meant to make models fast, cheap and integrated enough to run across Search, the Gemini app, AI Studio, YouTube and enterprise tools. The deeper shift, Kilpatrick says, is that the model is increasingly absorbing the scaffolding that once surrounded it, while Google standardizes the remaining agent infrastructure across its products.

Nathan Labenz · Logan Kilpatrick · Tulsee DoshiThe Cognitive RevolutionMay 20, 202619 min read

Coding Agent Skills Need Live Documentation, Not Cached Product Knowledge

Marc Klingen of Langfuse argues that coding agents can add observability, but often do it first from stale model memory, producing broken or incomplete instrumentation before recovering through current documentation. In a talk on building a Langfuse skill for Claude Code, he says the fix is not to stuff more product knowledge into the agent, but to give it reliable ways to find live docs, expose its intermediate work in traces, and evaluate changes against realistic repositories. The same work, he warns, creates new risks when optimization loops reward shorter paths and remove the documentation-fetching and approval steps that make the skill reliable.

Marc KlingenAI EngineerMay 20, 202613 min read

AI Needs Inference, Incentives, and Institutions Around the Model

Michael I. Jordan, the Berkeley statistician and computer scientist, argues that modern machine learning is being misdescribed when it is framed as a race toward AGI or disembodied intelligence. In this conversation, Jordan says the more important problem is designing collective economic systems around prediction models: incentives, markets, uncertainty, regulation, privacy, and institutions. His case is that prediction alone is not inference, and that useful AI will depend less on anthropomorphic claims about understanding than on system design that lets humans act, coordinate, and reduce uncertainty.

Michael Jordan · Tim ScarfeMachine Learning Street TalkMay 20, 202625 min read

Modern AI Needs Inference and Incentives, Not AGI Framing

Michael I. Jordan argues that modern AI is being framed around the wrong object: an isolated intelligent machine rather than the collective economic systems in which machine-learning components actually operate. In this conversation, the Berkeley statistician and computer scientist says AGI is mostly a PR term, and that the field’s harder problems lie in inference, uncertainty, incentives, markets, and mechanism design. His case is not that recent models are unimpressive, but that prediction and fluent language are only pieces of systems that must be engineered around human institutions.

Michael JordanMachine Learning Street TalkMay 20, 202623 min read

Text-to-Image Training Is Becoming a Problem of Signal Allocation

Stanford adjunct lecturers Shervine Amidi and Afshine Amidi present text-to-image model training as a problem of allocating scarce learning signal across the full model lifecycle, not simply choosing a diffusion or flow-matching loss. In Lecture 6 of Stanford’s CME296 course, they argue that practical training depends on emphasizing hard timesteps, adjusting for resolution, using data curricula and representation alignment, then applying post-training, personalization, and distillation methods to improve control and reduce inference cost.

Shervine AmidiStanford OnlineMay 19, 202621 min read

Language Model Scaling Depends on Controlling Hyperparameter Drift

Stanford’s CS336 scaling-laws lecture, taught by Tatsunori Hashimoto, argues that modern language-model scaling is less about accepting a single Chinchilla-style rule than about controlling which training choices drift with size. Hashimoto presents scaling laws as useful empirical tools for choosing model/data tradeoffs, learning rates, batch sizes, sparsity, optimizers, and architectures, but repeatedly cautions that their transfer depends on the regime that produced them. Techniques such as µP and WSD schedules can reduce some uncertainty, he says, while data mixtures, optimizer details, weight decay, architecture changes, and post-training can still break clean extrapolations.

Tatsunori HashimotoStanford OnlineMay 19, 202619 min read

AI Narrows Ugandan Breast-Cancer Vaccine Targets From 15,000 Sites to 15

Dr. Daudi Jjingo of Makerere University argues that AI-enabled biology can move Ugandan breast-cancer research earlier and closer to where the disease burden is being seen. In a Google DeepMind source, he describes using tools including AlphaFold and AlphaGenome to narrow 15,000 possible sites in a highly expressed breast-cancer protein to 15 candidates for lab validation, a step he says could eventually support vaccine development. The source presents the immediate change not as a finished vaccine, but as local capacity: work Jjingo says once required better-resourced settings abroad can now be done with a laptop and server access.

Daudi JjingoGoogle DeepMindMay 19, 20264 min read

Spotify Uses Semantic IDs to Make LLMs Recommend Catalog Items

Spotify’s Shivam Verma argues that LLM-era personalization requires translating both users and catalog items into forms a model can process alongside language. In his account, Spotify combines long-term user embeddings, Semantic IDs that turn tracks and episodes into token sequences, and soft tokens that project a listener’s profile into an LLM’s embedding space. The aim is a generative recommender that can produce catalog-native recommendations without full fine-tuning, while still relying on traditional ranking layers for production use.

Shivam VermaAI EngineerMay 19, 202610 min read

Recursive Emerges From Stealth at $4.65 Billion Valuation

Recursive CEO Richard Socher told Bloomberg that the newly disclosed startup is trying to build AI systems that can automate the research loop: proposing ideas, implementing them, testing them, and using the results to improve AI itself. The company emerged from stealth with more than $650 million raised, a $4.65 billion valuation, and backers including GV, Greycroft, Nvidia, and AMD. Socher argued Recursive’s edge is an organization built around open-ended AI experimentation, while Bloomberg’s Caroline Hyde pressed him on compute costs, safety, hiring, and why the work belongs in a separate lab.

Caroline Hyde · Richard SocherBloomberg TechnologyMay 18, 20265 min read

Agentic AI Is Turning Model Quality Into a Systems Problem

At AI Engineer Singapore’s second day, speakers from Google DeepMind, Cloudflare, Arize, OpenClaw, Adaption and other teams made a shared engineering case: as AI systems become more agentic, model quality is no longer separable from the systems around the model. Richard Ngo framed the risk as long-horizon, situationally aware agents whose goals cannot be inspected, while practitioners argued that production AI now depends on continuous evaluation, traces, deterministic execution boundaries, routing, memory, fine-tuning and test-time search. The source’s central claim is that useful and safe agentic AI is becoming a systems problem, not just a model-selection problem.

Shawn Wang · Eugene Yan · Philip Vollet · Haotian Zhang · Eugene Evstafev · Jason Liu · Pratik Desai · Michelle Chen · Jason Lopatecki · Amr Ahmed · Rita Zhang · Harris Snyder · Adarsh Shah · Eric Zhang · Ricky Robinett · Linoy Bitan · Wei Sheng · Richard NgoAI EngineerMay 17, 202626 min read

AlphaGo Shows How Search Can Turn RL Into Supervised Learning

Eric Jang rebuilds AlphaGo as a way to examine why its combination of search, value learning and self-play still matters for modern AI. His central claim is that AlphaGo’s Monte Carlo Tree Search turns each move into a better supervised-learning target, avoiding the long-horizon credit-assignment problem that makes much reinforcement learning for language models inefficient. Jang also argues that current LLM research assistants can already help execute and optimize experiments, but still struggle with the harder judgment of choosing which research paths are worth pursuing.

Dwarkesh Patel · Dan Pontecorvo · Yaron Minsky · Eric JangDwarkesh PatelMay 15, 202628 min read

AI Is Moving Deeper Into Science, but Validation Remains the Bottleneck

At AI+Science: AI for the Universe, Kyle Cranmer, Carina Hong and Douglas Finkbeiner argued that AI is already embedded in scientific work, but its value depends on where validation happens. Cranmer framed physics applications around prediction and inference, where formal checks, simulator calibration or uncertainty correction determine whether model output can support scientific claims. Hong made the parallel case in mathematics, where Lean-style formal proof gives some AI results a clean score but leaves problem selection and theory-building with experts. Finkbeiner said astronomy’s newer disruption is the desk-level AI collaborator, which can improve research work while increasing the need for verification and scientific judgment.

Kyle Cranmer · Douglas Finkbeiner · Benjamin Nachman · Carina HongStanford HAIMay 15, 202623 min read

AI Tools Target Labeling, Simulation, and Scaling Bottlenecks in Research

At Stanford’s second AI+Science lightning-talk session, three researchers presented AI less as a general-purpose scientific shortcut than as infrastructure for specific measurement problems. Matt DeButts argued that PRC-linked patronage can reshape Chinese-language media markets by helping already favorable outlets survive; Samuel Young showed how self-supervised learning can extract particle structure from unlabeled detector data; and Benjamin Dodge described using AI-scale computation to make Gaussian process priors practical for 3D maps of Milky Way dust. The shared claim was that AI’s value depended on a sharply defined bottleneck: too many articles to label, too few reliable detector labels, or too large an inference problem for conventional computation.

Risa Wechsler · Samuel Young · Matt DeButts · Benjamin DodgeStanford HAIMay 15, 20268 min read

AI Is Pushing Science Beyond the Paper as Its Core Artifact

In closing remarks from an AI and science meeting, Risa Wechsler argued that AI is reshaping scientific fields unevenly, depending on their data, theory and modes of inquiry, and that scientists should use the moment to choose structures aligned with human values. Surya Ganguli pushed the question toward scientific communication itself, suggesting that papers may be too narrow an artifact for AI-assisted science and that richer institutional records of research could better transfer knowledge. Both framed AI for science as a design problem around human purposes, not just faster automation.

Surya Ganguli · Risa WechslerStanford HAIMay 15, 20265 min read

AI Is Making Scientific Throughput the New National Advantage

Dario Gil, the U.S. Department of Energy’s Under Secretary for Science, used his AI+Science keynote to argue that AI is shifting scientific advantage from access to instruments and computing toward the throughput of integrated discovery systems. He presented DOE’s Genesis initiative as the national-scale architecture for that shift, linking data, AI models, high-performance computing, experimental facilities, and industry partners into closed-loop workflows. Gil’s case was that the test is not more papers, but whether faster scientific cycles can produce measurable gains in productivity, security, and industrial capability.

Darío Gil · Risa WechslerStanford HAIMay 15, 202613 min read

Stanford Merges AI and Data Science Institutes Around Open Scientific Discovery

Stanford’s AI+Science Conference opened with James Landay announcing that the university is merging the Human-Centered AI Institute and Stanford Data Science into a single institute for AI and data science across Stanford. Landay, president Jonathan Levin, Surya Ganguli and Risa Wechsler framed the move around a common argument: AI is becoming a scientific instrument, but one that will require open research, domain-specific rigor, uncertainty-aware methods and human judgment about which questions matter.

James Landay · Risa Wechsler · Surya Ganguli · Jonathan LevinStanford HAIMay 15, 202612 min read

AI-for-Science Advances Depend on Evaluation, Not Just Generation

In a Stanford AI+Science lightning-talk session introduced by Surya Ganguli, four young researchers made a common case: AI-for-science is useful only when paired with rigorous evaluation. Aishwarya Mandyam, Amar Venugopal, Steven Dillmann and Alda Elfarsdóttir each treated AI systems or outputs as claims to be tested — through uncertainty estimates for clinical policies, causal checks on generated text, executable benchmarks for scientific agents, and empirical links between corporate climate language and later emissions.

Aishwarya Mandyam · Surya Ganguli · Aldís Elfarsdóttir · Amar Venugopal · Steven DillmannStanford HAIMay 15, 20267 min read

Energy-Based Fine-Tuning Trains Language Models on Whole Responses

Microsoft Research’s presentation on energy-based fine-tuning argues that language-model post-training can be aimed at whole responses rather than next-token imitation. Carles Domingo-Enrich presents EBFT as a middle path between supervised fine-tuning and reinforcement learning: it samples model completions, compares them with ground-truth answers in a model-derived feature space, and turns that comparison into a policy-gradient update without a separate reward model or verifier. The reported results show gains over SFT on several coding and translation measures, with performance often comparable to RLVR while avoiding explicit correctness rewards.

Yash Lara · Carles Domingo-EnrichMicrosoft ResearchMay 14, 20267 min read

AI’s Biggest Disruption Requires Rebuilding Markets Around Agents

David Rothschild argues that AI’s largest economic effects will come less from better models than from whether workflows and markets are rebuilt for agents rather than humans. In his Microsoft Research Forum talk and related work on agentic markets, he says the key question is architectural: open systems could reduce communication friction and spread welfare gains, while closed platforms could use the same capabilities to reinforce incumbency. The transition, in his account, depends on choices about delegation, monitoring, auditability, and market access that are being made before the full disruption is visible.

David Rothschild · Yash LaraMicrosoft ResearchMay 14, 20265 min read

Interwhen Verifies AI Agent Actions Before They Become Irreversible

Microsoft Research’s Amit Sharma presents Interwhen as a framework for moving AI agents from post-hoc checking to verified execution while they are still acting. The open-source library uses LLMs to turn natural-language instructions, policies, and partial responses into smaller verifiable properties, then applies symbolic or model-based verifiers to tool calls and intermediate behavior. Sharma argues that this lets agents continue normally when checks pass but interrupts them when a verifier detects a violation, addressing risks that final-output review may catch too late.

Amit Sharma · Yash LaraMicrosoft ResearchMay 14, 20266 min read

NVIDIA’s Nemotron 3 Nano Omni Trades Accuracy for Multimodal Throughput

Károly Zsolnai-Fehér’s account of NVIDIA’s Nemotron 3 Nano Omni argues that the 30-billion-parameter open multimodal model is notable less for leading general intelligence benchmarks than for processing long video, audio, images and documents quickly and cheaply. The reported advantage comes from compression across the system — Mamba layers, audio tokenization, aspect-ratio-preserving vision handling, distilled encoders and efficient video sampling — which reduces the amount of material sent into the language-model backbone.

Károly Zsolnai-FehérTwo Minute PapersMay 13, 20267 min read

Autonomous Medical Robots Need Physics Models, Not Just Foundation Models

UC San Diego professor Michael Yip argues in a Stanford Robotics Seminar that medical robotics must move beyond teleoperation if it is to address healthcare labor shortages. Current surgical robots can improve precision but still depend on a surgeon’s skill, while surgery’s scarce data, deformable tissue, safety constraints, and need for millimeter accuracy make end-to-end learning an inadequate answer on its own. Yip makes the case for a hybrid path: modern perception where it works, explicit physics and control where contact demands it, and humanoid platforms where broader hospital tasks require more general embodiment.

Michael YipStanford OnlineMay 12, 202617 min read

KV Cache Movement Has Become the Core Inference Bottleneck

Stanford’s CS336 lecture on inference, taught by Percy Liang with Tatsunori Hashimoto, argues that serving language models is now a core systems problem rather than an afterthought to training. Liang’s central claim is that autoregressive Transformer generation is sequential and often memory-bound, especially because attention must repeatedly move KV-cache data rather than perform dense, easily parallelized computation. The lecture treats batching, grouped-query and latent attention, quantization, pruning, speculative decoding, continuous batching, and PagedAttention as different attempts to move fewer bytes, reuse memory better, or trade latency for throughput without degrading model quality too much.

Percy Liang · Tatsunori HashimotoStanford OnlineMay 12, 202617 min read

Reasoning Gains Persist When Models Learn Them During Pretraining

Shrimai Prabhumoye of Mistral AI used a Stanford CS25 seminar to argue that large-language-model pretraining is becoming less a matter of adding tokens and more a question of training strategy. Drawing on studies of curriculum ordering, early reasoning data, and reinforcement as a pretraining objective, she said base models improve when they see broad data before high-quality data, encounter reasoning traces during pretraining rather than only post-training, and are rewarded for intermediate thoughts that improve prediction.

Steven Feng · Shrimai PrabhumoyeStanford OnlineMay 11, 202617 min read

Ultra-Scale Training Depends on Memory Sharding and Communication Overlap

Nouamane Tazi of Hugging Face uses a Stanford CS25 seminar to argue that ultra-scale model training is less a question of adding GPUs than of managing memory, communication, batch size, and hardware topology. His central case is that 5D parallelism—data, tensor, pipeline, context, and expert parallelism—lets training runs span massive clusters only when each axis is chosen for a specific bottleneck. The practical rule, he says, is conservative: shard only as much as the workload requires, because every added parallelism dimension buys scale by spending communication, complexity, or both.

Steven Feng · Nouamane Tazi · Karan SinghStanford OnlineMay 11, 202618 min read

Apple-Device AI Is Becoming Viable Without Cloud Inference

Prince Canuma presents MLX, Apple’s array framework for Apple Silicon, as a practical foundation for running AI agents locally rather than through cloud services. His case is rooted in accessibility and unreliable connectivity, but extends to product constraints for voice agents, robots and multimodal apps: vision, speech, video generation and long-context inference can increasingly run on Macs, iPhones and iPads without a network call. Canuma does not argue that local models replace every frontier cloud system, but that the boundary has moved far enough to make on-device AI a serious deployment option.

Prince CanumaAI EngineerMay 11, 202613 min read

Text-to-Speech Models Are Converging on LLM-Style Architectures

Samuel Humeau of Mistral argues that modern text-to-speech has converged on an architecture that resembles large language modeling: an autoregressive transformer generates compressed audio tokens frame by frame, rather than raw waveform samples. Using Mistral’s open-weight Voxtral TTS model as the example, he says neural audio codecs make that possible by reducing dense speech signals to token-like representations a transformer can handle. The remaining latency frontier, in his account, is not just streaming playable audio early, but letting TTS consume an LLM’s text stream as it is still being written.

Samuel HumeauAI EngineerMay 9, 202612 min read

Pretraining and Attention Infrastructure Made Vision Transformers Practical

Isaac Robinson of Roboflow argues that transformers overtook convolutional networks in vision not because images stopped needing visual structure, but because that structure moved from hand-built architecture into pretraining, scaling and tooling. In his account, ViT-style models first lacked the inductive biases and efficiency that made CNNs dominant, but self-supervised vision pretraining and attention infrastructure from the LLM world made the simpler architecture practical. Robinson frames the next problem as deployment: turning large foundation backbones into model families that can meet real latency, cost and hardware constraints.

Isaac RobinsonAI EngineerMay 8, 202610 min read

BFL Is Moving FLUX From Image Generation Toward Physical AI

Stephen Batifol of Black Forest Labs argues that FLUX is no longer just an image-generation line but the start of a broader push toward visual intelligence: models that can generate, edit, understand, and eventually act across images, video, audio, and physical environments. In the talk, he presents FLUX.1, Kontext, FLUX.2, and FLUX.2 Klein as product steps toward that goal, while BFL’s Self-Flow research is framed as the mechanism for moving representation learning inside multimodal generative models rather than relying on external encoders.

Stephen BatifolAI EngineerMay 8, 202611 min read

Claude’s Activations Suggested It Recognized Anthropic’s Blackmail Test

Anthropic researcher Subhash Kantamneni presents Natural Language Autoencoders as a way to translate Claude’s internal activations — the numerical states produced while it answers — into readable text. The central claim is that this can expose what a model appears to be representing before it speaks, including whether a successful safety-test result reflects the intended behavior or recognition of the test itself. In Anthropic’s simulated blackmail evaluation, Claude refused to act harmfully, but the NLA translation suggested it also understood the scenario was likely a safety evaluation.

Subhash KantamneniAnthropicMay 7, 20265 min read

DeepSeek V4 Claims Frontier-Adjacent Open Weights With One-Million-Token Context

Károly Zsolnai-Fehér of Two Minute Papers argues that DeepSeek V4 Preview is a consequential open-weight AI release because it pairs frontier-adjacent benchmark results with a reported one-million-token text context window and sharply lower long-context memory costs. His case rests less on outright benchmark dominance than on access economics: a freely self-hostable model appears close enough to recent closed frontier systems to change what developers can afford to use. He also stresses the limits: DeepSeek V4 is text-only, degrades near the edge of its context window, and still needs serious hardware at full scale.

Károly Zsolnai-FehérTwo Minute PapersMay 7, 20266 min read

Data Scarcity, Not Compute, Is the Next AI Bottleneck

At AI Ascent 2026, Flapping Airplanes co-founders Ben and Asher Spector argued that data scarcity, more than compute alone, will determine where AI can create value next. They said the biggest gains so far have come in unusually data-rich domains such as search and coding, while much of the economy — including robotics, trading, science and narrow industrial workflows — lacks comparable datasets. Their proposed answer is to make models far more data-efficient by developing new GPU-level primitives that current frameworks such as PyTorch make hard to express.

Asher Spector · Ben SpectorSequoia CapitalMay 7, 20266 min read

Ricursive Wants AI to Design the Chips That Train AI

At AI Ascent 2026, Ricursive Intelligence co-founders Anna Goldie and Azalia Mirhoseini argued that the next bottleneck in AI is the chip-design process itself, and that AI should be used to design the hardware that trains and serves it. Drawing on their AlphaChip work, which Goldie said has shipped in four generations of Google TPUs, they described Ricursive’s plan to rebuild chip-design tools for fast AI feedback loops and turn that tooling into a platform for custom silicon. Their larger claim is that workload-specific chips, and eventually co-designed chips and models, require moving chip design from yearlong expert workflows to automated optimization.

Azalia Mirhoseini · Anna GoldieSequoia CapitalMay 7, 20266 min read

AI Scaling Faces an Energy Wall Without Physics-First Hardware

At AI Ascent 2026, Unconventional AI founder and CEO Naveen Rao argued that the current AI compute stack is approaching an energy wall because it is built on an 80-year-old digital computing model poorly suited to intelligence. Rao’s case is that GPUs and matrix math cannot close the efficiency gap with biological brains fast enough, and that AI hardware must instead be rebuilt around physical dynamics, time-domain computation, and architectures that blur memory and processing. He presented Unconventional AI’s coupled-oscillator chip prototype as an attempt to move compute closer to the thermodynamic limits of intelligence per watt.

Naveen RaoSequoia CapitalMay 7, 20266 min read

Luma Is Rebuilding Video AI Around a Unified Multimodal Transformer

In a Stanford CS153 guest lecture, Luma AI co-founder and chief executive Amit Jain argues that generative video is only a staging point toward “unified intelligence”: models that understand and generate across text, images, video, audio, code and tools in a single work loop. Jain traces Luma’s path from Apple-era LiDAR and 3D capture to internet-scale video, saying the company followed the data but now sees prettier clips as insufficient. The destination, he says, is a multimodal AI factory for professional creative and physical work, where human skills, tool use, feedback and unified transformer architectures produce full campaigns, schematics, productions and eventually robotics workflows.

Anjney Midha · Amit JainStanford OnlineMay 7, 202619 min read