Topic

Multimodal AI

Models and products that work across text, image, audio, video, speech, documents, screens, and other modalities.

Flows Agent Turns Creative Briefs Into Editable AI Production Pipelines

ElevenLabs presents Flows Agent as a conversational assistant for building and revising node-based creative workflows inside ElevenCreative Flows. The company’s case is that a user can describe an ad or other asset in natural language, have the agent assemble the models, prompts, nodes, and connections, then keep the resulting pipeline visible for edits, approvals, and reuse. The demo emphasizes cost controls for credit-heavy generation, node-level revisions through chat, and templates that turn a completed flow into a repeatable production system.

ElevenLabsJun 18, 20266 min read

Camera AirPods Would Give Siri Visual Context in Apple’s 2027 Push

Bloomberg’s Mark Gurman says Apple is preparing a dense 2026 and 2027 hardware cycle that includes its first foldable iPhone, a second-generation foldable, a 20th-anniversary iPhone and camera-equipped AirPods. Gurman argues the AirPods cameras are meant not for photography or facial recognition but to give Siri visual context about a user’s surroundings, while Snap’s new Specs show the same broader push toward ambient, augmented computing despite high prices and limited near-term adoption.

Ed Ludlow · Mark GurmanBloomberg TechnologyJun 17, 20264 min read

GRU Space Plans Lunar-Regolith Bricks as the First Step Toward a Moon Hotel

On This Week in Startups, GRU Space founder Skyler Chan argues that a Moon hotel is the first commercial wedge for a larger off-Earth manufacturing business: using lunar regolith to make construction materials rather than shipping them from Earth. Chan lays out a plan to prove the technology by making a brick on the Moon, then scale toward robotic habitats, NASA construction work, space tourism and eventual claims on lunar resources. The same episode turns to Anthropic’s forced shutdown of Fable 5 and Mythos 5, which Jason Calacanis and Lon Harris frame as a warning that frontier capabilities can be cut off before law, politics and operating norms have settled.

Jason Calacanis · Lon Harris · Skyler ChanThis Week in StartupsJun 16, 202621 min read

Dubbing v2 Preserves Speaker Performance Across 90-Plus Languages

ElevenLabs presents Dubbing v2 as an AI dubbing model designed to transfer a speaker’s performance across more than 90 languages, not just translate the words. The company argues that by conditioning on the original audio rather than a transcript, the system can preserve voice, tone, emphasis, emotion and timing while adapting phrasing for natural delivery in the target language. The walkthrough positions the tool as an automated localization workflow for creators, marketers and studios, with speaker similarity as the main setting users adjust between voice resemblance and native-language naturalness.

ElevenLabsJun 12, 20266 min read

Models Will Absorb Today’s Agent Harnesses Within a Year

Logan Kilpatrick, who leads Google AI Studio and the Gemini API, argues that the current rush to build agent harnesses may have a short shelf life. In an interview with Sequoia Capital’s Sonya Huang, he says models are absorbing the scaffolding around agents and could make much of today’s custom harness layer less distinctive within about 12 months. Google’s own strategy runs on both sides of that claim: Antigravity has become a shared agent layer across products, while Kilpatrick says the durable advantage for builders will move to focus, domain knowledge, risk tolerance and useful outcomes for users.

Logan Kilpatrick · Sonya HuangSequoia CapitalJun 11, 202619 min read

MiniCPM-V 2.6 Runs at 18 Tokens per Second on iPhone

OpenBMB used its Build Small hackathon session to argue that small models are valuable when they can be deployed where applications and data already live: on phones, laptops, mobile apps and edge devices. Its main example was MiniCPM-V 2.6, a vision-language model shown running on an iPhone 15 Pro at 18 tokens per second with llama.cpp and 4-bit quantization. The broader claim was that compact, open models paired with existing runtimes can expand access, reduce cloud dependence, and improve privacy and latency for local AI use cases.

Hugging FaceJun 10, 20266 min read

Codex Positions Its Data Plugin as an End-to-End Analytics Workspace

OpenAI’s Codex data science demo presents the product as an analytics workspace that can take a business question, use Databricks data, and produce a decision-ready report for leadership. The case made in the demo is that Codex can act as an agentic data analyst configured to a team’s tools and templates: generating a cancellation-spike analysis, exposing the source query behind a chart, allowing live edits, and exporting the finished work as a Google Slides executive readout.

OpenAIJun 9, 20264 min read

Responsible Mental Health AI Depends on Measurement, Co-Design, and Trust

At Stanford’s 2026 AI for Mental Health Symposium, Carolyn Rodriguez, Ehsan Adeli, Brandon Staglin and Vaile Wright argued that the urgent question is no longer whether people will use AI for mental health, but whether the field can make that use safe, clinically meaningful and trustworthy. The panel’s case was that responsible deployment will require measurable standards for quality and harm, early involvement from clinicians and people with lived experience, regulatory and payment systems that support trust, and designs that strengthen rather than replace human relationships.

Brandon Staglin · Ehsan Adeli · Vaile Wright · Carolyn RodriguezStanford HAIJun 8, 202619 min read

ElevenLabs Adds Studio and Flows Agents to Automate Creative Production

Luke Harries used ElevenLabs’ Warsaw summit to argue that AI creative production is moving beyond prompt-based asset generation toward agent-directed workflows. Presenting ElevenCreative, he introduced Studio Agent and Flows Agent as layers above models and editing tools, intended to help teams ideate, script, prompt, edit, localize, and reuse campaigns. His case was that marketers’ role shifts from executing each production step to directing and approving systems that can produce hero assets, performance variations, and localized creative continuously.

Luke HarriesElevenLabsJun 8, 20266 min read

Tech Founders Argue IPOs Can Create More Upside After Listing

At an All-In Liquidity IPO panel, Altimeter’s Brad Gerstner, Cerebras chief executive Andrew Feldman and Planet Labs chief executive Will Marshall made the case that public markets are again becoming a place where venture-backed technology companies can compound, not merely exit. Gerstner argued that investors often give up large gains by forcing distributions after an IPO, while Feldman said more money is historically made after companies go public than before. Marshall and Feldman also described the IPO less as an operating transformation than as a change in capital, credibility and scrutiny, with execution still determining whether the listing creates lasting value.

Jason Calacanis · Chamath Palihapitiya · David Sacks · Brad Gerstner · Andrew Feldman · Will MarshallAll-In PodcastJun 6, 202613 min read

Frontier Labs Treat Recursive Self-Improvement as a Near-Term Control Problem

AI in the AM’s first weekly highlights edition argues that the important AI signal in early June was not a model launch but a pattern: frontier labs are treating AI-accelerated AI research as near-term, while their main control strategy remains AI systems monitoring other AI systems. Nathan Labenz presents that as a safety concern, and the source contrasts thin recursive-self-improvement plans with OpenAI’s more concrete tax-agent example, where the harness improves from practitioner corrections rather than from changes to model weights. The through-line is that value and risk are moving into the layers around the model: tax harnesses, private data and expert judgment in cyber, real-time moderation guardrails, and safety architecture in mental-health deployments.

Nathan Labenz · John Wasseige · Matthew Sanders · Brett Levenson · Prakash Narayanan · Taras Pohrebniak · Snehal Antani · Hooman Radfar · Peter Jansen · Arthur Fernandes · Tal Hoffman · Yair TsarfatyThe Cognitive RevolutionJun 6, 202624 min read

Hackathon Caps Models at 32B Parameters to Reward Tinkerable AI Apps

Build Small is a Hugging Face and Gradio hackathon organized around a hard constraint: every model used must be under 32 billion parameters. Yuvraj Sharma framed the rule as a way to move AI building away from dependence on giant hosted models and back toward systems that participants can inspect, fine-tune, run locally, and ship as working Gradio Spaces. Sponsor presentations from Black Forest Labs, OpenBMB, OpenAI, NVIDIA, Modal, JetBrains, and Cohere largely reinforced that premise, offering small models, credits, tools, and prize categories meant to turn the constraint into runnable projects rather than demos in name only.

Shashank Verma · Vaibhav Srivastav · Stephen Batifol · Julian Mack · Yuvraj Sharma · Felicia Chang · Nikita Pavlichenko · Hannah Blair · Zhong ZhangHugging FaceJun 5, 202620 min read

Native Multimodal Models Extend LLMs but Still Lack Unified Representations

Victoria Lin of Thinking Machines uses a Stanford CS25 seminar to argue that native multimodal models have extended much of the large-language-model recipe into images, audio, video and action, but have not yet unified multimodal intelligence. Her account is that tokenization, Transformers, autoregressive conditioning and scaling transfer only partly: images, video and action require different representations, objectives and sometimes modality-specific parameters. The result, she says, is a field moving beyond text-only systems while still relying on text as its strongest abstraction for reasoning.

Steven Feng · Victoria LinStanford OnlineJun 4, 202619 min read

Vision-Language Models Understand Multimodal Inputs but Still Generate Text

Stanford’s CS336 lecture on alignment and multimodality, led by Percy Liang with Tatsunori Hashimoto, argues that the core problem in vision-language systems is still how to turn non-text data into tokens a Transformer can use. The lecture traces the field from CLIP and SigLIP through LLaVA and Qwen, presenting modern VLMs as largely built around a stable template: a vision encoder, an adapter, and a pretrained language model that generates text. Liang’s larger point is that these systems are powerful multimodal input models, but not true omni models; representing images and video without losing fine detail remains the central technical constraint.

Percy Liang · Tatsunori HashimotoStanford OnlineJun 4, 202622 min read

NVIDIA Frames Cosmos 3 as Compute-Generated Data for Physical AI

NVIDIA presents Cosmos 3 as an open foundation model for physical AI, built to address what it frames as a data-scaling problem in robotics, autonomous vehicles and other systems that operate in the physical world. The company argues that real-world data cannot capture enough variability on its own, so compute must generate usable training and evaluation signals: synthetic video, predicted sensor outputs, simulation loops and action plans. Cosmos 3 is positioned as a post-trainable mixture-of-transformers system that combines multimodal reasoning with generation to support perception, prediction, simulation and action.

NVIDIAJun 2, 20265 min read

OpenAI CFO Says Compute Scarcity Will Define Its Next Phase

OpenAI CFO Sarah Friar used an All-In interview to frame the company less as an IPO candidate chasing public-market timing than as an infrastructure-scale AI business trying to finance scarce compute, broaden distribution, and defend the intelligence layer between users and the underlying technology. Friar argued that OpenAI’s consumer and enterprise products are meant to compound off the same foundation, even as the company raises unprecedented capital, diversifies cloud and chip supply, and considers ads without letting sponsored results distort ChatGPT.

Chamath Palihapitiya · Jason Calacanis · David Sacks · David Friedberg · Sarah FriarAll-In PodcastJun 2, 202615 min read

YouTube Is Becoming Hollywood’s Talent Market and IP Proving Ground

TBPN’s John Coogan and Jordi Hays argue that YouTube is moving from Hollywood competitor to Hollywood’s talent market, where creator-led films prove creative judgment, production ability and audience response before studio capital arrives. The episode extends that pattern to AI policy, software and prediction markets: established institutions are trying to absorb signals formed outside their usual channels, from internet-proven filmmakers and frontier AI labs to traders and startups testing demand before regulators, studios or public markets have settled their response.

Jordi Hays · John Coogan · Marc Benioff · Nico Ferreyra · Mike Schroepfer · Graham Stephan · Bernie Su · Sue Khim · Scott Trinkham · Adam Iscoe · Jason Oppenheim · Danial Jameel · Tyler BohallTBPNJun 1, 202627 min read

Open Image Models Converge on Flow Matching and DiT Architectures

Stanford adjunct lecturer Shervine Amidi uses Lecture 8 of CME296 to argue that modern visual generation is best understood as a stack of choices for transporting noise into data: the paradigm, representation, architecture, training procedure, and evaluation method. He presents flow matching as the current default for image-generation systems, diffusion transformers as the dominant architectural direction, and latent spaces as a practical compression tradeoff now being challenged by scaled pixel-space models.

Shervine AmidiStanford OnlineJun 1, 202623 min read

Nvidia Targets AI PCs With New Blackwell Chip and MediaTek CPU

Bloomberg Technology’s Caroline Hyde and Ed Ludlow framed Nvidia’s Computex announcements as an attempt to extend AI demand beyond the data center and into PCs, software and physical systems. The central case, led by Jensen Huang and assessed by Bloomberg reporters and analysts, is that Nvidia’s new RTX Spark chip and agentic-AI thesis could redraw parts of the PC and enterprise software markets, even as questions remain about performance, Arm’s history in PCs and the health of the broader hardware cycle.

Caroline Hyde · Ed Ludlow · Jensen Huang · Ian King · Isabelle Lee · Mark Gurman · Amit Jain · Mandeep Singh · Julie Samuels · George Ferguson · Matt Day · Vince Hu · Matt Wittmer · Stephen EngleBloomberg TechnologyJun 1, 202613 min read

Luma AI Targets Robotics Generalization With Open Physical AI Lab

Luma AI is launching an open physical AI lab to work on robots that can generalize beyond task-by-task demonstrations, CEO Amit Jain told Bloomberg Technology. Jain argues that physical AI should be built on large-scale multimodal data systems rather than narrow robotics training alone, and that the stack must remain open because robots could become part of homes, factories, hospitals and other productive systems.

Ed Ludlow · Amit Jain · Caroline HydeBloomberg TechnologyJun 1, 20266 min read

Language Models Are Becoming the Bottleneck in Video Generation

Ethan He, who worked on NVIDIA’s Cosmos world model and xAI’s Grok Imagine, argues that the next major gains in video generation will come less from diffusion models alone than from language models, agents, and context management around them. In an interview with swyx and Vibhu Sapra, He describes Grok Imagine as a fast-built example of that shift: diffusion renders pixels, while language systems increasingly rewrite prompts, plan clips, call tools, manage memory, and turn short generations into longer, editable video.

Shawn Wang · Vibhu Sapra · Ethan HeLatent SpaceJun 1, 202628 min read

AI Moves Medical Alerts From Fall Response to Fall Prevention

LogicMark chief executive Chia-Lin Simmons argues that medical-alert technology for older adults has remained too reactive, built around emergency buttons that assume a user can call for help after a fall. In an interview with Craig Smith, she describes LogicMark’s shift toward AI-supported monitoring that builds individual baselines from activity, sleep, medication and location patterns, then flags signs of decline before a crisis. Simmons says the aim is not to replace human responders, but to give families, caregivers and monitoring services earlier signals that can help more seniors age at home safely.

Craig Smith · Chia-Lin SimmonsEye on AIJun 1, 202617 min read

A Two-Hour AI Prototype Let Museum Visitors Talk to Statues

Joe Reeve of ElevenLabs argues that his “talk to a statue” prototype mattered less as a museum product than as evidence of what can now be assembled quickly from existing AI APIs. Built in Cursor in about two hours, the app identifies a photographed statue, generates historical context and a plausible voice, spins up an ElevenLabs agent, and starts a conversation in roughly 30 seconds. Reeve says the harder remaining questions are institutional rather than purely technical: who authors the object’s story, what voice it should have, and how multimodal voice interfaces should work.

Joe ReeveAI EngineerJun 1, 202614 min read

Sarvam and NVIDIA Build Full-Stack Sovereign AI Infrastructure for India

Sarvam co-founder Pratyush Kumar argues that India’s AI sovereignty cannot mean putting Indian-language interfaces on foreign-built systems. In a NVIDIA-backed account of Sarvam’s work, he describes a full-stack effort to build foundational models, data pipelines, inference systems and developer APIs inside India, using NVIDIA H100 clusters and NeMo tooling to process Indian-language data at scale. The case is that voice-first AI for India’s population requires domestic capability across data, models, applications and accelerated-compute expertise.

Pratyush KumarNVIDIAJun 1, 20265 min read

Voice Agents Need Colocated Models to Stay Under One Second

Rishabh Bhargava of Together AI argues that production voice agents are now constrained less by demos than by a sub-second engineering budget spanning speech-to-text, LLMs, text-to-speech, networking, and scaling. In his account, users notice delays above 500ms and abandon calls around one second, making even 75ms network hops material once model latency is optimized. The practical architecture remains a cascade, he says, because it lets teams control tool calling, evaluation, and reliability while speech-to-speech models still lag on production requirements.

Rishabh BhargavaAI EngineerMay 31, 202610 min read

Hugging Face Ships a $299 Hackable Robot for Voice AI Experiments

Andres Marafioti argues that Hugging Face’s Reachy Mini is meant to move robotics experimentation out of expensive humanoid hardware and into a $299-to-$449 open-source platform that users can assemble, repair and modify themselves. The robot’s most-used application is conversation, and Marafioti’s account ties its social ambition to a technical stack built for low-latency speech: Parakeet transcription, Qwen 3.5 27B, and an optimized Qwen3 TTS implementation that he says improved from 0.8x to 5.8x real time.

Andres MarafiotiAI EngineerMay 29, 202612 min read

AI Photo Analysis Is Moving From Skin Care to Cosmetic Advice

George Mack, Nirav Savjani, Tim Ferriss and Chris Williamson argue that image-capable AI is moving from practical skin-care triage into cosmetic judgment. Mack says Gemini identified a fungal skin treatment that years of doctors and lifestyle changes had missed; Savjani says the same photo-upload pattern is now driving looksmaxing tools that recommend facial changes, procedures and appearance edits. The discussion turns on a boundary the speakers see becoming harder to police: when AI advises what to do to a face, it can also normalize a version of that face that no longer matches reality.

Chris Williamson · Nirav Savjani · George Mack · Tim FerrissChris WilliamsonMay 29, 20267 min read

Snowflake Rally Reflects AI Demand More Than Amazon Deal

Bloomberg Technology framed Snowflake’s 34% stock surge less as a reaction to its $6 billion Amazon Web Services deal than as a repricing of its AI software position. Snowflake chief executive Sridhar Ramaswamy pointed to stronger product revenue, higher retention and adoption of tools such as Cortex, while Bloomberg’s Brody Ford argued the AWS agreement mainly helps answer how Snowflake can manage the infrastructure costs of building AI features.

Ed Ludlow · Caroline Hyde · Mark Gurman · Brody Ford · Sridhar Ramaswamy · Sampriti Bhattacharyya · Jo Constantz · Jared Isaacman · Eric Vishria · Stephen Engle · Shweta Khajuria · Alexandra Levine · Yeyi Yun · Arthur Mensch · Carson BlockBloomberg TechnologyMay 28, 202612 min read

Text-to-Image Evaluation Requires Metrics Matched to Specific Failure Modes

Stanford adjunct lecturers Afshine Amidi and Shervine Amidi argue that evaluating text-to-image models starts with separating aesthetic quality from prompt adherence, then choosing metrics suited to the failure being tested. In Lecture 7 of Stanford’s CME296 course on diffusion and large vision models, they treat human ratings, FID, CLIPScore, reference-based measures, multimodal judges, and benchmarks as imperfect instruments rather than substitutes for a universal image-quality score. Their central warning is practical: automated and qualitative evaluations can be useful, but only when their assumptions, calibration, and failure modes are made explicit.

Shervine AmidiStanford OnlineMay 28, 202619 min read

Chip Ganassi Racing Uses OpenAI to Find Tenths Between Sessions

OpenAI’s Joyce Ruffell presents the company’s collaboration with Chip Ganassi Racing as an effort to turn an already data-rich IndyCar operation into a faster decision-making system. The case made in the source is not that AI replaces race judgment, but that it can connect historical, test, race, pit-stop, and strategy data quickly enough to matter in the narrow windows between sessions and during a race. At Long Beach, the argument is illustrated through Alex Palou’s win: a late pit-strategy adaptation, precise crew execution, and trusted information flow produced the margin.

Barry Wanser · Alex Palou · Chip Ganassi · Joyce Ruffell · Will PlummerOpenAIMay 28, 20266 min read

ElevenLabs Says Dubbing v2 Preserves Performance Across 90 Languages

ElevenLabs is introducing Dubbing v2 alpha as an AI dubbing model built around preserving the original speaker’s performance, not just translating a transcript. The company says the system conditions directly on source audio so tone, pacing, emphasis and emotional delivery can carry across more than 90 languages, with sync-aware translation adapting phrasing to fit the timing of the original. ElevenLabs is positioning the launch for creators, marketers and studios that want automated localization without building a separate dubbing pipeline.

Jimmy DonaldsonElevenLabsMay 28, 20265 min read

Neuralink Says 20-Patient Scale Is Advancing Brain-AI Interfaces

Neuralink co-founder and president DJ Seo told Sequoia partner Shaun Maguire at AI Ascent 2026 that the company has moved from a single human implant demonstration to more than 20 patients, while still treating its current work as restoration of lost function rather than elective enhancement. Seo argued that Neuralink’s larger aim is not faster computer control but a higher-bandwidth interface between brains and AI, eventually enabling direct, multimodal transfer of concepts. The path there, he said, depends less on a single implant breakthrough than on scaling surgery, robotics, manufacturing, clinical evidence and neural-data models.

Elon Musk · Shaun Maguire · Alex Conley · Noland Arbaugh · Jake Harrell · Nick Wray · Audrey Crews · Sammy Nio · Dongjin Seo · Kenneth Shock · Brad SmithSequoia CapitalMay 28, 202612 min read

Transformers.js Turns Local AI Models Into JavaScript Pipelines

Nico Martin presents Transformers.js as the JavaScript application layer around local AI models, not the engine that performs the model math. In his explanation, ONNX defines the model graph and weights, ONNX Runtime executes the computation, and Transformers.js handles the surrounding work: loading assets, converting inputs to tensors, selecting devices and precision, and decoding outputs. Martin argues that this task-based abstraction is why one `pipeline()` API can support very different workloads, from text generation to depth estimation, while hiding much of the model-specific wiring from developers.

Nico MartinHugging FaceMay 27, 20267 min read

Gemma Is Google’s On-Device Extension of Gemini Research

Google DeepMind’s Omar Sanseviero argues that Gemma is not a parallel alternative to Gemini but the open, local and on-device expression of the same research stream. He presents Gemma 4 as a model family optimized for efficiency, developer integration and emerging agentic use cases, while drawing a clear boundary around Gemini as Google’s route for frontier capability, broad factual knowledge and long-running tasks.

Vibhu Sapra · Shawn Wang · Omar SansevieroLatent SpaceMay 25, 202613 min read

Heterogeneous Model Routing Beats Frontier Baselines on Visual Web Tasks

Adrian Bertagnoli of Callosum argues that AI scaling is moving away from monolithic models running on uniform GPU clusters and toward heterogeneous systems that route subtasks across different models, chips and workflows. He points to Callosum results in visual web navigation and recursive long-context reasoning, where mixed model-and-hardware systems reportedly matched or beat frontier baselines while cutting cost and latency, as evidence that agentic workloads should be decomposed rather than sent wholesale to the most capable model.

Adrian BertagnoliAI EngineerMay 24, 202610 min read

Google’s GenAI Stack Turns Multimodal Prompts Into Application Pipelines

Google DeepMind’s Paige Bailey and Guillaume Vernade argue that Google’s generative AI stack is being organized as an application pipeline rather than a set of isolated models. In a three-hour workshop, Bailey showed AI Studio turning multimodal Gemini prompts into inspectable API calls and generated apps with auth and Firestore, while Vernade used Gemini, Nano Banana, Veo and Lyria to illustrate, animate and score The Wind in the Willows. Their case is that builders can now orchestrate prompt, code, media generation and deployment in one workflow, even as the demos exposed seams that still require engineering discipline.

Paige Bailey · Guillaume Vernade · Ian ValentineAI EngineerMay 23, 202623 min read

Android Makes Gemini Nano a Shared System Service for Apps

Google’s Florina Muntenescu and Oli Gaymond argue that Android’s on-device AI strategy depends on treating Gemini Nano as a shared system service, not something each app ships and manages itself. In their account, AICore centralizes the three-to-four-gigabyte model, scheduling, battery management and privacy boundaries, while developers call higher-level ML Kit GenAI APIs. The constraint is reach: those APIs need recent flagship-class devices, so Google is positioning hybrid cloud fallback and LiteRT-LM as alternatives when local Gemini Nano is unavailable or too limiting.

Florina Muntenescu · Oli GaymondAI EngineerMay 22, 202611 min read

DeepSeek Uses Visual Primitives to Make Image Reasoning Cheaper

Károly Zsolnai-Fehér presents DeepSeek’s “Thinking with Visual Primitives” paper as a meaningful shift in visual AI: not a model that merely sees images, but one that can reason by marking them with points, boxes and paths. He argues that this makes tasks such as counting and maze tracing cheaper, more accurate and easier to inspect, with the paper reporting strong benchmark results while using about 90% fewer visual tokens than many frontier systems. He also cautions that the work is a blueprint rather than a released model, and still depends on triggers and may struggle with fine visual detail or unfamiliar topology problems.

Károly Zsolnai-FehérTwo Minute PapersMay 22, 20266 min read

Google’s AI Strategy Emphasizes Scale Over Frontier Model Leadership

Kevin Roose and Casey Newton read Google’s I/O announcements as evidence of a company that has regained operational confidence in AI without yet proving frontier leadership. Roose argues Google is leaning on speed, cost, distribution and infrastructure — putting capable models across search, coding, video and cloud tools at enormous scale. Newton is more skeptical: fast and cheap, he says, is not the same as best, and many of Google’s most important product claims remain untested until users can rely on them in real workflows.

Kevin Roose · Casey Newton · Demis HassabisHard ForkMay 21, 20267 min read

Gemini Omni Flash Replaces Veo as Google’s Default Video Model

ElevenLabs’ breakdown of Google’s I/O 2026 launch presents Gemini Omni as a major reset of Google’s AI video stack, with Omni Flash already replacing Veo as the default video model in the Gemini app. The source argues that the significance is not just better text-to-video generation, but a shift toward multimodal, conversational video creation: users can combine text, images, audio, video, and reference photos, then revise clips through successive instructions while preserving characters and scenes.

ElevenLabsMay 21, 20266 min read

Google’s I/O Pitch Put Distribution Ahead of Model Breakthroughs

John Coogan and Jordi Hays read Google I/O as a mixed signal: Google’s smart-glasses strategy looks stronger where it combines Gemini with eyewear distribution and Google’s own services, but its model launches exposed the risk of tying AI progress to a fixed conference calendar. On TBPN, they argued that Street View may be an underappreciated AI training asset and that AI video still has to move from impressive short clips to coherent long-form outputs. The episode also framed a potential SpaceX IPO and Nvidia’s latest results as evidence that the financial returns from space and AI infrastructure are already arriving at exceptional scale.

John Coogan · Jordi Hays · Tyler Cosgrove · Steve WozniakTBPNMay 21, 202614 min read

Google’s AI Assets Are Becoming a Product Coherence Problem

John Coogan and Jordi Hays read Google’s I/O as evidence that the company’s AI advantage is becoming a product-navigation problem: it has data, distribution, models and hardware partnerships, but its demos and product names left questions about coherence and pace. Across the source, that same pressure appears in more operational forms, as AI pushes companies to turn technical capability into usable workflows, secure software dependencies and faster product systems. Tae Kim’s Nvidia argument and the expected SpaceX IPO make the capital-market version of the question explicit: whether investors will keep paying for scarce infrastructure, extreme scale and growth curves that may take years to prove out.

Jordi Hays · John Coogan · Dylan Field · Immad Akhund · Brian Chesky · Marcus Milione · Feross Aboukhadijeh · Tae KimTBPNMay 20, 202632 min read

Any-to-Any Agents Rely on Orchestrated Multimodal Models, Not One Network

Google DeepMind’s Patrick Löber presents “any-to-any” agents as an orchestration problem rather than a claim that one model already handles every modality. In his architecture, Gemini reads and reasons across PDFs, images, audio, video and other sources, then uses function calling to invoke specialized native models for images, speech, live audio, video or embeddings. Löber argues that the useful shift is not generating every possible format, but letting an agent decide when a diagram, spoken explanation or other output is warranted.

Patrick LoeberAI EngineerMay 20, 202610 min read

Gemini’s Strategy Shifts From Frontier Leaderboards to Deployable AI Infrastructure

Google DeepMind executives Tulsee Doshi and Logan Kilpatrick argue that Google’s current Gemini strategy is built less around a single frontier model than around a deployable AI stack. In their account, Gemini 3.5 Flash, the Anti-Gravity agent harness and new multimodal products such as Omni are meant to make models fast, cheap and integrated enough to run across Search, the Gemini app, AI Studio, YouTube and enterprise tools. The deeper shift, Kilpatrick says, is that the model is increasingly absorbing the scaffolding that once surrounded it, while Google standardizes the remaining agent infrastructure across its products.

Nathan Labenz · Logan Kilpatrick · Tulsee DoshiThe Cognitive RevolutionMay 20, 202619 min read

Fine-Tuning Pushed FunctionGemma From 46% to 90% Function-Calling Accuracy

Cormac Brick, a Google AI Edge engineer, argues that on-device agents are becoming practical when developers either use system models such as Gemini Nano through Android AI Core or ship narrow, fine-tuned tiny models with LiteRT-LM. His main example is FunctionGemma, a 270 million parameter function-calling model that rose from about 46% accuracy out of the box to more than 90% on most tested app-intent functions after synthetic-data fine-tuning. Brick presents the tradeoff plainly: system GenAI is easier when it fits, while app-shipped tiny models require more work but can run locally, offline, and with more control.

Chintan Parikh · Cormac BrickAI EngineerMay 20, 202611 min read

Google’s AI Repricing Turns on Product Restraint and Developer Adoption

John Coogan and Jordi Hays use Google I/O to argue that Alphabet is being repriced less as a search incumbent threatened by AI than as a full-stack AI company, though they say Google still has to prove it can turn models such as Gemini Omni and Flash into useful products without cluttering every surface. The Diet TBPN episode also treats distribution as the common pressure point behind several unrelated fights: whether smartphones help explain the timing of global fertility decline, why a small Spotify icon change provoked backlash, and whether podcasts or childcare are eroding the market for serious nonfiction.

John Coogan · Jordi HaysTBPNMay 20, 202615 min read

Text-to-Image Training Is Becoming a Problem of Signal Allocation

Stanford adjunct lecturers Shervine Amidi and Afshine Amidi present text-to-image model training as a problem of allocating scarce learning signal across the full model lifecycle, not simply choosing a diffusion or flow-matching loss. In Lecture 6 of Stanford’s CME296 course, they argue that practical training depends on emphasizing hard timesteps, adjusting for resolution, using data curricula and representation alignment, then applying post-training, personalization, and distillation methods to improve control and reduce inference cost.

Shervine AmidiStanford OnlineMay 19, 202621 min read

AI’s Value Is Shifting From Model Demos to Distribution and Measurement

Google’s problem at I/O, Jordi Hays argued, was no longer proving that its AI models are impressive, but making Gemini useful rather than redundant across products investors now increasingly view as part of a full-stack AI business. The TBPN discussion extended that framing across the rest of the show: AI’s value, the hosts and guests argued, depends less on model spectacle than on distribution, workflow integration, economics and adoption by institutions. That distinction ran from Google’s risk of crowding users with Gemini entry points to SendCutSend’s physical capacity constraints, Commure’s push to automate healthcare administration, and METR’s effort to turn frontier-model risk into something auditable.

Jordi Hays · John Coogan · Ajeya Cotra · Jim Belosic · Tanay Tandon · Aidan Dewar · Fai Nur · Philip InghelbrechtTBPNMay 19, 202631 min read

AI Growth Is Running Into Power, Memory, and Inference Bottlenecks

TBPN’s discussion recast the AI boom around physical and economic bottlenecks — power, cooling, chip scarcity, inference cost and memory — rather than model ambition alone. Mike Isaac, Rowan Trollope and Dean Leitersdorf described an industry where local utilities, low-level inference optimization and fast state management are becoming central constraints, a capacity problem the hosts also saw in the whey protein shortage. Everlane’s reported sale to Shein pointed to a different limit: Hays argued that venture-backed ethical basics struggled against price pressure, brand preference and the demand for sustained growth. Joanna Stern supplied the adoption constraint, arguing from her reporting that AI’s progress will be judged through trust, job anxiety, children’s safety and whether new devices ease or deepen phone dependence.

John Coogan · Jordi Hays · Joanna Stern · Rowan Trollope · Dean Leitersdorf · Mike IsaacTBPNMay 18, 202624 min read

Gemini Becomes the Prompt Engineer for Google’s Gen Media Stack

Google DeepMind developer advocate Guillaume Vernade demonstrates a gen-media workflow built around Gemini as the orchestrator rather than as a one-shot generator. Using The Wind in the Willows, he shows Gemini reading the full book, producing structured prompts and scripts, and handing them to Nano Banana, Veo, Lyria and TTS models for images, video, music and narration. His broader case is that multimodal production depends less on a single model than on schemas, reference assets, state management, cost controls and prompt handoffs between specialist systems.

Guillaume Vernade · Paige BaileyAI EngineerMay 18, 202619 min read

Economic Entanglement, Not Decoupling, Defines the New China Bargain

Salesforce CEO Marc Benioff joined the All-In hosts for a discussion that framed U.S.-China relations, enterprise AI, and the software selloff around the same question: when dependence is a stabilizer and when it becomes leverage. Benioff argued that more trade with China can lower conflict risk and that large software platforms remain valuable because AI still needs trusted customer data, cash-flowing distribution, and enterprise deployment. David Friedberg, Chamath Palihapitiya, and Jason Calacanis extended the argument across Taiwan, chips, AI assistants, El Niño-driven food risk, and private-market SPVs, where interconnection can either absorb shocks or transmit them.

Jason Calacanis · Chamath Palihapitiya · David Friedberg · Marc BenioffAll-In PodcastMay 15, 202620 min read

Images 2.0 Moves Image Generation From Novelty to Workflow Tool

OpenAI product lead Adele Li and researcher Kenji Hata argue that Images 2.0 marks a shift from novelty image generation to a working visual layer inside ChatGPT. In a podcast discussion with Andrew Mayne, they point to 1.5bn images generated weekly, sharper text rendering, stronger photorealism, broader aspect ratios and more consistent characters as evidence that the model is moving into education, internal communication, marketing assets, software mockups and other practical creative work.

Andrew Mayne · Adele Li · Kenji HataOpenAIMay 14, 202612 min read

MagenticLite Brings Full Agent Workflows to Small Language Models

Microsoft Research is presenting MagenticLite as a full-stack agentic system designed to make small language models usable for multi-step work across a browser and local files. Weili Shi, Harkirat Behl and Hussein Mozannar argue that the capability comes from specializing the stack rather than relying on frontier-scale models: MagenticBrain handles planning, coding and delegation, while Fara 1.5 controls the browser. The release also emphasizes user oversight, with the agent pausing for credentials, approvals or other points where the user needs to take control.

Hussein Mozannar · Harkirat Behl · Weili ShiMicrosoft ResearchMay 14, 20267 min read

GPT-Realtime-2 Turns Voice Agents Into Tool-Using Reasoning Systems

OpenAI’s Build Hour on GPT-Realtime-2 presented the new realtime voice release as a shift from conversational voice interfaces toward tool-using, stateful agents. Teri Yu and Erika Kettleson argued that GPT-realtime-2’s larger context window, stronger instruction following, parallel tool calling and controllable speech behavior let developers build voice systems that can operate apps, reason across workflows and know when not to speak. Sierra’s Ken Murphy and Soham Ray added that production voice agents still depend on the surrounding system: guardrails, tuned turn-taking, tracing, redaction, evaluations and customer-specific workflows.

Ken Murphy · Teri Yu · Sarah Urbonas · Soham Ray · Erika KettlesonOpenAIMay 13, 202614 min read

AI Companions Are Tempting Because They Make Relationships Too Easy

Joanna Stern, author of I Am Not a Robot, argues on Big Technology Podcast that AI’s most plausible near-term role is not as a standalone gadget or replacement professional, but as a second layer on devices, workflows, and relationships people already use. Drawing on a year of trying to put AI into daily life, she says the tools can be genuinely useful in wearables, medical interpretation, and solo work, while chatbot companionship exposes a more troubling risk: systems that are always available, agreeable, and easier than human relationships.

Alex Kantrowitz · Joanna SternAlex KantrowitzMay 13, 202615 min read

Agents Can Now Fine-Tune Open Models Through Prompted Workflows

Merve Noyan argues that open models have moved from downloadable artifacts into an operational stack for selection, serving, inspection, training and deployment. In her Hugging Face presentation, she makes the case that access to model weights now matters because developers can quantize, fine-tune and run models locally or at the edge, while Hub benchmarks, inference providers, traces, MCP and Skills let agents act directly on those workflows. Her strongest example is a coding agent that can size hardware, choose infrastructure and launch a fine-tuning job from a prompt.

Merve NoyanAI EngineerMay 13, 202612 min read

NVIDIA’s Nemotron 3 Nano Omni Trades Accuracy for Multimodal Throughput

Károly Zsolnai-Fehér’s account of NVIDIA’s Nemotron 3 Nano Omni argues that the 30-billion-parameter open multimodal model is notable less for leading general intelligence benchmarks than for processing long video, audio, images and documents quickly and cheaply. The reported advantage comes from compression across the system — Mamba layers, audio tokenization, aspect-ratio-preserving vision handling, distilled encoders and efficient video sampling — which reduces the amount of material sent into the language-model backbone.

Károly Zsolnai-FehérTwo Minute PapersMay 13, 20267 min read

The Mouse Pointer Becomes a Reference Tool for AI Interfaces

Google DeepMind researcher Adrien Baranes argues that the mouse pointer can become more than a tool for selecting and clicking. In an experimental prototype, he presents the cursor as an AI-mediated reference layer: a way for Gemini to connect words such as “this,” “that,” and “here” to the precise objects, app data, and screen content a user is indicating. The aim is to make pointing function as shared context between a person and an AI system across documents, calendars, maps, and images.

Adrien BaranesGoogle DeepMindMay 13, 20265 min read

Altman Testimony Casts Musk’s OpenAI Claims as a Fight Over Control

OpenAI’s trial, Anthropic’s secondary-market flare-up, and two media deals are read on Diet TBPN as fights over control, enforceability, and credibility. John Coogan argues that Musk v. OpenAI is increasingly not only about whether OpenAI betrayed its nonprofit mission, but whether Elon Musk accepted a for-profit path only if he controlled it; Jordi Hays frames the Anthropic panic as a test of whether private-company transfer restrictions can hold against demand for AI exposure. Coogan and Hays treat Thinking Machines’ demo separately, as a bet that real-time interaction should be native to AI models, while eBay’s rejected GameStop bid and Byron Allen’s BuzzFeed investment turn on market confidence.

John Coogan · Jordi Hays · Alex ShanTBPNMay 13, 202615 min read

Codex Can Now Operate Local Mac Apps Without Taking Over

OpenAI’s Ari Weinstein argues that computer use turns Codex from a coding agent into a system that can operate local Mac applications by seeing interfaces, clicking, typing and continuing work in the background. In a demonstration with Romain Huet, Weinstein presents the feature as distinct from a full-desktop takeover: Codex uses a separate cursor, combines screenshots with macOS accessibility data, and requires app-by-app permission before it can see or type into local software.

Romain Huet · Ari WeinsteinOpenAIMay 12, 20266 min read

Autonomous Medical Robots Need Physics Models, Not Just Foundation Models

UC San Diego professor Michael Yip argues in a Stanford Robotics Seminar that medical robotics must move beyond teleoperation if it is to address healthcare labor shortages. Current surgical robots can improve precision but still depend on a surgeon’s skill, while surgery’s scarce data, deformable tissue, safety constraints, and need for millimeter accuracy make end-to-end learning an inadequate answer on its own. Yip makes the case for a hybrid path: modern perception where it works, explicit physics and control where contact demands it, and humanoid platforms where broader hospital tasks require more general embodiment.

Michael YipStanford OnlineMay 12, 202617 min read

Apple-Device AI Is Becoming Viable Without Cloud Inference

Prince Canuma presents MLX, Apple’s array framework for Apple Silicon, as a practical foundation for running AI agents locally rather than through cloud services. His case is rooted in accessibility and unreliable connectivity, but extends to product constraints for voice agents, robots and multimodal apps: vision, speech, video generation and long-context inference can increasingly run on Macs, iPhones and iPads without a network call. Canuma does not argue that local models replace every frontier cloud system, but that the boundary has moved far enough to make on-device AI a serious deployment option.

Prince CanumaAI EngineerMay 11, 202613 min read

Travel AI Needs Visual Agents, Not Chatbot Booking Flows

Airbnb chief executive Brian Chesky argues that today’s AI chatbots are the wrong interface for travel and e-commerce, even as AI becomes central to how Airbnb operates. In a live TBPN conversation, Chesky said consumer AI’s next wave will depend on richer, more visual and collaborative agentic products, not text-first chat boxes or another round of enterprise software. He also tied Airbnb’s recent growth reacceleration to more hands-on “founder mode” management, saying AI makes operating intensity more important rather than less.

Jordi Hays · John Coogan · Brian CheskyTBPNMay 8, 202615 min read

Pretraining and Attention Infrastructure Made Vision Transformers Practical

Isaac Robinson of Roboflow argues that transformers overtook convolutional networks in vision not because images stopped needing visual structure, but because that structure moved from hand-built architecture into pretraining, scaling and tooling. In his account, ViT-style models first lacked the inductive biases and efficiency that made CNNs dominant, but self-supervised vision pretraining and attention infrastructure from the LLM world made the simpler architecture practical. Robinson frames the next problem as deployment: turning large foundation backbones into model families that can meet real latency, cost and hardware constraints.

Isaac RobinsonAI EngineerMay 8, 202610 min read

BFL Is Moving FLUX From Image Generation Toward Physical AI

Stephen Batifol of Black Forest Labs argues that FLUX is no longer just an image-generation line but the start of a broader push toward visual intelligence: models that can generate, edit, understand, and eventually act across images, video, audio, and physical environments. In the talk, he presents FLUX.1, Kontext, FLUX.2, and FLUX.2 Klein as product steps toward that goal, while BFL’s Self-Flow research is framed as the mechanism for moving representation learning inside multimodal generative models rather than relying on external encoders.

Stephen BatifolAI EngineerMay 8, 202611 min read

Luma Is Rebuilding Video AI Around a Unified Multimodal Transformer

In a Stanford CS153 guest lecture, Luma AI co-founder and chief executive Amit Jain argues that generative video is only a staging point toward “unified intelligence”: models that understand and generate across text, images, video, audio, code and tools in a single work loop. Jain traces Luma’s path from Apple-era LiDAR and 3D capture to internet-scale video, saying the company followed the data but now sees prettier clips as insufficient. The destination, he says, is a multimodal AI factory for professional creative and physical work, where human skills, tool use, feedback and unified transformer architectures produce full campaigns, schematics, productions and eventually robotics workflows.

Anjney Midha · Amit JainStanford OnlineMay 7, 202619 min read

Descript Bets Creator AI on Reliable Editing, Not Content Slop

Laura Burkhauser, Descript’s chief executive, distinguishes generative AI tools for creators from the “slop” she defines as mass-produced content arbitrage. Her case is that Descript’s future depends less on adding AI everywhere than on making editing automation reliable, reversible and useful for recorded human media. That means choosing third-party models by fit and taste, building in-house systems where Descript has workflow data, and treating creator backlash as a product constraint rather than a branding problem.

Nathan Labenz · Laura BurkhauserThe Cognitive RevolutionMay 7, 202619 min read

Gemma 4 Moves On-Device AI From Chatbots to Local Agents

Chintan Parikh of Google DeepMind argues that on-device AI is moving from local chatbots toward local agents, as smaller Gemma 4 edge models become capable of tool calling, structured output and reasoning on phones, laptops and embedded hardware. With Weiyi Wang joining the Q&A, Parikh presents LiteRT as the deployment layer for that shift across Android, iOS, desktop, web and IoT. His case is pragmatic rather than absolute: edge inference can improve latency, privacy, offline use and cost, but teams still have to manage memory, quantization, accelerator support and when to call the cloud.

Weiyi Wang · Chintan ParikhAI EngineerMay 7, 202611 min read