Voice and Audio AI
Speech recognition, voice agents, text-to-speech, audio generation, dubbing, call automation, and voice-first AI interfaces.
Flows Agent Turns Creative Briefs Into Editable AI Production Pipelines
ElevenLabs presents Flows Agent as a conversational assistant for building and revising node-based creative workflows inside ElevenCreative Flows. The company’s case is that a user can describe an ad or other asset in natural language, have the agent assemble the models, prompts, nodes, and connections, then keep the resulting pipeline visible for edits, approvals, and reuse. The demo emphasizes cost controls for credit-heavy generation, node-level revisions through chat, and templates that turn a completed flow into a repeatable production system.
ElevenMusic Lets Creators Publish Tracks to Explore and Earn After 11,000 Streams
ElevenLabs presents ElevenMusic as an AI music platform where discovery, remixing, publishing, and earning are meant to operate as one loop. The source argues that creators can turn a lyric, melody, mood, or existing track into publishable music, place it on the Explore page for others to stream or remix, and use audience response to guide further work. It also makes the monetisation path conditional: creators must subscribe to Pro, meet an 11,000-stream threshold, and satisfy the platform’s royalty terms before earning from listens.
Apple’s Revamped Siri May Be Good Enough to Ease Its AI Crisis
Bloomberg’s Mark Gurman argues that Apple’s revamped Siri is not a leap ahead of ChatGPT, Gemini or Claude, but may be good enough to stabilize Apple’s position in AI. Speaking with Ed Ludlow, Gurman said the new Siri finally delivers on much of the assistant promise Apple made years ago, while still falling short on advanced tasks such as deep research, long-document summaries and creating spreadsheets or slide decks. His case is that Apple can ease its AI crisis if Siri now handles the everyday questions and device-assistant tasks most of its 2bn-plus users actually need.
Dubbing v2 Preserves Speaker Performance Across 90-Plus Languages
ElevenLabs presents Dubbing v2 as an AI dubbing model designed to transfer a speaker’s performance across more than 90 languages, not just translate the words. The company argues that by conditioning on the original audio rather than a transcript, the system can preserve voice, tone, emphasis, emotion and timing while adapting phrasing for natural delivery in the target language. The walkthrough positions the tool as an automated localization workflow for creators, marketers and studios, with speaker similarity as the main setting users adjust between voice resemblance and native-language naturalness.
Human Attention Is Becoming the Bottleneck in AI Coding Workflows
Zack Proser, an Applied AI engineer at WorkOS, argues that AI coding has shifted the bottleneck from tool speed to human attention. His proposed workflow uses voice dispatch, isolated git worktrees, Slack and Linear-reading agents, remote phone control, and layered verification so developers can keep agent loops moving without staying pinned to a desk or rubber-stamping work they can no longer track.
ElevenMusic Turns Music Discovery Into AI Remixing and Prompted Creation
ElevenLabs presents ElevenMusic as a music platform that begins with discovery and turns listening into creation. The onboarding shows users moving between Explore, where they can browse and remix tracks from more than 4,000 independent and emerging artists, and Studio, where they can upload material or generate new tracks from prompts. Its central argument is practical: the main user skill is not production technique but writing a specific musical brief that gives the model enough genre, mood, instrumentation, vocal, and energy cues to produce a closer result.
Apple’s New Siri Tests Who Controls the Default AI Assistant
John Coogan and Jordi Hays read Apple’s WWDC as a test of whether the company can turn its long-delayed Siri promise into a defensible AI interface without giving up control of defaults, privacy, and the iPhone camera. The Diet TBPN segment argues that Apple’s AI story is less about a single keynote than about older bets now becoming technically possible, while Anthropic’s Claude Fable release and Meta’s data-center training push show the same shift toward long-running inference and physical AI infrastructure.
ElevenLabs Adds Studio and Flows Agents to Automate Creative Production
Luke Harries used ElevenLabs’ Warsaw summit to argue that AI creative production is moving beyond prompt-based asset generation toward agent-directed workflows. Presenting ElevenCreative, he introduced Studio Agent and Flows Agent as layers above models and editing tools, intended to help teams ideate, script, prompt, edit, localize, and reuse campaigns. His case was that marketers’ role shifts from executing each production step to directing and approving systems that can produce hero assets, performance variations, and localized creative continuously.
LOT Turns to ElevenLabs for Multilingual AI Passenger Support
LOT Polish Airlines chief executive Michał Fijoł used an ElevenLabs summit in Warsaw to announce a collaboration that will bring ElevenAgents into the airline’s passenger support. His argument was that customer communication has become an operational challenge for LOT: nearly 200 IT systems, flights across dozens of markets, and routine passenger questions arriving in multiple languages and time zones. Fijoł positioned AI voice support not as a replacement for airline staff, but as a way to handle language, timing, and information access at a scale a Warsaw-centered contact model cannot easily cover.
Voice Cloning Preserves Identity for People Losing Speech to MND
At ElevenLabs’ Warsaw summit, Gabi Leibowitz argued that voice cloning can do more than replace lost speech with functional text-to-speech: it can preserve the vocal traits that make people recognizable to themselves and others. The case was told through Irene Perrin, a former history teacher living with motor neuron disease, who uses an ElevenLabs-cloned voice to continue volunteering at St George’s Chapel and says the technology has given back part of the identity the disease took away.
TELUS Digital Cuts Contact-Center Onboarding Time 20% With AI Voice Simulations
TELUS Digital’s vice president of product, Mitch Lieberman, presents the company’s Agent Trainer as a response to a high-volume contact-center onboarding problem: 70,000 associates, 20,000 to 30,000 hires a year, and industry churn of 30% to 50%. Built on ElevenAgents, the voice and chat simulation platform is intended to get new agents ready for customer interactions faster, with TELUS Digital reporting a 20% reduction in time to proficiency, more than 50,000 completed simulations, and early signs of lower churn.
ElevenLabs Unveils Dubbing v2 and Previews More Controllable Eleven v4
ElevenLabs co-founder Mati Staniszewski used a Warsaw summit keynote to argue that AI’s next constraint is not intelligence but communication people can trust. He presented two new models — Dubbing v2, designed to preserve an original performance across languages, and a preview of Eleven v4, aimed at finer control over speech, emotion, accent, whispering and song — as evidence of that thesis. The broader case was that voice AI becomes commercially useful only when models are tied to agents, integrations, authentication, memory and deployment systems that let companies put spoken interfaces into production.
Cognitive Surrender Is the Core Risk for AI Product Teams
Tony Fadell, the iPod creator, iPhone co-creator and Nest founder, argues that AI raises the value of product judgment rather than replacing it. In a conversation with Lenny Rachitsky, Fadell says builders should use AI to prototype and accelerate bounded work, but not “cognitively surrender” decisions about architecture, taste, marketing, ethics or what is worth building. His broader case is that great products still come from opinionated judgment applied to real pain, new technology and the full customer journey, not from tools that merely make shipping easier.
Hackathon Caps Models at 32B Parameters to Reward Tinkerable AI Apps
Build Small is a Hugging Face and Gradio hackathon organized around a hard constraint: every model used must be under 32 billion parameters. Yuvraj Sharma framed the rule as a way to move AI building away from dependence on giant hosted models and back toward systems that participants can inspect, fine-tune, run locally, and ship as working Gradio Spaces. Sponsor presentations from Black Forest Labs, OpenBMB, OpenAI, NVIDIA, Modal, JetBrains, and Cohere largely reinforced that premise, offering small models, credits, tools, and prize categories meant to turn the constraint into runnable projects rather than demos in name only.
Voice AI Benchmarks Understate Errors in Real Multi-Speaker Audio
Hervé Bredin of pyannoteAI argues that voice AI benchmarks often make speech-to-text look more solved than it is by evaluating cleaner, more single-speaker-like audio. In his talk, he shows Nvidia Parakeet scoring 11.4% word error rate on AMI meeting audio in the Open ASR Leaderboard but 26% in pyannoteAI’s run on the same dataset using the table microphone rather than headset audio. Bredin’s broader case is that conversational AI needs fine-grained speaker diarization and speaker-attributed transcription, because words alone do not capture who spoke, when they overlapped, or how real multi-speaker conversations are structured.
AI Voice Agents Are Beating the Average Customer-Service Rep
Tom Chen, chief product officer at Aircall, argues that AI voice agents should be judged against the average customer-service interaction, not the best human rep. In his account, the technology is already good enough for many routine calls, can handle far more concurrency at lower cost, and may improve satisfaction when customers are given a clear choice between faster AI service and a human agent. The main constraint, Chen says, is often not the model but the undocumented company knowledge the agent needs to resolve issues.
Coding Agents Are Becoming a Managed Workforce Inside Conductor
Conductor CEO and co-founder Charlie Holtz argues that AI coding tools should be managed more like a team of workers than used as autocomplete inside an IDE. In a demo of how he uses Conductor to build Conductor, Holtz shows a workflow built around starting multiple agent workspaces, reviewing their pull requests, and merging only the work that passes human judgment. He says the shift makes prompts, architecture, review discipline, and “slop-free” parts of the codebase more important as hand-written code becomes less central.
Useful AI Systems Are Emerging Inside Controlled Enterprise Workflows
TBPN’s latest discussion framed the commercial AI moment less as a race to looser autonomy than as a shift toward bounded systems. Across Microsoft’s Build announcements, Suno’s funding, creator films, stablecoins, crypto markets, cybersecurity, and workflow software, the central argument was that AI becomes useful when it is embedded in infrastructure that can price, route, audit, secure, or constrain it. John Coogan and guests applied that lens most directly to Microsoft’s agent strategy, where Azure and Microsoft 365, not a new phone, become the controlled operating environment for enterprise agents.
The Model Alone Is No Longer the AI Product
At AI Engineer Melbourne 2026’s Day 1 keynote program, speakers including Shawn Wang, George Cameron, Sarah Sachs, Igor Costa, Vamsi Ramakrishnan and Geoffrey Huntley argued that AI engineering has moved beyond picking the strongest model. Their shared case was that useful AI products now depend on the systems around models: harnesses, routing, evals, memory, state, latency budgets, deterministic tools and cost controls. The model still matters, but the keynote program framed product advantage as an architecture and economics problem, not a leaderboard problem.
Screen Fatigue Is Driving New Markets for Physical Consumer Products
Sam Parr and Shaan Puri use a My First Million episode to test seven unconventional business ideas against a narrower question: whether each points to real demand or just novelty. Their strongest cases are for anti-phone hardware, social wellness formats, physical screen-free media and VR trade training, where they argue odd-looking products attach to existing pressures such as phone addiction, screen fatigue and labor shortages. They are more skeptical of ideas that rely on unverifiable claims or inflated mission language, including AI pet translation and clinical-trial prediction markets.
Travelers Deploys AI Claims Assistant Nationwide After Eight-State Pilot
Travelers’ claims CIO Erik Roen argues that putting an AI assistant into first notice of loss required changing the operating model around claims, not just adding a model to a call flow. In a conversation with OpenAI chief revenue officer Denise Dresser, Roen says the insurer moved from an eight-state pilot to countrywide deployment by pairing OpenAI’s technology with cross-functional business ownership, continuous evaluations, near-real-time monitoring and fail-safes for a workflow that helps customers decide whether and how to file a claim.
Language Models Are Becoming the Bottleneck in Video Generation
Ethan He, who worked on NVIDIA’s Cosmos world model and xAI’s Grok Imagine, argues that the next major gains in video generation will come less from diffusion models alone than from language models, agents, and context management around them. In an interview with swyx and Vibhu Sapra, He describes Grok Imagine as a fast-built example of that shift: diffusion renders pixels, while language systems increasingly rewrite prompts, plan clips, call tools, manage memory, and turn short generations into longer, editable video.
A Two-Hour AI Prototype Let Museum Visitors Talk to Statues
Joe Reeve of ElevenLabs argues that his “talk to a statue” prototype mattered less as a museum product than as evidence of what can now be assembled quickly from existing AI APIs. Built in Cursor in about two hours, the app identifies a photographed statue, generates historical context and a plausible voice, spins up an ElevenLabs agent, and starts a conversation in roughly 30 seconds. Reeve says the harder remaining questions are institutional rather than purely technical: who authors the object’s story, what voice it should have, and how multimodal voice interfaces should work.
Sarvam and NVIDIA Build Full-Stack Sovereign AI Infrastructure for India
Sarvam co-founder Pratyush Kumar argues that India’s AI sovereignty cannot mean putting Indian-language interfaces on foreign-built systems. In a NVIDIA-backed account of Sarvam’s work, he describes a full-stack effort to build foundational models, data pipelines, inference systems and developer APIs inside India, using NVIDIA H100 clusters and NeMo tooling to process Indian-language data at scale. The case is that voice-first AI for India’s population requires domestic capability across data, models, applications and accelerated-compute expertise.
Voice Agents Need Colocated Models to Stay Under One Second
Rishabh Bhargava of Together AI argues that production voice agents are now constrained less by demos than by a sub-second engineering budget spanning speech-to-text, LLMs, text-to-speech, networking, and scaling. In his account, users notice delays above 500ms and abandon calls around one second, making even 75ms network hops material once model latency is optimized. The practical architecture remains a cascade, he says, because it lets teams control tool calling, evaluation, and reliability while speech-to-speech models still lag on production requirements.
Personal AI Systems Need Separate Layers for Memory and Autonomy
Nathan Labenz opens his personal AI infrastructure to a security audit by Daniel Miessler, showing a system that combines a high-context Claude Code “second brain” with lower-access autonomous agents for operational work. Their central argument is that useful personal AI should not collapse memory, authority, and autonomy into one assistant: raw personal history should be preserved and audited, while agents that act in the world need narrower permissions, clear roles, and containment. Miessler frames the longer-term model as an assistant that navigates from current state to ideal state while continually pruning obsolete scaffolding as models improve.
Hugging Face Ships a $299 Hackable Robot for Voice AI Experiments
Andres Marafioti argues that Hugging Face’s Reachy Mini is meant to move robotics experimentation out of expensive humanoid hardware and into a $299-to-$449 open-source platform that users can assemble, repair and modify themselves. The robot’s most-used application is conversation, and Marafioti’s account ties its social ambition to a technical stack built for low-latency speech: Parakeet transcription, Qwen 3.5 27B, and an optimized Qwen3 TTS implementation that he says improved from 0.8x to 5.8x real time.
ElevenLabs Music v2 Adds Section Editing and Mid-Track Genre Shifts
ElevenLabs’ launch walkthrough for Music v2 presents the model as a more controllable generative music system, not only a higher-quality one. Alec Wilcock says the new version improves vocals, instrumentation, arrangement, multilingual output and dense vocal delivery, while adding section-by-section composition, targeted inpainting and the ability for one song to move between genres without losing coherence. The company also says the model is trained on licensed data and that generated tracks are cleared for commercial use.
Claude Code Reverse Engineers Viking VoIP Phone’s Undocumented Configuration Protocol
Boris Starkov of ElevenLabs presents the Viking K-1900D-IP phone as a reverse-engineering case study in which Claude Code turned an unusable, undocumented VoIP handset into a working AI demo. Starkov argues that Claude did the investigative work: discovering a two-letter command protocol, brute-forcing valid registers, intercepting the manufacturer’s Windows XP-era software through a TCP proxy, and deriving the one-byte checksum needed to write persistent configuration. His account is also a claim about agency in hardware work: he says he acted largely as Claude’s hands while Claude orchestrated the protocol break.
ElevenLabs Says Dubbing v2 Preserves Performance Across 90 Languages
ElevenLabs is introducing Dubbing v2 alpha as an AI dubbing model built around preserving the original speaker’s performance, not just translating a transcript. The company says the system conditions directly on source audio so tone, pacing, emphasis and emotional delivery can carry across more than 90 languages, with sync-aware translation adapting phrasing to fit the timing of the original. ElevenLabs is positioning the launch for creators, marketers and studios that want automated localization without building a separate dubbing pipeline.
Voice Will Become the Default Interface for Enterprise AI
Luiz Domingos, chief technology officer of Mitel, argues that enterprise AI has moved past pilots and into communications workflows where latency, compliance, auditability and human oversight determine whether systems can be deployed. In a conversation with Craig Smith, Domingos says cloud-only AI will not meet the needs of real-time voice and regulated industries, and that edge and hybrid deployments will become central. His larger prediction is that enterprise AI will increasingly be accessed by voice rather than screens, especially for frontline workers whose jobs do not fit a desktop interface.
ElevenLabs Adds Licensed Stan Lee AI Voice to Creator Tools
ElevenLabs is introducing an approved AI replica of Stan Lee’s voice through a partnership with Stan Lee Universe, positioning the late comic-book creator as a licensed feature inside its voice and creator tools. The company says users can request to license Lee’s voice for projects, hear it in Eleven Reader, generate Stan Lee cameos, and use Stan-inspired music, while repeatedly framing the launch around official authorization, rights ownership, and Lee’s mythology of stories being carried forward.
ElevenLabs Launches Music v2 for Licensed Commercial AI Song Generation
ElevenLabs is presenting Music v2 as a licensed-data AI music model built to generate vocal-led tracks from detailed natural-language prompts, not just loops or backing beds. The launch materials argue that the model can produce finished-sounding, one-shot outputs across styles and languages, while adding workflow features such as targeted inpainting, section-by-section composition, and deployment through ElevenMusic, ElevenCreative, and a forthcoming ElevenAPI.
Synthetic Intimacy, Surveillance, and Stimulation Are Raising the Cost of Impulse
Chris Williamson’s inaugural Mostly Wise conversation with Andrew Huberman, Matt McCusker and Tom Segura uses health advice, comedy, AI replicas and conspiracy talk to examine where useful tools become distortions. Huberman repeatedly argues for moderation and mechanism over slogans — from low-dose tadalafil and sleep protocols to cannabis, sunscreen and self-control — while Segura and McCusker test those claims against comedy, parenting and lived experience. The broader case is that modern life increasingly requires judgment about thresholds: when optimization becomes rumination, evidence becomes pattern-seeking, and synthetic intimacy or surveillance starts to reshape ordinary behavior.
Google’s GenAI Stack Turns Multimodal Prompts Into Application Pipelines
Google DeepMind’s Paige Bailey and Guillaume Vernade argue that Google’s generative AI stack is being organized as an application pipeline rather than a set of isolated models. In a three-hour workshop, Bailey showed AI Studio turning multimodal Gemini prompts into inspectable API calls and generated apps with auth and Firestore, while Vernade used Gemini, Nano Banana, Veo and Lyria to illustrate, animate and score The Wind in the Willows. Their case is that builders can now orchestrate prompt, code, media generation and deployment in one workflow, even as the demos exposed seams that still require engineering discipline.
Divergent Says Software-Defined Factories Can Build Drones in 71 Days
Lukas Czinger, co-founder of Divergent Technologies, argues that the bottleneck in defense hardware is not design but the tooling and fixed production lines that make iteration slow once a product leaves prototype. In a livestream interview, he said Divergent’s software-defined factory can move autonomous aircraft and other complex systems from digital design into production without rebuilding the supply chain around each change, citing a 71-day clean-sheet build of a flyable small uncrewed aircraft as proof of the model.
AI’s Bottlenecks Shift From Model Demos to Compute, Rights, and Institutions
AI, in TBPN’s latest discussion, is no longer treated mainly as a product demo but as a question of infrastructure, financing and institutional adoption. The strongest evidence came from SpaceX’s AI-heavy IPO framing, Anthropic’s reported move toward operating profit, and OpenAI’s claimed Erdős breakthrough, which the speakers used to challenge the “AI is a scam” critique. The unresolved issue is not whether the technology matters, but how quickly compute capacity, rights regimes, regulation and existing institutions can absorb it.
Gemini Omni Flash Replaces Veo as Google’s Default Video Model
ElevenLabs’ breakdown of Google’s I/O 2026 launch presents Gemini Omni as a major reset of Google’s AI video stack, with Omni Flash already replacing Veo as the default video model in the Gemini app. The source argues that the significance is not just better text-to-video generation, but a shift toward multimodal, conversational video creation: users can combine text, images, audio, video, and reference photos, then revise clips through successive instructions while preserving characters and scenes.
Google’s AI Assets Are Becoming a Product Coherence Problem
John Coogan and Jordi Hays read Google’s I/O as evidence that the company’s AI advantage is becoming a product-navigation problem: it has data, distribution, models and hardware partnerships, but its demos and product names left questions about coherence and pace. Across the source, that same pressure appears in more operational forms, as AI pushes companies to turn technical capability into usable workflows, secure software dependencies and faster product systems. Tae Kim’s Nvidia argument and the expected SpaceX IPO make the capital-market version of the question explicit: whether investors will keep paying for scarce infrastructure, extreme scale and growth curves that may take years to prove out.
Any-to-Any Agents Rely on Orchestrated Multimodal Models, Not One Network
Google DeepMind’s Patrick Löber presents “any-to-any” agents as an orchestration problem rather than a claim that one model already handles every modality. In his architecture, Gemini reads and reasons across PDFs, images, audio, video and other sources, then uses function calling to invoke specialized native models for images, speech, live audio, video or embeddings. Löber argues that the useful shift is not generating every possible format, but letting an agent decide when a diagram, spoken explanation or other output is warranted.
Gemini’s Strategy Shifts From Frontier Leaderboards to Deployable AI Infrastructure
Google DeepMind executives Tulsee Doshi and Logan Kilpatrick argue that Google’s current Gemini strategy is built less around a single frontier model than around a deployable AI stack. In their account, Gemini 3.5 Flash, the Anti-Gravity agent harness and new multimodal products such as Omni are meant to make models fast, cheap and integrated enough to run across Search, the Gemini app, AI Studio, YouTube and enterprise tools. The deeper shift, Kilpatrick says, is that the model is increasingly absorbing the scaffolding that once surrounded it, while Google standardizes the remaining agent infrastructure across its products.
Fine-Tuning Pushed FunctionGemma From 46% to 90% Function-Calling Accuracy
Cormac Brick, a Google AI Edge engineer, argues that on-device agents are becoming practical when developers either use system models such as Gemini Nano through Android AI Core or ship narrow, fine-tuned tiny models with LiteRT-LM. His main example is FunctionGemma, a 270 million parameter function-calling model that rose from about 46% accuracy out of the box to more than 90% on most tested app-intent functions after synthetic-data fine-tuning. Brick presents the tradeoff plainly: system GenAI is easier when it fits, while app-shipped tiny models require more work but can run locally, offline, and with more control.
ElevenLabs Adds Albert Einstein’s Voice to Its Licensed AI Marketplace
ElevenLabs is offering a licensed AI version of Albert Einstein’s voice through its Iconic Marketplace, positioning it for narration, education, documentaries, and immersive storytelling. The company argues that Einstein’s voice can be used as both a cultural artifact and a creative tool, while saying the marketplace is curated and that each voice is approved and managed with the relevant rights holder.
Gemini Becomes the Prompt Engineer for Google’s Gen Media Stack
Google DeepMind developer advocate Guillaume Vernade demonstrates a gen-media workflow built around Gemini as the orchestrator rather than as a one-shot generator. Using The Wind in the Willows, he shows Gemini reading the full book, producing structured prompts and scripts, and handing them to Nano Banana, Veo, Lyria and TTS models for images, video, music and narration. His broader case is that multimodal production depends less on a single model than on schemas, reference assets, state management, cost controls and prompt handoffs between specialist systems.
Abridge Bets Clinical Conversations Can Become Healthcare’s Intelligence Layer
Abridge executives Janie Lee and Chaitanya “Chai” Asawa argue that the patient-clinician conversation is becoming healthcare’s core intelligence layer, not merely an input for automated notes. In a discussion with Redpoint’s Jacob Effron, they describe Abridge’s move from ambient documentation into clinical decision support, prior authorization and other workflows that depend on EHR data, payer rules, medical literature and local guidelines. Their case is that healthcare AI will be judged less by chatbot fluency than by whether it can deliver accurate, low-latency, privacy-preserving support inside clinical workflows without adding to clinicians’ alert burden.
GPT-Realtime-2 Turns Voice Agents Into Tool-Using Reasoning Systems
OpenAI’s Build Hour on GPT-Realtime-2 presented the new realtime voice release as a shift from conversational voice interfaces toward tool-using, stateful agents. Teri Yu and Erika Kettleson argued that GPT-realtime-2’s larger context window, stronger instruction following, parallel tool calling and controllable speech behavior let developers build voice systems that can operate apps, reason across workflows and know when not to speak. Sierra’s Ken Murphy and Soham Ray added that production voice agents still depend on the surrounding system: guardrails, tuned turn-taking, tracing, redaction, evaluations and customer-specific workflows.
Agent Workflows Route Conversations Through Specialized Subagents
ElevenLabs is introducing Workflows, a visual editor for its Agents Platform that lets builders design routed conversation flows instead of placing all business logic inside one agent prompt. The company argues that specialized subagents, each with their own instructions, tools, knowledge bases and model choices, give teams more control over cost, latency and accuracy. The product is positioned as a way to combine AI interpretation with predefined actions, verification steps and human handoffs on the same design surface.
Suno Bets That Making Songs Can Become a Mass Consumer Medium
Suno founder and CEO Mikey Shulman argues that AI music should not be understood as a cheaper substitute for streaming catalogs, but as a new form of active consumer entertainment. In a conversation with Sequoia’s Sonya Huang, he says Suno’s technical choices — modeling raw sound, prioritizing full songs, and using preference data rather than conventional benchmarks — support a product thesis that making music can be as much the point as listening to it. Shulman also frames partnerships with labels such as Warner as central to building new participatory music formats, not as a concession to incumbents.
Altman Testimony Casts Musk’s OpenAI Claims as a Fight Over Control
OpenAI’s trial, Anthropic’s secondary-market flare-up, and two media deals are read on Diet TBPN as fights over control, enforceability, and credibility. John Coogan argues that Musk v. OpenAI is increasingly not only about whether OpenAI betrayed its nonprofit mission, but whether Elon Musk accepted a for-profit path only if he controlled it; Jordi Hays frames the Anthropic panic as a test of whether private-company transfer restrictions can hold against demand for AI exposure. Coogan and Hays treat Thinking Machines’ demo separately, as a bet that real-time interaction should be native to AI models, while eBay’s rejected GameStop bid and Byron Allen’s BuzzFeed investment turn on market confidence.
Platform Dependence Is Breaking Across AI Products and Digital Media
AI and media incumbents are being forced to respond to systems changing faster than their strategies, regulations or business models. Sriram Krishnan, Aarthi Ramamurthy and Condé Nast chief executive Roger Lynch make that case across AI regulation that may miss the next generation of products, private AI investing repackaged through SPVs, and media businesses built on platform traffic that is disappearing. Lynch’s counterpoint is that media companies can still endure if they move away from click incentives and toward authority, direct audience relationships and human creative work.
Apple-Device AI Is Becoming Viable Without Cloud Inference
Prince Canuma presents MLX, Apple’s array framework for Apple Silicon, as a practical foundation for running AI agents locally rather than through cloud services. His case is rooted in accessibility and unreliable connectivity, but extends to product constraints for voice agents, robots and multimodal apps: vision, speech, video generation and long-context inference can increasingly run on Macs, iPhones and iPads without a network call. Canuma does not argue that local models replace every frontier cloud system, but that the boundary has moved far enough to make on-device AI a serious deployment option.
Production AI Features Need Feedback Loops, Not One-Shot Prompts
Mehedi Hassan, a product engineer at Granola, argues that the hard part of shipping AI features is not getting a model to work once in a demo, but making its behavior reliable and inspectable in production. Using Granola’s meeting-notes app as the case, he says web search, chat, and prompt personalization quickly expose costs, context limits, provider instability, and role-specific user expectations that a single prompt cannot absorb. Granola’s response, in his account, was to build feedback loops: internal tracing, broadly usable debugging tools, and faster ways to test product variants before shipping.
Text-to-Speech Models Are Converging on LLM-Style Architectures
Samuel Humeau of Mistral argues that modern text-to-speech has converged on an architecture that resembles large language modeling: an autoregressive transformer generates compressed audio tokens frame by frame, rather than raw waveform samples. Using Mistral’s open-weight Voxtral TTS model as the example, he says neural audio codecs make that possible by reducing dense speech signals to token-like representations a transformer can handle. The remaining latency frontier, in his account, is not just streaming playable audio early, but letting TTS consume an LLM’s text stream as it is still being written.
Voice AI Still Confuses Natural Speech With Real Conversation
Neil Zeghidour, CEO of Gradium AI and one of the researchers behind the full-duplex voice model Moshi, argues that voice AI’s long-promised “Her” moment is still being confused with better synthetic speech. His case is that cascaded voice agents are useful but structurally too slow and lossy to feel conversational, while speech-to-speech models improve flow but remain limited unless they can listen and speak simultaneously, use tools reliably, understand paralinguistic cues, and run cheaply enough to scale.
ElevenLabs Voice Engine Wraps Existing Chat Agents Without Rebuilding Them
Luke Harries of ElevenLabs argues that the next step for chat agents is not a new orchestration stack but a voice layer around the agents companies have already built. His case for ElevenLabs’ Voice Engine is that teams can keep their existing LLM logic, RAG, tools and business rules, while offloading speech-to-text, text-to-speech, turn-taking and interruption handling to a wrapper. The product is positioned for companies that want voice interfaces across web, phone and meeting channels without rebuilding their chat agents inside a fully managed platform.
ElevenLabs Shows Voice Isolator Cleaning Noisy iPhone Audio in Seconds
ElevenLabs presents Voice Isolator, a tool inside ElevenCreative, as a fast way to salvage noisy recordings that cannot be reshot. In the tutorial, the company demonstrates a single workflow on an iPhone recording made on a London street: upload or drag in the file, click send, and play back an isolated voice track. The presenter says the street noise is removed and the file is processed within nine seconds, while interviews, podcasts, social clips, audio files and video files are named as broader use cases.
OpenAI Splits Audio API Into Translation, Transcription, and Voice-Agent Models
OpenAI is presenting three new API audio models as infrastructure for voice applications that can translate, transcribe, reason and act in real time. Romain Huet’s demonstration centered on GPT-Realtime-Translate, which keeps pace with multilingual speech, and GPT-Realtime-2, a voice-agent model that can follow turn-taking instructions, use business context and call tools while explaining its work. GPT-Realtime-Whisper completes the set as a streaming speech-to-text model for live transcription.
Voice Will Be the Primary Interface for AI Agents and Robots
At Sequoia’s AI Ascent 2026, ElevenLabs co-founder and CEO Mati Staniszewski argues that audio was an overlooked frontier in 2022 because the AI field was focused on text and images, leaving room for a smaller company to build quickly and monetize early. His broader case is that as AI intelligence becomes more capable, voice becomes the interface problem: the way people will use agents, robots, services, education and healthcare. Staniszewski says the next hard problems are emotional intelligence, timing, authentication and workflow, not merely making synthetic speech sound human.
Descript Bets Creator AI on Reliable Editing, Not Content Slop
Laura Burkhauser, Descript’s chief executive, distinguishes generative AI tools for creators from the “slop” she defines as mass-produced content arbitrage. Her case is that Descript’s future depends less on adding AI everywhere than on making editing automation reliable, reversible and useful for recorded human media. That means choosing third-party models by fit and taste, building in-house systems where Descript has workflow data, and treating creator backlash as a product constraint rather than a branding problem.