ElevenLabs Unveils Dubbing v2 and Previews More Controllable Eleven v4

Mati StaniszewskiElevenLabsSunday, June 7, 202610 min read

ElevenLabs co-founder Mati Staniszewski used a Warsaw summit keynote to argue that AI’s next constraint is not intelligence but communication people can trust. He presented two new models — Dubbing v2, designed to preserve an original performance across languages, and a preview of Eleven v4, aimed at finer control over speech, emotion, accent, whispering and song — as evidence of that thesis. The broader case was that voice AI becomes commercially useful only when models are tied to agents, integrations, authentication, memory and deployment systems that let companies put spoken interfaces into production.

The bottleneck is not intelligence, but communication

Mati Staniszewski framed ElevenLabs around a specific bet: AI adoption will be constrained less by raw intelligence than by whether systems can communicate in ways people trust. In 2022, he said, most AI work was focused on “solving intelligence.” ElevenLabs started from the view that if an AI “sounds robotic or interacts strangely,” it will not be widely adopted by the people it is meant to serve.

That thesis led the company first to voice, because Staniszewski described speech as the place where the gap between human and AI communication is most obvious. Voice is not only a matter of converting text into audio. In his account, the hard problem includes tone, emotion, prosody, cultural nuance, and the back-and-forth of conversation. The aim he stated is broader than text-to-speech: “AI that communicates at a human level across every channel and modality.”

The setting mattered to the argument. Staniszewski opened at Teatr Wielki, the Polish National Opera, describing it as a building designed to carry a single voice over an orchestra while preserving clarity. He tied that acoustical and cultural setting to the company’s origin: he and co-founder Piotr Dabkowski grew up in Warsaw suburbs, met at Copernic 33rd High School when they were 15, and Dabkowski built ElevenLabs’ first model in Warsaw.

That first model, Staniszewski said, was “contextually aware, emotional,” and crossed the uncanny valley of speech. Since then, the company has built models across transcription, orchestration, dubbing, and music. The technical substance centered on two new models: Dubbing v2 and a preview of Eleven v4.

Dubbing v2 is designed to preserve the original performance, not just translate the words

Mati Staniszewski introduced Dubbing v2 as ElevenLabs’ “first end-to-end dubbing model,” built to address what he called one of the largest weaknesses in traditional AI dubbing: flat audio. The company’s founding inspiration, he said, came partly from watching foreign films dubbed into Polish by a lector, where “whether it's male or female, every voice, every character, and every emotion is the same.”

The difference he claimed for Dubbing v2 is architectural. Instead of generating speech only from a transcript, the model has what he called a “new fused architecture” that conditions on the original audio. In practical terms, that means the model can hear the source performance — its emotion, emphasis, and delivery — and carry those qualities into the target language.

Rather than generating a speech from a transcript, Dubbing v2 has a new fused architecture that allows it to condition on the original audio.

Mati Staniszewski · Source

A Ramp commercial illustrated the claim by moving across English, Polish, and German. The visible commercial copy included a finance-department scene with expense categories, receipts, invoices, purchase orders, blocked categories, automated expense syncing, and Ramp’s “Love Finance” tagline. The staged sequence showed the same ad moving between languages: English setup, Polish lines about receipts and expense reports, German lines about integrated card policies and automated expense submission, then Polish again for the closing finance line.

Staniszewski’s claim was not that automated dubbing eliminates all finishing work. He said Dubbing v2 makes the automated dub “great quality” and allows further editing “to make it truly excellent.” The significance, as presented, is that dubbing moves from transcript-level translation toward performance transfer.

Eleven v4 pushes voice control into whispers, singing, accents, and mid-line emotion shifts

Mati Staniszewski presented Eleven v4 as an unreleased text-to-speech model and an “exclusive preview.” A slide described it as “the most realistic & controllable model.” Staniszewski emphasized control: the ability to direct not only what a voice says but how it performs each line.

The English sample was written to test range rather than simply naturalness. It staged two voices talking about Warsaw as “old and new at the same time,” then deliberately shifted through registers: angry and dark voices, a “proper” accent, non-robotic delivery, whispering, emotion changes within a line, a Shakespearean “To be, or not to be,” and a sung fragment of “A Whole New World.” The point of the sample was to show that a generated voice could move between styles that usually expose synthetic speech: intimacy, theatricality, accent, song, and sudden emotional change.

Staniszewski did not present the singing and whispering as finished endpoints. He said quality would “continue getting better,” but that the sample conveyed “how incredible the voice can be.” He then played a Polish sample, which began casually — “Cześć. Nie, zaczekaj. Daj mi sekundę” — and then escalated into a more theatrical address to the room: asking whether people were listening, claiming to hear when the room goes silent, when someone holds their breath, and when “something inside moved.”

For Staniszewski, the takeaway was that Eleven v4 gives “extreme control over every line,” with improved voice quality and performance. He said it would launch “in just a couple of weeks.”

This model gives you the extreme control over every line.

Mati Staniszewski

The product stack is meant to connect models, brand systems, agents, and deployment

Mati Staniszewski separated ElevenLabs’ work into three layers: research, product, and deployment. Research produces the models, but applying AI in the world, in his view, requires a full platform. He described the product layer as a set of connected tools — ElevenCreative, ElevenAgents, and ElevenAPI — built around a shared understanding of an organization’s audience, brand, and goals. The product-suite slide used the formulation “a platform that can power millions of on-brand interactions.”

Product area	Role in Staniszewski’s framing
ElevenCreative	Content production: brand voice, campaign production, localization across image, video, and audio, and repeatable creative workflows.
ElevenAgents	Conversational experiences: voice agents for product, operations, and support teams, with orchestration, integrations, monitoring, and enterprise control.
ElevenAPI	Developer access to the same platform capabilities, presented as part of the interconnected product suite.

The product-suite slide presented ElevenCreative, ElevenAgents, and ElevenAPI as connected parts of one platform.

ElevenCreative was positioned for content production: marketing teams designing a brand voice, producing and localizing campaigns across image, video, and audio, and scaling those workflows. Staniszewski also mentioned a “white-glove audio service” from ElevenLabs’ production team for narration and localization.

ElevenAgents was positioned for conversational experiences. Product, operations, and support teams, he said, can use it to deploy voice agents in more than 70 languages with enterprise-grade security and control. He described the product as combining orchestration, configuration, integrations, and monitoring.

70+

languages supported for ElevenAgents voice agents, according to Staniszewski

He said ElevenAgents is already powering 5 million agents, which handle “2.5 years of conversation” every day. The dashboard slide behind him showed the same figures, alongside interface references to workflows, analytics, calls, web, and custom channels.

The central product claim was consistency across touchpoints. In Staniszewski’s framing, one platform should let a company present the same brand experience from marketing content, to sales onboarding, to support troubleshooting. Lead qualification was one example: instead of prospects waiting days for a response, a voice agent can begin a live conversation immediately, ask questions, qualify the prospect, route them to the right representative, answer nuanced questions, and sync details into a CRM.

Healthcare was his example of agents moving beyond reactive support. He described post-discharge check-ins, medication reminders, and screening follow-ups happening at the right time and in the patient’s language, without requiring a human scheduler. He framed these as routine outreach tasks doctors “simply just can't do for everyone with the time that's available.”

The deployment layer was the least glamorous and, in Staniszewski’s account, essential. For established companies, production AI requires work on legacy systems, workflows, observability, change management, and reliability. ElevenLabs’ forward deployed engineers embed with customers, connect domain knowledge to the company’s technology, and relay deployment feedback back to research and product. A slide showed this as a loop among research and product, forward deployed engineering, and customers: data and deployment flow through forward deployed engineering, while customer deployments feed back into product and research.

The travel demo treated voice as an interface for authentication, memory, routing, and action

Mati Staniszewski used a tourism website for Warsaw to show what a voice agent could do when connected to tools, memory, and multiple sub-agents. He acknowledged the cliché directly: “everyone uses their personal travel agent example, but no one has actually built it yet.” He then said ElevenLabs had signed a partnership the previous week with the Greek government to build voice agents for tourism.

The initial agent was not just a spoken search box. Staniszewski described a system composed of sub-agents for practical travel information, local experiences, refunds, schedules, and other parts of the flow. In the first exchange, the agent greeted the user, recommended a one-day Warsaw itinerary beginning with Old Town and the POLIN Museum, and then identified an uploaded image as the National Theatre in Warsaw. When asked about booking, it offered a ballet performance of Dracula.

The more important version added personalization. Staniszewski said two capabilities were needed: authentication and memory. Authentication allows the agent to know that the user is logged in and that the caller is who they claim to be; he mentioned WhatsApp, security questionnaires, and login flows, with templates available for those functions. Memory and integrations allow a concierge agent to know past experiences, preferences, availability, and communication channels. In the example, the agent had access to Google Calendar for availability, WhatsApp and Gmail for notifications and information, and Salesforce for recording visits and retrieving context.

Once those parts were connected, the agent recognized him and resumed the theatre booking. It knew he had previously booked Dracula and suggested Romeo and Juliet instead. It used his calendar to identify the day his friends arrived in Warsaw. When he said the most important thing was sitting together, the agent found four adjacent seats on June 3 in the stalls, which it described as his preferred area for the atmosphere of the show.

The agent then suggested adding a Chopin recital, processed the booking in the background, answered a transport question by recommending the M1 line to Ratusz Arsenał, and sent the tickets by WhatsApp. The substance of the demonstration was not any one booking step, but the combination: a voice agent could use identity, preference, calendar context, web navigation, payment or booking workflows, messaging, and background task execution while the conversation continued.

The demo then handed off to a special theatre guide using the cloned voice of Piotr Fronczewski, whom Staniszewski described as one of Poland’s most renowned actors. The guide introduced himself as someone who had “graced that stage many times” and explained that in 1969 the theatre installed a “moon ceiling” with thousands of ceramic bowls, creating a celestial effect and improving the opera’s acoustics. When Staniszewski asked for the same explanation in Polish, the voice repeated it in Polish.

Fronczewski had given permission for the use of his voice, Staniszewski said. He presented the handoff as an example of future interactions with cultural icons and experts, not merely as a booking assistant. The broader point was that a voice agent could authenticate, personalize, navigate a website, process a booking, branch work in the background, answer practical questions, and move between agents while preserving voice continuity.

The enterprise examples move from support to embedded, domain-specific agents

Mati Staniszewski closed the business case with customer examples. Deutsche Telekom, described on the slide as Europe’s largest telco with more than 250 million mobile customers, started with conversational agents for customer support. The work is moving, he said, toward embedding agents inside the telecom network, so users can summon them directly for assistance or real-time translation during a call.

MasterClass was framed around access to expertise. Staniszewski said the company is creating agents for instructors, so users can talk with Gordon Ramsay about cooking or practice negotiation with former FBI hostage negotiator Chris Voss. He added that more than 75% of customers who use the conversational experience choose to engage by voice.

75%+

MasterClass conversational-experience users who choose voice, according to Staniszewski

In Poland, LOT Polish Airlines is building an AI agent for reservations and customer support, with human handoff as needed, and is looking toward a broader reinvention of the customer journey. The National Health Fund example was healthcare infrastructure: a voice agent for Centralna e-Rejestracja. The slide stated a goal to prevent “10 million+ missed appointments”; Staniszewski described the potential as preventing “tens of millions of missed appointments each year” and significantly improving access to healthcare across Poland.

These examples were used to support a single claim: conversational agents are not only a demo interface but a production tool for customer journeys. The agent can sit in support, bookings, public services, education, tourism, or telecom infrastructure, provided it can connect to the relevant systems, enforce handoffs, and speak in a way customers will actually use.

Poland is presented as origin story and operating base, not just stage backdrop

Mati Staniszewski made the Warsaw setting part of the closing argument, not merely the venue. He tied ElevenLabs’ founding story to Poland’s broader economic and engineering trajectory, saying the country is now a “trillion-dollar economy,” “on the cusp of joining the G20,” and home to engineers building important technology companies. Growing up in that environment, he said, gave ElevenLabs’ founders both ambition and skills.

The personal register remained present: government figures, families, and ElevenLabs engineers’ families were in the audience, including Staniszewski’s parents. But he connected that homecoming to a broader market claim. Organizations across industries, he said, are beginning to transform how they communicate with the people they serve.

The future he described is one where AI expands access to support and expertise, builds personalized experiences for important audiences, crosses language barriers, and amplifies human potential. In his telling, that depends on voice systems that can perform, listen, authenticate, act, route, and integrate — not simply generate speech from text.

AI Application Architecture AI in Customer Support AI in Healthcare and Life Sciences Voice and Audio AI Agents and Autonomy Model Releases Enterprise AI Adoption