Dubbing v2 Preserves Speaker Performance Across 90-Plus Languages

ElevenLabsFriday, June 12, 20266 min read

ElevenLabs presents Dubbing v2 as an AI dubbing model designed to transfer a speaker’s performance across more than 90 languages, not just translate the words. The company argues that by conditioning on the original audio rather than a transcript, the system can preserve voice, tone, emphasis, emotion and timing while adapting phrasing for natural delivery in the target language. The walkthrough positions the tool as an automated localization workflow for creators, marketers and studios, with speaker similarity as the main setting users adjust between voice resemblance and native-language naturalness.

Dubbing v2 is built around performance transfer, not transcript translation

Dubbing v2 is presented as an AI dubbing model that carries the original speaker’s performance into another language, not only the semantic content. The core distinction is that the model listens to the original recording rather than relying only on a transcript. In the explanation given, that is what allows the dub to preserve tone, emphasis, and emotion in the speaker’s delivery, while automatically cloning the voice so the result still sounds like the source speaker.

The opening example states the promise directly: a speaker moves across English, Spanish, and French while saying the system can keep “your voice, your tone, and your way of speaking.” In ElevenCreative, the feature is described on screen as supporting 92 languages, using sync-aware translations, providing automatic voice cloning, and preserving emotion and delivery. The spoken explanation elsewhere uses “over 90 languages,” while a visual counter lands on “90+ Languages.”

90+

languages supported for Dubbing v2, as described in the walkthrough

The translation claim is also about fit. Dubbing v2 is described as adapting phrasing so the target language sounds natural to a native speaker rather than translating word for word. The source description says its sync-aware translation logic adapts phrasing for natural delivery and keeps starts and stops aligned to the original out of the box. In the product visuals, “Translations are sync-aware” appears beside versions of the phrase in Spanish, Arabic, French, and Japanese, with translated wording indicating that phrasing adapts naturally.

The product claim is therefore not simply “translate this video.” Translation, voice cloning, delivery preservation, and timing alignment are treated as one workflow. The practical standard used throughout is whether the dubbed clip still sounds like the source speaker, whether it sounds natural in the target language, and whether it lines up with the existing edit.

The workflow accepts files or URLs, with hard limits on duration and size

Inside ElevenCreative, dubbing begins from the Dubbing tool in the left sidebar. If Dubbing is not visible, it can be found under “More tools” and pinned to the sidebar. The tool accepts either a video upload or a pasted URL; TikTok and YouTube links are given as examples.

For file upload, the interface allows selecting a local file or dragging and dropping it into the tool. The demonstration uses a short video clip from the ElevenLabs Warsaw Summit, shown in the UI as “Warsaw summit demo booths_original.MOV.”

The operational constraints are specific. A source video must be at least 11 seconds long. The maximum duration is 180 minutes. The file size must be below 2GB.

Constraint	Limit described
Minimum video length	11 seconds
Maximum video duration	180 minutes
Maximum file size	Below 2GB

Input limits shown or stated for Dubbing v2 in ElevenCreative

After upload, the user chooses one or more target languages from the available list. Spanish and French are selected for the same English source clip. The job is then submitted. The presenter says short videos should finish quickly; in the example, the two short dubbed clips complete in 31 seconds. That timing is shown as part of this run, not as a general processing guarantee.

Speaker similarity is the main creative tradeoff

The main adjustable setting shown is “speaker similarity,” exposed under an Advanced toggle. The interface tooltip describes it as controlling how closely the dubbed voice mimics the original speaker. Higher values preserve more of the original speaker; lower values sound more natural in the target language.

The spoken explanation describes the range as 1 to 10, with 10 as the highest speaker similarity, but the comparison later uses settings at 0 and 10. The interface labels the same control as “Cloning strength” in the output details, with values shown at 6, 10, and 0. The practical meaning is consistent: one end prioritizes sounding like the original speaker, while the other prioritizes target-language naturalness.

“Natural in target language” sits at one end of the on-screen control; “More like the original speaker” sits at the other.

The presenter’s preferred setting is “5 or 6,” with 6 used for the main Spanish and French examples. At that level, he says the dub sounds enough like the original speaker while remaining natural. He also says the dubbed version still lines up with the edit of the original video, whereas a setting of 0 is likely to be “a little off with the original flow of the video.”

The comparison is framed as a creative decision rather than a universal rule. A tip shown on screen says: “Try different settings to find what works best for you.” The presenter’s own conclusion is narrower: in his opinion, the sweet spot for speaker similarity is around 6.

The examples compare voice resemblance, target-language naturalness, and edit timing

The source clip is an English-language video from a demo room at the ElevenLabs Warsaw Summit. In the original, the speaker says: “the demo room, behind me we’ve got a demo from ElevenLabs, but there are also demos from our partners, such as Nvidia, Deutsche Telekom...”

At a similarity setting of 6, the Spanish dub renders the same content: “Ahora mismo estamos en la sala de demostraciones, detrás de mí tenemos una demo de ElevenLabs pero también hay demos de nuestros socios como Nvidia, Deutsche Telekom...” After playing it, the presenter says he can hear his tone, accent, and delivery from the original video.

The French version similarly preserves the structure of the line: “Alors là on est dans la salle de démo. Derrière moi on a une démo d’ElevenLabs, mais il y a aussi des démos de nos partenaires, comme Nvidia, Deutsche Telekom...” The presenter says the French “sounds perfect” and adds that, as someone speaking French, he can attest to that. He also points specifically to the delivery of the brand names, saying they match the timing of the edit.

The cloning-strength comparison makes the tradeoff more concrete. At 10, the Spanish dub sounds, in the presenter’s assessment, like him speaking Spanish. At 0, the Spanish line changes slightly — “sala de demostración” rather than “sala de demostraciones,” with a different cadence — and he says it still sounds great and might sound much better to a native speaker, but sounds less like him.

Once the user is satisfied with a dub, the interface allows the clip to be downloaded. The settings can also be inspected through “Show more,” which in the demonstrated Spanish output reveals fields including source language, target language, source upload, dubbed range, cloning strength, watermark status, credits used, and a download option.

The product pitch is automated localization without a custom dubbing pipeline

The closing use cases are creators localizing videos and marketers localizing ad campaigns. The accompanying launch description broadens that positioning: creators localizing YouTube videos in ElevenCreative, marketers scaling ads across markets, and studios producing broadcast-grade dubs through ElevenProductions. It says professional dubbing can cost hundreds of dollars per minute and presents Dubbing v2 as a way to automate that level of workflow without building a custom pipeline.

The landing page shown at the end carries the same positioning: “The original performance in every language.” It also displays “Start dubbing free,” “Talk to sales,” and language examples including English, Portuguese, German, Italian, Hindi, and Spanish. The page claims the product is trusted by “1,000,000+ creators and leading enterprises,” with Headspace and Meta visible on screen.

The launch description also offers seven days of free usage on every plan: 1 minute on Free, 15 minutes on Starter, and 30 minutes on Creator and above. It links to the Dubbing v2 product page, a Creator Dubbing Partner Program application, and ElevenProductions.

AI in Design and Creative Work Voice and Audio AI Multimodal AI Model Releases