Voice AI Benchmarks Understate Errors in Real Multi-Speaker Audio

Hervé BredinAI EngineerFriday, June 5, 202610 min read

Hervé Bredin of pyannoteAI argues that voice AI benchmarks often make speech-to-text look more solved than it is by evaluating cleaner, more single-speaker-like audio. In his talk, he shows Nvidia Parakeet scoring 11.4% word error rate on AMI meeting audio in the Open ASR Leaderboard but 26% in pyannoteAI’s run on the same dataset using the table microphone rather than headset audio. Bredin’s broader case is that conversational AI needs fine-grained speaker diarization and speaker-attributed transcription, because words alone do not capture who spoke, when they overlapped, or how real multi-speaker conversations are structured.

The benchmark gap starts with the microphone

Hervé Bredin uses a simple discrepancy to expose a larger problem in voice AI evaluation: the same speech-to-text model, on the same named dataset, can look close to solved or far from it depending on which microphone is used.

The example is Nvidia Parakeet on the AMI meeting dataset. On the Hugging Face Open ASR Leaderboard, the displayed result for nvidia/parakeet-tdt-1.1b-ctc-v1 on AMI was 11.4% word error rate. When pyannoteAI applied the same model to AMI on its side, the result was 26%.

The difference, in Bredin’s explanation, was not the model or the nominal dataset. It was the audio condition. AMI meetings, he said, involve four to five people in meeting rooms and include both headset microphones and a microphone in the middle of the table. The leaderboard number used headset microphone audio. PyannoteAI’s number used the central table microphone. One condition is closer to single-speaker speech; the other is multi-speaker, distant-microphone meeting audio.

Evaluation	Audio condition described by Bredin	WER shown
Open ASR Leaderboard on AMI	Headset microphone	11.4%
pyannoteAI run on AMI	Central table microphone	26.0%

The Parakeet AMI comparison shown in the talk attributed the WER gap to microphone condition, not to a different model.

26%

word error rate Bredin reported when running Parakeet on AMI table-microphone audio, versus 11.4% on the leaderboard headset condition

Bredin’s broader claim is that most speech-to-text models are trained on single-speaker data and do not generalize cleanly to multi-speaker recordings. Distant microphones, speaker changes, cross-talk, interruptions, and code switching can all degrade performance. The AMI example shows how a benchmark result can depend strongly on whether the evaluated audio is close-talking headset speech or distant, multi-speaker meeting audio.

That gap matters because transcription is only the first layer of conversational understanding. If the speech-to-text system struggles with overlap and speaker changes, downstream speaker attribution can fail before the diarization system even gets a chance to resolve who spoke when.

Speaker attribution is not a timestamp join

A tempting view of speaker-attributed transcription is that the problem should be mechanical. One system knows the words and their timestamps; another system knows the speakers and their timestamps. Assign each word to the speaker segment it overlaps.

Bredin’s live notebook demo was meant to show why that is insufficient. There are three reconciliation problems in his description. Speech-to-text systems do not transcribe overlapping speech well. Speech-to-text and diarization timestamps disagree. Voice activity detection can also be inconsistent: diarization may detect speech that transcription does not transcribe, while transcription may produce words where diarization does not mark speech.

In the demo, Nvidia Parakeet produced word-level timestamps for a two-speaker phone call. The transcript was largely recognizable: “Hello? Hello. Oh, hello. I didn’t know you were there. Neither did I...” Bredin aligned that output against pyannoteAI Precision 2’s diarization.

Some cases were clear even with timestamp drift. A word near one speaker’s segment probably belonged to that speaker. The edge cases carried the technical point. Bredin pointed to the third word, “Oh,” positioned between two diarized speech turns. After playing “Oh, hello,” he said it was not quite clear which segment should own the word. Elsewhere, the word “okay” appeared in a region where diarization indicated two speakers speaking but the transcription contained only one word.

Those are not merely annotation annoyances. Overlap, interruption, and brief acknowledgments are often the moments where the conversational structure matters most. They are also exactly the moments where a naive timestamp join is most likely to attribute words incorrectly.

PyannoteAI’s STT orchestration is intended to manage that reconciliation. In the demo, Bredin submitted a cloud API job combining Precision 2 diarization with Parakeet transcription and retrieved a word-level transcription with speaker labels. He highlighted an overlapping region around “in New Jersey. And I’m Sheila in Texas,” where one speaker has an “um” at the end of her turn while the other interrupts. The orchestrated output interleaved the words from the two speakers.

The mechanism behind that orchestration came up in the question period. Asked whether the trick was model training or heuristics, Bredin said the reconciliation method is proprietary. He did disclose one part already available in the Community 1 model: “exclusive diarization.” In overlapping regions, it selects the speaker most likely to be transcribed by the speech-to-text model, simplifying reconciliation.

When asked whether none of it is in the model training, Bredin clarified that pyannoteAI wants the approach to support any speech-to-text system without changing the STT model itself. That includes internally fine-tuned STT models adapted for a particular use case.

So really it's supposed to work with any STT, even fine-tuned ones that you might have internally for, because you fine-tune them for your particular use case and nobody has it, you can combine it like that.

Hervé Bredin · Source

Transcription misses the structure that makes conversations intelligible

Bredin frames the problem with voice AI as larger than speech-to-text. Transcription answers “what was said”: a sequence of words from an audio stream. But in a conversation, words alone often do not explain the exchange. For many applications, the next necessary layer is speaker-attributed transcription: “who said what.”

That additional layer is not cosmetic. In meeting note-taking, it determines who owns an action item. In automatic video dubbing, it determines which translated line should be rendered in which speaker-consistent cloned voice. In podcast intelligence, it enables a system to track hosts and guests across episodes or even across different podcasts.

Even “who said what” is still a partial representation. Conversations also depend on “when.” The timing of words can reveal an interruption, a backchannel, or a pause. A small acknowledgment during another person’s turn may signal agreement or engagement; if it disappears from the transcript, the interaction changes. A pause between turns can carry information about hesitation, emphasis, or state of mind.

The next layer is “how.” Non-speech vocalizations, laughter, coughing, disfluency, stress, and prosody can all affect interpretation. A cough may be just a cough, or it may be part of how someone manages discomfort. Laughter may signal that something was funny, awkward, or inappropriate. Stress can alter meaning even when the words are unchanged. Bredin uses the sentence “the dog ate the cake” to make the point: stressing “dog,” “ate,” or “cake” changes what the utterance implies.

The outer layers matter too. A system may need to know “to whom” a person is speaking: a presenter addressing an audience is different from the same person answering one audience member’s question. Acoustic conditions add another signal. A quiet room, a street, and a noisy restaurant are different conversational contexts, and those conditions affect both what systems can infer and why a person may be speaking in a particular way.

The work described at pyannoteAI starts with a narrower but foundational problem: speaker diarization.

Diarization is not speaker naming; it is the harder problem of finding who speaks when

Hervé Bredin defines speaker diarization as answering “who speaks when.” It begins with a recording of a conversation and usually proceeds through several stages.

The first is voice activity detection: determining whether anyone is speaking at a given moment. The next is segmentation: dividing speech regions into smaller turns, including speaker-change points and cross-talk. This is where interruptions and backchannels become important. Short turns are not noise to be discarded. A small “yes” can convey the most important information in a conversation; if it is missed, the listener may lose the speaker’s state of mind.

Only after that does the system assign speaker identities to turns. In Bredin’s example, the system detects two speakers and colors their regions green and black. But these are not real-world identities. Speaker diarization does not normally output “John,” “Hervé,” or “Jack.” It outputs labels such as speaker 1, speaker 2, and speaker 3. If the colors are swapped, the diarization is still correct so long as each speaker’s turns remain internally consistent.

That label permutation points to one of the reasons diarization remains difficult. The system often does not know in advance how many speakers are present. A meeting note-taker might have a list of invitees, but that does not mean the actual audio has one channel per invited attendee. Two people may join from the same device; someone not on the invite may appear; a listed attendee may not speak.

The task is also complicated by overlapping speech, very short turns, imbalanced speaking time across participants, and the same acoustic problems that affect other speech-processing tasks. Bredin’s claim is direct: even though the research community has worked on diarization for a long time, the problem is still not solved.

A screenshot from Hugging Face’s model hub supported his argument that the problem is not niche. Bredin said that when filtering audio models and sorting by downloads, three of the top seven models were related to speaker identity and speaker diarization, with the remaining models related to speech-to-text. His interpretation was that the community’s demand for diarization is visible in the tooling people actually download.

The error rate is a debugging map, not just a score

Diarization systems are commonly evaluated with diarization error rate, or DER. Bredin breaks it into three components: speaker confusion, missed detection, and false alarm, divided by total speech duration.

Speaker confusion means the system assigned speech to the wrong speaker. False alarm means it detected speech where the ground truth contains no speech. Missed detection is the reverse: the reference contains speech that the system failed to detect. Missed detection can happen during overlapping speech when the system captures only one of the people speaking.

The metric was demonstrated on a roughly 30-second phone call between two women. The manually labeled transcript begins with short, alternating turns: “Hello?”, “Hello,” “Oh, hello. I didn’t know you were there,” “Neither did I,” followed by introductions from Diane in New Jersey and Sheila in Texas.

The open-source pyannote speaker diarization Community 1 model produced a DER of about 5% on the sample. The notebook visualization made the number inspectable: it showed where the system confused speakers, where it detected speech that was not in the reference, and where it missed speech that was present.

diarization error rate for the open-source community model on the demo phone call

PyannoteAI’s Precision 2 model, run through the company’s cloud API on the same sample, made fewer false alarm and missed detection errors and reached about 3% DER.

diarization error rate for pyannoteAI Precision 2 on the same demo phone call

The useful engineering loop is not only producing a final label sequence. It is being able to see where the system failed locally, connect those regions to the aggregate DER, and decide whether the remaining errors matter for the application. Bredin’s demo tooling was built around that loop: pyannote.audio for diarization, pyannote.metrics for benchmarking, ipyannote for interactive visualization of diarization and speaker-attributed transcription, and pyannoteai-sdk for accessing pyannoteAI premium models. He said the notebook was available in the pyannoteAI/tutorials GitHub repository.

State of the art depends almost entirely on the acoustic setting

Asked how well state-of-the-art diarization works today, Bredin says the answer depends heavily on the use case. He showed benchmark results from pyannoteAI for two settings that make the spread obvious.

On conversational telephone speech—two people speaking over the phone, similar to the demo sample—the best system on the displayed benchmark reached 8% diarization error rate. The slide described the dataset as DIHARD CTS: 61 conversations with two speakers. The chart compared systems by false alarm, missed detection, and confusion.

In a restaurant setting, the result was dramatically worse. The slide described “Dinner with friends in a restaurant,” using the DIHARD Restaurant dataset: 12 conversations with four to eight speakers. In that noisy, multi-speaker environment, the best system shown reached 41% DER.

Setting	Dataset described on slide	Speakers	Best DER shown
Conversational telephone speech	DIHARD CTS / 61 conversations	2 speakers	8%
Dinner with friends in a restaurant	DIHARD Restaurant / 12 conversations	4 to 8 speakers	41%

Bredin's benchmark examples show how diarization performance changes with acoustic and conversational complexity.

Diarization can look close to solved in clean, constrained settings and remain far from solved in real conversational environments with noise, multiple participants, and overlapping speech.

The practical claim is narrower than full conversation understanding

Bredin’s slides describe a broad target for voice AI: understanding who said what, when, how, to whom, and in which acoustic conditions. The implementation focus in the talk is narrower. Speaker diarization is the starting layer, and fine-grained diarization is presented as the practical path to better speaker-attributed transcription.

The “take home messages” slide condensed the position into three claims. First, there is more than words in a conversation. Second, speech-to-text works well for single-speaker recordings but struggles with real conversational speech. Third, in the slide’s wording, fine-grained speaker diarization “fixes most issues” and unlocks new use cases.

The next areas of work named on the slide were real-time diarization and source separation. The talk did not elaborate on either, but their placement clarifies the technical direction: beyond assigning speaker labels after the fact, the unresolved problems include doing it as the audio arrives and separating voices when multiple people speak at once.

Evals and Benchmarks Voice and Audio AI