Voice AI Still Confuses Natural Speech With Real Conversation

Neil ZeghidourAI EngineerSaturday, May 9, 202612 min read

Neil Zeghidour, CEO of Gradium AI and one of the researchers behind the full-duplex voice model Moshi, argues that voice AI’s long-promised “Her” moment is still being confused with better synthetic speech. His case is that cascaded voice agents are useful but structurally too slow and lossy to feel conversational, while speech-to-speech models improve flow but remain limited unless they can listen and speak simultaneously, use tools reliably, understand paralinguistic cues, and run cheaply enough to scale.

The gap is no longer just whether voice sounds natural

Neil Zeghidour framed the current state of voice AI around an analogy he clearly dislikes but still treats as the right benchmark: the conversational operating system in Her. The reference has become, in his words, “the most overused, most annoying analogy” in the field, but it remains useful because the film’s imagined interaction still captures what today’s systems are trying to approximate: low-friction speech, immediate responsiveness, contextual intelligence, and sensitivity to what the user is not explicitly saying.

His main distinction was not between bad voice demos and good ones. It was between speech that sounds natural and interaction that behaves like conversation. An ElevenLabs demo of a government helper and one of Gradium’s own demos, in which a small robot was asked to adopt the voice and personality of a “gym addict,” both sounded more natural than older systems. Both also exposed the same basic failure: the interaction still did not feel human. The latency was noticeable, simultaneous speaking was not handled, and the system’s “intelligence” was largely the intelligence of a text model wrapped in voice.

That distinction matters because many voice agents are still fundamentally cascaded systems: speech-to-text, then an LLM, then text-to-speech. This architecture is practical. It is also limited. The system transcribes speech into text, reasons over the text, and then synthesizes a spoken response. Anything not preserved in the transcription—tone, hesitation, discomfort, backchanneling, emotional cues—is mostly unavailable to the agent.

Zeghidour described the requirements for a real “Her” moment as threefold: naturalness, intelligence, and empathetic response. Naturalness means low latency and the ability to listen and speak at the same time. Intelligence means understanding nuance and intent, maintaining context across turns, and performing agentic tasks. Empathy means detecting and responding to user emotions in a way that builds connection rather than merely producing fluent audio.

The central tension is that today’s practical systems are improving quickly where product teams can use them, but their architecture strips away much of what makes voice valuable in the first place. Speech-to-speech systems preserve more of that signal, but most current versions remain constrained in a different way: they are still not truly conversational.

Cascaded systems are useful, but their latency budget is structurally wrong

In Gradium’s own stack, Zeghidour said the company works on the voice-model building blocks rather than orchestration or vertical applications: streaming speech-to-text, multilingual text-to-speech, voice cloning, semantic VAD, speech-to-speech transformation, translation, and dialogue. The company’s goal is to be a model provider for teams building voice agents and voice products.

But even a fast component does not solve the whole system. Zeghidour showed a latency comparison in which Gradium’s TTS appeared faster than several other models, but he immediately undercut the implied win: text-to-speech alone still takes more than 200 milliseconds. In human conversation, he said, the entire stack—understanding, producing an answer, and pronouncing it—needs to fit around that same 200-millisecond window.

~200 ms

target total response budget Zeghidour gave for human-like conversation

That means a system can fight over 10 or 20 milliseconds of TTS latency and still fail to feel human. Worse, that latency measurement only covers text conversation. It does not include tool calls, reasoning, search, retrieval, booking, or any real task that a voice agent is supposed to perform.

Zeghidour used a scene from Her in which the AI instantly scans emails and extracts relevant information to make the contrast clear. Today’s agents do not do that instantly. If a voice agent has to call a tool, the user waits. Zeghidour said tool-call latency through services such as OpenRouter can run from 500 milliseconds to four seconds. That range is unpredictable, and for him it is becoming the main bottleneck.

500 ms–4 s

tool-call latency range Zeghidour cited for current voice-agent workflows

His proposed mitigation was not simply “make the model faster.” It was to make the interaction more resilient to waiting. One pattern he described was contextual fillers: the LLM splits into two paths, sending a tool call while also producing speech that keeps the conversation moving. When the tool result returns, the system inserts the answer back into the dialogue.

He demonstrated this with a quickly built travel-booking interface. The user wanted to go to Tokyo; while the agent retrieved options, it filled the gap by saying Tokyo was “a fascinating mix of ultra-modern skyscrapers and beautiful, peaceful shrines” before presenting hotels. Zeghidour acknowledged the demo needed polishing, but the point was architectural: if tool duration is unpredictable, the agent needs a way to speak naturally while waiting rather than leaving a dead pause.

The implication is that cascaded systems can be made more usable, but not by pretending the latency problem is confined to TTS. Once agents perform actions, the slowest and least predictable part of the loop may be the external work the agent is doing.

Speech-to-speech reduces latency but does not automatically create conversation

Neil Zeghidour said customers and investors often respond to cascaded-system latency by asking about speech-to-speech. In the simplified version, speech-to-speech replaces the three blocks—speech-to-text, LLM, text-to-speech—with a single model that takes speech as input and emits speech as output. That can reduce latency substantially. It still does not necessarily produce a human-like conversation.

The issue is duplexing. Zeghidour argued that every speech-to-speech model except Moshi is half-duplex. A half-duplex model is either listening or speaking. It cannot genuinely do both at once. Human conversation, by contrast, is full-duplex: information flows both ways continuously, and people regularly overlap, cough, interject, affirm, or signal that they are following without intending to take the floor.

A slide made the distinction visually: in the half-duplex timeline, the user and AI alternate turns; in the full-duplex timeline, “Alice” and “Bob” have overlapping streams across time. The slide’s caption stated the design claim directly:

A human conversation has no latency. Information flows both ways constantly.

Backchanneling was Zeghidour’s concrete example. A user may say “mhmm,” “sure,” or “yeah” while the other person is talking, not to interrupt but to show attention. In Japanese, Zeghidour said, frequent backchanneling can be a sign of politeness and active listening. He said overlap between speakers can reach 20% of the time.

up to 20%

overlapped speaking time Zeghidour cited in human conversation

Half-duplex voice agents misread this. Zeghidour showed a screen-recorded interaction with a speech-to-speech model in which he tried to brainstorm a talk about voice AI. When he made small affirming interjections, the model repeatedly treated them as interruptions, stopped its own answer, restarted, apologized, and then interrupted again. He explicitly told the model: “No, no, please stop interrupting. You know, it’s called backchanneling. Humans do it all the time.” The interaction still broke down.

For Zeghidour, this is why many impressive voice demos are recorded in quiet rooms, near the phone, under controlled conditions. Real conversation contains overlap, noise, partial words, coughs, side comments, and social signals. A system that cannot listen while speaking is fragile under those conditions.

The slide’s “no latency” line was less a literal claim about physics than a design principle: once two people are continuously adapting to each other, latency is no longer just a turn-by-turn delay. It is the wrong abstraction for the interaction.

Moshi solved flow, not usefulness

Moshi was Zeghidour’s counterexample to half-duplex speech-to-speech. Developed at Kyutai, the Paris-based open-science AI lab from which Gradium spun out, Moshi was presented as the first speech-native, full-duplex dialogue system. Zeghidour said it is still the only full-duplex model and, despite being almost two years old, “still kind of aged well.”

In the demo, Zeghidour’s co-founder Alex spoke with Moshi in a space-mission scenario. The point was not the content of the exchange—“The planet is Sirius 22,” “Can you plot a trajectory course?”—but the interaction pattern. Moshi could start responding before the user had fully completed the utterance, and the user could speak over it without the model ignoring the new information. The result, Zeghidour said, was still the most robust conversational experience of its kind: robust to noise, multiple speakers, and overlapping speech.

But he was blunt about what Moshi did not solve. It was a research prototype, not a useful agent. It had no tool calling. It could not perform tasks. It lacked observability and reliability. It was hard to detect if someone said something that should not be accepted. And it did not meaningfully understand or convey empathy.

Zeghidour’s Moshi lessons separated the breakthrough from the product gap:

What worked	What remained hard
Full-duplex conversation with natural, uninterrupted dialogue flow	A research prototype rather than a real agent
Real-time interaction where user and AI can speak simultaneously	Lack of observability and reliability
Robust conversational flow	Weak empathy and paralinguistic understanding

Zeghidour’s summary of what Moshi proved and what it left unsolved

This led Zeghidour to a pragmatic shift. He said he used to be “really at war against cascaded systems,” but now sees why they remain so practical and convenient. Full-duplex interactive models are the path to human-like interaction, in his view, but they cannot replace cascaded systems until they reach the same level of reliability, intelligence, personalization, and production control.

That is the hard part. Zeghidour suggested the flow problem is now known to be solvable: in his view, someone implementing a Moshi-like approach and training it on better data could get to sound and conversational flow that feel indistinguishable from humans. But sounding human is not the same as being useful, inspectable, and integrated with tools. The remaining challenge is not merely generating fluid speech; it is giving that fluid model the capabilities product teams currently get from text-first agents.

The missing signal is in the voice, not the transcript

The most conceptually important part of Zeghidour’s argument was his focus on paralinguistic understanding: the information carried by how someone speaks rather than by the words alone. He returned to Her for this point, using a scene in which Samantha infers from the protagonist’s tone that he is challenging her, perhaps because he is curious about how she works.

Zeghidour described this as the AI understanding that the character is “a bit uncomfortable.” In a speech-to-speech model, the raw information needed for that inference is technically present because the audio has not been reduced to text. But presence is not the same as use. If a model is trained on the audio version of an instruction dataset, and the dataset is mostly factual question answering, Zeghidour asked, why would the model learn to exploit tone, hesitation, discomfort, or other social cues?

That is where the cascaded architecture is especially limiting. Once speech is transcribed, the system may keep the words but discard much of the interaction. A user’s pause, uncertainty, irritation, confidence, accent, cultural backchanneling habits, and emotional state can be flattened into text. The resulting agent may sound expressive on output while being deaf to the expressive content of the input.

Zeghidour’s claim was not that current speech-to-speech systems already solve empathy. He said Moshi did not. His claim was that the path to “Her”-like interaction requires models that understand the voice itself, not just produce plausible synthetic speech. The model has to learn when an “um” is hesitation, when a “sure” is permission to continue, when a tone signals discomfort, and when a user’s spoken style changes the right response.

This is also why he rejected the idea that voice has become a commodity. The surface layer—pleasant voice output—may be increasingly available. The last mile, in his view, is harder: full-duplex interaction, paralinguistic comprehension, reliability, personalization, low cost, and production-grade control all at once.

Cost and privacy may become the scaling wall

Even if the interaction problem were solved, Zeghidour argued that large-scale voice AI faces another constraint: inference cost. He asked the audience to imagine a truly successful assistant, always available on a computer, used asynchronously throughout the workday, and spoken to for hours. Under current economics, he said, that product is difficult to make profitable.

He did not name competitors’ API pricing, but he said voice is expensive and that the voice modes of many hyperscalers run at a loss. In consumer voice apps, he said, LLM cost is now “almost nothing” compared with the rest of the bill. Speech-to-text is cheap; diarization is affordable; TTS consumes most of the cost.

His most pointed warning was that he had seen teams burn through fundraising on TTS bills before they had the chance to grow a user base. For voice applications, especially consumer ones, unit economics are not a side issue. They can determine whether a product survives long enough to find product-market fit.

Privacy compounds the problem. Zeghidour argued that the more people open up to an AI system, the more they will want control over where the data goes. If an assistant is continuously involved in private work and personal life, users may prefer local processing rather than sending voice and context to remote systems. He linked this concern to broader fear that databases will be hacked, saying users will be more comfortable if their private data remains local.

Gradium’s answer, at least for text-to-speech, is Phonon: an on-device TTS model designed to run on a smartphone CPU. Zeghidour emphasized that “on-device” can be used loosely; for Gradium, it does not mean requiring a gamer GPU. It means CPU inference on a phone.

The benchmark slide compared Phonon with several on-device TTS models:

Model	Weights	WER	Speaker similarity
Phonon	~100M	1.48%	56.37%
Kani TTS 2	450M	4.07%	40.73%
NeXTTS AR	552M	2.18%	47.51%
NeXTTS Nano	229M	1.71%	40.15%
Kokoro	100M	0.90%	—

On-device TTS comparison shown for Gradium Phonon

Zeghidour said Kokoro is a good model but lacks voice cloning. Phonon, by contrast, was presented as multilingual, able to reproduce a voice from a short reference clip without retraining, and faster than real time on CPU with no perceptible delay. He played a demo in a stylized character voice whose script began, “Uh, geez Morty, stop looking for a signal,” emphasizing that it ran locally on a smartphone CPU. The product point was simple: if TTS runs locally, a developer can power voice applications “without paying a single cent of API fee.”

Gradium opened a private beta for Phonon with the stated goal of enabling consumer voice apps that can scale usage without losing money on API calls.

Voice is not a commodity if the target is real interaction

Zeghidour’s closing position was explicitly opposed to competitors who say voice is now a commodity. He called that “completely false.” The commodity view may fit a narrow layer of the stack: generating speech that sounds smooth enough for a demo. It does not fit the broader target he described.

The “Her” moment, in his framing, requires at least four things to converge. First, the interaction must be full-duplex, so the system can handle overlap, backchanneling, and interruption the way people do. Second, it must be useful: agentic, tool-capable, reliable, observable, and controllable in production. Third, it must understand paralinguistic information rather than stripping speech down to text. Fourth, it must be economically and privately scalable, which means reducing dependence on expensive cloud TTS and moving more inference locally where possible.

Gradium’s position is shaped by that split between research and production. Zeghidour described Kyutai as an open-science lab backed by philanthropists including Eric Schmidt, Rodolphe Saadé, and Xavier Niel, where projects such as Moshi, Hibiki-Zero, and Pocket-TTS were developed. Gradium, the for-profit structure, was created to turn that kind of research into production models. A slide said the company raised $70 million in 2025.

Zeghidour’s claim is not that the “Her” moment is imminent because demos sound good. It is almost the opposite: the field keeps mistaking better voice output for solved voice interaction. His argument was that the hardest work is in the remaining gap—the science and engineering needed to make voice agents simultaneous, able to detect and respond to emotion, useful, observable, private, and cheap enough to run at scale.

Agents and Autonomy AI Application Architecture Inference and Deployment Voice and Audio AI AI Business Models Human-AI Interaction