Voice AI Still Confuses Natural Speech With Real Conversation
Neil Zeghidour, CEO of Gradium AI and one of the researchers behind the full-duplex voice model Moshi, argues that voice AI’s long-promised “Her” moment is still being confused with better synthetic speech. His case is that cascaded voice agents are useful but structurally too slow and lossy to feel conversational, while speech-to-speech models improve flow but remain limited unless they can listen and speak simultaneously, use tools reliably, understand paralinguistic cues, and run cheaply enough to scale.
AI Engineer·May 9, 2026·12 min read