Samuel Humeau

Samuel Humeau is an AI Scientist at Mistral AI in Paris, working on speech and language AI. He is associated with Mistral’s Voxtral TTS work and has prior machine-learning experience at Nabla, Facebook AI Research, and Diffbot, with open-source contributions to Facebook Research’s ParlAI.

Text-to-Speech Models Are Converging on LLM-Style Architectures

Samuel Humeau of Mistral argues that modern text-to-speech has converged on an architecture that resembles large language modeling: an autoregressive transformer generates compressed audio tokens frame by frame, rather than raw waveform samples. Using Mistral’s open-weight Voxtral TTS model as the example, he says neural audio codecs make that possible by reducing dense speech signals to token-like representations a transformer can handle. The remaining latency frontier, in his account, is not just streaming playable audio early, but letting TTS consume an LLM’s text stream as it is still being written.

AI EngineerMay 9, 202612 min read