Voice Agents Need Colocated Models to Stay Under One Second
Rishabh Bhargava of Together AI argues that production voice agents are now constrained less by demos than by a sub-second engineering budget spanning speech-to-text, LLMs, text-to-speech, networking, and scaling. In his account, users notice delays above 500ms and abandon calls around one second, making even 75ms network hops material once model latency is optimized. The practical architecture remains a cascade, he says, because it lets teams control tool calling, evaluation, and reliability while speech-to-speech models still lag on production requirements.

Voice agents fail when any one constraint fails
Production voice agents have four simultaneous constraints: latency, intelligence, naturalness, and reliability. Rishabh Bhargava’s point is not that each is individually difficult. It is that a usable system has to satisfy all of them at once.
Latency is the most unforgiving. Bhargava says human conversation works on roughly 300-millisecond response cues. Once an AI system takes more than 500 milliseconds to respond, users notice. At one second or two seconds, they hang up. In voice, silence is not neutral; it is interpreted as failure.
The second constraint is intelligence. A production caller cannot merely sound fluent. It has to handle real workflows: ambiguous instructions, multi-turn state, and tool calls that let it book appointments, look up information, or otherwise act in external systems. Bhargava treats tool calling as essential because it is how agents get access to the real world.
The third constraint is naturalness. A voice agent has to sound pleasant enough to be trusted. That includes language coverage, accent, pronunciation of names and products, and emotional tone appropriate to the situation. The slide frames the risk bluntly: robotic voices destroy trust, while users expect human-quality speech.
Reliability is the fourth constraint. A demo with one caller does not answer the production question. Bhargava asks what happens at 100, 1,000, or 10,000 concurrent calls. The slide specifies 99.9% uptime and cost-effective scale, with “smart scaling” rather than silent failures; the silence-as-failure framing comes from the latency constraint, where delayed response is experienced by users as a failed interaction.
This is an AND problem. You have to solve every single one of them at the same time.
The reason voice AI matters, in Bhargava’s account, is both practical and interface-level. Billions of phone calls are still handled by humans in customer support, healthcare appointment scheduling, and similar workflows. But he also argues that voice is becoming a new way to interact with computers more generally: ChatGPT voice mode, people speaking to coding agents, and systems where talking is the natural input rather than typing. Voice is not presented as speculative research. In 2026, he says, rich voice conversations are “primarily an engineering problem.”
The dominant production architecture is still a cascade
Bhargava describes the current production pattern as a pipeline, or cascading architecture. Audio chunks stream from the user into an agent orchestrator — he names Pipecat, LiveKit, or homegrown systems as examples — and then pass through speech-to-text, turn detection, an LLM, and text-to-speech. The system sends generated audio chunks back to the user over a streaming connection.
| Stage | Role in the voice agent | Engineering concern Bhargava emphasizes |
|---|---|---|
| Agent orchestrator | Receives streamed user audio chunks and coordinates the workflow | WebSocket streaming, logging, observability, and function calling |
| Speech-to-text | Converts incoming audio into text | Word error rate, time to completed transcript, turn detection, multilingual support, and streaming-native architecture |
| LLM | Decides what to do and produces text output, including tool calls | Low time to first token, instruction following, multi-turn conversation, and tool-calling reliability |
| Text-to-speech | Turns generated text into streamed audio | Time to first audio, real-time factor below one, naturalness, pronunciation, emotion, and language coverage |
That architecture is straightforward conceptually, but it creates a compound latency budget. Each component has its own quality and speed target, and network hops between components can become material even when the models themselves are already fast.
The speech-to-text model is the “ears” of the agent. Its obvious quality metric is word error rate. Bhargava says state-of-the-art models are typically around 6% word error rate on open benchmarks, though the acceptable number depends on the use case. The operational issue is not only aggregate transcript accuracy. Errors on key words can be fatal to the workflow: a person’s name, a drug name, a product name. Once speech-to-text gets such a term wrong, the LLM and the TTS system will carry the mistake forward.
For latency, the relevant STT measure is time to completed transcript: after a user stops speaking, how long until the transcript is ready for the LLM. Bhargava gives Together’s Nemotron Speech ASR as an example where the p90 is consistently around 100 milliseconds.
Speech-to-text also has to support capabilities that simple batch transcription does not solve cleanly. Turn detection is one. The system must decide whether a pause means the user is done speaking or merely pausing. If the agent starts talking before the user’s turn has ended, the experience fails in a way familiar from human interruption. Bhargava calls turn detection “still somewhat unsolved” and says it could support a 20-minute talk by itself.
Language coverage matters when customers speak different languages. Streaming-native model architecture is also becoming important. Bhargava contrasts older batch-oriented models, with Whisper as the canonical example, against newer streaming-oriented designs. Whisper was trained on 30-second audio clips, which is too long for real-time interaction; production systems have had to chunk audio, pad with silence, make repeated calls, and stitch transcripts together.
Newer approaches, including Nvidia Nemotron Speech ASR as shown in Bhargava’s example, use encoders trained with shorter lookahead windows — from about 80 milliseconds to 1.12 seconds — and cache activations so heavy computation is not repeated for every small audio step. In his description, the point of the cache-aware encoder is not only lower latency in the abstract. It changes the inference pattern for streaming audio: as small frames arrive, the model can reuse prior computation rather than repeatedly reprocessing overlapping chunks.
The LLM dominates both the latency budget and the intelligence budget
The LLM is the “brain” of the voice agent, and Rishabh Bhargava treats it as the dominant cost and latency component. The core metric is streaming latency, especially time to first token. A practical target is roughly 200 to 300 milliseconds, because the first tokens need to begin feeding the text-to-speech model quickly.
That target constrains model size. Bhargava identifies the 8-billion to 30-billion parameter range as the sweet spot for the tradeoff between latency and intelligence. Larger models can burn through the budget. Smaller models may be fast but can lose the instruction-following and tool-calling capability required for meaningful voice agents.
Tool-calling evaluation matters because the system needs a relatively small model to remain useful inside the latency budget. Bhargava separates component-level evaluation from system evaluation: if speech-to-text and text-to-speech are assumed to be good, the LLM’s tool-calling evals look similar to tool-calling evals for LLMs more broadly. The questions are whether the tool call is correct, whether the output is parsable, and whether the structure conforms to what downstream systems expect.
He says the tool-call structure should be “very close to 100%.” Correctness is more use-case dependent. For a workflow where a wrong tool call can create a bad customer outcome, the threshold will be different from one where the agent can recover.
One production response to this constraint is fine-tuning smaller LLMs on use-case-specific data. Bhargava says Together sees customers do this to improve tool-calling quality while keeping the model small enough to stay inside the voice latency budget. The practical trade is clear: rather than rely on a larger general model that responds too slowly, specialize a smaller model so the pipeline can preserve its latency target.
Text-to-speech quality is still judged by listening
Text-to-speech is the “voice” of the agent. Its first latency metric is time to first audio: once text is available, how quickly can the system produce the first audio chunk for streaming back to the user? Bhargava cites Cartesia Sonic 3 on Together as an example with p90 time to first audio under 120 milliseconds.
The second metric is real-time factor: generation must be faster than playback. Bhargava gives the example that if a model produces 10 seconds of audio in five seconds, its real-time factor is 0.5. Production systems want this below one so they do not buffer.
Quality is harder. Bhargava says objective measures exist, but for TTS “nothing quite beats listening to audio samples” for the voices and models that matter to the product. Teams need to judge whether the audio creates the end-customer experience they want.
The important capabilities are naturalness across voices, pronunciation control, emotional control, and language coverage. Names and product names need to be pronounced correctly. Emotional tags — happy, angry, sad — are an early form of control, and Bhargava says the models are getting pretty good at it. But the evaluation remains closer to product judgment than a single clean metric.
In a fast pipeline, network latency becomes large enough to matter
Once the component models are optimized, Rishabh Bhargava argues, system-level engineering becomes decisive. His rough budget is that the LLM consumes the largest share of latency and cost, followed by TTS, then STT. The slide puts the LLM at about 60% of the budget. But many latency numbers are engine latency only: how long the model takes to produce output. Real systems also pay network latency.
The colocation chart is the clearest demonstration. In a distributed voice-agent pipeline, the same optimized model stages can still incur three 75-millisecond network hops between the orchestrator and model services. That adds 225 milliseconds before the user hears a response. If the components are colocated in the same data center, those hops can fall to about 5 milliseconds each, or 15 milliseconds total. In Bhargava’s example, total time to first audio falls from roughly 710 milliseconds to roughly 500 milliseconds.
| Architecture | Network assumption | Network time shown | Total time to first audio shown |
|---|---|---|---|
| Distributed | 75 ms × 3 hops | 225 ms | ~710 ms |
| Co-located | 5 ms × 3 hops | 15 ms | ~500 ms |
The chart’s component values differ slightly between the shown slides — one version shows TTS at 105 milliseconds, another at 180 milliseconds — but the network lesson is unchanged. The point is not a single canonical TTS number; it is that network hops can consume a large fraction of the budget even after model engine latency has been brought into the right range.
Bhargava’s interpretation is that reducing each hop from 75 milliseconds to 5 milliseconds yields about a 30% reduction in an already optimized system. The operational lesson is that “every 10 milliseconds matters” in real-time systems.
Colocation is literal in this account. If a voice agent is running in London but calling an LLM whose servers are in the United States, traffic has to travel across that distance and back. If an open-source model can run in the same data center as the voice-agent server, the hop becomes intra-data-center rather than transatlantic. Distance, apart from other networking concerns, has a direct impact.
Colocation is not the only system-level issue. Autoscaling is different for voice agents than for more asynchronous workloads. Scaling up needs to happen aggressively because queued requests translate directly into bad real-time experiences. Scaling down is trickier because conversations can be stateful and long-lived; killing a pod arbitrarily can disrupt an active call. Bhargava suggests systems may need to wait for conversations to finish before removing capacity.
Global deployment matters for both latency and compliance. Systems should run close to end users to reduce network time, and in regions where data residency or GDPR compliance matters, teams need the ability to deploy where required.
Speech-to-speech promises a simpler architecture, but not yet a simpler production decision
Rishabh Bhargava identifies pure speech-to-speech models as the next major architectural direction. Instead of coordinating speech-to-text, an LLM, and text-to-speech, a single model would accept audio and produce audio while still handling instructions and function calls. Bhargava points to OpenAI’s real-time API and Nvidia Nemotron VoiceChat as examples of this direction.
The appeal is obvious. A single model reduces orchestration complexity and can preserve information that text transcription discards. Tone, emotion, accent, and hesitation survive as model input rather than being flattened into words. The model can also support fuller duplex communication: producing audio while still receiving audio. That enables backchannels such as “uh-huh” or “I see,” and can make interruption handling more native than in a pipeline where barge-ins require additional engineering.
But Bhargava says these models are not yet widely used in production because they still struggle with instruction following and tool calling. The real-world pattern, in his telling, is that teams try them, spend time prompt-engineering around issues, and eventually move back to a pipeline architecture.
The evaluation and observability implications are part of the production decision. Some pieces can remain similar if a transcription model runs alongside the audio system, giving teams some auditability over input and output audio. But the evaluation target changes. There is no pure text-to-speech component and no pure text-to-text component. The evals become full-duplex conversation evals over longer interactions, with metrics focused on the whole conversation rather than isolated model calls.
Bhargava also says this eval machinery typically sits on top of the base inference API rather than being assumed as a native property of the model interface. When asked whether speech-to-speech systems can output transcripts side by side with the conversation in a way that supports evaluation, he answers that the eval layer would typically be built on top of the base inference API.
The tradeoff is therefore not pipeline versus speech-to-speech in the abstract. Pipeline systems are operationally complicated but allow component-level control, measurement, and substitution. Speech-to-speech systems may eventually better match human conversation, especially around tone and interruption, but their current limitations in instruction following, tool use, and evaluation keep many production deployments on the cascade.


