OpenAI Splits Audio API Into Translation, Transcription, and Voice-Agent Models

Jason Wei Dominic GrilloOpenAIThursday, May 7, 20266 min read

OpenAI is presenting three new API audio models as infrastructure for voice applications that can translate, transcribe, reason and act in real time. Romain Huet’s demonstration centered on GPT-Realtime-Translate, which keeps pace with multilingual speech, and GPT-Realtime-2, a voice-agent model that can follow turn-taking instructions, use business context and call tools while explaining its work. GPT-Realtime-Whisper completes the set as a streaming speech-to-text model for live transcription.

OpenAI is positioning audio models as live interfaces for translation, transcription, and action

Romain Huet introduced OpenAI’s new real-time audio models for the API as infrastructure for voice applications that do more than capture speech. The announced set includes GPT-Realtime-Translate for live speech translation, GPT-Realtime-2 for voice agents that can follow instructions, reason through tasks, and take actions, and GPT-Realtime-Whisper, described in the source context as streaming speech-to-text that transcribes speech live as the speaker talks.

The demonstration itself focused on two of the three models: GPT-Realtime-Translate and GPT-Realtime-2. GPT-Realtime-Whisper is part of the announcement, but the source context only describes its function as live streaming speech-to-text; it is not separately demonstrated or explained in the spoken walkthrough.

The product claim is that voice can become an active interface: translation that keeps pace with a speaker, agents that remain in the conversation while reasoning in the background, and assistants that can operate inside software systems the user already uses.

70+ → 13

input languages to output languages for GPT-Realtime-Translate

The translation demo used a web view labeled “Live translation waveform,” with input above a baseline and translated output below it. The screen included separate input and output transcript panes, a live session state, and a session ID. Huet said the English output was the model’s live audio output, captured directly from the laptop with transcriptions, and specified that there was “no edit to the audio.”

The translation model follows sentence shape, not just individual words

GPT-Realtime-Translate was shown translating Huet’s French speech into English audio while he was still speaking. His first claim was simultaneity: the model could listen and translate at the same time.

Ce qui est très impressionnant, c'est que le modèle peut m'écouter et traduire en même temps que je parle.

Romain Huet · Source

The model’s English output rendered that sentence as: “What’s really impressive is that the model can listen to me and translate while I’m speaking.”

Huet then emphasized timing. The model, he said, can “wait for the key word, like the verb,” and then begin translating directly. In the demonstration, that waiting was presented as part of what makes the output feel more like conversation: the translation keeps pace without flattening sentence structure into a word-by-word stream.

The interaction also included an interruption in German from Dominic Grillo. Grillo said he could interrupt Huet in German and that the model would switch between his German and Huet’s French. The translated output rendered that as: “I can even interrupt in German, and the model switches effortlessly between my German and your French.” Grillo also included technical terms — “GPT,” “real time,” “OpenAI,” and “computer use” — and the translation said the model had no trouble handling them.

Huet summarized the capability as real-time translation across 70 different languages, “really following the shape of every sentence.” OpenAI’s source description gives the more specific boundary: GPT-Realtime-Translate translates speech from 70-plus input languages into 13 output languages while keeping pace with the speaker.

The use cases Huet named were media platforms, customer support tools, and education. What the demonstration showed was live translation from French into English, interruption in German, switching between input languages, and preservation of technical vocabulary in the translated output.

GPT-Realtime-2 is designed to keep the conversation alive while tools run

GPT-Realtime-2 brings “intelligent reasoning to voice agents,” according to Romain Huet. OpenAI describes it as its first voice model with GPT-5-class reasoning, able to handle harder requests and carry the conversation forward naturally.

Huet asked a phone-based assistant to look at his calendar for an upcoming customer meeting. The assistant replied that he had a meeting with Sablecrest Robotics in 12 minutes and that he was meeting with Alex Kim, the CTO.

The connected CRM context made the request more than a calendar lookup. The account on screen was Sablecrest Robotics, an enterprise customer in San Francisco. The visible account state showed an active expansion deal, a security review blocker, a pending infosec questionnaire, a CTO who had joined the last call, architecture questions from Alex Kim, and a warehouse launch announced that morning.

CRM field	Visible account context
Account	Sablecrest Robotics, enterprise, San Francisco
Meeting	In 12 minutes; 2:30 PM–3:00 PM; Alex Kim, CTO
Deal state	Expansion deal active
Blocker	Security review; infosec questionnaire pending
Recent context	Warehouse launch announced this morning
Last activity	Security review meeting; architecture diagrams, data retention summary, and pen test report requested

The CRM screen supplied business context for the voice agent’s later update request.

Huet then asked the assistant to stay quiet “until I say back to demo.” That instruction became part of the interaction model: the agent did not interrupt while Huet and Jason Wei discussed how developers should design these systems.

Romain, don't forget. Now that these models have things like reasoning and parallel tool calling, it's even more important to use things like preambles. This way the model can explain itself and update the user.

Jason Wei · Source

Huet agreed and made the product-design reason explicit: actions can take a few seconds, so the model should acknowledge what it is doing. With GPT-Realtime-2, he said, the model can communicate during reasoning and tool calling so the user stays informed.

The assistant then demonstrated the earlier “stay quiet” instruction by waiting until Huet said “back to demo.” When he did, it responded: “I’m here when you’re ready to continue the demo.” Huet’s point was that the agent had continued listening to the broader environment without interrupting the humans speaking around it.

The CRM example turns voice from query into workflow

Romain Huet asked the assistant to “update the CRM and put the meeting of today as a brief and the next steps.” The assistant did not only confirm receipt. It produced a preamble tied to the account context visible in the CRM and the stated task: “Let me pull the latest context and update your CRM. Sablecrest launched warehouse automation this morning. Expansion is active. Security review is the blocker.”

That line carried the developer pattern OpenAI was trying to show. The assistant was presented as able to draw on business context, act through connected tools, and narrate its work while tool calls ran. The source does not show the completed CRM write; it shows the voice agent beginning the update with a status explanation grounded in the CRM context.

Huet generalized from the CRM example to “any kind of system”: dashboards, services a company already uses, connected devices, and more. The CRM scenario served as the concrete case for a broader interface: an agent preserving conversational context, reasoning through a request, and taking an action inside a product the user already works in.

The final exchange reinforced the same turn-taking behavior. Huet again told the assistant to stay quiet while he wrapped up, then said “back to demo” and asked how it was. The assistant replied: “Smooth and clear. It felt natural and demo-friendly.” The response itself was lightweight; the demonstrated behavior was that the assistant accepted an instruction about when to speak, remained available, and resumed only when cued.

Model Releases Agents and Autonomy AI Application Architecture Voice and Audio AI

OpenAI is positioning audio models as live interfaces for translation, transcription, and action

The translation model follows sentence shape, not just individual words

GPT-Realtime-2 is designed to keep the conversation alive while tools run

The CRM example turns voice from query into workflow

The frontier, in your inbox tomorrow at 08:00.