GPT-Realtime-2 Turns Voice Agents Into Tool-Using Reasoning Systems

Erika KettlesonOpenAIWednesday, May 13, 202614 min read

OpenAI’s Build Hour on GPT-Realtime-2 presented the new realtime voice release as a shift from conversational voice interfaces toward tool-using, stateful agents. Teri Yu and Erika Kettleson argued that GPT-realtime-2’s larger context window, stronger instruction following, parallel tool calling and controllable speech behavior let developers build voice systems that can operate apps, reason across workflows and know when not to speak. Sierra’s Ken Murphy and Soham Ray added that production voice agents still depend on the surrounding system: guardrails, tuned turn-taking, tracing, redaction, evaluations and customer-specific workflows.

Realtime-2 moves voice agents closer to reasoning systems, not just conversational interfaces

OpenAI’s realtime audio release is organized around three models: GPT-realtime-translate, GPT-realtime-whisper, and GPT-realtime-2. Teri Yu described the first as live speech translation from more than 70 input languages into 13 output languages, aimed at settings such as video calls, live streams, and customer service. She described GPT-realtime-whisper as streaming speech-to-text that can transcribe while a person is still speaking, with tunable latency “as low as 200 milliseconds” and support for 80 input languages.

GPT-realtime-2 was presented as the agent model in the release. Yu called it OpenAI’s “most intelligent voice model,” bringing “GPT-5 class reasoning into voice” for agents that need stronger prompt adherence, tool calling, multilingual performance, and longer-running task context. The new capabilities listed on OpenAI’s slide were preambles, a 128K context window, parallel tool calling, domain understanding, context over turns, and controllable tone.

The context window was framed as a practical production change rather than a benchmark abstraction. Yu said it had increased 4x to 128K, roughly translating, in Erika Kettleson’s phrasing, to “almost an hour” of context. Yu said the larger window improves instruction following and intelligence because the system does not have to truncate as much.

128K

context window for GPT-realtime-2, described as a 4x increase

Yu grouped GPT-realtime-2 use cases into three broad categories: voice-to-action, systems-to-voice, and voice-to-voice. Voice-to-action covers hands-free applications such as shopping, scheduling, in-car experiences, and development workflows. Systems-to-voice describes spoken interfaces over apps, calendars, CRMs, and support dashboards. Voice-to-voice covers direct spoken exchanges such as customer service calls.

OpenAI showed benchmark slides comparing GPT-realtime-2 with GPT-realtime-1.5. The visible figures showed GPT-realtime-2 ahead on both an intelligence benchmark and an instruction-following benchmark. Yu summarized the improvement as “major jumps” in intelligence, instruction following, and tool calling.

Benchmark	GPT-realtime-2	GPT-realtime-1.5
Big Bench Audio: Intelligence	96.6%	81.4%
Audio MultiChallenge: Instruction Following	48.5%	34.7%

OpenAI's slide compared GPT-realtime-2 with GPT-realtime-1.5 on audio intelligence and instruction following.

The translation demo illustrated that the release is not only about the agent model. Kettleson asked Yu to speak while the demo translated into Spanish. The app showed input and output transcripts side by side, using GPT-realtime-translate with GPT-realtime-whisper powering transcription. The model translated a casual breakfast exchange, including references to an acai bowl and soy milk in the spoken prompt; the Spanish output also rendered part of Kettleson’s breakfast line as “un café con leche.” Kettleson noted that a Spanish-language chat commenter had said “wow, that’s good,” and pointed out that the demo preserved speaker differences through what she called dynamic voice cloning. Yu’s framing was “dynamic tone matching.”

The shopping demo was about tool routing, not voice polish

Erika Kettleson’s first live build was an e-commerce site called Supply Co. The realtime model was not only answering questions. It was operating the interface through tools, inspecting the current page, calling multiple tools in sequence, and combining shopping context with an external weather lookup.

Kettleson began as a shopper planning a Pacific Northwest hiking trip. The assistant recovered her shopping state: she still needed a tent and hiking shoes, while her day pack, trail socks, and insulated bottle were already covered. When she asked for tents under $450 for three to four people, the UI updated to show camping tents filtered by capacity and budget.

The important behavior was not the product list itself, but the chain of reasoning and action around it. The assistant compared a higher-rated three-to-four-person tent at $419.85 with free shipping tomorrow against a cheaper $357 quicker-open style that was not marked as full stock. When Kettleson asked about one- and two-star reviews on the more expensive tent, the assistant summarized the complaints: first-time setup could be slower than expected, and the tent was not ideal for heavy storms, strong wind, and downpour. It then connected those reviews to the trip context, saying the complaints were not deal breakers for light to moderate rain but would matter if rough weather was expected.

Kettleson asked it to check the Seattle-area forecast for the weekend after next. The assistant reported a “moderate” storm risk — rain and a breezy spell, but no clear storm — and recommended the tent with a footprint and solid stakes. It then added the tent to her cart, checked her saved boot size, searched waterproof hiking boots, opened the cheaper option at her request, and added a pair of “Lightweight Waterproof Hiking Boots” priced at $224.85. The cart total was $644.70.

$644.70

cart total after adding the tent and waterproof boots in the shopping demo

Kettleson emphasized that she was passing the model “15 or 20 tools,” a number she said would have been beyond what she would recommend for previous realtime models. The model had to select from those tools, inspect the current page, decide what action was needed, call multiple tools in a row, and reach out to an external weather source without forcing the shopper to leave the Supply Co site.

Her distinction was between a simple “speech in, single action out” workflow and a more complete shopping assistant. In the demo, the model updated the visual experience as it spoke, while reasoning across product search, reviews, cart state, saved sizing, and weather. For builders, the implication was that tool design is no longer just about exposing a single command behind a voice utterance. The assistant needed enough structured tools to search, inspect, compare, add to cart, and bring in external data, while the application still controlled what actions were available.

This is like an actual shopping assistant that can reason across all these tools and update the visual experience as it talks.

Erika Kettleson

Voice-to-action can also mean knowing when not to speak

Erika Kettleson’s second demo inverted the shopping scenario. Kettleson acted as a product manager at Supply Co using a product analytics dashboard called Metric Loop. Here, the voice assistant was not meant to chat continuously. It was meant to execute UI actions, preserve investigation state, and speak only when asked.

The dashboard showed an “Activation Investigation” with filters, charts, a metrics dictionary, and an investigation notebook. Kettleson said she had heard there were issues in Europe, then used voice to drive the workflow: filter by Europe; compare the last seven days against the prior seven; surface relevant filters; narrow to voice search, first-time shoppers, and footwear; launch a root-cause investigation; and compare mobile Safari to Chrome.

The assistant was configured not to confirm every action aloud. That silence was part of the point. Kettleson said the model was good enough at instruction following that “it only talked when I asked it to.” She wanted it to take actions without forcing her to click through the dashboard, not to verbally acknowledge every filter change.

Only at the end did she ask Metric Loop to give a two-sentence spoken overview for an engineering team. The assistant summarized the finding as a mobile Safari-specific regression: the product detail page size selector validation did not update correctly, so first-time Europe footwear shoppers got stuck after choosing a size and could not add to cart. Chrome was near baseline, which pointed toward PDP release behavior on Safari rather than a broad traffic-quality or search issue.

Kettleson’s interpretation was that the assistant behaved like “an analyst in the loop.” She said she was passing in many tools and asking the model to reason over a large amount of mock data. She also said that, with this class of reasoning, the model can write code, create dashboards, and make artifacts. The durable pattern she wanted builders to see was a realtime agent that can route across tools, maintain investigation state, and turn an analytics workflow into something conversational without losing developer control over the data or the UI.

That last constraint matters. Kettleson did not present the dashboard as an autonomous analyst replacing the application. She presented it as an interface layer over a controlled workflow: the model can operate tools, expose reasoning, and speak when asked, while developers retain control over the data and UI.

Sierra’s production lesson: the model is only one layer of a trusted voice agent

Sierra’s contribution shifted the standard from demo usefulness to production trust. Ken Murphy described Sierra as building AI agents that resolve real customer service issues end to end, not just answer questions. Its customers include large enterprises and Fortune 100 companies, which means the agents need to run at high scale, integrate with complex systems, and follow nuanced business policies.

Murphy’s core point was that a customer-facing agent is not merely generating responses. It is deciding when to act, what tools to call, what information to use from a large knowledge database, and whether an action is allowed. In that environment, he said, “small error rates quickly become real business risk.” An agent that hallucinates a policy or takes the wrong action even 0.1% of the time is “not shippable.”

The challenge is not just can we make the agent sound natural, it's whether we can build, evaluate, constrain, and operate agents that businesses trust to represent them directly with their customers.

Ken Murphy · Source

Sierra showed a demo call for a lending scenario. The agent introduced itself as Jade from Demo Loans, gave an opt-out and recording disclosure, and referred to a prior March conversation in which Ken had been eyeing Noe Valley. Murphy said he was still looking in San Francisco, now in the Sunset, and wanted a single-family home with a budget of about $350,000. The agent then said his licensed mortgage banker, Sarah, was there to help and that he was “in great hands.”

Murphy said Realtime-2 addresses several major production voice problems Sierra faces: latency, turn-taking, reasoning, and speech quality. Voice, he said, is “pretty unforgiving”; even a half-second pause can feel awkward or broken. He argued that voice-to-voice architecture is compelling because it can remove parts of the traditional cascaded stack — separate speech-to-text transcription and text-to-speech synthesis — making interactions feel faster, smoother, and more human.

But Sierra does not treat the model as the full system. The architecture slide placed GPT-realtime-2 inside a larger stack: user audio flows through an Agent Harness with tools, guardrails and policies, tool calls, supervisors, barge-in, redaction, and tracing before producing the agent response. Murphy described that harness as the production layer around Realtime-2. It defines customer-specific workflows, the tools an agent can use, the language and branding it should follow, the guardrails it needs, and the grounding required to align with customer policies.

Sierra also uses its own custom-tuned VAD models for turn-taking, because customer service audio often includes background noise, accents, interruptions, and people changing course mid-sentence. The harness handles tracing, redaction of sensitive information, PCI-compliant payment flows, and other production features intended to keep agents accurate, safe, and on-brand. Murphy framed this as the difference between a strong base model and a system that is controllable, observable, and safe enough for large companies to trust with their customers.

Sierra’s early testing, according to Murphy, showed latency gains: calls roughly 30% faster at P50 and “up to 200% faster at P90” compared with its cascaded system. He also said voice quality had been strong and competitive with dedicated synthesis providers in Sierra’s internal evaluations. But he warned that speed and voice quality are not enough. In a traditional stack, Sierra can evaluate transcription, the agent harness, and synthesis separately. With voice-to-voice, it has to evaluate the full call: listening, reasoning, acting, and speaking.

Measure	Sierra's reported initial result versus its cascaded system
P50 call speed	Roughly 30% faster
P90 call speed	Up to 200% faster

Murphy said Sierra saw these latency gains in initial testing of Realtime-2.

For that, Sierra uses simulations that replay realistic customer calls customized to each customer’s workflow. The bar is whether the agent completes the task, not whether it sounds natural.

Realistic voice benchmarks expose failures that clean demos hide

Soham Ray argued that voice agents “don’t get to pick their audio.” Ideal calls have clean audio, clear turn boundaries, neutral accents, and crisp information transfer. Real customer service calls have interruptions, accents, background television, children, car noise, coughs, spelling letter by letter, frame drops, and people changing course mid-sentence.

Ray’s central question was whether a voice agent can handle a customer on the side of a highway, or in a car with kids, or an impatient caller who aggressively interrupts, while still making progress on the task. He described several common failure modes.

One is tool hallucination: the agent tells a customer “I’ve updated your shipping address” when no tool was ever called. The customer believes the change happened, but it did not. Another is conversational confusion: a polite “mm-hmm” or a stray “hold on, kids” causes the agent to trip up because it cannot distinguish a signal to ignore from one requiring a response. A third is a logical misstep: the user wants to cancel a flight, and the agent cancels the wrong one. Ray emphasized that in airline scenarios this can strand someone in a city or country they do not want to be in. A fourth is spell-out failure: if a customer spells a name or code and the agent mishears one letter, authentication can fail and the call can die before any work begins.

Ray said some agents are good at conversation, while others are better at getting things done through tools. The difficult part is combining both; errors compound and performance degrades. Spelling errors, for example, are not isolated if the incorrect spelling remains in memory and is later used in a tool call. Ray said agents can get “mired in their own mistakes” and fail to recover.

He also made a nuanced point about cascaded systems. Sierra has found that cascaded models can shine in some of these conditions because the system can be overfit to production audio and customer-specific complexity. But Ray said it has been encouraging to see voice-to-voice models begin to absorb that complexity and improve both reasoning and communication.

On Sierra’s τ-voice benchmark, Ray said the leaderboard has been dominated by thinking models. The visible chart, attributed to Artificial Analysis and tau-bench.com, showed τ-voice agentic performance for OpenAI Realtime rising from 27.9% for gpt-4o realtime in December 2024 to 30.4% for gpt-realtime in August 2025 and 37.2% for gpt-realtime-2 high in May 2026. The slide also said that where other benchmarks peak above 95%, τ-voice peaks at 37.2%, because realistic domains and conversational dynamics make agentic tasks much harder. A separate note on the chart listed a “best reasoning ceiling” of about 85% for gpt-4o-text.

Date	Model shown	τ-voice pass@1
Dec. 2024	gpt-4o realtime	27.9%
Aug. 2025	gpt-realtime	30.4%
May 2026	gpt-realtime-2 (high)	37.2%

The Sierra × OpenAI slide, attributed to Artificial Analysis and tau-bench.com, showed OpenAI Realtime performance on τ-voice over time.

Ray’s conclusion was that thinking in voice conversations is hard. The model cannot simply disappear into a buffering loop. It needs to say something like “give me a moment” or “let me think about that,” tolerate being interrupted during that period, maintain state, and continue. Those issues, he said, “explode multilingually.” His assessment of Realtime-2 was positive but not final: OpenAI has made “some really big strides,” Realtime-2 is “significantly better,” and there is still “a high ceiling.”

The production guidance centered on interruption, context, and escalation

The Q&A focused less on model features and more on engineering decisions builders face once voice agents leave the demo environment.

On interruptions and barge-in, Sierra said the right approach depends on the use case. Realtime-2 includes its own VAD models, and Sierra found them to work well, but Sierra uses custom VAD tuned for customer service call audio with heavy background noise, children, television, and similar conditions. Teri Yu added that padding recommendations and tuning depend on the device and environment.

Erika Kettleson pointed out a specific control: built-in turn detection can be disabled on a turn-by-turn basis. If the model must say something, such as a disclaimer, the developer can disable VAD for that turn so the user cannot interrupt it, then re-enable interruption handling afterward. She framed this as stronger than relying only on instruction following.

On what separates a useful assistant from a system that improves business workflows, Yu said voice is especially suited to “fast capture” and “fast intent” use cases where speaking is easier or more convenient than typing. Text has greater information density, but voice works when people are driving, on mobile, or expressing a rough idea that the model can turn into a coherent prompt. Kettleson added that people can speak about four times faster than they can type, making voice a more casual and intimate way to provide context. Yu also noted that some countries, including Brazil and India, can be more voice-first in how users prefer to interact with apps.

Sierra’s production answer returned to orchestration. For customer service agents, the issue is not just a single model turn but the whole call. The system has to define workflows with each customer, understand policies such as refunds, and expose the right context to Realtime-2 at the right time.

For sessions longer than an hour, Yu suggested multiple sessions running at once or chained sessions. Kettleson said the most important pattern is saving state so the next session can be rehydrated, whether the cause is a call exceeding an hour, a dropped call, or a user calling back. With the 128K context window, she said, developers can provide much more context at the start of the next session, but they still need to persist what happened in the prior one.

On escalation to a stronger text model through an advisory pattern, Kettleson said the decision should be based on evals. With older realtime models, OpenAI had often recommended escalation for reasoning-heavy tasks because the realtime model lacked the needed intelligence. Soham Ray said Realtime-2 and thinking models change that pattern somewhat: the model can now say it needs a moment, reason, and return with an answer. Sierra’s production approach still includes supervisors that asynchronously review the conversation and inject additional information if the agent needs to get back on track. Sierra also supports both Realtime-2 and traditional text-model stacks, choosing based on agent complexity and the need for speed.

Kettleson added that developers can inject context at any time without triggering a model response, using conversation item creation. A longer-running background tool call can complete while the realtime model continues speaking; once results are available, they can be inserted into context for the model to use.

On context and decision consistency across tools, Kettleson emphasized that Realtime-2 is a reasoning model with preambles and frontier-model-like tool behavior. It can make parallel tool calls without losing context, and by default the turn-to-turn state is passed forward. Yu’s final practical warning was about instruction following: because the model follows instructions well, conflicting prompt instructions can cause it to try to satisfy both. She said OpenAI has found it useful to ask models themselves to identify conflicting guidance and suggest prompt optimizations.

Model Releases Agents and Autonomy AI Application Architecture Evals and Benchmarks Multimodal AI Voice and Audio AI AI in Customer Support