🏢 Industry | May 10, 2026 | 19 min read

By AI Tool Briefing Team

OpenAI's GPT-Realtime-2 Rewires Voice AI

The voice agent stack everyone has been quietly building for two years just got rerouted. On May 7, 2026, OpenAI shipped three new models on its Realtime API: GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper. The headline number is the context window — 32K tokens up to 128K — but that isn’t the part that matters. The part that matters is where the reasoning happens.

Today’s production voice agents run a three-step pipeline. Speech to text, then an LLM call, then text to speech. The latency is the sum of three round trips, plus the emotional and prosodic information that gets thrown away every time audio is converted to ASCII. GPT-Realtime-2 collapses that into a single model with native audio in, audio out, and GPT-5-class reasoning happening inside the voice loop instead of between transcription steps.

That’s an architectural shift, not a version bump. And it landed just two days after ElevenLabs disclosed $500M+ ARR at an $11B valuation on the promise of being voice AI infrastructure. The timing is not a coincidence.

Quick Summary: What OpenAI Shipped on May 7

Detail Info
Date May 7, 2026
Models launched GPT-Realtime-2, GPT-Realtime-Translate, GPT-Realtime-Whisper
GPT-Realtime-2 reasoning GPT-5-class, in-band with audio
Context window 128K tokens (up from 32K)
Translate model 70+ input languages, 13 output
Whisper streaming STT ~$0.017/min
Surface OpenAI Realtime API

Bottom line: OpenAI just put a frontier reasoning model inside the voice loop and priced the streaming STT below most pure-play vendors. Every team building on top of ElevenLabs, Vapi, or Bland needs to rerun the build-vs-buy math this week.

Detail	Info
Date	May 7, 2026
Models launched	GPT-Realtime-2, GPT-Realtime-Translate, GPT-Realtime-Whisper
GPT-Realtime-2 reasoning	GPT-5-class, in-band with audio
Context window	128K tokens (up from 32K)
Translate model	70+ input languages, 13 output
Whisper streaming STT	~$0.017/min
Surface	OpenAI Realtime API

What Actually Changed

The previous Realtime API, originally announced in late 2024, was good. It was also clearly a bridge. The model behind it was a tuned voice variant of GPT-4o, with a 32K context window and a noticeable gap in reasoning capability versus the text-only frontier line. Anyone running a serious agent on it knew the trade: faster speech-to-speech, but you gave up the smartness you’d get from a full GPT-5 call. So the production pattern was to keep the reasoning in a separate text model and use Realtime mostly as a sophisticated I/O layer.

GPT-Realtime-2 closes that gap. The reasoning quality OpenAI claims is GPT-5-class, which in practice means tool-use chains, multi-turn instruction following, and complex agentic flows now hold up over a voice channel without the model losing the thread. The 128K context is the enabler — long support calls, hour-plus interviews, and stateful agentic flows no longer need an external memory layer stitched together with embeddings and rolling summaries. You can just keep talking.

GPT-Realtime-Translate is a separate model tuned for live multilingual conversation. 70+ input languages, 13 output — narrower output coverage than ElevenLabs claims, but optimized hard for real-time interpretation latency rather than studio-quality dubbing. The use case is obvious: cross-border support, multinational meetings, healthcare interpretation in markets where the language pair determines the legal hire.

GPT-Realtime-Whisper is the one most people will skim past, and it’s probably the most disruptive of the three. It’s a streaming speech-to-text model priced at roughly $0.017/min — about $1.02/hour. That undercuts most of the standalone STT market and bundles it inside the same API everyone is already integrating for the rest of OpenAI’s stack. Deepgram, AssemblyAI, and the long tail of transcription startups now have to compete with a model that ships in the same SDK as their customers’ LLM calls.

For background on the prior generation of streaming STT, our Whisper-versus-services breakdown walks through the trade-offs. The short version: the new pricing makes “I’ll just call Whisper from OpenAI” the default answer for anyone not already deep into a specialized vendor.

Why “Reasoning Inside the Voice Loop” Is the Real News

Here’s the architectural picture for an enterprise voice agent in early 2026, before this launch.

Caller speaks. A streaming STT model — Deepgram, AssemblyAI, or Whisper running in your VPC — converts audio to text. The text goes to your LLM (usually GPT-5 or Claude). The LLM responds. A TTS model — ElevenLabs, Cartesia, or PlayHT — converts that response back to audio. The full path is three vendors, three round trips, three places where emotion, hesitation, intent, and tone get lost in translation.

The total latency budget on a phone call is roughly 800ms before it feels unnatural. Hitting that budget while keeping reasoning quality high has been the central problem of voice agent engineering for two years.

GPT-Realtime-2 doesn’t optimize that pipeline. It removes it. Audio goes in, audio comes out, and the reasoning step that previously required handing off to a separate text model now happens in-band. The model hears tone. It hears the pause before “yes” that signals the caller is uncertain. It generates speech that reflects that — slower, with the right prosody, asking a clarifying question instead of barreling forward. None of that information was available to a text-only LLM sitting between two converters.

This is why the 128K context window matters more than the number suggests. Long voice sessions previously had to summarize and re-prompt aggressively because the audio-to-text-to-audio path multiplied token usage. With native audio handling at 128K, an agent can maintain a full hour-plus call in working memory without the context-stitching gymnastics that complicate every production voice deployment today.

The downstream effect: a meaningful chunk of the engineering work done at companies like Vapi, Retell, and Bland — orchestration, latency tuning, prompt engineering across the boundaries — becomes optional. Not obsolete. Optional. And optional infrastructure tends to compress in price quickly.

What This Means for the Voice AI Stack

The competitive picture is clearest if you look at the bundling.

ElevenLabs sells two things: voice synthesis quality that leads the field and an enterprise agent platform built on top of it. Per ElevenLabs’ published Agents pricing, self-serve voice agents tier between $0.08 and $0.12 per minute depending on the model combination (Standard at $0.08/min, Turbo at $0.10/min, Premium at $0.12/min). Enterprise contracts carry custom, negotiated rates that aren’t publicly disclosed. That published self-serve range is the reference point for an enterprise RFP — actual negotiated rates at volume would typically come in lower.

OpenAI’s pricing for the Realtime API has historically been audio-per-minute equivalent, with separate input and output rates. The exact GPT-Realtime-2 pricing matters, but the structural shift is what changes the math: the reasoning model is now bundled with the audio I/O at the API level. A buyer doing the comparison is no longer pricing “ElevenLabs voice + my own LLM” against “OpenAI voice + my own LLM.” It’s “ElevenLabs voice + my own LLM” against “OpenAI does the whole thing.”

For ElevenLabs, the response is already legible. The company has been pushing hard into the agent platform layer — ElevenLabs Agents is a deployment surface that handles voice, conversation, and orchestration as one product. The institutional money on the cap table is there because the company plans to be more than a TTS API. But OpenAI shipping reasoning inside the voice loop forces ElevenLabs to differentiate on the two things OpenAI doesn’t do as well: voice quality and voice variety. Custom cloned voices. Studio-grade output for media production. Languages where OpenAI is thin. The race condition is whether ElevenLabs’s voice quality moat compounds faster than OpenAI’s reasoning bundle eats the agent market underneath it.

For the orchestration layer — Vapi, Retell, Bland, Synthflow — the picture is harder. These companies built businesses on the premise that stitching together STT, LLM, and TTS was hard enough to be worth paying someone to do well. GPT-Realtime-2 moves the hard part inside a single API call. The orchestration layer still has work to do — telephony integration, CRM hooks, transcript routing, compliance recording, post-call analytics — but the core technical complexity it was solving just became someone else’s problem to solve.

For a useful frame on how the broader enterprise AI deployment market is moving, the pattern here is the same one we’ve seen in coding agents and search. The frontier lab ships a primitive that absorbs the middleware, and the middleware companies pivot to the workflow surface — telephony, CRM, vertical-specific compliance — where the lab doesn’t compete.

Build vs. Buy: The Math Changed

If your team has been running an in-house voice agent on the three-vendor stack, the build-vs-buy decision is now meaningfully different than it was on May 6.

The old buy case: ElevenLabs Agents at published self-serve rates of $0.08–$0.12/min gives you voice quality that leads the market, sub-second latency, and a deployment model that ships in days. You don’t have to manage three vendors, three SLAs, and three places where things break at 2 AM. The premium over a roll-your-own stack is the price of not running a voice infra team.

The old build case: roll Whisper or Deepgram for STT, GPT-5 or Claude for reasoning, and ElevenLabs or Cartesia for TTS. Cheaper per minute at scale, more control over each layer, more freedom to tune for your specific use case. Worth it only if voice is core to your product and you have engineers who understand audio pipelines.

GPT-Realtime-2 collapses both cases toward a third option. Run the whole agent on the OpenAI Realtime API. One vendor. One SLA. Reasoning quality close to what your text agents get. Sub-second latency baked in. Pricing that should land below ElevenLabs’s bundled enterprise rate, especially for high-volume voice loads where you’d otherwise be paying ElevenLabs’s negotiated rate plus your separate LLM costs.

The trade-off you’re accepting: voice quality that’s good but short of ElevenLabs’s standard, voice variety that’s narrower than ElevenLabs’s catalog, and complete platform dependence on OpenAI for a customer-facing surface. That last one matters more than the first two. Voice agents are the customer experience for a growing number of support, sales, and scheduling workflows. Single-vendor dependence on the surface that talks to your customers is a procurement risk every CIO has to weigh.

For more context on how that platform-dependence question played out in another category, the OpenAI ending Azure exclusivity post is worth the re-read. The same logic — distribution everywhere, but with the lab still owning the model relationship — applies here.

How GPT-Realtime-2 Stacks Up Against the Field

Here’s how this lands against the most active voice AI vendors.

ElevenLabs. Still wins on voice quality, voice cloning, and language breadth. Loses the agentic-reasoning argument unless the company can ship a comparable in-loop reasoning model fast. The Nvidia and Salesforce Ventures money on the Series D cap table gives ElevenLabs the runway to build that, but the timing window just compressed.

Vapi, Retell, Bland. The orchestration story now needs a vertical or workflow angle to justify the layer. Pure latency engineering and prompt orchestration aren’t enough when GPT-Realtime-2 ships those as defaults. Expect aggressive moves into specific verticals — healthcare scheduling, real estate qualification, restaurant reservations — where compliance, integrations, and domain prompts are the moat.

Mistral Voxtral. The open-source play gets more interesting, not less. Teams that don’t want full OpenAI dependence will look harder at running their own reasoning model on top of an open-source voice layer. The total cost of ownership math is rough — you need GPU capacity and an MLops team — but the strategic optionality argument just got stronger for any enterprise that views voice as core IP.

Murf, PlayHT, and the long tail of TTS-only vendors. The hardest position in the market right now. They were never going to win on agent reasoning, and the voice generator landscape was already consolidating around a handful of leaders. Expect quiet acquisitions or pivots into specialized verticals — audiobook production, podcast generation, specific accessibility tooling — over the next twelve months.

Deepgram, AssemblyAI, and the streaming STT crowd. Real pressure from GPT-Realtime-Whisper at ~$0.017/min. The differentiation has to be domain accuracy (medical transcription, legal dictation, multilingual code-switching), latency in adversarial conditions, or on-prem deployment for regulated industries. The “we have a great Whisper alternative” pitch is no longer enough.

What Buyers Should Do This Week

Three concrete moves if voice agents are anywhere near your roadmap.

How should enterprise teams respond to the GPT-Realtime-2 launch?

Re-cost your current voice agent stack. Take whatever pipeline you’re running today and price it line by line — STT vendor cost per minute, LLM cost per minute, TTS cost per minute, orchestration overhead. Compare it to a bundled GPT-Realtime-2 deployment at the new pricing. If you’re inside an enterprise contract with ElevenLabs or another bundle, the question is whether the negotiated rate has enough headroom to survive OpenAI’s pricing as the reference point.
Run a parallel pilot. Pick one production voice workflow — outbound qualification, inbound support tier-one, internal IT helpdesk — and stand up a GPT-Realtime-2 version next to your current vendor. Don’t trust the demos; they’re tuned. Trust your own A/B on actual call volume over two weeks.
Re-read your ElevenLabs or Vapi contract. Check for renewal terms, volume commitments, and exclusivity language. If you’re locked in for twelve more months at a rate that’s now 30%+ above the OpenAI bundle, the renewal is the point of leverage. Bring it forward.
Get your security team involved early. GPT-Realtime-2 routes audio through OpenAI servers by default. If you’re in a regulated industry — healthcare, financial services, EU markets — your data residency and PHI/PII posture needs a hard look before you migrate any production traffic. The bundled-vendor convenience is meaningful only if it survives your compliance review.
Audit which workflows actually need best-in-class voice quality. A lot of voice agent deployments don’t need ElevenLabs-level synthesis; they need acceptable voice and great reasoning. If that’s your shape, GPT-Realtime-2 is the better fit. If you’re producing voice content, narrating media, or building a brand voice, ElevenLabs’s quality moat still applies.

The instinct to wait three months and see how the pricing settles is reasonable. The instinct to assume your existing vendor will quietly match the bundle is not. The platform shift here is structural — a frontier lab bundling reasoning with voice — and it’s unlikely to reverse just because a TTS-led competitor would prefer the old shape of the market.

The Bigger Picture

Two things are happening at once, and they’re pulling the voice AI category in different directions.

The first is the platform absorption pattern that has now played out in coding (GPT-5 absorbing Cursor’s stack), search (ChatGPT and Perplexity absorbing the long tail), and content generation. Voice was the obvious next category, and GPT-Realtime-2 is the move. The frontier lab ships a primitive that turns the middleware into an optional layer. Standalone vendors compete on quality, vertical depth, or workflow proximity rather than core technical complexity.

The second is the enterprise voice market hitting infrastructure scale at exactly the same time. ElevenLabs’s $500M ARR isn’t a fluke — it’s the leading edge of a real procurement category that didn’t exist 24 months ago. Pension funds and crossover funds are underwriting voice AI as infrastructure. Salesforce, Twilio, and IBM are integrating it into the agent stack they sell to enterprises. Telcos are running it in production. That market is real and it’s growing fast.

Those two trends collide on this week’s news. Platform absorption says voice is going to be a feature inside OpenAI. Infrastructure scale says voice is a category big enough to support multiple independent leaders. Both can be true for two years before the resolution becomes obvious. What probably won’t survive that window: the orchestration-only middleware layer, the STT-only vendor that doesn’t have a domain moat, and the TTS-only vendor without a quality reason to exist.

What probably will survive: ElevenLabs as the voice quality leader serving media, brand, and high-end enterprise. The vertical-specific voice agent companies (healthcare scheduling, legal intake, real estate) where workflow depth beats raw model capability. The compliance-grade on-prem players for regulated industries. And, of course, OpenAI sitting on top of the reasoning bundle for everyone else.

Our Take

This is the voice AI category’s “ChatGPT moment for the stack.” Not the consumer awakening — the enterprise re-architecture.

Until May 7, the rational way to design a production voice agent in 2026 was a multi-vendor pipeline. STT here, LLM there, TTS somewhere else, glued by a startup that specialized in latency. After May 7, the rational default is a single-vendor bundle on the OpenAI Realtime API, with multi-vendor orchestration reserved for use cases where voice quality, language coverage, or compliance posture genuinely demand it. The default just flipped.

That doesn’t kill ElevenLabs — the institutional cap table and the agent platform investments are positioned for exactly this kind of pressure. The company will defend by going deeper into voice quality, language breadth, and vertical-specific workflow integration. The IPO timeline gets more interesting, not less, because OpenAI’s move forces ElevenLabs to demonstrate that it’s a category leader rather than a feature that got absorbed.

For Vapi, Retell, and the orchestration tier, the pivot is harder. We’d expect the next twelve months to look like aggressive vertical specialization, telephony and CRM depth, and compliance-grade deployment options that OpenAI doesn’t compete on directly. Some of those companies are well-positioned to make that pivot. Some aren’t.

For buyers, the practical move is to treat this as the procurement reset moment. The voice agent stack you spec’d in 2025 was correct for 2025 pricing and 2025 capability. It’s not the right spec for 2026. The teams that re-bid the stack now will save material money and end up with simpler infrastructure than the teams that wait for the dust to settle.

We’d argue the dust isn’t going to settle. The voice category is going to look like the search category — one frontier-lab default that wins by bundle, a small number of specialists that win on quality and depth, and a long tail of vendors that get absorbed or pivot. GPT-Realtime-2 is the move that makes that shape inevitable. The only question is who lands on which side of the line.

Frequently Asked Questions

What is GPT-Realtime-2? GPT-Realtime-2 is OpenAI’s voice AI model launched on May 7, 2026, with GPT-5-class reasoning that operates natively inside the audio loop rather than between separate transcription steps. It ships with a 128K-token context window (up from 32K in the previous Realtime model) and is available through the OpenAI Realtime API.

What are the three new voice models OpenAI launched on May 7, 2026? GPT-Realtime-2 (the flagship voice agent model with GPT-5-class reasoning), GPT-Realtime-Translate (real-time multilingual interpretation across 70+ input languages and 13 output languages), and GPT-Realtime-Whisper (a streaming speech-to-text model priced at approximately $0.017 per minute).

How does GPT-Realtime-2 compare to ElevenLabs voice agents? ElevenLabs voice agents bundle synthesis with conversational AI orchestration at published self-serve rates of $0.08–$0.12 per minute (enterprise contracts are negotiated separately at custom rates). GPT-Realtime-2 bundles reasoning, voice input, and voice output in a single API call. ElevenLabs still leads on voice quality, custom voice cloning, and language variety; GPT-Realtime-2 leads on integrated reasoning capability and bundled pricing for teams that don’t need studio-grade voice. For procurement context, the ElevenLabs Series D analysis covers how the institutional money is positioning around this competition.

Does GPT-Realtime-2 replace Whisper for transcription? For most streaming use cases, yes. GPT-Realtime-Whisper is OpenAI’s new streaming STT model at roughly $0.017 per minute, integrated into the same Realtime API. The previous Whisper API remains available for batch transcription, and specialist vendors like Deepgram and AssemblyAI retain advantages in domain-specific accuracy and on-prem deployment. Our Whisper vs. transcription services breakdown walks through the trade-offs.

Can GPT-Realtime-2 do live language translation? GPT-Realtime-2 itself supports multilingual conversation, but the dedicated GPT-Realtime-Translate model is tuned specifically for real-time interpretation. It handles 70+ input languages and 13 output languages, optimized for low latency in cross-border support, healthcare interpretation, and multilingual meeting scenarios.

What does the 128K context window enable for voice agents? Hour-plus support calls, multi-turn agentic flows, and long stateful conversations no longer require external memory layers, embedding-based retrieval, or rolling summarization. The agent can maintain the full session in working memory, which simplifies architecture and removes a class of bugs that show up at the boundaries of context-stitching systems.

Should we migrate our voice agents off ElevenLabs or Vapi to GPT-Realtime-2? It depends on your workflow. If you’re using ElevenLabs primarily for voice quality (media production, brand voice, complex voice cloning), the migration case is weak. If you’re using ElevenLabs Agents or a Vapi-style orchestration tier mainly for STT-LLM-TTS plumbing on standard support and sales workflows, the GPT-Realtime-2 bundle is likely cheaper and architecturally simpler. Run a parallel pilot before committing.

Is GPT-Realtime-2 compliant for healthcare or financial services? OpenAI publishes its compliance posture on the OpenAI Trust portal, but voice routing through OpenAI servers introduces data residency and PHI/PII handling questions that any regulated buyer needs to evaluate. The bundled-vendor convenience only matters if it survives your compliance review. Plan for a security review before any production migration.

How does this affect orchestration vendors like Vapi, Retell, and Bland? The core technical complexity those vendors solved — STT-LLM-TTS orchestration with low-latency stitching — is now bundled inside the OpenAI Realtime API. The remaining moat is workflow proximity: telephony integration, CRM hooks, compliance recording, vertical-specific prompts, and post-call analytics. Expect those vendors to pivot harder into vertical specialization over the next twelve months.

When should we expect ElevenLabs or other vendors to ship a comparable in-loop reasoning model? Unclear. ElevenLabs has the cap-table runway and Nvidia/Google compute alignment to attempt it, but training a frontier-grade voice-native reasoning model is a different scale of investment than tuning a TTS system. Watch for announcements from ElevenLabs, Mistral, and Anthropic over the next two quarters; the Mistral Voxtral release is the open-source baseline to track for the alternative path.

Last updated: May 10, 2026. Sources: OpenAI GPT-Realtime-2 launch announcement · OpenAI Realtime API documentation · OpenAI Realtime API original announcement · ElevenLabs Agents · OpenAI Trust portal.