OpenAI shipped a new speech-to-speech model today: gpt-realtime-2

May 7, 2026

This is the first speech-to-speech model good enough to use in my voice agents that do "real work."

Or real play, for that matter. Here's gpt-realtime-2 as the brain of the ship AI in Gradient Bang.

The voice-to-voice response and tool calling times here are unedited, so you can see exactly what the interaction with the model is like in an agent with a very complex system instruction and frequent tool calls.

(I did clip out the subagent task execution segments, after gpt-realtime-2 starts a subagent via a tool call. Subagents in this config used gpt-5.2 "medium" effort.)

I have a lot of benchmarks and tests for voice agents.

Being the "brain" for Gradient Bang is an excellent, very hard, test for today's models. The ship AI agent implementation is complicated:

- 25 tool definitions
- a large system prompt (~8k tokens)
- structured data events are injected in large volume from the game server
- context sharing with task subagents
- situational inference triggering
- async context compression
- a UI with streaming text display and bi-directional sync with model state

Here is gpt-realtime-2's performance on the aiewf-eval benchmark.

This new model is now the best-performing fully speech-to-speech model.

(Ultravox is a native audio input model that outputs text. The Ultravox API is a very well-engineered speech-to-speech multi-model system.)

The model supports four reasoning effort levels. Here are the aiewf-eval results for each reasoning effort.

The sweet spot is "low", which, on this benchmark, actually performs better than higher reasoning levels.

This is a pattern we see fairly often on our complex, long multi-turn, agentic task benchmarks. The highest reasoning efforts don't always score the best. Possibly AI models, like humans, sometimes think themselves into sub-optimal decisions!

Codex implemented the gpt-realtime-2 support in the Gradient Bang voice agent code. It did a good job!

Here's the PR. This is still a work in progress because mostly I told Codex to implement compaction as a phase 2 feature, and I haven't gone back to the terminal to kick that off.

There also may be a bug in tool call handling or inference triggering timing in my (very complicated) pipeline. You can see in the video that the voice agent interrupts itself once, while it's planning a task. I need to track that down.

PR link: https://t.co/2F7UeKqSad

Here's the official OpenAI announcement post for the new model:
^[1]

Congratulations to the OpenAI Realtime team!

https://x.com/OpenAI/status/2052438194625593804 ↩