← kwindla hultman kramer

.@OpenAI has announced that native speech-to-speech APIs are coming soon

May 22, 2024

.@OpenAI has announced that native speech-to-speech APIs are coming soon. A lot of us are eagerly awaiting access. Potential new use cases *and* even more latency reduction!

But you don't have to wait for GPT-4o native speech input to build really cool voice+AI applications.

We are seeing GPT-4o consistently deliver ~300ms time-to-first-byte for text prompts. This is fast!

You can pipeline @DeepgramAI transcription → GPT-4o → Deepgram Aura or @elevenlabsio voices.

The latency of this tech stack is quite good — conversations feel natural. You just need to be careful about pipelining the data efficiently through the three stages (text-to-speech, inference, and speech-to-text).

The hardest problems to solve from scratch are phrase endpointing and handling interruptions. You may want to either use @pipecat_ai to build your application or look at the implementations of phrase endpointing and interruption handling in Pipecat.

A couple additional things:

Don’t use WebSockets for voice conversational AI in mobile apps or browsers. WebSockets are fine for moving audio between servers in the cloud. But not for moving audio from edge devices to/from the cloud. Use WebRTC. You’ll have latency, reliability, and monitoring/observability issues with WebSockets.

GPT-4o is impressively fast for text inference, and fast even with tool calling. The latency improvement over GPT-4T is really, really great. But the “vision” (image input) latency isn’t great (yet). In our testing, GPT-4of TTFB goes from ~300ms to ~2.5s if you add an image to the input.

1/2

I tweet about this stuff a lot.

Here's a video from last night that gives a pretty good sense of GPT-4o's latency. The response to "Hey, robot: what's 2+2" feels immediate. (The video is short because I wasn't explicitly trying to show latency. I was showing a new wake-word class I was experimenting with.)

https://t.co/clqjgj9p9C

Here's a primer on latency in speech-to-speech apps.

https://t.co/iwaLPtNUrC

Here is super simple test code with inference TTFB logging to the console. This is a quick way to get an intuitive feel for the latency of different models in the context of a full app stack built according to current best practices for this stuff.

Text only prompts:

https://t.co/AhE463W86u

Text+image prompts:

https://t.co/WkGYMxYM2j