After the really cool @OpenAI omnimodal announcements yesterday, everybody is…

May 14, 2024

After the really cool @OpenAI omnimodal announcements yesterday, everybody is talking about latency!

For today's voice conversational stacks, latency is the sum of:
1. getting audio from the client to the cloud
2. transcription
3. phrase endpointing
4. LLM inference
5. text-to-speech
6. getting audio from the cloud to the client

GPT-4o combines 2, 4, 5, and possibly 3 into a single inference step that the model can do natively (audio-to-audio).

This is really, really cool from an LLM capabilities perspective. It can also improve latency significantly. OpenAI says that GPT-4o average time to first token is ~300ms. That's fast!

But it's actually not clear to me why the Humane pin (and some other new AI hardware) has such slow response times. Here are typical latencies today for each part of a well-chosen, optimized stack.
1. WebRTC transport, including jitter buffer [75ms]
2. transcription [250ms]
3. phrase endpointing [250ms]
4. LLM - [200ms - 1200ms]
5. TTS - [200ms]
6. WebRTC transport [75ms]

This adds up to between one and two seconds. And it's possible to get down to 500ms with really good infrastructure that prioritizes latency over most everything else. Phrase endpointing (which is tricky) and choice of LLM (which involves cost/quality/latency trade-offs) are the two biggest variables here, most of the time.

Rajiv Ayyangar@rajivayyangar

I tried out the Humane AI Pin.

Latency was the biggest issue: it took awkwardly long to respond.

The latency just seemed to get in the way of everything I was trying to do

Latency is a solvable issue though, @kwindla points out.