Fast voice-to-voice response times for voice agents: nice technical overview…

May 6, 2025

Fast voice-to-voice response times for voice agents: nice technical overview and sample code from @cerebriumai.

All the platforms here are participating in the Voice AI course that starts this week. Join the course and get $2,500 in credits total for Cerebrium serverless GPU infrastructure, @FixieAI Ultravox, @cartesia voice generation, and @trydaily WebRTC.

Sign-up link in the thread below, plus more notes on how 600ms voice-to-voice latency stacks up against typical production voice agents today.

cerebriumai@cerebriumai

⚡ Voice AI that responds in 600ms?

Meet Ultravox — a multimodal LLM from @FixieAI that skips the STT step and processes audio directly into an LLM.

Built on @cerebriumai infra, deployed with Pipecat from @trydaily

Most voice agents today have P50 voice response latencies of 1,200 - 1,500 ms, measured on the client from the end of user speech to the first byte of non-silence audio from the agent.

This is okay, but faster would be better. I usually tell people that aiming for 800ms, aspirationally, is good. 600ms is great!

To achieve 600ms you need to optimize all the parts of your voice agent processing loop:
- network transport
- audio input handling (turn detection, especially)
- inference (your models and inference stack need to have excellent time-to-first-token performance)
- multi-model data pipelining

Read the article in the Cerebrium thread for a technical overview and complete code.