May 6, 2025
Fast voice-to-voice response times for voice agents: nice technical overview and sample code from @cerebriumai.
All the platforms here are participating in the Voice AI course that starts this week. Join the course and get $2,500 in credits total for Cerebrium serverless GPU infrastructure, @FixieAI Ultravox, @cartesia voice generation, and @trydaily WebRTC.
Sign-up link in the thread below, plus more notes on how 600ms voice-to-voice latency stacks up against typical production voice agents today.
⚡ Voice AI that responds in 600ms?
Meet Ultravox — a multimodal LLM from @FixieAI that skips the STT step and processes audio directly into an LLM.
Built on @cerebriumai infra, deployed with Pipecat from @trydaily

Most voice agents today have P50 voice response latencies of 1,200 - 1,500 ms, measured on the client from the end of user speech to the first byte of non-silence audio from the agent.
This is okay, but faster would be better. I usually tell people that aiming for 800ms, aspirationally, is good. 600ms is great!
To achieve 600ms you need to optimize all the parts of your voice agent processing loop:
- network transport
- audio input handling (turn detection, especially)
- inference (your models and inference stack need to have excellent time-to-first-token performance)
- multi-model data pipelining
Read the article in the Cerebrium thread for a technical overview and complete code.
Sign up for the month-long Voice AI course and $10k in total free credits here: https://t.co/yjxxM3Brtw