← kwindla hultman kramer

Instant startup for voice AI agents

March 12, 2025

Instant startup for voice AI agents ...

Typical startup time for a voice agent today is 2-4 seconds. Here's an example starter kit that shows how to reduce that to 200ms.

This code uses client-side buffering and careful sequencing to start capturing audio for the voice AI to respond to as soon as the local microphone is live.

In the demo video, here are the relevant numbers:

- run 1: network and bot ready 1905ms. audio live: 157ms
- run 2: network and bot ready 1828ms. audio live: 155ms
- run 2: network and bot ready 1848ms. audio live: 161ms

You can see in the video that I start to talk before the network connection is live, but that the voice bot "heard" everything I said.

The tech stack in the video is:

- @GoogleDeepMind Gemini Multimodal Live API for voice-to-voice interaction
- @pipecat_ai for the agent orchestration
- @trydaily for WebRTC audio transport

Link to code and more notes below.

Starter kit code is here:

https://t.co/DwBtbihiTu

The buffering mechanism is built into the @trydaily WebRTC transport in @pipecat_ai, so you can implement this fast connect logic in your own voice AI agents with ~10 lines or so of code.

There's lots of data about "bounce rates" for fast vs slow web page loads. Google's Page Rank docs say that bounce rates increase by 32% when load time increases from 1 to 3 seconds, and increase by 90% if load time reaches 5 seconds.

Voice agents aren't web pages and we don't have similarly time-tested research about voice AI in production, yet. But, in general, users don't like to wait! So it seems safe to assume that fast connect times for voice agents are important.

The basic issue is that establishing network connections can be slow.

WebRTC can be particularly slow, because there are so many moving parts. WebRTC is the fastest way to send and receive audio/video data once a connection is established. But connection setup time is a bottleneck.

Add additional server-side bot startup time, including things like service discovery and possible resource scale-out, and it's easy to have your median "bot ready" latency exceed several seconds.

Buffering audio on the client, then sending the audio in a quick burst as soon as the connection is live, can work around these setup bottlenecks in the vast majority of cases.

Shout out to Filipi, who wrote this core code and the example showing how to use it!

  1. https://github.com/pipecat-ai/pipecat/tree/main/examples/instant-voice