Voice AI fast response - phrase endpointing

September 4, 2024

How does a Voice AI bot/agent/process know when it should process input speech and respond?

This problem is called phrase endpointing.

Getting phrase endpointing right is critical for voice interactions.

The Open Source, state-of-the-art phrase endpointing implementation in @pipecat_ai uses:

- voice activity detection
- a speech "confidence" metric
- a running average of audio volume

Using these three signals together makes it possible to reliably detect pauses in human speech in a wide range of audio environments.

Pipecat has support for two VAD implementations: Silero VAD and the VAD from the libwebrtc codebase. We usually use Silero VAD in production at @trydaily. It's fast, flexible, doesn't use much CPU, and is quite good.

It's also pretty configurable. Pipecat exposes four VAD parameters:
- confidence
- start_secs
- stop_secs
- min_volume

You can read the source code if you're interested in how these are used.

The only one that most people will change from app-level code is stop_secs.

You can think of stop_secs as how long a speaker needs to pause for the "stopped talking" event to trigger.

A shorter stop_secs allows the AI bot to respond faster. But with a shorter stop_secs, the bot will more often talk when the human isn't really done speaking.

The shortest possible setting in the Pipecat implementation is 100ms. This is too short for most use cases! A typical setting is 300-500ms.

In general, we're aiming for 800ms of total voice-to-voice latency. This isn't always achievable, depending on how slow our whole inference and tools stack is. Here's some very rough math:

- network transport and media processing: 200ms
- speech-to-text and phrase endpointing: 200ms
- llm inference: 200ms
- tts: 200ms

So ... 800ms is tight, but achievable.

Often, though, a longer stop_secs is a good trade-off. For example, in a setting like a job interview, people will often want to pause to think. Here, a stop_sec setting of something like 800ms or even 1s can feel ideal.

Another important note: good phrase endpointing also depends on good interruption handling. If you can quickly detect the user starting to speak again and flush your output pipeline, you can use a shorter stop_secs setting.

@jonptaylor added a stop_secs slider to the Daily Bots demo last weekend, making it really easy to play with different settings. I've embedded a video. The link to the demo is in the next tweet in this thread.

Pipecat's phrase endpointing implementation is the best code I've personally seen for this. But there are lots of ways you can tweak this approach for incremental improvement. And in the future, there will be definitely be new ways to solve this problem.

One thing we've done for some specialized use cases (and demos) is to rearrange the typical phrase endpointing pipeline so that llm inference and tts happen *before* the "stopped speaking" event fires, and voice output is accumulated and gated on that event.

Almost all voice-to-voice apps today have a data flow pipeline that looks like this.

1. audio input
2. stt (transcription)
3. [ wait on phrase endpointing ]
4. llm inference
5. tts
6. audio output

But you *can* do llm inference "greedily" on every chunk of text that comes out of the speech-to-text model. Then, if the stt model outputs another chunk of text before you see a "stopped speaking" event, throw away the inference and redo it.

1. audio input
2. stt
3. | llm inference
4. | tts
5. [ accumulate and wait on phrase endpointing ]
6. audio output

The advantage of this approach is that you can set a longer stop_secs and still get fast response times. The disadvantage is that this is cost-prohibitive unless llm inference and tts are very cheap. This really only makes sense if you're running llm and tts inference locally and can model the GPU resources for this bot as "all you can eat."

Some interesting work that is also worth referencing, that I'm excited to see more of in the future:

- @krispHQ has excellent audio processing, noise reduction, and speech models. These are commercial models that require a license, but Krisp makes a consumer noise-cancelling and meeting assistant app that uses them. The Krisp app is definitely worth checking out if you haven't seen it before. Good audio processing really helps with transcription accuracy and phrase endpointing. I know the Krisp team is working on pushing the state-of-the-art forward in areas like phrase endpointing, too. (We license the Krisp models at @trydaily and you can enable many of the Krisp features in your apps if you build on our SDKs.)

- The team at @Vapi_AI is doing really cool phrase endpointing R&D. They have beta versions of experimental features that you can try out. They are terrific engineers building a great product.

- I've had some fun conversations with the team at @heytavus about using video to help with phrase endpointing. Humans use visual cues as part of how we know when to talk. Why shouldn't AI voice and video agents do the same thing?

Try out different stop_secs settings:

^[1]

Pipecat's phrase endpointing code:

^[2]

The Silero VAD repo:

^[3]