March 27, 2025
Smarter voice AI turn detection is a "2025 problem."
By which I mean: in 2024 all of us in the realtime, multimodal AI ecosystem spent most of our time working on relatively low-level things ...
➡️ basic turn detection using VAD
➡️ fast, reliable interruption handling
➡️ context management for multi-turn voice
➡️ reliable, instrumented, interruptible function calling
➡️ developer-friendly tooling for low-latency network transport (WebRTC, telephony integrations)
➡️ developer-friendly realtime, multimodal data "pipeline" abstractions
➡️ basic observability
The implementations for all of these in frameworks like @pipecat_ai are now pretty good! Lots of people are deploying voice AI to production at scale.
Now the priority is solving the next level of pain points — 2025 problems.
➡️ better turn detection
➡️ reliable scripting of complex realtime workflows
➡️ making devops for realtime AI easier
The new @OpenAI Semantic VAD is a big step forward for turn detection. 🎉🎉🎉
🧵 about improving turn detection ...
"Turn detection" just means deciding when a voice agent should talk. There are several ways to approach this problem.
1. Wait for a pause in user speech. Typically a small, specialized "voice activity detection" ML model is used to detect these pauses. A VAD model takes audio as input and outputs a classification — "speech" or "not-speech." This is what most voice AI agents use today.
2. Transcribe the audio and match on text patterns to classify segments of speech as "continuing" or "end of turn." You can do this with regular expressions or a small language model. This is usually used in combination with VAD, and can be an improvement over using VAD alone. But this approach can't capture the audio patterns in human speech that often indicate people are not finished talking.
3. Train a small audio classifier model to directly output "continuing" or "end of turn" confidence.
4. Do turn detection in the LLM itself. LLMs that natively understand audio can do turn detection, if you prompt them well. And new LLM architectures that can both accept audio streams as input and bidirectionally stream audio out don't need to rely on external VAD or turn detection at all.
Some links ...
Silero VAD is an open source, very good, and quite efficient VAD model. (It uses about 1/8th of a typical cloud virtual CPU.)
https://t.co/GBej9b9fnu
Here's is the @pipecat_ai implementation of VAD, which can use Silero or other VAD models as plugins. If you are building voice AI agents from scratch without using a framework like Pipecat or an API like the OpenAI Realtime API, this source code is worth reading. Tuning how you use VAD is an important thing to get right.
https://t.co/nxo4N0tqtZ
@pbbakkum's overview of the new OpenAI Semantic VAD capability:
https://t.co/Wz98EQZkg3
The Pipecat community-driven, open source "smart turn" model project:
https://t.co/YbiYc7Y8VT