New turn detection model for voice agents from the excellent team at @krispHQ

August 5, 2025

New turn detection model for voice agents from the excellent team at @krispHQ.

Fast, accurate turn detection is one of the "hard problems" for all of us building voice AI right now.

Many, many voice agents use Krisp's background voice cancellation models to improve transcription accuracy and reduce unintended interruptions. It's great to see the Krisp team working on turn detection, too, and offering customers a really good, native audio turn detection model.

The launch blog post is a good read for anyone interested in voice AI. It covers:
- approaches to turn taking
- the importance of latency
- building a test dataset and analyzing model performance
- comparison with the open source, native audio @pipecat_ai smart-turn model
- roadmap: fused text/audio, handling backchannel utterances, and more

I work on the open source smart-turn model, and (unsurprisingly) I would have set up the comparison differently. It's really interesting to see the Krisp team's focus on training a robust confidence metric with a strong temporal component.

We did some experiments in that direction in an early version of the open source model. We achieved a pareto-better (by our measurements) latency/accuracy trade-off by:

1. Training the model with a goal of over-fitting to a bi-modal classification and not trying to map confidence to a useful range.
2. In practice, biasing towards false positives. (Because latency is so, so important in voice AI.)

You can see the difference in approach in the side-by-side video comparison in the Krisp launch post. The two models are configured so that the response time is frequently more than 3 seconds for a "positive" decision. In production use cases, the open source model would never be configured that way. The open source model was trained with the goal of accomodating a small percentage of false positives in order to achieve typical turn decision times of <400ms!

But having a confidence value that is a useful input to a contextual classification decision is definitely valuable. I think we can continue to improve the training data sets for the open source smart-turn model and achieve that in future checkpoints.

I'm really looking forward to continuing the work on this challenging problem and sharing ideas with other researchers. And to offering the Krisp model as an option for Pipecat Cloud customers.

VIVA SDK turn-taking model from @krispHQ launch blog post.

^[1]

https://krisp.ai/blog/turn-taking-for-voice-ai/ ↩