December 19, 2024
Shockingly good voice agent performance from Gemini Multimodal Live in a noisy environment ...
This is post 3 in our series — 2⃣5⃣ demos heading into 2⃣0⃣2⃣5⃣ ᓚᘏᗢ Building multimodal AI with Pipecat and Gemini
Voice AI agents succeed or fail based on the their real-world interruption handling performance.
The AI needs to stop talking whenever the user starts talking. This is tricky. Background noise, and even background speech, should not trigger an interruption.
@pipecat_ai's Open Source, state-of-the-art interruption handling implementation combines:
— A small "voice activity detection" AI model
— Baselining against a running average of audio volume
— Optionally, audio processing using the excellent @krispHQ audio processing models.
Krisp's models are specialized for background noise filtering and primary speaker isolation.
As you can see in the video, the performance of Gemini Multimodal Live's native audio input + Pipecat + Krisp is truly next-level.
(That's @chadbailey59 talking to Gemini in the video!)