November 5, 2024
.@aconchillo and @mark_backman have been experimenting with using an LLM to determine whether it's the LLMs turn to talk in a real-time voice conversation.
You could call this semantically aware phrase endpointing, or llm-as-a-judge turn detection, or just LLMs all the way down.
It's working nicely, though there's definitely more prompt engineering possible here.
The @AnthropicAI release of Claude 3.5 Haiku today was fantastic timing for this, because using a small, fast model for the phrase endpointing classification is ideal.
Here's a demo video.
The architecture is:
1. Pipecat's very good core phrase endpointing implementation, which uses a specialized "voice activity detection" model's speech confidence score, plus a smoothed running average of audio input energy, plus a configurable non-speech silence interval.
2. Claude Haiku looking at the most recent assistant response and user message in the context, prompted to classify the user text as complete or not.
3. Claude Sonnet doing "greedy inference" every time a transcription fragment arrives.
4. Sonnet's output is gated until Haiku determines that the user has finished talking.
5. After another configurable user silence timeout (set to 5 seconds here) elapses, the output gate opens even if Haiku doesn't think the user is finished. This is a backstop in case the classification doesn't correctly identify the end-of-speech.
This use of the ParellelPipeline construct gives you the LLM-as-a-judge functionality for free, from a latency perspective.
It's also clear from the testing we've done so far with this approach that we could crank down the core VAD silence interval setting to further reduce average response latency. I just didn't bother to change the default, when recording this demo. (There's a cost implication to doing this, though, because you're running the greedy inference more often.)
One thing that surprised me quite a lot, while testing this today, is that the 5-second idle timer doesn't seem slow at all in real-world use. @aconchillo originally had this set to 3 seconds and I lengthened it.
Five seconds would be terribly slow as a setting for all phrase endpointing. (The pipecat default is 800ms.)
But because it doesn't fire very often, and usually fires only when you've stopped and started talking once or twice anyway (causing the "user text" that Haiku is trying to evaluate to be a bit of a mess), subjectively it doesn't feel long at all.
I expect that the time-to-first-token numbers you can see in the console, in the video above, won't be representative of Haiku's performance.
They're fine (400-550ms), but when I was testing Haiku earlier today it was consistently delivering sub-300ms TTFT.
I think people are very excited about Haiku and are hammering the Anthropic GPUs!
You can run this demo, try out "natural turn detection," and make improvements to the code. PR is here: https://t.co/XnPjKMn0di