July 23, 2025
The team at @Speechmatics just shipped a really clean integration of realtime speaker diarization for voice agents. I've tinkered quite a bit with multi-speaker voice agent pipelines, and this is the best implementation I've seen so far.
Voice AI in 2025 is at a really interesting point.
A wide variety of companies are deploying voice agents at scale for a range of use cases. (Customer support. Outbound telephone calls for healthcare workflows. Answering the phone for restaurants and other small businesses. Teaching and tutoring. Phone screens. User research. And many more.)
But there are still a *lot* of interesting problems to solve and new components to build.
Speaker diarization means figuring out who is talking. If your voice agent knows that multiple people are talking, and who said what, you can do lots of useful things and build new kinds of interactions:
- incorporate "side conversations" between people into the LLM's understanding of context
- ignore side conversations that the LLM shouldn't respond to
- (relatedly) do a better job with turn detection and selective response, overall
For example, imagine a parent and child sitting together talking to an LLM tutor. The LLM can do a much better job guiding the lesson if the child and parent transcriptions are separate and properly marked.
@uberboffin posted a great demo video, with source code, showing realtime speaker diarization in action.
The code is running on the ESP32 embedded hardware that a lot of us are having fun hacking on these days! (cc @_pion, @thorwebdev, and @aconchillo).
We played a game of Guess Who? using @Speechmatics diarization that knows who’s talking, running on a tiny ESP32 using WebRTC via @pipecat_ai from @trydaily
Yes. Really. 😎
Matt Barty and I went up against “Humphrey”, trying to guess a mystery Brit ... 🎩
With diarization,
