So my weekend project has been working on a CoreML version of smart-turn v2

Smart Turn v2: open source, native audio turn detection in 14 languages.

New checkpoint of the open source, open data, open training code, semantic VAD model on @huggingface, @FAL, and @pipecat_ai.

- 3x faster inference (12ms on an L40)
- 14 languages (13 more than v1, which was english-only)
- New synthetic data set `chirp_3_all` with ~163k audio samples
- 99% accuracy on held out `human_5_all` test data

Good turn detection is critical for voice agents. This model "understands" both semantic and audio patterns, and mitigates the voice AI trade-off between unwanted turn latency vs the agent interrupting people before they are finished speaking.

Training scripts for both @modal_labs and local training are in the repo. We want to make it as easy as possible to contribute to or customize this model!

Here's a demo running the smart-turn model with default settings, aimed at generally hitting 400ms total turn detection time. You can tune things to be faster, too.

You can help by contributing data, doing architecture expermints, or cleaning open source data! Keep reading ...