← kwindla hultman kramer

So my weekend project has been working on a CoreML version of smart-turn v2

July 21, 2025

So my weekend project has been working on a CoreML version of smart-turn v2.

The bad news is I haven't managed to produce a performant, accurate, CoreML conversion from the PyTorch model.

The good news is that the PyTorch MPS back end actually does a really good job with this model. CPU inference is ~600ms. MPS inference is ~110ms.

110ms is more than fast enough for local use cases, because inference here runs in parallel with transcription or greedy LLM inference.

kwindla@kwindla

Smart Turn v2: open source, native audio turn detection in 14 languages.

New checkpoint of the open source, open data, open training code, semantic VAD model on @huggingface, @FAL, and @pipecat_ai.

- 3x faster inference (12ms on an L40)
- 14 languages (13 more than v1, which was english-only)
- New synthetic data set `chirp_3_all` with ~163k audio samples
- 99% accuracy on held out `human_5_all` test data

Good turn detection is critical for voice agents. This model "understands" both semantic and audio patterns, and mitigates the voice AI trade-off between unwanted turn latency vs the agent interrupting people before they are finished speaking.

Training scripts for both @modal_labs and local training are in the repo. We want to make it as easy as possible to contribute to or customize this model!

Here's a demo running the smart-turn model with default settings, aimed at generally hitting 400ms total turn detection time. You can tune things to be faster, too.

You can help by contributing data, doing architecture expermints, or cleaning open source data! Keep reading ...

Video from @kwindla's post

PRs for the smart-turn repo and @pipecat_ai supporting class:
-[1]
-[2]

  1. https://github.com/pipecat-ai/smart-turn/pull/22
  2. https://github.com/pipecat-ai/pipecat/pull/2224