Voice AI turn taking is a solved problem

February 17, 2026

Voice AI turn taking is a solved problem.

The single most common complaint about voice AI, today, is that agents interrupt too often. But the voice agents I build for myself now respond quickly and interrupt me less often than the people I talk to every day. (I actually measured this.)

@mark_backman made a @pipecat_ai PR two weeks ago that was the last piece of the puzzle for turn taking so good that I no longer ever think about it.

The approach combines three layers of processing:

1. Voice activity detection, with a short (200ms) trigger.
2. A native audio turn detection model that's small, fast, and runs on CPU. This model captures audio nuances like inflection and filler sounds that don't get transcribed.
3. A prompt mixin for the conversation LLM that decides turn completion based on conversation context.

None of these are new. We've been using VAD for a long time. We trained the first version of the Pipecat Smart Turn native audio model in December 2024. And we've been experimenting with prompt-based large model turn detection (sometimes called "selective refusal") for more than a year.

Now, the Smart Turn model and the SOTA LLMs we're using in voice agents have both gotten so good that using them together feels like we've finally "solved" turn detection.

Mark also figured out how to elegantly apply a "single-token tagging" technique to this problem. We sometimes use single-token tagging in place of tool calling, when we need a near-zero latency programmatic trigger. Mark's Pipecat mixin defines three single-token characters and prompts the LLM to output exactly one of them at the beginning of every response.

- ✓ means the agent should respond normally (immediately)
- ○ is a "short incomplete" - the agent should wait 5 seconds
- ◐ is a "long incomplete" - the agent should wait 10 seconds

The wait times, and the details of the prompt, are configurable, of course.

Watch the video to see me talk to an agent that handles all my various pauses and inflections, plus phrases like "let me think," pretty much the way a person would handle them, in terms of response latency.

Also, in the second half of the video, I ask the agent to adjust its response pattern because I'm going to tell it a phone number. This kind of "in-context" adjustment of response wait times is really useful.

The LLM in the video is GTP-4.1. We've tested the prompt and single-token adherance with GPT-4.1, Gemini 2.5 Flash, Anthropic Claude Sonnet 4.5, and AWS Nova 2 Pro. Note that older models in all these families (and, in general, smaller open weights models) aren't able to reliably output these single-token tags. But the new models we're using these days are pretty amazing.

Docs for this in Pipecat^[1]

Mark's original PR^[2]

The `UserTurnCompletionLLMServiceMixin` was shipped in Pipecat 0.0.101