May 9, 2026
This is a great question! There are not (yet) any speech-to-speech models that are workable bases for production fine tuning. But that's likely to change pretty soon.
To be a good starting point for fine tuning, a base model needs to already be reasonably good at multi-turn instruction following and tool calling.
My mental model, here, is that the fine tuning process needs to be able to "find", in the model weights, the conversation patterns in your data set. If the model is too weak, those patterns aren't there to find and emphasize.
My experience fine tuning text models for multi-turn and voice agents, this means the model needs to score something like 85% on the aiewf-eval benchmark.
So today that's text models like Nemotron 3 Nano, Gemma 4 31b, Qwen 3.5 27b, GPT OSS 120b.
There aren't any open weights speech-to-speech models that are close to this bar, yet.
The best options today are the Kyutai Moshi and NVIDIA Personaplex (which is a Moshi-family model). Moshi is a very nifty model architecture. Fine-tuning these is a great research project. I don't think you'll get to production level performance for most voice use cases, though, no matter how much fine-tuning (or even post-training) you do on these models.
There's a new NVIDIA Nemotron Voicechat model coming soon, though, that should be a great base model for voice agent fine-tuning. It's in early access. You can try out the demo ...
@kwindla @dan_jenkins what is a good speech-to-speech model to finetune on top of, does this exist, e.g. like how we finetune llama 3.1 8B?
Nemotron Voicechat demo[1]
aiewf-eval benchmark[2]