← kwindla hultman kramer

Multilingual voice AI tutor using the new @cartesia_ai Sonic-2 voice model and…

March 11, 2025

Multilingual voice AI tutor using the new @cartesia_ai Sonic-2 voice model and @GoogleDeepMind Gemini 2.0 Flash native audio understanding ...

Here's the code. It's a single Python file.

https://t.co/CGMZck6M3c

Congratulations to the team at @cartesia_ai on the new Sonic-2 model (and on the Series A funding announcement)!

So much hard work and experience goes into training these models. I was impressed by Cartesia's speech model when they first launched 10 months ago, and even more impressed with the constant improvements they've shipped since then.

The tech stack here is:

- A standard Cartesia voice ("Sophie"), localized to have a French accent in the Cartesia dashboard.

- Gemini 2.0 Flash. I'm using Gemini's native audio input, which handles multiple languages really nicely. I don't actually speak French in this video recording (I'm not that brave), but when I do try to speak French, Gemini understands and can even critique my pronunciation.

- A Gemini function to control the Sonic-2 "speed" parameter, so I can ask the tutor to speak more slowly or more quickly!

- 300 lines of @pipecat_ai code to implement the voice AI logic

More thoughts on Sonic-2 ...

Cartesia has a loyal following among voice AI developers because their models:

- are fast,
- follow the context of a text transcript well when they speak,
- are served by extremely reliable infrastructure with excellent metrics for median and p95 latency,
- offer a lot of features (voice cloning, speed control),
- and are priced affordably.

The new Sonic-2 models are a general improvement across the board. But I'm particularly impressed with this multilingual capability. In the past, with all voice models, I've had to set up multiple output pipelines to get good performance when mixing languages. This is complicated and hard to make really reliable.

Sonic-2 is completely happy to mix French and English together. Generation is a little more consistent if I actually tell Cartesia the output language for each generation. But, as you can hear in the video, it's really quite good even though I'm specifying "fr-FR" for every voice output here.

Here's what the Cartesia team says about Sonic-2 ...

⚡️ the fastest model on the market today with just 90ms of latency. We also released a smaller, faster version of this model, Sonic Turbo, with just 40ms of latency!
🎯 the most controllable model on the market – excellent transcript following (especially with long, complex transcripts), speed controls, spell tags, and more
🌀 the best at cloning with more than 1.5x as many blinded raters preferring clones on this model over the next leading provider.
📈 the most reliable - we have the most reliable serving stack for voice generation, with 99.9% uptime and the fastest P90 latencies across the globe.

Blog post here: https://t.co/uJKi5876ut

  1. https://gist.github.com/kwindla/d80b881557d0851eea5d2745ed2583cd