Quality, latency, cost: pick three

September 23, 2024

For conversational voice AI applications, the three most important voice model attributes are realism (quality), time to first audio byte (latency), and cost.

These days, when I talk to startup founders building real-time voice AI apps, I almost always recommend they use Cartesia Sonic as the voice component. Sonic performs very well on all three of those critical metrics.

The other thing that impresses me about Cartesia is how consistently the team adds features and makes improvements with every release. Lower latency, support for more languages, word-level timestamps, fixes for long-tail pronunciation and prosody issues, etc.

I'm particularly obsessed with latency, and we have a lot of data about the latency we see from all the partners we work with, every day.

The Cartesia WebSocket API's time-to-first byte is consistently under 180ms, including network overhead. (And the P95 is actually really close to 180ms, too. If you run infrastructure at scale, you know how impressive that is!)

We did a blog post with Cartesia about why we chose Sonic as the primary voice model when we launched Daily Bots last month. Feel free to read it if you're interested in additional notes on latency, features, and model architecture.

Or, just go straight to the thread below to play @JonPTaylor 's homage to Metal Gear Solid, which uses custom Sonic voices to take you back to the glory days of PlayStation CD-ROM gaming.

Cartesia@cartesia

Daily Bots by @trydaily's launch last month saw hundreds of devs sign up instantly. We're thrilled to be their main voice provider, powering top multimodal agents.

Check out this Metal Gear-inspired demo showcasing Sonic's real-time gaming potential. Link to full story below