Talk to Cartesia speech-to-text about Cartesia speech-to-text

June 10, 2025

Talk to Cartesia speech-to-text about Cartesia speech-to-text.

Cartesia launched a streaming STT model today, called Ink-Whisper, that's optimized for realtime voice AI.

@pipecat_ai has launch-day support for this new model, so I figured I'd talk to the model about itself.

The voice pipeline here is:

Cartesia Ink-Whisper -> Google Gemini 2.0 Flash -> Cartesia Sonic 2

Full code and other useful links in 🧵

Github repo with this demo (run it yourself): https://t.co/EpsGLdo9vZ

The code follows a common strategy for setting up a voice agent knowledge base.

We insert the full text of the Cartesia Ink technical blog post into the system instruction that we pass to Gemini 2.0 Flash. And we embed the two tables from the blog post in the first context message, as images. Gemini is great at understanding tabular data embedded in images.

Here's that blog post. It has lots of good technical info about what the Cartesia team was aiming for, and accomplished, with Ink-Whisper.
https://t.co/zNzphcWaLf

Here's the Pipecat implementation code, if you're interested in how a streaming transcription model API integrates with the Pipecat realtime AI pipelines architecture:
https://t.co/fyNPpsrYIf

https://github.com/kwindla/cartesia-ink-demo ↩