March 12, 2025
Yes! Kokoro is quite good. And, as you say, it's a small enough model to run locally on a wide variety of devices. 82M parameters. It's shockingly small for how good it is.
Whisper is fantastic. I use it for most things I build where conversational voice-to-voice latency is not a critical design goal.
For realtime, conversational voice, whisper has two drawbacks:
1. Anything smaller than whisper-large-v3-turbo isn't good enough to use for input to a voice agent. turbo is an 800M parameter bf16 model. I think that's too big to run on "the average person's" mobile device or web browser. So today you're targeting power users, if you run whisper on-device.
2. The whisper architecture wasn't designed for low latency. You can optimize it to make it very fast. But it's very hard to optimize whisper's time-to-first-token. All of the open source whisper derivatives I've tested have a time-to-first-token of 800ms or more.
Always interested in other thoughts, ideas, and experiences with these tools. What do you think? You've built some really cool stuff!
@kwindla @vokaysh @tavus @withdelphi Have you given Kokoro a try? It’s really fast to run in the browser using WebGPU, removing the network latency all together. And you can also use Whisper in the browser for ASR