July 27, 2025
It’s not trivial to train a really good unified end-to-end audio model (projectors between the stages, etc). The Ultravox work is impressive!
I do think native audio is the long-term future. But you lose some things when you fuse the inference stages as well as gain some (big) advantages in audio understanding and (potentially) latency.
Evals are harder. Observability is harder. Being able to pick and choose the most effective STT and TTS based on your own evals is sometimes very helpful.
I’m also hoping that we get true bidirectional streaming architectures that we can scale up, soon. I’m actually more excited about that than about anything else!
@kwindla I wonder how hard is to frankenstein these models together so that it's one end-to-end model?
Kinda like is doing