April 6, 2025
Vapi is great! Really good platform capabilites, good performance, great track record of reliability and enterprise-grade support.
We've delivered 500ms voice-to-voice for specialized use cases with stt->llm->tts.
You can host all the models right next to each other in the same cluster/AZ and optimize all the inference stacks for latency. That pushes the cost up. Most people are targeting $0.04-0.06/minute for voice AI, right now. Which is hard to do if you're optimizing primarily for latency. (You can do it with a balanced approach, using "standard" models and APIs, targeting 800-1200ms).
On the other hand, the speech-to-speech approach has two issues. First, there isn't a speech-to-speech model yet that is as good at instruction-following and function calling as the SOTA LLLMs in text mode. Second, for most real-world voice AI use cases you need an "orchestration" layer of some kind. You're generally doing context management and compression, changing the tools list, doing "guardrails" processing in parallel with the main conversation. The speech-to-speech APIs don't give you this level of control, yet.
@kwindla @GroqInc @pipecat_ai @twilio Thanks. This is similar to what I'm getting in Vapi. I feel like we need a single model thats speech-to-speech like Gemini live.