March 11, 2024
This weekend in one of the AI engineer group chats I'm in, we had an interesting discussion about self-hosting the components of voice/conversational AI apps as compared to using SaaS services.
This is a generally interesting topic, and I've had this (new) AI infra conversation a lot recently. I've also had a similar conversation many, many times over the years with a bunch of different friends and customers about the economics and trade-offs of managing your own WebRTC servers vs using a service like @trydaily.
Here's what I posted about voice bots:
We’ve done a lot of cost/performance analysis here for the voice+AI apps we are helping our big customers roll out. The main issue with self-hosting right now is that open source models aren’t quite there yet for all three core functions: STT, LLM, and TTS. Whisper-large is great but getting very low latency is non-trivial. Mixtral and Llama-2 are good but not as good as GPT-4 for open-ended conversation. ElevenLabs and PlayHT voices are much better than Open Source voice models, and Deepgram Aura is good and crazy fast.
You can definitely build for some use cases with self-hosted components. And I agree with you that a big long term trend will be towards open source and also more models running on device.
It’s worth noting that the economics of self-hosting are tough. We would have to be doing a massive amount of whisper (hacked up to be real-time) before we’d save money over using @DeepgramAI. Similarly, we did the cost exercise for self-hosting custom fine tunes of Llama-2 70 for ambient scribing/clinical notes (healthcare). Even with very optimistic growth projections the numbers say we should just stick with GPT-4 this year. 😁
All this stuff is evolving quickly, though. Exciting times!