April 9, 2024
Yesterday there was a thread on the @latentspacepod discord about what's changed in the voice+AI domain over the past few months.
I think three big things have changed since ~October last year.
First, as with everything in Generative AI, there's a proliferation of more and better models! Specifically, lower latency models for TTS that have good voice quality (like @DeepgramAI Aura) and improvements in latency for the most human-like models (@elevenlabsio and @play_ht). Deepgram is ~200ms time-to-first-byte, ElevenLabs and Fal are now ~400ms. (@OpenAI TTS is ~800ms, which I think is just outside the "acceptable" band for time-to-first-byte for most use cases.)
Image generation is so, so, so much faster. In October last year there were a few people trying to get down to sub-second generation times. Now you can do sub-100ms! Both @radamar and @FAL are accounts you should follow if you're interested in fast image generation.
Last October we built a story-teller demo with GPT-4 and DALL-E. The best we could do then was have images that lagged 5-6 seconds behind the story. Our updated version of that demo now has pretty great looking images (SDXL on @FAL) that don't lag the story at all.
Fast vision models are also here today. You can run Moondream (which is fantastic) locally or on Fal. QWEN-VL is great. Good/cheap/fast vision opens up really interesting new use cases.
Second, phrase endpointing and handling interruptions are sufficiently solved problems for now. We were all still trying to figure out the best way to do both of those things, back in October. Now we have good sample code for both in our open source framework. And platforms like @Vapi_AI provide production-grade services that implement really good endpointing and interruption handling.
It's possible to imagine doing even better/faster/more natural phrase endpointing — for example by training small, fast models specifically to handle endpointing. But we don't really need to do that for voice agents to work well for pretty much all the use cases people have thought of so far.
Third, tool calling works well enough to use as a core building block for voice bots. I've actually been pretty amazed at how significant this feels, in terms of making it possible to build conversational bots that interact with other systems. GPT-4 tool calling is tricky to use well, but very good once you've done the work on prompting and evals. I played with the @FireworksAI_HQ FireFunction-V1 model recently at an @aiengfoundation hackathon and was impressed.
@latentspacepod Here's a video from @chadbailey59 showing the possibilities of fast voice response + tool calling.
@latentspacepod @chadbailey59 Source code for that patient intake demo is here[1]
The open source framework for building multi-modal/interactive/real-time/voice AI things is here[2]
@latentspacepod @chadbailey59 My original blog post that started the thread on the Latent Space discord is here[3]
It's still a good place to start if you're just getting into building voice agents. But given how fast all things in generative AI are continually developing, it's overdue…