December 23, 2024
Gemini 2.0 drops the beat.
Watch the video all the way through — I had four legit "no way it did that" reactions when @JonPTaylor sent this to me.
"It looks like it's only hitting on the first beat of each bar."
This is Jon collaborating with Gemini to create a song in @Ableton Live.
Jon is using the Multimodal Live API to stream audio and video to Gemini and have a conversation about the song he's creating.
What blows my mind is what's unlocked by the combination of:
- Conversational audio
- Video understanding
- Gemini's extensive knowledge base (it even knows, or can intuit, how to use Ableton Live)
- Gemini's core reasoning capabilities
Code for the UI is here — this is a complete starter kit for ultra-low latency Multimodal API audio/video apps:
https://t.co/sGSCV4i47u
The app streams voice and video to a @pipecat_ai bot via WebRTC.
The Pipecat pipeline sends voice and video to the Gemini Multimodal Live API, transcribes all the audio, saves the conversation so you can come back to it later, and powers the text+audio chat interface.
More on why this architecture (and WebRTC in general) is useful here:
https://t.co/zKhlWiMrsE
This experiment was inspired by @shresbm using Multimodal Live as a photo editing helper:
https://t.co/BhgalYXyXF
This is post 6⃣ in our series 2⃣5⃣ demos for 2⃣0⃣2⃣5⃣ — building multimodal AI apps with @googledevs Gemini and @pipecat_ai.