Gemini Multimodal Live + Android + WebRTC

January 7, 2025

Build Gemini voice and video apps with the @pipecat_ai open source SDK for Android.

The Android SDK supports WebSocket, WebRTC, and HTTP network transport options. Change one line of code to switch protocols.

Here's a video demo — this is an Android native app that uses Gemini Multimodal Live voice and video.

Read the thread below for technical details and for links to repos and docs.

Here's a blog post that walks you through getting started with Gemini Multimodal Live and WebRTC on Android:

https://t.co/WtzakWWBTw

This SDK is part of the open source Pipecat ecosystem. There are SDKs for JavaScript, React, Android, iOS, Python, and C++.

The repo for the Pipecat Android SDK is here:

https://t.co/JfM6Yy6GB1

The Pipecat iOS SDK is here:

https://t.co/p3eQ6VYc3V

The Pipecat SDK for Gemini and WebRTC on the web (and React) is here:

https://t.co/5TXdt5l7rB

Contributions are welcome! If you want to write a new pluggable network transport for the SDK, check out this repo and README:

https://t.co/B2T7GbnRrU

Here's a full-featured multimodal chat applicaton that demonstrates how to use the Gemini Multimodal Modal Live WebSocket API, HTTP single-turn APIs, and WebRTC all in one app. (They all have their place for different use cases.)

https://t.co/sGSCV4i47u

Pipecat's SDKs for Web, React, Android, iOS, Python, and C++ are architecture-compatible.

Read the rest of the thread for more technical background ...

Some more technical background ...

If you're just starting out with voice AI, you might gravitate towards using a WebSocket library for networking. WebSockets are familiar, simple, and widely supported. And they are great for server-to-server use cases, for use cases where latency is not a primary concern, and are fine for prototyping and general hacking.

But WebSockets shouldn't be used in production for client-server, real-time media connections.

For production apps, you need to use WebRTC. WebRTC was designed from the ground up as *the* protocol for real-time media on the Internet.

The major problems with WebSockets for real-time media delivery to and from end-user devices are:

- WebSockets are built on TCP, so audio streams will be subject to head-of-line blocking and will automatically attempt packet resends even if packets are delayed so much that they can not be used for playout.

- The Opus audio codec used for WebRTC is tightly coupled to WebRTC's bandwidth estimation and packet pacing (congestion control) logic, making a WebRTC audio stream resilient to a wide range of real-world network behaviors that would cause a WebSocket connection to accumulate latency.

- The Opus audio codec has very good forward error correction, making the audio stream resilient to relatively high amounts of packet loss. (This only helps you if your network transport can drop late-arriving packets and doesn't do head of line blocking, though.)

- Audio sent and received over WebRTC is automatically time-stamped so both playout and interruption logic are trivial. These are harder to get right for all corner cases, when using WebSockets.

- WebRTC includes hooks for detailed performance and media quality statistics. A good WebRTC platform will give you detailed dashboards and analytics for both aggregate and individual session statistics that are specific to audio and video. This level of observability is somewhere between very hard and impossible to build for WebSockets.

- WebSocket reconnection logic is very hard to implement robustly. You will have to build a ping/ack framework (or fully test and understand the framework that your WebSocket library provides). TCP timeouts and connection events behave differently on different platforms.

Finally, good WebRTC implementations today come with very good echo cancellation, noise reduction, and automatic gain control. You will likely need to figure out how to stitch this audio processing into an app that uses WebSockets.

In addition, long-haul public Internet routes are problematic for latency and real-time media reliability, no matter what the underlying network protocol is. So if your end-users are a significant distance from OpenAI's servers, it's important to try to connect the user to a media router as close to them as possible. Beyond that first "edge" connection, you can then use a more efficient backbone route. A good WebRTC platform will do this for you automatically.

If you're interested in network protocols designed for sending media, here's a technical overview of RTMP, HLS, and WebRTC:

https://t.co/bZETYoGrZU

For a deep dive into WebRTC edge and mesh routing, here's a long post about Daily's global WebRTC infrastructure:

https://t.co/RLOIMLBxGM

This is post 8 in our series of 25 multimodal demos for 2025 — Building with Gemini and Pipecat. Check back for more posts every day!

https://www.daily.co/blog/build-a-voice-agent-for-android-with-gemini-multimodal-live/ ↩