Serverless WebRTC for Voice AI

April 14, 2025

Introducing the new `SmallWebRTCTransport` in Pipecat 0.0.63 ... 🧵

Voice AI agents use either WebSockets or WebRTC for realtime audio transport.

➡️ WebSockets are great for server-to-server use cases and for general prototyping.
➡️ WebRTC gives you the best latency and reliability for web and native mobile use cases.

WebRTC is a relatively complicated protocol. Using WebRTC in production requires running clusters of specialized WebRTC servers (usually in multiple geographic regions).

We wanted to make it easier to use WebRTC while you're prototyping, or for small-scale, experimental, or customized deployments.

The SmallWebRTCTransport gets rid of a WebRTC server entirely. The transport sets up a direct peer-to-peer connection between the Python agent code and the client.

Server-side transport docs are here:

https://t.co/KlpehWVUgS

There are client SDK plugins for JavaScript/React, iOS, Android, and C++:

https://t.co/tqYLALUoVw

Here's super-simple example code; a minimal Python voice agent and single-file JavaScript/HTML client.

https://t.co/DE3nFVJjXm

Here's a full starter kit that uses @GroqInc for the agent, the SmallWebRTCTransport when you're developing and testing locally, and Pipecat Cloud for hosting.

When you deploy this code to the cloud, it uses either the DailyWebRTCTransport (for web/app connections), or the Twilio WebSocketTransport (for phone calls).

https://t.co/GjFjHQzsoE

If you've been doing WebRTC stuff for a long time, a few things will be obvious to you:

1⃣ This is a "back to the future" kind of thing. WebRTC's roots are as a peer-to-peer protocol. As WebRTC usage has grown, a lot of effort has gone into building really terrific WebRTC servers (SFUs) and infrastructure. But in a literal, technical sense, every WebRTC connection is actually a peer-to-peer connection.

2⃣ This approach is heavily influenced by the WebRTC community's work on WHIP. Thank you to @murillo, @_pion, and all the other people who put so much effort into developing and standardizing WHIP. (https://t.co/yyfnNRt5EJ)

A bunch of people have asked me why I'm excited about "serverless" WebRTC, given that I have been working on cloud WebRTC infrastructure for the last 15 years and co-founded a company that sells WebRTC cloud developer tools (@trydaily).

The honest answer is that I think it's a lot more interesting to be trying to figure out what the future looks like than to be limited by what you've built in the past.

But having said that, I do still think you need cloud infrastructure for most production voice AI use cases. At least for now!

See the discussion of network routing in the Voice AI Illustrated Guide for a deep dive into network infrastructure:

https://t.co/E0MyvpNCaA

Cloud services also offer useful things like observability tooling, dashboards, and integrated recording and transcription. Building these features for a peer-to-peer codebase like SmallWebRTCTransport is definitely possible. But nobody has done that, yet.

Finally, I think a lot of interesting multimodal AI use cases are going to involve more than just one human and one "agent": games and social applications, copilots of various kinds, education applications. If you have more than two participants in a session, you need a WebRTC server.