July 22, 2025
You don't need a WebRTC server for voice agents.
If you're deploying your own voice AI infrastructure, you should almost certainly be using the new(†) serverless WebRTC approach.
Serverless is much simpler, which translates to faster development, better scaling, and higher reliability. You'll have slightly lower latency, too, compared to doing a network hop through a (single zone) WebRTC server cluster.(‡)
More notes below ...
Here is a guide to serverless WebRTC that covers when it makes sense to opt for the serverless approach and when it makes sense to use a traditional WebRTC server.
https://t.co/fdzYXhCo3D
The tldr is that all of the benefits of using WebRTC servers for voice agents come from running a multi-region deployment with mesh networking. This is not something that you can do unless you devote a *lot* of engineering and devops resources to your WebRTC infrastructure. (No open source WebRTC codebase implements multi-region server management and mesh networking between server clusters.)
So your two choices are:
1. Deploy your own voice agent infrastructure and use serverless WebRTC, or
2. Use a commercial WebRTC cloud for your utra low latency audio transport.
In 2025, both of these are good options!
A few notes:
- (†) "Serverless" WebRTC isn't actually new. It's a "back to the future" kind of thing! WebRTC has always supported direct connections between two peers. But for the last ten years or so, most WebRTC use cases needed servers, so most of the engineering effort in the WebRTC community went into building great WebRTC servers (SFUs) rather than writing code to support and evolve serverless approaches (peer-to-peer).
- (‡) Routing through a server adds a small amount of latency (10-100ms) compared to a best-possible direct connection. It turns out that in real-world deployments, though, if you have a very large network footprint and optimized mesh routing across private network connections, then your P50 latency is going to be better than public Internet connections. (And your throughput will be higher and your jitter lower.) More on that here: https://t.co/RLOIMLAZRe
- If your use case involves mult-participant sessions (bigger than one human and one bot) or if you are doing video in addition to audio, you should use a WebRTC server.
More in general on networking for voice agents in the Voice AI Primer: https://t.co/ee78el9kRo