Nice thread from @baseten with metrics and technical details about serving…

May 1, 2025

Nice thread from @baseten with metrics and technical details about serving Qwen3 235B with SGLang.

244ms is a *very* good P50 TTFT for such a big model.

Note that this number is the server-side metric. For end-to-end TTFT you'll also need to factor in the network stack processing and transit time. So add ~10-100ms depending on how far away from the cluster your clients are.

For conversational voice AI use cases, I'd like to see a smaller spread between P50 and P95. I'm 100% sure that's achievable here with different concurrency/batching tuning.

Baseten@baseten

Early benchmarks of Qwen 3 with SGLang show promising initial results and key avenues for improvement.

We're seeing:
- Up to 76 TPS per user for real-time
- Up to 4600 total token throughput for batch
- 32 concurrent requests as a good balance for prod

Details in 🧵