I sat down with @zachk and @bnicholehopkins to talk about how we benchmark…

February 4, 2026

I sat down with @zachk and @bnicholehopkins to talk about how we benchmark models for voice AI. Benchmarks are hard to do well, and good ones are really useful!

We covered what makes an LLM actually "intelligent" in a real-world voice conversation, the latency vs intelligence trade-off, how speech-to-speech models compare to text-mode LLMs, infrastructure and full stack challenges, and what we're all most focused on in 2026.

Open source voice agent LLM benchmark^[1]

Technical deep dive into voice agent benchmarking^[2]

https://github.com/kwindla/aiewf-eval ↩
https://www.daily.co/blog/benchmarking-llms-for-voice-agent-use-cases/ ↩