There's a lot of good work ramping up now around evals for product teams…

May 6, 2025

There's a lot of good work ramping up now around evals for product teams working on voice/multimodal conversational AI applications. This is a little bit different from evals for leaderboards. But related, I think.

Some questions that are top of mind for me right now:

- Do you need audio or can you just do evals on transcribed text? (I think you need audio.)
- The different needs for evals in different parts of product work: during development, offline evals of production voice agents, integrated evals of voice agents (content guardrails, etc)
- Can we use simulated conversations to improve the "coverage" of our evals?
- How do we craft evals that exercise application-specific workflows. (Voice AI workflows are increasingly being represented as state machines or multi-agent systems.)

Some of the teams/people that I follow who are working on multimodal, multiturn evals: @covaldev, @freeplay_ai, @weights_biases, @tom_shapland.

clem 🤗@ClementDelangue

What are you using to evaluate models or AI systems? So far we're building lighteval & leaderboards on the hub but still feels early & a lot more to build. What would be useful to you?