June 12, 2025
100%. I’ve been talking to a lot of voice ai developers about this, lately. You need basic evals for your conversation flow and tool use, AND you need to monitor in production for latency and api errors. Otherwise you are flying blind. Relatedly, you should not be using preview models in production. (Possible exception is if you have very, very good monitoring and evals. But you don’t, so don’t!)
I keep saying this - if you rely on an AI model, you need to continuously monitor it like any other critical IT system. Otherwise you're blind to any changes in the models, prompts, or any of the intermediary systems that might affect its behavior.
See also: https://t.co/e828uI2qR5
Voice AI evals are important. But it's hard to get started.
Here's a quick video clip of an unusual voice AI conversation turn that I happened to capture on video.
This is the kind of thing that you want traces for, so you can see how often it happens, and write evals against this behavior.
Backing up: I'm a big fan of everything @sh_reya and @HamelHusain have been doing to help people build good evals. This week I talked to their "AI Evals for Engineers and PMs" course about voice AI.
Borrowing Hamel's mantra, "look at your data," I called my presentation ...
Put your voice conversation data somewhere (so you can look at it (and listen to it)).
