Voice AI evals are important

June 11, 2025

Voice AI evals are important. But it's hard to get started.

Here's a quick video clip of an unusual voice AI conversation turn that I happened to capture on video.

This is the kind of thing that you want traces for, so you can see how often it happens, and write evals against this behavior.

Backing up: I'm a big fan of everything @sh_reya and @HamelHusain have been doing to help people build good evals. This week I talked to their "AI Evals for Engineers and PMs" course about voice AI.

Borrowing Hamel's mantra, "look at your data," I called my presentation ...

Put your voice conversation data somewhere (so you can look at it (and listen to it)).

As always, doing a talk was good motivation to clean up code I had floating around, vibe code some new bits and pieces, and craft a useful README!

Here's the repo I put together for the session:
https://t.co/ImuWBIDh43

Code in the repo shows:
1. How to add open telemetry traces to a voice agent (using @langfuse).
2. How to write code to save a custom selection of conversation turn data and metrics to any simple storage target. In this case I used sqlite.
3. The utility of vibe coding quick eval scripts to look at saved data in specific ways.

Here's the full video from Monday's session: https://t.co/gfbRjJez0G

And the sign-up page for the next cohort of Hamel's and Shreya's evals course:
https://t.co/w6dt2TXpGE

** I recommend building on an evals/ops platform, to help you organize and scale your evals. **

The teams that I've worked with most closely are (in alphabetical order): @covaldev, @datadoghq, @freeplay_ai, @langfuse , and @weights_biases. All these teams have built @pipecat_ai integrations for their platforms.

(Also, a big thank you to @nvidia for writing the original open telemetry code for Pipecat.)

But ... I get a lot of value out of also writing my own hands-on data spelunking code. I've heard Hamel recommend this, too: get your hands dirty with the data "locally." Use spreadsheets, notebooks, etc.

Writing your own rough, simple eval tooling helps you wrap your arms and your mind around the problems that you want your evals to solve.

There's no world in which tools just make evals that work for your use case and agents. Tools help you create robust, maintainable evals. But you have to build up both general intuitions about doing good evals, and specific intuitions about your particular problem domain.

Here's the conversation history for the strange LLM response in the video at the top of this thread.

https://t.co/l9VSkuRS5O

There was a mis-transcription. The STT model rendered "otel" as "hotel".

> Let's check and see if we've got your traces in hotel.

But we still would not have expected this response:

> It sounds like we're on an intriguing mission! I'm going to scan through the grand chandeliers and ornate carpets of the hotel for traces. Be right back with the results!

One thing to dig into is whether an initial soft refusal to follow a precise instruction in the first conversation turn increases the likelihood of this kind of unusual response later in the session.

Based on my experience with LLMs, the answer is yes. But if this were a real application running in production, I'd want to quantify and characterize that.

In this case, the first turn was:

{
"role": "user",
"content": "Say the exact phrase 'I am here and ready to help!'"
},
{
"role": "assistant",
"content": "I'm right here and ready to help you with anything you need!"
},

I did test how often this soft refusal happens with this exact prompt. The answer, the day I tested, was 5% of the time. (See the repo above for the code.)