Evals for voice AI agents — Gemini 2.0 sets a new standard for reliable…

January 10, 2025

Evals for voice AI agents — Gemini 2.0 sets a new standard for reliable instruction following. The team at @covaldev does a lot of interesting work with synthetic data and evaluations for voice AI agents.

In this video @bnicholehopkins walks through an eval comparing Gemini 2.0 and GPT-4o. These two models are currently the best-performing LLMs for conversational voice AI.

The results are subtle, but the tldr is:
- Gemini follows the eval task instructions more reliably.
- GPT-4o completes all the steps in the eval tasks more often.

One thing that a test like this highlights is the need for task-specific evals. Models have different strengths. Often a slightly different approach to prompting is required to leverage a model's strengths and work around its weaknesses.

That means that if you're serious about building voice agents, you need to build your own evals. (Coval's tools can help you do that.)

For example, you might be using a larger context window than average, or defining more tools (functions) than average. So "off the shelf" evals might not actually tell you which model will perform best for you.

In general, the things you want to test in a functional eval — the building blocks for reliable voice agent performance — are:

1. Instruction following — does the LLM "remember" and correctly follow the instructions in your prompt, even as the context window gets longer during a multi-turn conversation.

2. Function calling — almost all voice agent workflows depend heavily on function calling. Does the LLM call the functions you define, when needed, with the correct arguments. Can the LLM call multiple functions at a time to perform complex operations?

3. Context awareness — can the LLM pull relevant information out of the context as the context length and the complexity of information in the context window grows? Imagine the metaphorical "needle in the haystack." Can the LLM find the needle when it needs to!

The performance of the best LLMs is improving rapidly. GPT-4o today is much better than GPT-4o was three months ago. Gemini 2.0 is a very impressive improvement over Gemini 1.5.

I'm particularly excited about new capabilities of Gemini 2.0 that we mostly haven't written evals for, yet, in the voice AI community: the ability to combine function calls flexibly, built-in code execution, built-in search.

Here is the @covaldev blog post with more details about their scripted evaluation framework and the results of this eval:

^[1]

Here's the paper that the @bnicholehopkins talks about in the video and blog post:

^[2]

Follow @covaldev for posts about evals, synthetic data, and voice AI testing.

https://t.co/R05taCBg3o

This is the 10th post in our series, 25 Multimodal Demos for 2025 — building with Gemini and Pipecat. Check back every day for more multimodal, conversational, and sometimes just fun demos and code examples.