I listen to the @thursdai_pod every week

December 19, 2025

I listen to the @thursdai_pod every week. It's an opinionated roundup of all the new model releases, tools and platforms, and open source AI developments.

This morning was the Thursd/AI 2025 year in review. This was the year of agents, a year of accelerating progress, and a year in which it felt like we packed months work of progress into single weeks, multiple times.

From the voice AI perspective, here are the things that are top of mind for me about 2025:

1. The year voice agents hit mainstream adoption. We're seeing lots of voice agent growth in use cases like customer support, answering the phone for small businesses, market research, outbound calls to prepare patients for healthcare appointments, and many more.

2. The year of Google. From the Gemini 2 model releases in December last year, to the Gemini 3 model releases this month, Google has been on a tear. Fast, excellent models. The cheapest per-token intelligence cost of any frontier lab. Nano Banana image generation. Veo video. Good self-serve APIs and enterprise-quality inference on Google Cloud. Gemini 2.5 Flash took a lot of voice agent LLM developer mindshare from OpenAI/gpt-4o.

3. Many, many exciting releases of transcription and voice models from the leading audio model labs, the frontier labs, and new players. Deepgram, Speechmatics, Soniox, and AssemblyAI all have very good realtime transcription models now, with advanced features like built-in turn detection and speaker identification. Cartesia and ElevenLabs, both known for their text-to-speech models, also both released transcription models. Rime, Inworld, and MiniMax released text-to-speech models that are getting adoption in voice AI applications. And OpenAI and Google released interesting, steerable speech-to-text and text-to-speech models built on their respective LLM foundations (GPT-4 and Gemini). There are too many audio model releases every month to fully evaluate.

4. Speech-to-speech models made progress, but are still research models rather than production models, from the perspective of those of us building voice agents for enterprise use cases. We need to see improvements in function calling reliability, instruction following, and API maturity. It's clear, though, that speech-to-speech models are going to continue to improve and use cases will move over to speech-to-speech models as this happens.

5. Relatedly, we are starting to use multiple models together in parallel "inference loops" in our production voice agents. We can use fast models for the main voice pipeline, and slower models or dedicated tool-calling models in parallel. We can implement guardrails for content safety in this way, too. And we can route different parts of a conversation to different models, to improve cost, latency, and performance in flexible ways.

6. NVIDIA is ramping up open source work in LLMs and speech models. The NVIDIA Parakeet transcription model is the first open source model that can challenge the commercial models for voice agent use cases. The Nemotron models look like they will pick up the baton from the LLama 3 LLMs and become standard open components in research and production systems.

7. The best LLMs now saturate my hard, multi-turn benchmarks. In some ways, this is the biggest story of 2025 for me! Six months ago the best model for voice agents was arguably still gpt-4o -- a model that was more than a year old. And gpt-4o still made an average of 3 mistakes per 30-turn conversation in my private benchmarks. Now, GPT-5.1, Gemini 3 Flash, and Claude Sonnet 4.5 score perfectly on this benchmark. But ... right now all three of these models have time-to-first-token numbers that are too high for product use. All I want for Christmas is for Anthropic, Google, and OpenAI to offer inference tiers optimized for low latency.

The full @thursdai_pod 2025 year in review stream^[1]

https://www.youtube.com/live/EO53hFlWlzE?t=726s ↩