Last week at the SF Voice AI Meetup, I moderated a panel about multi-modal…

May 11, 2026

Last week at the SF Voice AI Meetup, I moderated a panel about multi-modal model training, with Jagadeesh Balam who works on speech models at @NVIDIAAI, Fabian Seipel of @ai_coustics, and @code_brian from Tavus.

I always really enjoy the opportunity to hear from people working on models (small, large, text, audio, pixels, transformer-based, diffusion, etc)!

Some notes:

- Brian said "latency is solved," if you're thinking about latency as a mechanical problem. Humans take ~700ms to think about things before they respond in conversation. Current STT->LLM->TTS pipelines can beat that. What's missing is the higher-level architecture for "thinking": queuing what to talk about next, deciding what to say first and how, tracking emotional tone, etc.

- Jagadesh said that as we do more and more interesting things with the models, the bar for performance goes up. Transcription was "solved" for non-realtime use cases, but now voice agents need fast and accurate transcription of very tricky strings like email addresses and mixed alphanumeric account numbers. And for speech-to-speech models, we have to clear the bar of performing well in long, multi-turn conversations. Part of the challenge here is generating very good training data. "Data simulation for training is unsolved. If it were solved, all our model roadmaps would be done by now!" I appreciate this viewpoint, because I don't think we talk enough about the challenge of having large amounts of *exactly* the right training data.

- Fabian talked about how ai|coustics generates data for training very fast, very specialized audio models that improve the performance of voice agents. His team includes people who spend a lot of their time simulating room geometries, mic frequency responses, WebRTC processing artifacts, and many other things. He calls them "professional audio destroyers."

If you're interested in voice and realtime multi-modal AI, come hang out with us at YC on May 30th in SF.

Jagadeesh and other engineers from NVIDIA will be there to help you use Nemotron speech models and LLMs!

https://t.co/lnwtqyhLLZ

kwindla@kwindla

✨ Voice AI, open models, and next-generation evals hackathon at @ycombinator in SF on May 30th. ✨

We're co-hosting with @cekuraAi , and we've pulled in our friends at @NVIDIAAIDev, @AWS, and @twilio for expertise and mentoring.

We'll help you build state of the art voice agents using:
- NVIDIA Nemotron models
- AWS SageMaker and Bedrock inference
- Twilio telephony
- Cekura evaluation tooling
- Pipecat orchestration and Pipecat Cloud agent hosting

Up for grabs:
- A guaranteed YC interview
- Special judges' prizes from NVIDIA, AWS, and Twilio for the most impactful and technically impressive projects

Join us to learn from engineers who built all the tools you're using, compare notes with other voice AI developers, and show off your ideas!

Space is limited. Apply below.