Consumer research — LLM + real-time + voice and video

September 20, 2024

@Praveen_Maloo shows a great example here of a voice AI experience that's both scripted and open-ended.

People "follow scripts" all the time in our conversations. We ask a question. If we don't get an answer that's complete or that we understand, we ask follow-up questions. Or maybe we engage with that answer and then circle back later to the first question.

This is a big part of what makes conversations feel "human-like."

Large Language Models are good at this aspect of human-like conversation — capable of aiming towards goals while also being flexible.

But you need to do a few things to build production-quality scripted, open-ended LLM applications.

☀️ Iterating to get prompts right takes a fair amount of effort! All of us who build these kinds of things the first time use prompts that are too short, not specific enough, and generally just not great. Prompt engineering for complex use cases takes practice. Multi-turn, goal-oriented, human-like conversation is a complex use case.

☀️ Function calling (or structured data outputs) is important for these use cases. Function calling is quite good with the current generation of state-of-the-art LLMs, but the "control loop" for function calling is a little hard to wrap your mind around the first couple of times you write LLM+function calling code. And testing to make sure function calling works reliably means building good eval frameworks.

☀️ You need to build an orchestrator (some people call this an executor) to manage the conversation context, handle the function calls, and maybe modify your prompting on the fly based on conversation state. For real-time voice applications, an orchestrator is a non-trivial amount of code. The @pipecat_ai Open Source voice/multi-modal orchestrator exists because a bunch of us compared notes last year as we built real-time voice AI applications and realized we were solving the same problems over and over!

☀️ Guardrails that protect against prompt injection attacks, hallucinations, and inappropriate content are important for production applications.

☀️ Function calling and guardrails both add latency. Too much latency makes the experience not very human-like. See the follow-up tweets in @Praveen_Maloo's thread about this topic.

Praveen Maloo@Praveen_Maloo

So, I decided to have some fun and run a mock design/branding study using unSurvey while waiting for a coffee this morning. This time, we’re going back in time to the infamous 1985 New Coke design! 🥤👀

In case you didn’t know,@CocaCola made a huge move back in '85, replacing