February 11, 2025
Long context is all you need?
It's surprisingly hard to build a really good RAG system for voice agents.
What if you just ... put your whole knowledge base into a Gemini 2.0 Flash Lite system instruction?
Tom and Adrian wrote code to implement and benchmark an approach they call Model Augmented Generation.
This approach pairs Gemini 2.0 Flash Lite (acting as a lookup engine) with Gemini 2.0 Flash (powering the voice agent conversation loop).
Check out @tom_shapland's thread for a detailed architecture write-up, example code, cost estimates, and thoughtful notes about LLM memory techniques.
tldr: results are very good; cost is very low.
(And no, long context isn't all you need. But for a lot of voice AI use cases, leveraging long context will give you better results than content chunking.)
@Google's Gemini 2.0 solved RAG for Voice AI. Put the KB in Gemini’s system prompt and have your agent make a tool call to Gemini.
Recall: ~100%
Latency: ~0.9 s
Cost: 300 q's / dy on 50 page KB. $26 / mo ($7 with prompt caching)
See thread for @pipecat_ai code and more info.