December 18, 2024
"Yeah the trailer does show a gladiator riding a rhinoceros, which is not something you see every day."
A streaming audio+video command-line client for Gemini Multimodal Live ...
Whenever I use a new API, I start by writing the simplest working thing that I can, using as few dependencies as possible.
For the Gemini Multimodal Live API, my "minimal working thing" hack was a very basic command-line client.
After getting audio in and out working, I switched gears and wrote the @pipecat_ai service for the new Gemini API. But I had so much fun with the first hack that I wanted to do more with it.
So, here's a complete Python command-line client for the Multimodal Live API, written with minimal dependencies and lots of comments/docs notes. It supports:
- text, audio, and screen capture video input
- text or audio output
- setting the system instruction
- setting an initial message with a command-line arg
- the grounded search built-in tool
- the code execution built-in tool
- importing functions from a file and automatically generating Gemini function declarations
If you check it out, let me know what you think. Link in the thread below.
Repo is here[1]
A couple of fun things to highlight about this code.
The core audio and text features really are "minimal dependency." (Only pyaudio and websockets.)
But I also wanted to support screen sharing and function calling.
For screen sharing, I…
For function calling, I spent some time thinking about how to make it really easy to declare functions and to wire those up to toolCall -> toolResponse logic.
Sometimes the most time-consuming thing about setting up LLM function calling is writing the json-formatted function declarations. Function declarations are certainly verbose and there's no elegant way to pass them in as command-line arguments.
I had just decided to write some Python introspection and docstring-snarfing code to automatically generate function declarations ... when I saw that Google had just implemented exactly that capability in the new google-genai SDK.
It seemed worth adding another, optional dependency to avoid re-inventing this wheel. (Also, I had fun reading the google-genai library source code to see how they wrote the code I had been planning to tackle.)
This does require that the functions you import have good docstrings. But you always write good docstrings, right? :-)