← kwindla hultman kramer

"Yeah the trailer does show a gladiator riding a rhinoceros, which is not…

December 18, 2024

"Yeah the trailer does show a gladiator riding a rhinoceros, which is not something you see every day."

A streaming audio+video command-line client for Gemini Multimodal Live ...

Whenever I use a new API, I start by writing the simplest working thing that I can, using as few dependencies as possible.

For the Gemini Multimodal Live API, my "minimal working thing" hack was a very basic command-line client.

After getting audio in and out working, I switched gears and wrote the @pipecat_ai service for the new Gemini API. But I had so much fun with the first hack that I wanted to do more with it.

So, here's a complete Python command-line client for the Multimodal Live API, written with minimal dependencies and lots of comments/docs notes. It supports:

- text, audio, and screen capture video input
- text or audio output
- setting the system instruction
- setting an initial message with a command-line arg
- the grounded search built-in tool
- the code execution built-in tool
- importing functions from a file and automatically generating Gemini function declarations

If you check it out, let me know what you think. Link in the thread below.

Repo is here[1]

A couple of fun things to highlight about this code.

The core audio and text features really are "minimal dependency." (Only pyaudio and websockets.)

But I also wanted to support screen sharing and function calling.

For screen sharing, I…

For function calling, I spent some time thinking about how to make it really easy to declare functions and to wire those up to toolCall -> toolResponse logic.

Sometimes the most time-consuming thing about setting up LLM function calling is writing the json-formatted function declarations. Function declarations are certainly verbose and there's no elegant way to pass them in as command-line arguments.

I had just decided to write some Python introspection and docstring-snarfing code to automatically generate function declarations ... when I saw that Google had just implemented exactly that capability in the new google-genai SDK.

It seemed worth adding another, optional dependency to avoid re-inventing this wheel. (Also, I had fun reading the google-genai library source code to see how they wrote the code I had been planning to tackle.)

This does require that the functions you import have good docstrings. But you always write good docstrings, right? :-)

  1. https://github.com/pipecat-ai/multimodal-live-cmdline/