This is such a nice example of collaborating with Gemini on a multi-faceted…

December 15, 2024

This is such a nice example of collaborating with Gemini on a multi-faceted computer use task.

I'm convinced that voice is broadly more efficient than typing for a wide range of what we might call "multimodal support" experiences.

It's going to take some time to get used to talking to our computers. But as this video shows, a voice-to-voice UI for the LLM interactions complements the "traditional" UI of the image editing workflow really well.

Gemini's capabilities are a big step forward for these use cases. For the last couple of months, I've been running an always-on screen capture plus voice assistant when I write code. I can ask the assistant things like "please explain the stack trace," or "I forgot the name of the python dot env pip install again, what is it?" (I usually have voice output turned off. For programming, it turned out that I prefer voice input and text output.)

That prototype was made possible by Claude 3.5 Sonnet's combination of visual parsing and programming abilities. I hacked it together as a @pipecat_ai pipeline, with voice transcription feeding Claude, a WebRTC video feed sending the screenshare up to the cloud all the time, and a function call pulling the most recent frame from the screenshare whenever an image was needed.

In my testing so far, Gemini seems to be better at individual visual tasks, much better at visual tasks that involve several sequential images, and massively faster. The Multimodal Live WebSocket API also makes it easier to build multi-turn, multimodal, conversational apps. These are all very impactful improvements.

Of course, the next big, open-ended question is: what does a UI look like that's built natively around these capabilities. I love the copilot / technical assistant / idea partner mode we're just starting to explore. And these modes are very powerful.

But they are also surely stepping stones towards a completely new UI paradigm that's as different from "windows, icons, menus, pointers" as WIMP is from the command line.

Shrestha Basu Mallick@shresbm

The Gemini realtime API is kind enough to let me interrupt it to keep things moving! Thanks for all the love you all are giving it. We'll keep cooking more!

Here is Gemini working with me to edit an image in Adobe Photoshop and identify the bird at the end - a motmot.