Voice AI programming with Gemini² and Cursor

January 21, 2025

Adrian built a Gemini voice + vision AI agent that writes software indirectly, collaborating with a human and with Gemini running inside @cursor_ai.

More details and thoughts below.

The setup here is:

1. Prompt the "designer" agent with a project plan. The designer is a voice- and vision-enabled AI agent — Gemini 2.0 running in a @pipecat_ai multimodal pipeline.

2. The designer can see your screen so it always has live context.

3. For each step, the designer AI prompts the coding agent in Cursor, then pauses for input/feedback from the human.

Like a lot of the most mind-blowing things people are building these days, this feels to me both like a glimpse of the future and a hack! Adrian uses the macOS accessibility APIs to glue things together, and manually accepts the changes.

I borrowed the "designer" terminology from Adrian's colleague at Canonical AI, @tom_shapland. Tom talks about this pattern as a "designer AI" collaborating with an "engineer AI".

(Follow Tom for lots of great pointers to voice AI resources.)

I think of this, also, as a nice example of a multi-agent system with each agent specialized both in terms of task and UI.

The voice+vision agent is a realtime, multimodal, conversational scaffolding wrapped around Gemini 2.0. Similarly, the Gemini coding agent is leveraging all of the Cursor framework and capabilities.

It's clear that a lot of new AI applications that ship in 2025 will be built as multi-agent systems.