Voice-only programming with the new OpenAI Realtime API

August 28, 2025

Voice-only programming with the new OpenAI Realtime API ...

I spend a lot of time these days pair programming with LLMs. Often I'm talking rather than typing.

This "voice dictation" use case has become an important vibe benchmark for me. Being able to create text input just by talking, flexibly, in a context dependent way, with tool calling, is a *hard* problem for today's models.

Natural language dictation requires a very high degree of contextual intelligence, instruction following accuracy, and tool calling reliability.

Today's new gpt-realtime model is quite good at this hard problem.

The original realtime model release last year was impressive. Seeing what a speech-to-speech model could do got a lot of people excited about the possibilities of voice AI. The improvements since that first release are equally impressive. I can use this new model, now, for real world tasks that were past the edge of the "jagged frontier" before.

Here's a video showing a couple of fun (and tricky) modes of voice input.

It's hard to over-state how big a change this LLM+voice programming workflow is. And how fast that change has happened.

A few months ago I had to force myself to work this way. (I want to live in the future!) But now it's difficult to imagine going back to writing code by hand (literally) with a keyboard.

I've been meaning to clean up some of the voice dictation code I use every day and post it for other people to try. The GA today of the OpenAI Realtime API was a good excuse to spend a few hours making a repo and writing some things down.

Here's a repo. You can `git clone` this, export your OpenAI API key, and run a single `uv` command to try this yourself.

https://t.co/gDSGq2BuqI

And here's the PR that Codec CLI created from the voice interaction in the demo video, above. :-) -> https://t.co/ghlpEiEQiM

My goals for voice input are to:

1. Be able to talk to my computer the same way I talk to another person. I don't want to have to dictate literal phrases. I want to stop and start, go back and correct things I said before, rely on previous context, have my tools interpret what I mean to say rather than what I literally said, and have the model fill in gaps and rewrite things for me on the fly.

2. Do many of the things I can easily do with a keyboard and mouse. Send input to different windows. Perform sequences of actions. Copy and paste. Take screenshots.

3. Have context and memory, so I don't have to repeat myself every session.

4. Add new tools and work patterns to my everyday environment.

This is always a work in progress for me. Half the time, half my code is broken! (I try a new model, I add a bunch of stuff that makes instruction following less reliable, or I refactor some of the Pipecat tool integrations and only get partway through before I have to do something else.)

I just put the basic dictation input functionality into this public repo, for now. But there's enough there to try out this way of working and see if it's interesting to you.

Feel free to create issues and PRs. I'll try to add more code over time, and keep things a little more stable, going forward, in case other people want to work together on this.

The launch livestream^[1]

The Realtime API docs^[2]