← kwindla hultman kramer

Another nice trick made possible by Gemini 2.0 native audio: for long-running…

December 24, 2024

Another nice trick made possible by Gemini 2.0 native audio: for long-running, live voice AI conversations, use native audio input once and then replace with Gemini's transcript.

This reduces the token count for each turn of user speech by ~10x.

For a ten-minute conversation, this reduces the token count by ~100x. (Because the conversation history compounds every turn.)

For live transcription I use this system instruction:

```
transcriber_system_instruction = """You are an audio transcriber. You are receiving audio from a user. Your job is to
transcribe the input audio to text exactly as it was said by the user.

You will receive the full conversation history before the audio input, to help with context. Use the full history only to help improve the accuracy of your transcription.

Rules:
- Respond with an exact transcription of the audio input.
- Do not include any text other than the transcription.
- Do not explain or add to your response.
- Transcribe the audio input simply and precisely.
- If the audio is not clear, emit the special string "-".
- No response other than exact transcription, or "-", is allowed.
```

Dwarkesh Patel@dwarkesh_sp

Human quality transcripts using Gemini 2.0 w Audio.

Lmk if you come up with a prompt that works better!

Blame Claude for all coding mistakes.

More context here:

[1]

Demo code here:

[2]

  1. https://x.com/kwindla/status/1870974144831275410
  2. https://github.com/pipecat-ai/pipecat/blob/main/examples/foundational/22d-natural-conversation-gemini-audio.py