I love grounded hot tech takes like this

November 12, 2024

I love grounded hot tech takes like this. Even if I disagree with them. Which I do with this one.

Text 🤝 Voice 🤝 Images

Why choose? We'll all be using multi-modal input soon. I talk, type, and visionize (is that a word?) much of the day already, today.

I run an always-on background process that:

- Does screen capture every five seconds and buffers the last few images
- Starts a voice conversation when I hit a hot-key, or
- I can type queries into a textbox

I ask it things like: "Look at my editor and my terminal and tell me why I'm getting this Pydantic bug." Both Claude Sonnet and GPT-4o can read the text from screenshots and answer this kind of question easily.

Output is both text and voice by default. @mark_backman wrote a @pipecat_ai frame processor that filters out LLM formatting. So now I can have both nice text and comprehensible voice.

(Though I will say that you haven't lived until you've heard a TTS system read a table out loud to you. "Vertical bar plus plus plus plus plus plus plus plus plus vertical bar ...")

If I'm by myself in my office, I talk to it. But if I want to copy and paste, or I have loud music on, or I'm sitting in an office with other people I ... don't talk, I type.

I need to clean this code up and share it. This way of working feels like the future. I fully expect tools like @cursor_ai to have native voice input in the not-too-distant future.

Philipp Tsipman@ptsi

🌶️ AI UX take:
AI voice-only desktop UIs are going to flop — no matter how fast inference gets. Here's why:
* No one wants to talk to an Orb all day 🔮

Text > Voice
* Text is more multi-purpose -- you can combine it with images, code, etc
* Text output is much more re-mixable —