Yes, and intuitively this seems to be what humans do in conversation!

July 25, 2024

We have built @pipecat_ai bots that do this in a couple of different ways (using a small, fast predictor model; various prompting strategies for the LLM).

For commercial use cases today it’s not really worth doing this. You need to run the LLM colocated with the STT so you can do “greedy” inference and throw away most results. That’s much more expensive than using an LLM via an API.

But this kind of “predictive streaming” approach could definitely be something that LLMs are trained to do natively in the future, both in the text and audio domains.

Alex Mizrahi@killerstorm

@kwindla @GroqInc Latency can be reduced further if you let the LLM to predict the end of human utterance and make a 'speculative' answer, together with fast local classifier which can tell if the end actually matches the prediction.

Might be unnecessary, but LLMs are inherently predictive, so