The TTFT (latency) for the Qwen model here is about the same as using Groq’s…

July 27, 2025

The TTFT (latency) for the Qwen model here is about the same as using Groq’s API with their larger models.

Prefill time dominates the TTFT, and Groq’s speed advantage doesn’t come into play very much because the network penalty is ~100ms, most of the individual turns add only a few hundred tokens, and KV caching locally works well.

Having said that, for almost all production use cases you can’t run a good enough LLM locally on real users’ devices. This demo uses 110GB of (unified) RAM!

And for anything where you care about throughput and not just latency, Groq blows the M4 out of the water.

ForeverPatriotic@MI_Patriotic

@kwindla Is the latency due to QWEN3? What about using the GROQ endpoint? Would it be near instant if I was ok sending my data to cloud?