July 30, 2025
Tonight @charles_irl and I are hanging out with ~400 of our closest friends and talking about open source voice AI models, tools, and infrastructure.
Join us in person in San Francisco at the @trychroma office, or on the livestream.
Charles gave a fantastic talk at the @aiDotEngineer World's Fair last month: What Every AI Engineer Needs To Know About GPUs.
Watch that talk today, and ask Charles all your GPU questions tonight!
Charles opens his talk with an analogy: GPUs are like databases, in the sense that almost all the code we write uses them so it's valuable to have good intuitions about how to write code that uses them effectively.
I've used this analogy, too, as a tool for thinking about LLMs and how leveraging what LLMs can do is changing the code we write. Understanding the core internal algorithms and mechanics of databases/GPUs/LLMs is important, even if you're never planning to write, say, a CUDA kernel.
One thing we've barely started to internalize yet in our tools and libraries, as Charles says, as soon as you start using GPUs for anything, some very surprising operations become almost free.
Watch the talk for more on this, but one important take-away is that right now there's a lot of potential to improve all the LLM-related code we use.
For example, we can do more speculative decoding down in our inference libraries. But we can also do a bunch of different kinds of "greedy inference," prediction, and speculative decoding-like things in our application code.
Yesterday I was writing some multi-turn voice conversation code using a local model. KV caching was working well and helping keep the TTFT low for even fairly long conversations. But as soon as I started asking the model to generate fairly long single-turn responses, TTFT went way up. Obvious optimization: use the time while waiting for the user's next input to do a dummy inference (which you'll throw away) just to shove the LLM's most recent output into the KV cache. Not rocket science, but doing this cut the TTFT for turns following long responses in half.
Your (UI) latency is my (optimization) opportunity.
Come hang out with us tonight.
Watch @charles_irl 's @aiDotEngineer talk and read his GPU Glossary.