← kwindla hultman kramer

123ms TTFT for Llama 3.1 70B, vLLM, on an H100

October 10, 2024

123ms TTFT for Llama 3.1 70B, vLLM, on an H100. (256 input tokens.)

Nice data point and write-up from the team at @cerebriumai. This matches what I've seen from vLLM. I'm glad/sad to see that Cerebrium — which is a lot better at GPU wrangling than I am — didn't find any tricks I missed. 😅

There's a lot more benchmark data for inference throughput than latency floating around. For voice AI use cases, though, it's often worth tweaking settings to optimize for latency, even if that raises costs a bit.

cerebriumai@cerebriumai

We benchmarked vLLM, SGLang and TensorRT to see which framework would be most performant when running Llama 3.1 70B on a H100

However performant can mean different things based on what you value (TTFT/throughput).

#llama3 #vllm #tensorrt #h100