
Serving LLMs on IaaS: throughput vs latency tuning with practical guardrails
Serving LLMs on IaaS is queueing plus memory pressure dressed up as ML. Every request has a prefill phase (prompt → KV cache) and a decode phase (token-by-token output). Throughput tuning pushes batching and concurrency. Latency tuning caps them to protect TTFT and ITL . With vLLM on a single L40S (PCIe), you win by setting hard limits and enforcing admission control. TTFT, ITL, TPS: stop mixing the metrics If you tune the wrong metric, you’ll ship a fast benchmark and a slow product. You need three numbers, and they mean different things: TTFT (time to first token): how long the user waits before anything shows up. Interactive UX lives here. ITL (inter-token latency): the “smoothness” of streaming output once decoding starts. Chat feels broken when this jitters. Throughput (tokens/sec): the finance metric. It decides cost per request. One important detail: E2E latency includes queueing + prefill + decode. TTFT is where queueing hides when you’re overloaded. Practical measu
Continue reading on Dev.to
Opens in a new tab



