Back to articles
Concurrent LLM Serving: Benchmarking vLLM vs SGLang vs Ollama
NewsTools

Concurrent LLM Serving: Benchmarking vLLM vs SGLang vs Ollama

via Dev.tozkaria gamal

I wanted to know exactly how the three most popular open-source LLM serving engines perform when real users hit your server at the same time . So I built this educational repo and ran identical tests on a single GPU. Repo: https://github.com/zkzkGamal/concurrent-llm-serving Model: Qwen/Qwen3.5-0.8B Hardware: Single GPU Concurrency: 16 simultaneous requests (only 4 for Ollama) Task: Diverse AI & programming questions (max_tokens=150) The Results (spoiler: one engine destroys the others) | Engine | Requests | Total Time | Avg per Request | Concurrency Model | |----------|----------|------------|-----------------------|----------------------------| | SGLang | 16 | 2.47s | 0.68–2.46s | True parallel batching + RadixAttention | | vLLM | 16 | 11.26s | ~10.25–11.26s | PagedAttention + continuous batching | | Ollama | 4 | 134.72s | 26–134s | Sequential (time-sliced) | SGLang was 4.6× faster than vLLM and completely smoked Ollama. Why the huge difference? The Core Algorithms 1. KV-Cache & Memor

Continue reading on Dev.to

Opens in a new tab

Read Full Article
2 views

Related Articles