Concurrent LLM Serving: Benchmarking vLLM vs SGLang vs Ollama

I wanted to know exactly how the three most popular open-source LLM serving engines perform when real users hit your server at the same time . So I built this educational repo and ran identical tests on a single GPU. Repo: https://github.com/zkzkGamal/concurrent-llm-serving Model: Qwen/Qwen3.5-0.8B Hardware: Single GPU Concurrency: 16 simultaneous requests (only 4 for Ollama) Task: Diverse AI & programming questions (max_tokens=150) The Results (spoiler: one engine destroys the others) | Engine | Requests | Total Time | Avg per Request | Concurrency Model | |----------|----------|------------|-----------------------|----------------------------| | SGLang | 16 | 2.47s | 0.68–2.46s | True parallel batching + RadixAttention | | vLLM | 16 | 11.26s | ~10.25–11.26s | PagedAttention + continuous batching | | Ollama | 4 | 134.72s | 26–134s | Sequential (time-sliced) | SGLang was 4.6× faster than vLLM and completely smoked Ollama. Why the huge difference? The Core Algorithms 1. KV-Cache & Memor

Concurrent LLM Serving: Benchmarking vLLM vs SGLang vs Ollama

Related Articles

Stack Data Structure: Concepts, Operations, and Implementation in C

Why I Love FreeBSD

Why QA Matters Even On Strong Engineering Teams

The PeopleMax Creator Charter

Ghosts and Demons: Undefined Behavior in C2Y

Related Articles

News
Stack Data Structure: Concepts, Operations, and Implementation in C
Medium Programming • 2h ago

News
Why I Love FreeBSD
Lobsters • 3h ago

News
Why QA Matters Even On Strong Engineering Teams
Medium Programming • 3h ago

News
The PeopleMax Creator Charter
Medium Programming • 3h ago

News
Ghosts and Demons: Undefined Behavior in C2Y
Lobsters • 3h ago