Best LLMs for Ollama on 16GB VRAM GPU

Running large language models locally gives you privacy, offline capability, and zero API costs. This benchmark reveals exactly what one can expect from 9 popular LLMs on Ollama on an RTX 4080 . With a 16GB VRAM GPU, I faced a constant trade-off: bigger models with potentially better quality, or smaller models with faster inference . TL;DR Here is the comparison table of LLM performance on RTX 4080 16GB with Ollama 0.15.2: Model RAM+VRAM Used CPU/GPU Split Tokens/sec gpt-oss:20b 14 GB 100% GPU 139.93 ministral-3:14b 13 GB 100% GPU 70.13 qwen3:14b 12 GB 100% GPU 61.85 qwen3-vl:30b-a3b 22 GB 30%/70% 50.99 glm-4.7-flash 21 GB 27%/73% 33.86 nemotron-3-nano:30b 25 GB 38%/62% 32.77 devstral-small-2:24b 19 GB 18%/82% 18.67 mistral-small3.2:24b 19 GB 18%/82% 18.51 gpt-oss:120b 66 GB 78%/22% 12.64 Key insight : Models that fit entirely in VRAM are dramatically faster. GPT-OSS 20B achieves 139.93 tokens/sec, while GPT-OSS 120B with heavy CPU offloading crawls at 12.64 tokens/sec—an 11x speed dif

Best LLMs for Ollama on 16GB VRAM GPU

Related Articles

5 gadgets I'm buying this spring to grow my green thumb (and they're still discounted)

The Graph Problems You’re Already Solving (Just Badly)

If-Else Is Killing Your Code — Here’s What Senior Developers Do Differently

Why Software Gets Harder to Change Long Before It Breaks

These 7 wellness gadgets helped me become more mindful (and they're still on sale)

Related Articles

News
5 gadgets I'm buying this spring to grow my green thumb (and they're still discounted)
ZDNet • 8h ago

News
The Graph Problems You’re Already Solving (Just Badly)
Medium Programming • 8h ago

News
If-Else Is Killing Your Code — Here’s What Senior Developers Do Differently
Medium Programming • 9h ago

News
Why Software Gets Harder to Change Long Before It Breaks
Medium Programming • 9h ago

News
These 7 wellness gadgets helped me become more mindful (and they're still on sale)
ZDNet • 9h ago