
Best LLMs for Ollama on 16GB VRAM GPU
Running large language models locally gives you privacy, offline capability, and zero API costs. This benchmark reveals exactly what one can expect from 9 popular LLMs on Ollama on an RTX 4080 . With a 16GB VRAM GPU, I faced a constant trade-off: bigger models with potentially better quality, or smaller models with faster inference . TL;DR Here is the comparison table of LLM performance on RTX 4080 16GB with Ollama 0.15.2: Model RAM+VRAM Used CPU/GPU Split Tokens/sec gpt-oss:20b 14 GB 100% GPU 139.93 ministral-3:14b 13 GB 100% GPU 70.13 qwen3:14b 12 GB 100% GPU 61.85 qwen3-vl:30b-a3b 22 GB 30%/70% 50.99 glm-4.7-flash 21 GB 27%/73% 33.86 nemotron-3-nano:30b 25 GB 38%/62% 32.77 devstral-small-2:24b 19 GB 18%/82% 18.67 mistral-small3.2:24b 19 GB 18%/82% 18.51 gpt-oss:120b 66 GB 78%/22% 12.64 Key insight : Models that fit entirely in VRAM are dramatically faster. GPT-OSS 20B achieves 139.93 tokens/sec, while GPT-OSS 120B with heavy CPU offloading crawls at 12.64 tokens/sec—an 11x speed dif
Continue reading on Dev.to
Opens in a new tab



