How I Stopped GGUF Models From Crashing My GPU: A Pre-flight VRAM Check

The crash that started this I was loading a Q4_K_M quantized 13B model on a 24GB card. The model file was about 7.5GB. Free VRAM according to nvidia-smi : 21GB. Plenty of headroom. I hit run, watched the loader bar, and the process died on the last few layers with CUDA out of memory . That was not a one-off. I had the same crash twice that week, each time after eyeballing free VRAM and convincing myself a model would fit. After the second one I stopped trusting my eyes and started actually doing the math. This post is the math, and the small CLI tool I now run before any local inference job. Why "free VRAM" is not what you think nvidia-smi reports a snapshot. It tells you what is allocated right now. It does not tell you what your model loader is about to allocate, and it does not account for the things that are about to grow. Three buckets eat into the gap between "reported free" and "actually usable": 1. CUDA context overhead. Loading a CUDA context for inference costs a few hundred

How I Stopped GGUF Models From Crashing My GPU: A Pre-flight VRAM Check

Related Articles

Borrow-checking surprises

Verifying human authorship with human.json

On Vinyl Cache and Varnish Cache

GUID v4 vs v7: Why You Should Care About the Shift

The Future of Everything is Lies, I Guess

Related Articles

News
Borrow-checking surprises
Lobsters • 2h ago

News
Verifying human authorship with human.json
Lobsters • 4h ago

News
On Vinyl Cache and Varnish Cache
Lobsters • 5h ago

News
GUID v4 vs v7: Why You Should Care About the Shift
Reddit Programming • 5h ago

News
The Future of Everything is Lies, I Guess
Lobsters • 6h ago