
How I Stopped GGUF Models From Crashing My GPU: A Pre-flight VRAM Check
The crash that started this I was loading a Q4_K_M quantized 13B model on a 24GB card. The model file was about 7.5GB. Free VRAM according to nvidia-smi : 21GB. Plenty of headroom. I hit run, watched the loader bar, and the process died on the last few layers with CUDA out of memory . That was not a one-off. I had the same crash twice that week, each time after eyeballing free VRAM and convincing myself a model would fit. After the second one I stopped trusting my eyes and started actually doing the math. This post is the math, and the small CLI tool I now run before any local inference job. Why "free VRAM" is not what you think nvidia-smi reports a snapshot. It tells you what is allocated right now. It does not tell you what your model loader is about to allocate, and it does not account for the things that are about to grow. Three buckets eat into the gap between "reported free" and "actually usable": 1. CUDA context overhead. Loading a CUDA context for inference costs a few hundred
Continue reading on Dev.to
Opens in a new tab
