FlareStart
HomeNewsHow ToSources
FlareStart

Where developers start their day. All the tech news & tutorials that matter, in one place.

Quick Links

  • Home
  • News
  • Tutorials
  • Sources
  • Privacy Policy

Connect

© 2026 FlareStart. All rights reserved.

Back to articles
How I Stopped GGUF Models From Crashing My GPU: A Pre-flight VRAM Check
NewsProgramming Languages

How I Stopped GGUF Models From Crashing My GPU: A Pre-flight VRAM Check

via Dev.toDmytro Romanov3h ago

The crash that started this I was loading a Q4_K_M quantized 13B model on a 24GB card. The model file was about 7.5GB. Free VRAM according to nvidia-smi : 21GB. Plenty of headroom. I hit run, watched the loader bar, and the process died on the last few layers with CUDA out of memory . That was not a one-off. I had the same crash twice that week, each time after eyeballing free VRAM and convincing myself a model would fit. After the second one I stopped trusting my eyes and started actually doing the math. This post is the math, and the small CLI tool I now run before any local inference job. Why "free VRAM" is not what you think nvidia-smi reports a snapshot. It tells you what is allocated right now. It does not tell you what your model loader is about to allocate, and it does not account for the things that are about to grow. Three buckets eat into the gap between "reported free" and "actually usable": 1. CUDA context overhead. Loading a CUDA context for inference costs a few hundred

Continue reading on Dev.to

Opens in a new tab

Read Full Article
0 views

Related Articles

News

Borrow-checking surprises

Lobsters • 2h ago

Verifying human authorship with human.json
News

Verifying human authorship with human.json

Lobsters • 4h ago

News

On Vinyl Cache and Varnish Cache

Lobsters • 5h ago

News

GUID v4 vs v7: Why You Should Care About the Shift

Reddit Programming • 5h ago

News

The Future of Everything is Lies, I Guess

Lobsters • 6h ago

Discover More Articles