
How to benchmark NexusQuant on your own model
Running benchmarks on someone else's hardware tells you very little. This guide shows you how to measure NexusQuant's impact on your model, your data, and your hardware in under 15 minutes. Prerequisites pip install nexusquant-kv transformers torch datasets You need a HuggingFace causal LM (any model using split-half RoPE — that's every Llama, Mistral, Qwen, and Phi variant since 2023). Step 1: Load your model import torch from transformers import AutoTokenizer , AutoModelForCausalLM MODEL_NAME = " mistralai/Mistral-7B-v0.1 " # replace with yours tokenizer = AutoTokenizer . from_pretrained ( MODEL_NAME ) model = AutoModelForCausalLM . from_pretrained ( MODEL_NAME , torch_dtype = torch . float16 , device_map = " auto " , ) model . eval () If you are on a smaller GPU, use load_in_8bit=True or try a quantized checkpoint. The benchmark logic is the same. Step 2: Compute baseline perplexity Perplexity (PPL) is the standard quality metric for language models. Lower is better. We measure it o
Continue reading on Dev.to
Opens in a new tab


