
Running LLMs Locally: A Rigorous Benchmark of Phi-3, Mistral, and Llama 3.2 on Ollama
Abstract This report presents a comprehensive evaluation of three small language models (SLMs) – Llama 3.2 (3B), Phi-3 mini, and Mistral 7B – running locally via Ollama. A FastAPI-based benchmarking framework was developed to measure inference speed, resource consumption, and the models' ability to produce valid JSON outputs as defined by Pydantic schemas. A retry mechanism with reprompting was implemented to handle malformed responses. The models were tested on a suite of 30 prompts spanning general knowledge, mathematics, coding, reasoning, and creative writing. Results highlight trade-offs between speed, accuracy, and resource usage, providing actionable insights for deploying local AI assistants in production environments. 1. Introduction Local deployment of small language models offers privacy, low latency, and cost advantages over cloud-based APIs. However, ensuring consistent, structured outputs is essential for integration into applications. This project benchmarks three popula
Continue reading on Dev.to Python
Opens in a new tab


