~21 tok/s Gemma 4 on a Ryzen mini PC: llama.cpp, Vulkan, and the messy truth about local chat

Hands-on guide based on a real setup: Ubuntu 24.04 LTS , AMD Radeon 760M (Ryzen iGPU), lots of RAM (e.g. 96 GiB), llama.cpp built with GGML_VULKAN , OpenAI-compatible API via llama-server , Open WebUI in Docker, and OpenCode or VS Code (§11) using the same API. Who this is for: if you buy (or plan to buy) a mini PC or small tower with plenty of RAM and disk , this walkthrough gets you to local inference — GGUF weights on your box, chat and APIs on your LAN, without treating a cloud vendor as mandatory for every request. The documented path is AMD iGPU + Vulkan ; if your hardware differs, keep the Ubuntu → llama.cpp → weights → server flow and adjust §5–§6 (deps and build) for your GPU. Reference hardware (validated while writing this guide): Minisforum UM760 Slim mini PC ( Device Type: MINI PC on the chassis label; vendor Minisforum / Micro Computer (HK) Tech Limited ) with AMD Ryzen 5 7640HS , Radeon 760M Graphics , 96 GiB DDR5 RAM, ~1 TiB NVMe, Ubuntu 24.04 LTS . This is not a minimu

~21 tok/s Gemma 4 on a Ryzen mini PC: llama.cpp, Vulkan, and the messy truth about local chat

Related Articles

Installing every* Firefox extension

Why XIRR Breaks When Your Angel Portfolio Hits 10+ Investments

Installing OpenBSD on the Pomera DM250{,XY?}

Five years of building my game engine Taylor

Building My First Custom Mechanical Keyboard

Related Articles

How-To
Installing every* Firefox extension
Lobsters • 6h ago

How-To
Why XIRR Breaks When Your Angel Portfolio Hits 10+ Investments
Dev.to • 8h ago

How-To
Installing OpenBSD on the Pomera DM250{,XY?}
Lobsters • 13h ago

How-To
Five years of building my game engine Taylor
Reddit Programming • 16h ago

How-To
Building My First Custom Mechanical Keyboard
Dev.to • 18h ago