Back to articles
~21 tok/s Gemma 4 on a Ryzen mini PC: llama.cpp, Vulkan, and the messy truth about local chat
How-ToDevOps

~21 tok/s Gemma 4 on a Ryzen mini PC: llama.cpp, Vulkan, and the messy truth about local chat

via Dev.toHermes Rodríguez

Hands-on guide based on a real setup: Ubuntu 24.04 LTS , AMD Radeon 760M (Ryzen iGPU), lots of RAM (e.g. 96 GiB), llama.cpp built with GGML_VULKAN , OpenAI-compatible API via llama-server , Open WebUI in Docker, and OpenCode or VS Code (§11) using the same API. Who this is for: if you buy (or plan to buy) a mini PC or small tower with plenty of RAM and disk , this walkthrough gets you to local inference — GGUF weights on your box, chat and APIs on your LAN, without treating a cloud vendor as mandatory for every request. The documented path is AMD iGPU + Vulkan ; if your hardware differs, keep the Ubuntu → llama.cpp → weights → server flow and adjust §5–§6 (deps and build) for your GPU. Reference hardware (validated while writing this guide): Minisforum UM760 Slim mini PC ( Device Type: MINI PC on the chassis label; vendor Minisforum / Micro Computer (HK) Tech Limited ) with AMD Ryzen 5 7640HS , Radeon 760M Graphics , 96 GiB DDR5 RAM, ~1 TiB NVMe, Ubuntu 24.04 LTS . This is not a minimu

Continue reading on Dev.to

Opens in a new tab

Read Full Article
2 views

Related Articles