I built an Ollama alternative with TurboQuant, model groups, and multi-GPU support

The problem I run multi-model architectures — 3 LLMs receiving the same prompt, deliberating, and producing a consensus response. Think of it as a voting system where individual model biases cancel out. Ollama swaps models sequentially. vLLM is cloud-oriented. llama.cpp server handles one model at a time. None of them could do what I needed: load 3+ models simultaneously, send them the same prompt in parallel, collect all responses, and handle failures gracefully. So I built EIE. What EIE does EIE (Elyne Inference Engine) is a local inference server for GGUF models. It loads models, serves them via an OpenAI-compatible REST API, and manages GPU memory. It does one thing : serve completions. No agents, no RAG, no UI. Everything else runs on top. Model Groups This is the core idea. Instead of thinking in individual models, EIE thinks in groups : groups : - name : core models : [ mistral-7b , granite-3b , exaone-2.4b ] required_responses : 3 type : parallel pinned : true fallback : partia

I built an Ollama alternative with TurboQuant, model groups, and multi-GPU support

Related Articles

Dave Garage - Why your new computer is slower than your old computer

All of the String types

The Last Quiet Thing

The Great Nix Flake Check

Can open source outperform proprietary software?

Related Articles

News
Dave Garage - Why your new computer is slower than your old computer
Reddit Programming • 5h ago

News
All of the String types
Lobsters • 6h ago

News
The Last Quiet Thing
Lobsters • 9h ago

News
The Great Nix Flake Check
Lobsters • 12h ago

News
Can open source outperform proprietary software?
Reddit Programming • 13h ago