Flash-MoE: Running a 397B Parameter Model on a Laptop

Flash-MoE: Running a 397B Parameter Model on a Laptop TL;DR: A new Mixture-of-Experts implementation lets you run a 397 billion parameter model on consumer hardware. No cloud. No API costs. Just your laptop and patience. The Breakthrough Yesterday, Flash-MoE hit the Hacker News front page with 332 points. The pitch is simple: run massive models locally by only activating the parameters you need. Traditional models activate every parameter for every token. A 397B model means 397 billion computations per token. That's why you need datacenter GPUs. Mixture-of-Experts (MoE) works differently. The model has 397B total parameters, but only activates ~50B per token. The "router" picks which expert networks to use for each input. Flash-MoE optimizes this routing to be memory-efficient enough for consumer GPUs. Why This Matters The economics shift: Approach Cost per 1M tokens Hardware needed GPT-4 API $30+ None (cloud) Local 70B ~$0.001 RTX 4090 Flash-MoE 397B ~$0.001 RTX 4090 + patience Same c

Flash-MoE: Running a 397B Parameter Model on a Laptop

Related Articles

Generators in lone lisp

My favorite color e-reader is $80 off ahead of Amazon's Big Spring Sale

You can get a free iPhone 17e at Visible with this deal - here's how

Semi-retirement, or, really, changing my relationship with the BSDs

Markdown Ate the World

Related Articles

News
Generators in lone lisp
Lobsters • 1h ago

News
My favorite color e-reader is $80 off ahead of Amazon's Big Spring Sale
ZDNet • 1h ago

News
You can get a free iPhone 17e at Visible with this deal - here's how
ZDNet • 2h ago

News
Semi-retirement, or, really, changing my relationship with the BSDs
Lobsters • 2h ago

News
Markdown Ate the World
Lobsters • 2h ago