Your iPhone Just Ran a 400B AI Model. It Shouldn't Be Possible.

An individual recently executed a 400-billion parameter AI model on an iPhone. No server connection. No cloud streaming. It was in airplane mode running on a phone with 12GB of RAM. Technically, the model requires over 200GB of memory. Yet the phone only has 12. How does this defy the laws of computer science? The Trick: Stream, Don't Load A developer known as @anemll posted a demo over the weekend using an open-source project named Flash-MoE. It implements a Mixture of Experts architecture that only wakens 4 to 10 experts per token out of 512 in total. Instead of preloading the complete model into memory, it streams weights directly from the phone's NVMe SSD to the GPU upon demand. Imagine a library where you only take books off the shelf when you get a question. The remainder of the books stay on the shelf. Except the "shelf" is a 2GB/s SSD and the "books" are neural network weights. The Speed? 0.6 Tokens Per Second That's crazy slow for a chatbot. You'd be looking at 30 seconds for

Your iPhone Just Ran a 400B AI Model. It Shouldn't Be Possible.

Related Articles

Lego Star Wars Smart Play Throne Room Duel and A-Wing Review

I found the best tech deals under $50 during Amazon's Big Spring Sale

How American Camouflage Conquered the World

Unlock the Power of the Future with the Quantum Computing System ⚡

This Tiny Change Multiplied My OpenClaw Output

Related Articles

News
Lego Star Wars Smart Play Throne Room Duel and A-Wing Review
Wired • 2h ago

News
I found the best tech deals under $50 during Amazon's Big Spring Sale
ZDNet • 3h ago

News
How American Camouflage Conquered the World
Wired • 3h ago

News
Unlock the Power of the Future with the Quantum Computing System ⚡
Medium Programming • 3h ago

News
This Tiny Change Multiplied My OpenClaw Output
Medium Programming • 3h ago