
Your iPhone Just Ran a 400B AI Model. It Shouldn't Be Possible.
An individual recently executed a 400-billion parameter AI model on an iPhone. No server connection. No cloud streaming. It was in airplane mode running on a phone with 12GB of RAM. Technically, the model requires over 200GB of memory. Yet the phone only has 12. How does this defy the laws of computer science? The Trick: Stream, Don't Load A developer known as @anemll posted a demo over the weekend using an open-source project named Flash-MoE. It implements a Mixture of Experts architecture that only wakens 4 to 10 experts per token out of 512 in total. Instead of preloading the complete model into memory, it streams weights directly from the phone's NVMe SSD to the GPU upon demand. Imagine a library where you only take books off the shelf when you get a question. The remainder of the books stay on the shelf. Except the "shelf" is a 2GB/s SSD and the "books" are neural network weights. The Speed? 0.6 Tokens Per Second That's crazy slow for a chatbot. You'd be looking at 30 seconds for
Continue reading on Dev.to Webdev
Opens in a new tab



