Where is the computing limit of smartphones?
On March 23, a large language model with 400 billion parameters successfully ran on
Technical "Black Technology": Flash Memory Streaming and Mixture of Experts (MoE) Model
Under the severe lack of memory capacity, this "impossible task" was mainly achieved through two technical approaches:
SSD Forced "Expansion": Using the open-source project Flash-MoE, the device directly streams data from the solid-state drive (SSD) to the GPU, breaking through the physical memory limit.
Advantages of MoE Architecture: "MoE" stands for Mixture of Experts, which means the system only needs to call a small part of the 400 billion parameters when generating each word, rather than loading the entire model.
Speed Drawback: A Word Every Two Seconds
Although it "ran successfully," the actual experience is still far from being "usable." Test results show:
Generation Speed: Only 0.6 Token/second. In other words, it takes about 1.5 to 2 seconds to generate one word.
Power Consumption Pressure: This high-intensity local computation will rapidly drain the phone's battery life, and the heat generated is also not negligible.
Industry Insight: The "Singularity" of Local Large Models Is Approaching?
Although the current generation speed is frustrating, the symbolic significance of this demonstration exceeds its practical value. It proves that running top-scale large models locally on a smartphone is not a dead end.
Privacy Protection: Local operation means data does not need to be uploaded to the cloud, providing extremely high privacy protection.
Offline Feasibility: It is becoming possible to get responses from top AI even without an internet connection.
