Where is the computing limit of smartphones?Apple's latest flagshipiPhone17Pro has just given an answer that is both impressive and somewhat embarrassing.

On March 23, a large language model with 400 billion parameters successfully ran oniPhone17Pro. It should be noted that even after quantization compression, such models usually require at least 200GB of memory to run, whileiPhone17Pro has a hardware configuration of only 12GB LPDDR5X memory.

Technical "Black Technology": Flash Memory Streaming and Mixture of Experts (MoE) Model

Under the severe lack of memory capacity, this "impossible task" was mainly achieved through two technical approaches:

SSD Forced "Expansion": Using the open-source project Flash-MoE, the device directly streams data from the solid-state drive (SSD) to the GPU, breaking through the physical memory limit.

Advantages of MoE Architecture: "MoE" stands for Mixture of Experts, which means the system only needs to call a small part of the 400 billion parameters when generating each word, rather than loading the entire model.

Speed Drawback: A Word Every Two Seconds

Although it "ran successfully," the actual experience is still far from being "usable." Test results show:

Generation Speed: Only 0.6 Token/second. In other words, it takes about 1.5 to 2 seconds to generate one word.

Power Consumption Pressure: This high-intensity local computation will rapidly drain the phone's battery life, and the heat generated is also not negligible.

Industry Insight: The "Singularity" of Local Large Models Is Approaching?

Although the current generation speed is frustrating, the symbolic significance of this demonstration exceeds its practical value. It proves that running top-scale large models locally on a smartphone is not a dead end.

Privacy Protection: Local operation means data does not need to be uploaded to the cloud, providing extremely high privacy protection.

Offline Feasibility: It is becoming possible to get responses from top AI even without an internet connection.