When large models are no longer just "describing images" or "generating images from text," but can understand, plan, and perform cross-modal operations in complex environments like humans, multimodal AI is experiencing a qualitative leap. On October 30, the Beijing Zhiyuan Institute of Artificial Intelligence officially released its next-generation multimodal world model - Emu3.5. For the first time, it introduces the autoregressive "next-state prediction" (Next-State Prediction, NSP) into multimodal sequence modeling, marking a key step for AI to move from "perception and understanding" to "intelligent operation."
NSP Architecture: Teaching AI to "Predict How the World Will Change"
The core breakthrough of Emu3.5 lies in its unified NSP framework: the model treats multimodal inputs such as text, images, and action instructions as continuous state sequences, achieving end-to-end intelligent reasoning by predicting the "next state." This means that Emu3.5 not only understands the current scenario but also predicts the results after actions and plans the optimal action path accordingly.

For example, when a user inputs "move the coffee cup in this photo to the right side of the table and brighten the overall tone," Emu3.5 can not only accurately identify objects and background but also perform compound operations such as moving and adjusting lighting in steps, ensuring that each output conforms to physical logic and visual consistency.
Embodied Intelligence Begins to Emerge: Comprehensive Upgrade of Cross-Scenario Operation Capabilities
In testing, Emu3.5 has shown strong cross-modal generalization and embodied operational capabilities:
Text-image collaborative generation: generates high-detail images based on complex descriptions (such as "a rain street in cyberpunk style, neon lights reflecting on the wet road");
Intelligent image editing: supports semantic-level modifications (such as "change the character's clothing style to vintage suit"), without manual selection;
Spacetime dynamic reasoning: can edit video frame sequences coherently, such as "making a running character suddenly stop and turn around."
This capability makes it highly promising in scenarios requiring a "perception-decision-execution" loop, such as robot control, virtual assistants, and intelligent design.
New Paradigm of Multimodal Fusion: Breaking Information Silos
Different from early multimodal models that only aligned features, Emu3.5 unifies text, vision, and actions into predictable state streams, enabling true cross-modal free switching and collaborative reasoning. Researchers can use this to efficiently process heterogeneous data, while ordinary users can complete creative tasks that previously required professional software through natural language.
Zhiyuan stated that Emu3.5 will be applied first in education (intelligent courseware generation), healthcare (multimodal medical record analysis), and entertainment (AI director), and will continue to open source some capabilities, promoting the development of the multimodal ecosystem.
Conclusion: From "Understanding the World" to "Operating the World"
The release of Emu3.5 is not only an upgrade in technical parameters, but also a shift in AI's role - evolving from a passive response "tool" to an active planning "collaborator." When the model starts to predict "what will happen next," it truly begins its journey toward general intelligence. And Zhiyuan is using the NSP architecture as a fulcrum, leveraging the next breakthrough in multimodal AI.
