The NEO-unify architecture adopted by
SenseTime Open Sources SenseNova U1, Achieving a Multimodal Native Unified Architecture


The NEO-unify architecture adopted by
The competition in large models is shifting towards agents. SenseTime is developing the industry's first natively multimodal agent base, integrating a unified core of "understanding, generation, and action", directly benchmarking against GPT-Image 2, and pushing AI from passive Q&A to active execution.....
SenseTime is secretly developing the multimodal large model U1Pro, targeting design scenarios, led by Chief Scientist Lin Dahua. The model belongs to the "Ri Ri Xin" family, aiming to compete with OpenAI's GPT-Image2, emphasizing long-range logic and thinking capabilities, and expected to launch internal testing and commercial use in July.
On April 28, SenseTime open-sourced the 'SenseNova U1' series, a 'native understanding and generation unified model' that overcomes traditional multimodal models' reliance on modular splicing, achieving deep integration of vision and language through a unified architecture, marking a significant domestic AI breakthrough in multimodal technology.....
SenseTime launches Seko2.0, the world's first AI agent for multi-scene video generation, enabling continuous narratives from single clips. It ensures high consistency in characters, scenes, and style, advancing plot coherence and visual uniformity, scalable for short videos, ads, and education, powered by its proprietary multimodal model.....
Recently, a research team from the University of Hong Kong, The Chinese University of Hong Kong, and SenseTime has released a groundbreaking framework - GoT-R1. This new multimodal large model significantly enhances the semantic and spatial reasoning capabilities of AI in visual generation tasks by introducing reinforcement learning (RL), successfully generating high-fidelity and semantically consistent images from complex text prompts. This advancement marks another leap in image generation technology. Currently, although existing multimodal large models have made significant progress in generating images based on text prompts