TikTok's AI Video Revolution: Vidi2 Launches
Chinese tech giant TikTok has recently launched a new artificial intelligence video editor called Vidi2, which has shocked the industry with its revolutionary video understanding and creation capabilities. According to the latest technical evaluation, Vidi2 has surpassed Google's Gemini 3 Pro in video comprehension depth, becoming the most advanced video multimodal model currently available. Designed for long-timedelta videos, this model can process hours of massive materials and automatically generate scripts, TikTok short videos, or full-length movies based on simple prompts, greatly lowering the barrier to video creation.
The release of Vidi2 marks a key step for TikTok in the field of AI multimodal technology. As the parent company of TikTok and Douyin, TikTok has long been focused on the short video ecosystem. The launch of this tool will further empower global content creators and promote the transformation of AI from an auxiliary tool to core productivity.

Core Technological Breakthroughs: Spatiotemporal Localization and End-to-End Editing
The core of Vidi2 lies in its fine-grained spatiotemporal localization capability, which can accurately capture the location and time period of specific actions and objects in a video. Traditional video AI often struggles to "read" complete long videos, leading to inaccurate identification of event start and end points and participants. Vidi2, however, through a multimodal encoder and language model backbone, integrates text, visual frames, and audio inputs to achieve pixel-level analysis.
Specifically, users need only input a text query, and the model will return precise start and end timestamps along with the trajectory of the target object's bounding box. For example, in a several-hour documentary, querying "cat jumping scene" will not only locate the second-level clip but also automatically extract and edit it into an independent short film. This model uses adaptive visual token compression technology to ensure efficient memory usage, maintaining the integrity of key context even for extremely short or long edits.
In benchmark tests, Vidi2 significantly outperformed commercial models such as Gemini 3 Pro and GPT-5 on VUE-STG (spatiotemporal benchmark) and VUE-TR-V2 (temporal retrieval benchmark). In video question-answering tasks, its accuracy improved notably, especially in the automatic editing of narrative perception, supporting one-click trimming, adding subtitles, and reconstructing story maps. The TikTok team stated that the model's training data combines synthetic editing and massive real videos, ensuring high-fidelity and natural-sounding generated content.
Application Scenario Innovations: From Material Search to Intelligent Script Generation
Vidi2 is not just an editor; it is also an intelligent creative assistant. Its end-to-end workflow shifts video production from manual track pulling to natural language-driven: input a theme prompt, and the model automatically outputs titles, hooks, shot scripts, and produces the final video. Creators can easily convert long material into TikTok vertical short videos or expand them into movie-level narratives, applicable to fields such as news, advertising, and entertainment.
Imagine a documentary director uploading several hours of raw footage and simply prompting "focus on an inspiring story about environmental protection," and Vidi2 will generate a complete script and editing outline. This not only accelerates the iteration process but also enhances the inclusiveness of creative expression. Even non-professional users can achieve professional-level output through simple conversations.
Industry Impact and Future Outlook
The release of Vidi2 has intensified competition in the global AI video market. TikTok has thus consolidated its position at the forefront of multimodal AI. Previously, its MagicVideo series had shown potential in text-to-video generation, while Vidi2 focuses on the understanding and editing loop. Experts predict that this tool will reshape the content industry ecosystem, reduce production costs, and drive short-form videos toward longer formats.
