ByteDance's Volcano Engine announced on May 6 that the Doubao large model family has officially introduced its first full-modal understanding model - Doubao-Seed-2.0-lite. As a major upgrade in this series, the new model completely breaks through the limitations of a single modality, achieving native unified understanding of video, images, audio, and text, marking a key step forward in the field of multimodal interaction.

The model shows outstanding performance in visual and logical reasoning capabilities. In complex reasoning tests in high-level disciplines such as physics and medicine, its performance has significantly surpassed the Pro version released in February this year. In cutting-edge fields such as fine-grained perception and embodied understanding, the model has reached industry-leading levels. By integrating speech understanding technology, Doubao-Seed-2.0-lite can achieve "synchronized audio-visual" deep joint reasoning. This means it not only "understands" video content, but also accurately judges the consistency of video content by combining background audio, and even precisely locates specific events in long videos according to instructions, restoring complex character relationships.

In terms of audio processing, the new model demonstrates high translation and perception accuracy, supporting transcription in 19 languages including Chinese and English, and translation between 14 languages. In addition to accurate semantic recognition, it can also sensitively capture emotional fluctuations and ambient sounds in speech, making its understanding ability closer to human natural cognition.

Notably, the Agent (intelligent entity) and Coding (programming) capabilities of Doubao-Seed-2.0-lite have also been upgraded simultaneously. The model's compliance with multi-turn complex instructions has significantly improved, with stronger self-decomposition and verification capabilities. In the development field, its code capabilities cover front-end pages, 3D scenes, and game development, and can deliver visually appealing and fully engineered products.

Additionally, the model has achieved integrated understanding and execution of GUI (Graphical User Interface) for the first time. It can not only recognize elements such as buttons and menus in web pages or applications, but also perform operations like clicking, dragging, and inputting, just like a human, truly realizing the closed-loop from "understanding the interface" to "completing tasks end-to-end."
Currently, this technology has been applied in multiple fields such as e-sports review, online education, and cross-border e-commerce. For example, in e-sports scenarios, AI can act as a coach, continuously analyzing match videos and voice for up to 25 hours, and automatically generating tactical review diagrams. At the same time, a more efficient version of Doubao-Seed-2.0-mini has also been launched, providing a more cost-effective option for enterprises to deploy large-scale, low-cost full-modal reasoning tasks.
