Yi-VL Multimodal Language Model Released with Two Versions

The Zero-One-Everything Yi-VL Multimodal Language Model is the latest addition to the Zero-One-Everything Yi family of models, excelling in both visual comprehension and conversational generation. The Yi-VL model has achieved top results on both the English dataset MMMU and the Chinese dataset CMMMU, demonstrating its prowess in complex interdisciplinary tasks. The Yi-VL-34B model surpassed other large multimodal models with a 41.6% accuracy rate in the new multimodal benchmark test MMMU, showcasing its robust interdisciplinary knowledge comprehension and application capabilities. Built on the open-source LLaVA architecture, the Yi-VL model includes the Vision Transformer (ViT), Projection modules, and large-scale language models Yi-34B-Chat and Yi-6B-Chat. The ViT is used for image encoding, the Projection modules enable the alignment of image features with text feature spaces, and the large-scale language models provide powerful language comprehension and generation capabilities.

Meta's Latest Audio Model SPIRIT LM: Making AI Not Just Talk, But Also Express Emotion!

Recently, Meta AI open-sourced a foundational multimodal language model named SPIRIT LM, which can freely mix text and speech, opening new possibilities for multimodal tasks involving audio and text. SPIRIT LM is based on a pre-trained text language model with 7 billion parameters, which has been continuously trained on text and speech units, expanding into the speech modality. It can understand and generate text like a large text model, while also being capable of understanding and generating speech, and even mixing text and speech to create various forms of expression.

AI Company Zero One Everything Releases Open Source Yi-9B Model, the Strongest in Its Series

Zero One Everything has announced the open sourcing of the Yi-9B model. Yi-9B is the strongest model in the Yi series in terms of coding and mathematical capabilities. It excels in overall, coding, and mathematical abilities, and can be easily deployed on consumer-grade graphics cards. The company was founded by Kaifu Lee, Chairman and CEO of Innovation Works.

National University of Singapore Releases Open Source Multimodal Language Model NExT-GPT to Advance Multimedia AI Applications

NExT-GPT is an open source multimodal language model developed by the National University of Singapore, capable of processing text, images, videos, and audio, providing robust support for multimedia AI applications. It features a three-layer architecture, including linear projection, Vicuna LLM core, and modality-specific transformation layers, with intermediate layer training conducted using MosIT technology. The open-source contribution enables researchers and developers to create applications that integrate multimodal inputs, with potential applications spanning a wide range of fields. What sets NExT-GPT apart is its ability to generate modalities based on user requests.

The Mastermind Behind Teaching Robots to Cook Tomato and Egg: Genesis AI Open-Source Full-Stack Training Platform

Genesis AI company opens its World 1.0 platform, providing high-performance full-stack simulation infrastructure for robot and physical AI developers, helping agents efficiently train in a virtual training ground, reducing the learning barrier for skills such as robots cooking, and accelerating the implementation of embodied intelligence.

Yi-VL Multimodal Language Model Released with Two Versions

Related Recommendations

Meta's Latest Audio Model SPIRIT LM: Making AI Not Just Talk, But Also Express Emotion!

AI Company Zero One Everything Releases Open Source Yi-9B Model, the Strongest in Its Series

National University of Singapore Releases Open Source Multimodal Language Model NExT-GPT to Advance Multimedia AI Applications

The Mastermind Behind Teaching Robots to Cook Tomato and Egg: Genesis AI Open-Source Full-Stack Training Platform

Unlock unlimited image generation for 9.9 yuan! Banana 2, Image2 free to create anytime