Article Content

Zhejiang University Alumni Collaborate with Microsoft to Launch Multimodal Model LLaVA, Challenging GPT-4V

Published in Latest AI News

Time :Oct 12, 2023

Read :1minute

Translated data: A Zhejiang University Chu Kochen Honors College alumnus has collaborated with Microsoft Research to launch the multimodal model LLaVA, challenging GPT-4V. LLaVA has performed exceptionally well on 11 test datasets, garnering over 6,000 stars. The model's comprehensive capabilities exceed 85% of GPT-4V's level. The open-source code, model, and training data for LLaVA are now available for use.

Related Recommendations

Apple Releases New Multimodal Model Manzano: Breaking the Boundaries of Image Viewing and Drawing

Apple launched the multimodal model Manzano, which solves the long-standing problem in the AI field of being unable to balance visual understanding and image generation through an innovative dual-structure architecture.

Jan 15, 2026

174.2k

Moonshot Unveils New Multimodal Model: Kimi K2 Upgrade Set to Launch in Q1

Moonshot plans to launch the multimodal model K2.1/K2.5 in the first quarter of 2026. The model is an upgraded version of its trillion-parameter open-source model Kimi K2, aiming to enhance multimodal processing and agent capabilities. Since its release in July 2025, Kimi K2 has performed exceptionally well in areas such as code generation, thanks to its mixture-of-experts architecture.

Jan 4, 2026

245.3k

Zhipu Multimodal Open Source Week Concludes Successfully: Four Core Video Generation Technologies Fully Open-Sourced

The Zhipu team has open-sourced four core video generation technologies, including GLM-4.6V visual understanding, AutoGLM device control, GLM-ASR speech recognition, and GLM-TTS speech synthesis models, showcasing their latest progress in the multimodal field and laying the foundation for the development of video generation technology.

Dec 12, 2025

179.0k

NotebookLM Upgraded to Support Image Import, Whiteboard Notes Become Searchable Knowledge Base

Google introduces image recognition features for NotebookLM, allowing users to upload whiteboard notes, textbooks, or table images, and automatically recognize text and perform semantic analysis. Users can directly search the content of images using natural language. This feature is free across all platforms and will soon add local processing options to protect privacy. The system uses multimodal technology to distinguish between handwritten and printed text, analyze table structures, and intelligently link with existing notes.

Nov 17, 2025

342.9k

ByteDance Launches Sa2VA: Achieving Multimodal Intelligent Segmentation by Combining LLaVA with SAM-2

ByteDance and universities launch Sa2VA, integrating LLaVA for video understanding and SAM-2 for precise object segmentation, enhancing video analysis through complementary capabilities.....

Oct 21, 2025

175.2k

Intelligent Future, Your Artificial Intelligence Solution Think Tank

English 简体中文繁體中文にほんご