Zhejiang University Alumni Collaborate with Microsoft to Launch Multimodal Model LLaVA, Challenging GPT-4V


Apple launched the multimodal model Manzano, which solves the long-standing problem in the AI field of being unable to balance visual understanding and image generation through an innovative dual-structure architecture.
Moonshot plans to launch the multimodal model K2.1/K2.5 in the first quarter of 2026. The model is an upgraded version of its trillion-parameter open-source model Kimi K2, aiming to enhance multimodal processing and agent capabilities. Since its release in July 2025, Kimi K2 has performed exceptionally well in areas such as code generation, thanks to its mixture-of-experts architecture.
The Zhipu team has open-sourced four core video generation technologies, including GLM-4.6V visual understanding, AutoGLM device control, GLM-ASR speech recognition, and GLM-TTS speech synthesis models, showcasing their latest progress in the multimodal field and laying the foundation for the development of video generation technology.
Google introduces image recognition features for NotebookLM, allowing users to upload whiteboard notes, textbooks, or table images, and automatically recognize text and perform semantic analysis. Users can directly search the content of images using natural language. This feature is free across all platforms and will soon add local processing options to protect privacy. The system uses multimodal technology to distinguish between handwritten and printed text, analyze table structures, and intelligently link with existing notes.
ByteDance and universities launch Sa2VA, integrating LLaVA for video understanding and SAM-2 for precise object segmentation, enhancing video analysis through complementary capabilities.....