Zhejiang University Alumni Collaborate with Microsoft to Launch Multimodal Model LLaVA, Challenging GPT-4V


Google introduces image recognition features for NotebookLM, allowing users to upload whiteboard notes, textbooks, or table images, and automatically recognize text and perform semantic analysis. Users can directly search the content of images using natural language. This feature is free across all platforms and will soon add local processing options to protect privacy. The system uses multimodal technology to distinguish between handwritten and printed text, analyze table structures, and intelligently link with existing notes.
ByteDance and universities launch Sa2VA, integrating LLaVA for video understanding and SAM-2 for precise object segmentation, enhancing video analysis through complementary capabilities.....
"Alibaba has established a robotics and embodied AI team", led by executive Lin Junyang, aimed at developing innovative robotics technology and promoting the advancement of embodied AI. Embodied AI refers to intelligent systems that can interact with the environment through physical bodies, marking the company's further expansion in the field of intelligence.
During a technical livestream at 1 AM today, OpenAI officially launched its latest and most powerful multimodal models: o4-mini and the full-power o3. These models offer unique advantages, capable of processing text, images, and audio simultaneously. They also function as agents, automatically utilizing tools such as web search, image generation, and code parsing. Furthermore, they possess a deep thinking mode, enabling reasoning about images within a chain of thought.
The latest version 2.6 of WallFacer’s MiniCPM-V series has rapidly climbed to the Top 3 on GitHub and HuggingFace trends, surpassing ten thousand stars. Since its release in February, it has accumulated over a million downloads, becoming a benchmark for on-device model capabilities. MiniCPM-V2.6 achieves performance enhancements for on-device multimodal models with 8 billion parameters, including real-time video understanding, multi-image joint understanding, and multi-image in-context learning, with a quantized backend memory of only 6GB and an inference speed of up to 18 tokens.