Xi Xiaoyao Technology Talk | Stop Saying GPT-4V is Amazing! It Can't Even Recognize Peking Duck, Can You Believe It??


AI startup Moondream has officially announced the completion of $4.5 million in seed funding and presents a disruptive viewpoint: in the world of AI models, smaller models may hold advantages. The company is backed by Felicis Ventures, Microsoft's M12 GitHub Fund, and Ascend, launching a visual language model with only 1.6 billion parameters that can compete with models four times its size in terms of performance.
Recently, H2O.ai announced the launch of two new visual language models aimed at enhancing the efficiency of document analysis and optical character recognition (OCR) tasks. The two models, H2OVL Mississippi-2B and H2OVL-Mississippi-0.8B, demonstrate remarkable competitiveness compared to models from large tech companies, potentially offering businesses dealing with heavy document workflows a more efficient solution.
Alibaba Group's Tongyi Qwen QwenLM project has encountered an unexpected takedown of its Github page, resulting in a 404 error message when users try to access it. Project leader Lin Junyang responded in a social media post, stating that the team has not disappeared and is in communication with officials to resolve the page takedown issue. Despite the team's efforts, the page remains inaccessible. It is worth noting that the team recently released the Qwen2-VL model, which has excelled in processing video content up to 20 minutes long, surpassing multiple authoritative evaluation metrics for multimodal models, with some metrics even exceeding those of GPT.
On September 2nd, Tongyi Qwen announced the open sourcing of its second-generation visual language model Qwen2-VL, and launched APIs for the 2B and 7B sizes as well as their quantized versions on the Aliyun Bailian platform for direct user access. The Qwen2-VL model achieves comprehensive performance improvements in several areas. It can understand images of different resolutions and aspect ratios, setting global leading performance on benchmarks such as DocVQA, RealWorldQA, and MTVQA.
NVIDIA has collaborated with several universities to introduce NVEagle, a large visual language model capable of chatting using images. NVEagle can analyze image content and provide accurate answers, such as identifying individuals in images, like Jensen Huang. The model significantly enhances the understanding of visual information by transforming images into visual tokens and combining them with text embeddings. In addressing the challenges of high-resolution image processing, the research team has constructed models like Eagle-X5-7B and Eagle-X by exploring various visual encoders and fusion strategies.