Xi Xiaoyao Technology Talk | Stop Saying GPT-4V is Amazing! It Can't Even Recognize Peking Duck, Can You Believe It??

Translation: Regarding the recently highly-discussed visual language model GPT-4V, researchers have constructed a new benchmark test called HallusionBench to evaluate its image reasoning capabilities. The results indicate that models like GPT-4V perform poorly on HallusionBench, often succumbing to language hallucinations influenced by their parametric memories, with error rates as high as 90%. Additionally, GPT-4V's performance on visual tasks involving geometry is also unsatisfactory, highlighting its current limitations in visual abilities. Simple image manipulations can easily mislead GPT-4V, exposing its vulnerabilities. In contrast, LLaVA-1.5, while not as richly knowledgeable as GPT-4V, has fewer common sense errors. This study reveals the limitations of current visual language models in image reasoning and provides insights for future improvements.

OpenAI Talent Mobility: Former Researcher Tian Yonglong Joins Tencent, Focused on Visual Language Model Development

Tian Yonglong, a former researcher at OpenAI, has joined Tencent's Large Language Model Department, focusing on the development of visual language models. This move is seen as a key recruitment for Tencent to strengthen its multi-modal large model strategy, highlighting the intense competition for cutting-edge talent.

Moondream Raises $4.5 Million to Launch a 1.6 Billion Parameter Efficient AI Model with 5K GitHub Stars

AI startup Moondream has officially announced the completion of $4.5 million in seed funding and presents a disruptive viewpoint: in the world of AI models, smaller models may hold advantages. The company is backed by Felicis Ventures, Microsoft's M12 GitHub Fund, and Ascend, launching a visual language model with only 1.6 billion parameters that can compete with models four times its size in terms of performance.

Small but Powerful! H2O.ai Launches New AI Visual Models to Outperform Tech Giants in Document Analysis

Recently, H2O.ai announced the launch of two new visual language models aimed at enhancing the efficiency of document analysis and optical character recognition (OCR) tasks. The two models, H2OVL Mississippi-2B and H2OVL-Mississippi-0.8B, demonstrate remarkable competitiveness compared to models from large tech companies, potentially offering businesses dealing with heavy document workflows a more efficient solution.

Alibaba Cloud's Tongyi Qwen Responds to Github Page 404: In Contact with Officials

Alibaba Group's Tongyi Qwen QwenLM project has encountered an unexpected takedown of its Github page, resulting in a 404 error message when users try to access it. Project leader Lin Junyang responded in a social media post, stating that the team has not disappeared and is in communication with officials to resolve the page takedown issue. Despite the team's efforts, the page remains inaccessible. It is worth noting that the team recently released the Qwen2-VL model, which has excelled in processing video content up to 20 minutes long, surpassing multiple authoritative evaluation metrics for multimodal models, with some metrics even exceeding those of GPT.

Xi Xiaoyao Technology Talk | Stop Saying GPT-4V is Amazing! It Can't Even Recognize Peking Duck, Can You Believe It??

Related Recommendations

OpenAI Talent Mobility: Former Researcher Tian Yonglong Joins Tencent, Focused on Visual Language Model Development

Unlocking PB-Level Video Assets! InfiniMind, Founded by a Former Google Employee, Helps Enterprises Mine Video Dark Data

Moondream Raises $4.5 Million to Launch a 1.6 Billion Parameter Efficient AI Model with 5K GitHub Stars

Small but Powerful! H2O.ai Launches New AI Visual Models to Outperform Tech Giants in Document Analysis

Alibaba Cloud's Tongyi Qwen Responds to Github Page 404: In Contact with Officials