Weekly AI Highlights | Wenxin Yiyuan Launches 5 Major Plugins, OpenAI Announces First Public Acquisition, GPT-4 Adds Review Feature


A new study conducted high-pressure tests on 12 mainstream large models, finding that their performance significantly declined when facing shortened deadlines and increased penalties. For example, the failure rate of Gemini 2.5 Pro increased from 18.6% to 79%, and GPT-4o also experienced a near-halving drop. In critical tasks such as biosecurity, the models even made serious mistakes by skipping key steps.
ICLR 2026 Review System Suffered Large-Scale AI Infiltration: Detection Shows Among 76,000 Reviews, 21% Were Fully Generated by Large Models, 35% Were Polished by AI, and Only 43% Were Written by Humans. Machine Reviews Are Longer, Score Higher, but Often Contain Errors Such as 'Hallucinated Citations', Triggering Protests from Authors. The Organizing Committee Issued an Emergency Ban, Planning to Block AI-Generated Content at the Submission Stage to Rebuild Trust.
The 2025 Hong Kong Fintech Week focuses on the integration of fintech and AI, bringing together guests such as Carrie Lam and Geoffrey Hinton. Zhu Guang, CEO of Du Xiaoman, emphasized the innovative applications of large models in financial services, driving customer service from monthly surveys to real-time responses, achieving a revolutionary transformation centered around customer-centricity.
"Baidu E-commerce Selection" brand uses large model technology to optimize risk control review, achieving full machine review, instant feedback, and high interpretability, solving the problems of low efficiency and slow response in traditional manual review, and enhancing e-commerce security and user experience.
A study by the University of Chicago found significant differences in the performance of AI text detectors, with some tools having high accuracy but others frequently misclassifying, especially in short texts. The Pangram detector performed best in terms of accuracy and cost-effectiveness. The study, based on 1992 human texts and four mainstream large models, covered six types of texts and revealed shortcomings in the reliability and robustness of detectors.