Investigation into the Chaos of Large Model Evaluation: Parameter Scale Does Not Represent Everything

With the surge in popularity of ChatGPT, various domestic and international large-scale model evaluation rankings have been introduced. However, large models with similar parameter sizes often show significant ranking differences across different lists. The industry and academia attribute this primarily to the use of different evaluation sets, and also to the increasing proportion of subjective questions, which raises doubts about the fairness of the evaluations. As a result, third-party evaluation institutions like OpenCompass and FlagEval have started to gain attention. However, the industry believes that to create truly comprehensive and effective large-scale model evaluations, other dimensions such as model robustness and security need to be considered. This is still under exploration.

Trillion-Level Computing Race Intensifies: Kimi K3 to be Unveiled in Third Quarter, Parameter Scale Targeting 2.5 Trillion

Competition among domestic AI large models is intensifying. After the DeepSeek V4 sparked widespread discussion, Kimi K3 under Moonshot is expected to make its debut in the third quarter of this year, with a parameter scale potentially reaching 2.5 trillion, far exceeding the 1.6 trillion of DeepSeek V4 Pro and about 1 trillion of Baidu Wenxin 5.0. The parameter scale has become a key indicator for measuring model capabilities.

Behind the Hype of DeepSeek-V4: How Does the Open-Source Framework One-Eval End the AI Evaluation Nightmare?

Ten hours after the release of DeepSeek-V4, the DCAI team from Peking University quickly generated a comprehensive automated evaluation report using the newly released open-source One-Eval evaluation framework. Traditional large model evaluation processes are cumbersome, requiring significant effort in setting up testing pipelines. One-Eval significantly improves efficiency, marking a new stage in the industry.

DeepSeek V4 Lite Evolves Stealthily: A 200 Billion-Parameter Small Model with Impressive Performance, Approaching Top Overseas Models

As a pre-release version of V4, DeepSeek V4 Lite has attracted attention with 200 billion parameters and a context length of up to 1 million tokens. After continuous upgrades, its performance is comparable to top closed-source models, showing outstanding results in various benchmark tests and demonstrating strong competitiveness.

Alibaba Qwen2-72B Tops HELM Ranking: Performance Surpasses Llama3-70B

Recently, the HELM MMLU released its latest results from Stanford University's large model evaluation leaderboard. Percy Liang, director of the Stanford Center for Research on Foundation Models, stated that Alibaba's Tongyi Qianwen Qwen2-72B model has surpassed Llama3-70B in ranking, becoming the best-performing open-source large model.

"Baimao Battle" Family's First, When Will Cheating in Large Model 'Scoring' Stop?

["📊 Evaluation System of Large Models: The current evaluation system for large models has issues such as open-source datasets that can be manipulated, fairness problems arising from closed evaluation datasets, and evaluation metrics that are not sufficiently scientific and comprehensive.", "💡 Trend of Large Model Applications: The article mentions that large models have evolved from model-level development to innovation at the application level.", "🔎 Commercialization Issues of Large Models: For large model teams, achieving commercialization is far more important than rankings and parameters." ]