DeepSeek-V4 was released just 10 hours ago, and the DCAI team from Peking University quickly came up with a full-scale automated evaluation report. This speed has attracted widespread attention in the AI engineering community, with the core driver being the newly open-sourced evaluation framework from Peking University - One-Eval.
For a long time, large model evaluation has been considered a "nightmare" for engineers. In traditional processes, a lot of effort is spent on setting up testing pipelines rather than on model scoring itself, including selecting benchmark sets, writing scripts, field adaptation, and parsing runtime logs. The emergence of One-Eval marks a paradigm shift in industry efficiency.
The Dilemma of Traditional Evaluation: Black Box and Pollution
Current large model evaluation is facing serious challenges. With the rapid increase in model size and complexity, the drawbacks of static evaluation models have become increasingly apparent. First, the operational threshold is high, parameter configuration is complicated, and the program has extremely low error tolerance; second, there is a lack of transparency, and the final score often resembles a "black box," making it difficult to trace the specific basis for the model's score.
The most troubling issue for the industry is the "data pollution" phenomenon. Due to the possibility that the model may have encountered test questions during training, the credibility of the ranking list has declined, and high scores no longer equate to high capability. To address these pain points, the industry urgently needs more flexible and transparent evaluation tools.
One-Eval: Interactive Revolution Driven by Intelligent Agents
The Peking University team's One-Eval took a "low-level strike" approach, transforming complex script operations into a natural language-driven intelligent agent mode.
Users simply need to input the test intent through dialogue, and the system can automatically identify the requirements, match the corresponding benchmark tools (such as finance, law, and healthcare), and silently complete the backend configuration. In addition, One-Eval introduced a "global state" bus architecture, ensuring the entire evaluation process is traceable. To ensure the rigor of the results, it still retains the "human-in-the-loop" mechanism, waiting for human confirmation at key decision points, achieving a balance between full automation and professional intervention.
The Business Fundamentals of the Evaluation Market
Large model evaluation is not only a technical task but also a business worth hundreds of billions of dollars. Taking the industry giant Scale AI as an example, its business model has evolved into three closed loops:
Service Charges: Providing basic subscription services such as compliance audits and permission management for enterprises.
Defining Standards: By introducing mechanisms such as blind testing by human experts, redefining industry credibility, and charging high fees to large model vendors pursuing certification.
Data Completion: This is the highest level of moat - after diagnosing model weaknesses, the system sells targeted high-quality fine-tuning datasets.
