Recently, Google launched its latest AI model, Gemini3, claiming to lead in multiple academic benchmarks. However, relying on vendor-provided benchmark tests has certain limitations. Recently, Prolific conducted an independent evaluation comparing the performance of Gemini3 with other models in real-world applications. This assessment involved 26,000 users and used a blind test approach to conduct a strict comparison of AI models, focusing on key practical metrics such as user trust, adaptability, and communication style.

Google's large model Gemini

According to the "HUMAINE Benchmark" by Prolific, the user trust score for Gemini3Pro increased from 16% to 69%, setting a historical high for the organization. Gemini3 performed better than its predecessor Gemini2.5Pro in terms of trust, ethics, and safety, which only excelled in 16% of cases. Additionally, Gemini3 ranked first in three major evaluation categories: performance and reasoning, interaction and adaptability, and trust and security, and was only surpassed by DeepSeek V3 in terms of communication style.

This test showed that Gemini3 performed consistently well across 22 different user groups, covering various variables such as age, gender, race, and political orientation. The likelihood of users choosing Gemini3 in a double-blind comparison increased fivefold. Phelim Bradley, co-founder and CEO of Prolific, said that Gemini3's victory lies in its consistency across various scenarios and its personality and style that appeal to a broad range of users.

The HUMAINE evaluation method revealed some shortcomings in industry model assessments. By having users engage in multi-round conversations with two models without knowing, the test reflects the characteristics of model performance varying depending on the audience. Bradley pointed out that although they use AI evaluations in some cases, human evaluation remains crucial because human data can provide more valuable insights.

In terms of advice for businesses choosing AI models, Bradley emphasized that a more rigorous evaluation framework should be adopted, focusing on the consistency of models across different usage scenarios and user groups rather than relying solely on peak performance in a single task. Through such an evaluation method, companies can better select AI models that suit their specific needs.

Key Points:

🌟 Gemini3Pro received a 69% approval rating in the user trust test, far exceeding the previous product's score of 16%.

📊 The model performed excellently in areas such as performance, interaction, and trust, especially in its consistent performance across diverse user groups.

🔍 Prolific advocates for companies to adopt a more rigorous evaluation framework to choose the AI model that best suits their needs.