Recently, the LongCat team from Meituan launched a new benchmark called UNO-Bench, aiming to systematically evaluate these models' understanding capabilities across different modalities. This benchmark covers 44 task types and five modality combinations, striving to comprehensively demonstrate the performance of models in both single-modal and full-modal scenarios.

The core of UNO-Bench lies in its rich dataset. The team carefully selected 1,250 full-modal samples, which have a cross-modal solvability of 98%. In addition, 2,480 enhanced single-modal samples were added. These samples fully consider real-world applications, especially performing exceptionally well in the Chinese context. Notably, after automatic compression processing, the runtime speed of these datasets increased by 90%, maintaining a consistency of up to 98% across 18 public benchmarks.

image.png

To better evaluate the complex reasoning ability of models, UNO-Bench also introduced an innovative multi-step open-ended question format. This format combines a general scoring model that can automatically evaluate six different question types with an accuracy rate of an impressive 95%. This innovative evaluation method undoubtedly provides new insights for evaluating multimodal models.

image.png

Currently, UNO-Bench mainly focuses on the Chinese scenario. The team states that they are actively seeking partners and plan to jointly develop English and multilingual versions. Interested developers can download the UNO-Bench dataset through the Hugging Face platform, and related code and project documentation are also publicly available on GitHub.

With the release of UNO-Bench, the evaluation standards for multimodal large language models will be further improved. This not only provides researchers with powerful tools but also paves the way for the advancement of the entire industry.

Project address: https://meituan-longcat.github.io/UNO-Bench/