In the field of artificial intelligence, Meta recently introduced the WebSSL series of models. These models range in size from 300 million to 7 billion parameters and are trained solely on image data, aiming to explore the vast potential of language-free visual self-supervised learning (SSL). This new research opens up exciting possibilities for future multimodal tasks and offers fresh perspectives on how visual representations are learned.
Previously, OpenAI's CLIP model garnered significant attention for its excellent performance in multimodal tasks such as visual question answering (VQA) and document understanding. However, language-based learning methods face numerous challenges due to the complexity and limitations in acquiring large-scale datasets. To address this, Meta leveraged its MetaCLIP dataset (MC-2B) containing 2 billion images for training, completely eliminating language supervision. This strategy allowed researchers to thoroughly evaluate the performance of pure visual self-supervised learning without constraints imposed by data or model size.

The WebSSL models employ two primary visual self-supervised learning paradigms: joint embedding learning (DINOv2) and masked modeling (MAE). All models were trained using 224x224 resolution images, with the visual encoder frozen to ensure that differences in results stem solely from the pre-training strategy. This model series was trained across five capacity levels (ViT-1B to ViT-7B) and evaluated using the Cambrian-1 benchmark, encompassing 16 VQA tasks covering general visual understanding, knowledge reasoning, OCR (optical character recognition), and chart interpretation.
Experimental results show a significant improvement in WebSSL's performance on VQA tasks as the model size increases. Notably, it even surpasses CLIP in OCR and chart tasks. Furthermore, high-resolution (518px) fine-tuning significantly improved WebSSL's performance on document tasks, narrowing the gap with some high-resolution models.

It's noteworthy that WebSSL, despite lacking language supervision, exhibits good alignment with some pre-trained language models (like LLaMA-3). This suggests that large-scale visual models can implicitly learn features related to text semantics, offering new insights into the relationship between vision and language.
Meta's WebSSL models not only excel in traditional benchmarks but also pave the way for future research in language-free learning.
