Recently, Ant Financial Technology launched a revolutionary technology called "Multilingual Multimodal Large Model Training Framework" at the Hong Kong FinTech Festival. This framework aims to address the application bottlenecks of large models in multilingual environments. With the rapid development of artificial intelligence technology, large models are gradually becoming an important tool for improving efficiency across various industries. However, traditional large models that are primarily based on English often perform poorly in minority language environments, frequently facing issues such as "language confusion" and chaotic reasoning information, which severely limit their global application.

To tackle this challenge, Ant Financial Technology's research team developed this new framework and achieved significant results in the Multicultural Multilingual Visual Question Answering benchmark test (CVQA). The framework performed particularly well on resource-scarce minority languages such as Egyptian Arabic, Javanese, Bahasa Indonesia, and Sundanese, demonstrating excellent multilingual recognition capabilities and successfully ranking first.
The core of this breakthrough lies in an innovative language perception optimization framework. This framework uses a mechanism where "thinking is done in the target language," combined with fine-grained multi-dimensional reward strategies and automated data solutions, enabling deep understanding and processing in minority languages. According to the test results, compared to open-source models of similar scale, this framework improved accuracy by approximately 9.5% in mainstream multilingual visual question answering (Multilingual Visual Question Answering, VQA) benchmark tests, and in some tasks, it even outperformed international mainstream closed-source models such as GPT-4o and Gemini-2.5-flash, achieving the top overall score.
In terms of security capabilities, Ant Financial Technology also introduced an image security framework that combines visual analysis and common sense reasoning for detecting forgeries. It can efficiently identify visual inconsistencies and logical contradictions in images. This technology not only locates tampered areas but also performs explainable analysis, significantly enhancing the risk control capabilities of digital content.
As a core technology of Ant Financial Technology's global business, these two capabilities have been scaled in the ZOLOZ document authentication product (RealDoc), supporting 119 languages and efficiently handling multilingual business documents, contracts, and documents, covering scenarios such as insurance claims, credit reviews, and cross-border trade.
