The 10th Advanced Forum on Language Services was held at Guangzhou University from December 6th to 7th. At the event, the Guangzhou University team officially launched its developed AI-DimSum Multimodal Cantonese Corpus Platform, marking a new stage in the digital development of Cantonese, which has tens of millions of users worldwide.
Breaking Through the Low-Resource Challenge Professor Qi Jiayin from Guangzhou University explained that Cantonese is considered a "low-resource language" on the internet. The platform was built around the needs of "Digital Chinese Construction" and "Cultural Digitalization of the Greater Bay Area," creating a multimodal corpus data ecosystem based on Lingnan culture and aimed at AI applications, following the principles of "standards first, traceable data, and usable services."

Integrated and Modular Infrastructure The AI-DimSum platform consists of seven subsystems: corpus collection, annotation, large model integration, rights confirmation retrieval, quality assessment, management, and an application store, achieving an integrated and modular process from data collection to model access and application release.
Massive Corpus Support This corpus has gathered rich multimodal resources, providing a solid foundation for AI training:
Text: Over 1 million words (including news, literature, etc.).
Audio and Video: 3,000 hours of high-fidelity voice annotation completed, with over 1TB of audio and video materials.
Video and Film: Includes works such as "Kung Fu Panda," "Monkey King: Hero Is Back," and "The New Bride from Outside the City" with Cantonese subtitles and annotations.
Evaluation: Built over 200,000 multimodal evaluation questions for content safety of Cantonese large models.
The launch of this platform will significantly enhance the application capabilities and cultural inheritance value of Cantonese in the era of large models.
