At Guangzhou University in Guangzhou, the 10th Advanced Forum on Language Services and the 2025 National Emergency Language Service Group Academic Annual Conference were successfully held from December 6th to 7th. At this conference, Guangzhou University's Philosophy and Social Sciences Key Laboratory launched a new AI-DimSum Cantonese Corpus Platform, marking a new stage in the digital development of Cantonese.

Cantonese, as an important dialect of Chinese, is spoken by hundreds of millions of people worldwide, but it has long been considered a low-resource language on the Internet. According to Professor Qi Jiayin from the School of Cyber Security at Guangzhou University, the AI-DimSum platform focuses on "Digital Chinese Construction" and the cultural digitalization needs of the Guangdong-Hong Kong-Macao Greater Bay Area, aiming to build a multimodal Cantonese corpus data ecosystem based on Lingnan culture and oriented towards artificial intelligence applications. The system follows the principles of "standards first, traceable data, and available services," providing a solid foundation for the study and learning of Cantonese.

Metaverse, Sci-fi, Cyberpunk Painting (1) Large Model

Image source note: The image was generated by AI, and the image licensing service provider is Midjourney.

The AI-DimSum platform has seven subsystems, including corpus collection, annotation, model integration, rights confirmation and retrieval, quality assessment, management, and an application store, forming a complete data processing chain. This means that the entire process from data collection to final application release can achieve efficient collaboration, promoting the construction and management of the Cantonese corpus.

Currently, the AI-DimSum Cantonese Corpus has gathered over one million words of text data, covering multiple fields such as news, literature, and social media. In addition, the platform has completed 3,000 hours of high-fidelity audio annotations and more than 1TB of audio-visual materials, including popular animated and film works with Cantonese subtitles, such as "Kung Fu Panda" and "Peppa Pig." The platform also provides over 10,000 sentences of multi-purpose Cantonese daily life audio and text corpus, and has collected rich Lingnan cultural image materials, totaling 10,000 images.