When AI begins to truly "understand" Chinese, a quiet technological revolution is taking place. In the competition of domestic large models, high-quality Chinese data has become a key factor determining success. According to industry research, the proportion of Chinese content in the training data of mainstream domestic large models is generally over 60%, with some models even reaching 80%, significantly reducing reliance on English corpora. This shift not only improves the model's accuracy in understanding the needs of Chinese users but also enables AI to deeply interpret culturally specific concepts such as "heatiness," "dampness," and "looking at cars."
From "Translation" to "Understanding Context": The Complexity of Chinese Drives Data Upgrades
The term "looking at cars" refers to "choosing a car" in a 4S store, but may mean "watching vehicles" in a parking lot — expressions that highly depend on context cannot be accurately captured by translation-based training alone. Professor Meng Qingguo from Tsinghua University pointed out: "The metaphors, policy terminology, dialect habits, and cultural symbols in Chinese form a unique semantic network. Only by being rooted in sufficiently deep Chinese data can models truly become 'localized.'"
Zhao Yanjun from iFLYTEK further explained: The "heatiness" in traditional Chinese medicine does not literally mean burning, but refers to a series of internal heat symptoms; the classical poetry line "falling flowers and flowing water" can express spring scenery or symbolize the passing of love. If a model hasn't learned sufficiently from high-quality Chinese corpora, it can only mechanically break down the text without conveying cultural essence.
A high-quality dataset of 3500TB is now in place, with China Mobile leading the infrastructure development
To strengthen the foundation of Chinese AI, the industry is accelerating its efforts. China Mobile has built a general-purpose high-quality Chinese dataset covering more than 30 industries, with a total volume exceeding 3500TB, including scenarios such as government affairs, healthcare, finance, and education, providing structured, noise-free, and compliant training materials for large models. In addition, universities, publishers, and cultural institutions are also promoting the digitization and annotation of rare resources such as ancient books, local chronicles, and operas.
Data Silos and Lack of Standards Remain Bottlenecks
Although significant progress has been made, challenges remain prominent:
- Data Silos: Data from government, enterprises, and academic institutions are fragmented, making it difficult to form a unified effort;
- Inconsistent Annotation Standards: The same term may have different labels in different datasets, affecting model consistency;
- Privacy and Security: High-value Chinese data involves personal information and national sensitive information, requiring new privacy computing technologies for protection.
Experts call for the urgent establishment of national Chinese data annotation standards, promote cross-institutional data collaboration, and encourage the use of technologies such as federated learning and trusted execution environments (TEE) to achieve "data available but not visible."
AI + Culture: From Tool to Custodian
AIbase believes that the strategic value of Chinese data goes beyond the technical level — it concerns cultural sovereignty and the voice in digital civilization. When large models can vividly interpret the metaphors of "Dream of the Red Chamber," accurately generate Song dynasty poetry following tonal patterns, and explain the philosophy of "harmony without uniformity" to the world, AI will evolve from a tool into a digital custodian of Chinese civilization.
Under the convergence of the dual national strategies of "Artificial Intelligence +" and "Cultural Digitization," the construction of high-quality Chinese data is transforming from a technical issue into a mission of the era. And this wave of AI localization driven by data has just begun to rise.
