Beijing Academy of Artificial Intelligence Releases Chinese Internet Corpus CCI3.0 Containing 1000GB Dataset

At the 2024 Beijing Cultural Forum, the Beijing Academy of Artificial Intelligence (BAAI) announced the official release of the new generation Chinese Internet Corpus CCI3.0 (Chinese Corpora Internet), further promoting data co-construction and sharing. CCI3.0 includes a dataset of 1000GB and a high-quality subset CCI3.0-HQ of 498GB, marking another significant update following the initial open-source release of CCI1.0 in November 2023 and the release of CCI2.0 in April 2024.

Since its first open-source release, the CCI series datasets have been downloaded over 40,000 times, serving more than 500 enterprises and institutions in their large-model R&D, effectively supporting the development of China's artificial intelligence industry ecosystem.

WeChat Screenshot_20240925135352.png

Features of CCI3.0 include:

Expanded scale, broad sources: CCI3.0 includes over 268 million web pages, covering news, social media, blogs, and other fields. Compared to CCI2.0, the data scale of CCI3.0 has nearly doubled, with data sources increasing to over 20, significantly enhancing data coverage and representativeness.
Fine-grained annotation, empowering applications: CCI3.0 has conducted detailed classification and marking of raw data in more than 10 dimensions, including grammar, syntax, and educational levels, to filter out high-value data. Additionally, CCI3.0-HQ is a high-quality subset derived from automatically labeled samples based on a 70B model, further optimized through a small-scale quality model, better meeting the needs of different industries and application scenarios.
Significant results, better understanding of Chinese: In comparative experiments where a 500M model was trained from scratch on 100B data, CCI3.0 outperformed other datasets in both standalone Chinese corpus training and mixed Chinese-English corpus training, with even more significant results for CCI3.0-HQ.

The BAAI expressed that it will continue to collaborate with the industry ecosystem to promote the co-construction and sharing of the corpus, building large-scale, high-quality, high-knowledge-density Chinese datasets, and making greater contributions to the development of China's artificial intelligence industry.

CCI3.0 Download Links

Flopsera:

https://open.flopsera.com/flopsera-open/data-details/BAAI-CCI3

Huggingface: https://huggingface.co/datasets/BAAI/CCI3-Data

Datahub:

https://data.baai.ac.cn/details/BAAI-CCI3

Global AI Computing Power Is on the Rise! Token Usage Has Seen a Major Reversal for the First Time in Two Years

The global usage of tokens by AI large models has recently fluctuated, and after ten consecutive weeks of growth, it has declined for two consecutive weeks. From April 13 to 19, the total global token usage remained at 20.6 trillion tokens. Among them, the weekly token usage of Chinese AI large models, which had previously shown strong performance, dropped significantly by 23.77% compared to the previous week, reaching 4.44 trillion tokens, while the US market showed a different trend, leading to significant shifts in the market landscape.

ZhiYuan Research Institute Jointly Builds Chinese Internet Corpus CCI to Provide Resources for Big Data and Artificial Intelligence Industries

ZhiYuan Research Institute, in collaboration with TuoSi and ZhongKe WenGe, has jointly established the 'Chinese Internet Corpus' (CCI). This corpus has undergone strict screening and cleaning, with a data scale of 104GB, covering the period from 2001 to 2023. ZhiYuan Research Institute will continue to expand data sources and improve data processing workflows to provide more high-quality and reliable data resources. The institute has also opened up other high-quality Chinese datasets, such as WUDAO corpus, COIG, and MTP. This initiative aims to support the big data and artificial intelligence industries.

China's AI Models Account for 40% of the Global Market, Investors Warn of Industry Reshuffle

China's AI models account for 40% of the global market, leading the United States. Investors warn that the large model industry will face a reshuffle. Market predictions indicate that only the strongest models will survive. Giants like Alibaba, Tencent, and Baidu are participating in the battle of hundreds of models. The large model industry may face consolidation and price wars.

Performance Boosted by 50 Times! The Super Driver Behind OpenAI's o1 Model is NVIDIA's Blackwell Architecture

Recently, NVIDIA's CEO Jensen Huang revealed in an interview that OpenAI's latest o1 large language model (LLM) will be based on NVIDIA's Blackwell architecture, which has achieved an astounding improvement in inference capability. As the market gradually emerges from the hype of generative AI, developing LLM models capable of generating 'inference-focused' responses has become increasingly important, marking a significant step toward achieving general artificial intelligence.

Beijing Academy of Artificial Intelligence Releases Chinese Internet Corpus CCI3.0 Containing 1000GB Dataset

Related Recommendations

Global AI Computing Power Is on the Rise! Token Usage Has Seen a Major Reversal for the First Time in Two Years

DeepMind CEO Hassabis Rarely Affirms China's AI Advancements: Only a Few Months Behind, but Disruptive Innovation Remains the Key Gap

ZhiYuan Research Institute Jointly Builds Chinese Internet Corpus CCI to Provide Resources for Big Data and Artificial Intelligence Industries

China's AI Models Account for 40% of the Global Market, Investors Warn of Industry Reshuffle

Performance Boosted by 50 Times! The Super Driver Behind OpenAI's o1 Model is NVIDIA's Blackwell Architecture