Google has recently officially launched a new speech dataset called WAXAL in Africa. This project covers 21 African languages, including Acoli, Hausa, Luganda, Yoruba, and others, aiming to address the long-standing "survivability issue" of low recognition accuracy and frequent errors in AI systems for African languages.

The core breakthrough of this project is:

  • Data sovereignty returned: Unlike previous models where large companies controlled data, the ownership of the WAXAL dataset completely belongs to local African institutions participating in its development, not Google itself.

  • : The dataset includes over 11,000 hours of speech and nearly 2 million recordings. It includes about 1,250 hours of transcribed speech, as well as high-fidelity audio for text-to-speech (TTS).

  • Empowering local innovation: The project is open-sourced with a permissive license allowing commercial use. Currently, institutions such as the University of Ghana have begun using these data to advance localized AI application research, such as maternal health.

Despite technical challenges such as complex languages and the lack of tone symbols, the release of WAXAL marks that Africa is moving from a mere data collector to an owner of technological infrastructure. Google plans to expand the language coverage to 27 languages in the future, further enhancing Africa's voice in AI.