Release of Financial Inclusion Speech Dataset for native Ghanaian Languages
Researchers at Ashesi University and Nokwary Technologies are happy to announce the public release of speech datasets in Asante Twi, Akuapem Twi, Fanti and Ga - all native languages spoken in Ghana. This project was funded by the Lacuna Fund. To the best of our knowledge, this is the first ever open (publicly available) speech dataset for any Ghanaian language. You can access the dataset here: Financial Inclusion Speech Dataset. You can also learn more about NLP Research at Ashesi University.
This speech dataset was developed to:
- Help answer some research questions we are exploring at Ashesi University around speech and NLP for African languages.
- Support the development of financial applications in native Ghanaian languages to allow illiterate and semi-literate people to fully benefit from digital financial services.
From a research standpoint, this dataset is interesting for a number of reasons.
First, it will help in exploring research questions around domain-specific and general-purpose dataset and system development in low resource settings. Advances in modeling means that we are now able to see decent performance from systems trained with very little data. Are these advanced models suitable for production-grade applications? If not, how much data and what sort of data (domain-specific or general-purpose) provides the most econimical way to fine-tune these models to production-readiness? If domain-specific datasets have a role in attaining production-readiness for specific applications, could these domain specific datasets be developed in a way that maximizes their general utility? These are just examples of domain related research questions that a dataset such as this allows us to start to explore.
Secondly, we took a very unique approach in the collection of the dataset. Instead of the typical approaches taken in collecting datasets, we chose to use WhatsApp - a widely used chat application in Ghana and Africa. We developed a bot on WhatsApp through which people could contribute data recordings. To the best of our knowledge, we are the first to ever take such an approach. You can read more here: Using WhatsApp for Speech Dataset Building.
Thirdly, we collected similar data in different dialects of the same language. Asante Twi, Fanti and Akuapem Twi are all dialects of Akan. Not only are there many languages in Ghana and Africa in general, there are many dialects. How do we develop models that work well with different dialects in low resource settings? Is it necessary to develop dialect specific models? This dataset will help in exploring these questions.
We are proud to release this dataset. We wish to acknowledge the contributions of the following language experts in the development of this dataset: Dr. Clement Appah, Diana Eshun Morgan and Dr. Joseph Oduro-Frimpong. We also extend our appreciation to the many Ghanaians who contributed voice recordings to this dataset.