When developing a speech dataset or any dataset for that matter, representation matters. It is important to ensure that the people that any model developed from the data would serve are represented. In the case of our speech dataset building, this means having good representation across age, gender and socio-economic status.

Many tools exist for speech dataset building. However, considering our particular context, we saw WhatsApp as a promising platform for speech dataset building due to its wide availability in Ghana and Africa, it’s ease of use and familiarity of people with the platform. We found that WhatsApp usage extends even to people who are not very literate.

We, therefore, developed a bot on WhatsApp as the main tool for the development of our speech dataset. To the best of our knowledge, we are the first to leverage the ubiquity of WhatsApp for dataset development. Our experience confirmed to us the value of a platform like WhatsApp for data collection and we recommend it to researchers seeking to build datasets for Ghanaian and African languages.

Our process for data collection on WhatsApp was pretty simple. First, recruited participants recorded and sent their consent to the data collection process via WhatsApp voice notes in their native languages. The chatbot then sent them a template recording for them to repeat. We took this approach because this meant that people who could speak the language but could not read it (which is most people), could also contribute to the data collection. The system then automatically organized the audios on an AWS S3 bucket.