Company announcement · 2019-03-01

Mozilla releases the largest to-date public domain transcribed voice dataset – Company announcement

Open source Mozilla has crowdsourced the largest dataset of human voices available for use, including 18 different languages, adding up to almost 1,400 hours of recorded voice data from more than 42,000 contributors.

Here’s the official announcement:

Today, we’re excited to share our first multi-language dataset with 18 languages represented, including English, French, German and Mandarin Chinese (Traditional), but also for example Welsh and Kabyle. Altogether, the new dataset includes approximately 1,400 hours of voice clips from more than 42,000 people.

With this release, the continuously growing Common Voice dataset is now the largest ever of its kind, with tens of thousands of people contributing their voices and original written sentences to the public domain (CC0). Moving forward, the full dataset will be available for download on the Common Voice site.

Data Qualities

The Common Voice dataset is unique not only in its size and licence model but also in its diversity, representing a global community of voice contributors. Contributors can opt-in to provide metadata like their age, sex, and accent so that their voice clips are tagged with information useful in training speech engines.

This is a different approach than for other publicly available datasets, which are either hand-crafted to be diverse (i.e. equal number of men and women) or the corpus is as diverse as the “found” data (e.g. the TEDLIUM corpus from TED talks is ~3x men to women).

More Common Voices: from 3 to 22 languages in 8 months

Since we enabled multi-language support in June 2018, Common Voice has grown to be more global and more inclusive. This has surpassed our expectations: Over the last eight months, communities have enthusiastically rallied around the project, launching data collection efforts in 22 languages with an incredible 70 more in progress on the Common Voice site.

Click here to opt-out of Google Analytics