In the Massively Multilingual Speech (MMS) project, Meta overcomes some of the challenges with a huge amount of multilingual dataset by combining wav2vec 2.0, their pioneering work in self-supervised learning, and a new dataset that provides labeled data for over 1,100 languages and unlabeled data for nearly 4,000 languages. Some of these, such as the Tatuyo language, have only a few hundred speakers, and for most of these languages, no prior speech technology exists. The results show that the Massively Multilingual Speech models outperform existing models and cover 10 times as many languages. Meta is focused on multilinguality in general: For text, the NLLB project scaled multilingual translation to 200 languages, and the Massively Multilingual Speech project scales speech technology to many more languages.
Concretely, they trained self-supervised models on about 500,000 hours of speech data in over 1,400 languages — this is nearly five times more languages than any known prior work. The resulting models were then fine-tuned for a specific speech task, such as multilingual speech recognition or language identification.
Results
To get a better understanding of how well models trained on the Massively Multilingual Speech data perform, Meta evaluated them on existing benchmark datasets, such as FLEURS.
Meta trained multilingual speech recognition models on over 1,100 languages using a 1B parameter wav2vec 2.0 model. As the number of languages increases, performance does decrease, but only very slightly: Moving from 61 to 1,107 languages increases the character error rate by only about 0.4 percent but increases the language coverage by over 18 times.
In a like-for-like comparison with OpenAI’s Whisper, we found that models trained on the Massively Multilingual Speech data achieve half the word error rate, but Massively Multilingual Speech covers 11 times more languages. This demonstrates that MMS model can perform very well compared with the best current speech models.
Meta also built text-to-speech systems for over 1,100 languages. Current text-to-speech models are typically trained on speech corpora that contain only a single speaker. A limitation of the Massively Multilingual Speech data is that it contains relatively few different speakers for many languages, and often only a single speaker. However, this is an advantage for building text-to-speech systems, and so Meta trained such systems for over 1,100 languages.
Contact us with your foundation model usage requirements.