A team of researchers from Microsoft and Zhejiang University has proposed FastSpeech, a “feed-forward network” that claims to add speed & robustness to text to speech model.
Microsoft announced this on its official blog. Explaining, it said FastSpeech generates “mel-spectrograms” with fast generation speed, robustness, controllability, and high quality.
So why are mel-spectograms so important & what’s their role?
Neural network-based TTS models usually first generate a mel-scale spectrogram (or mel-spectrogram) auto-regressively from text input & then synthesize speech from the mel-spectrogram using a vocoder. The Mel scale is used to measure frequency in Hertz.
All this is further, extensively explained in a research paper titled, “FastSpeech: Fast, Robust and Controllable Text to Speech,” has been accepted at the thirty-third Conference on Neural Information Processing Systems (NeurIPS 2019).
FastSpeech utilizes a unique architecture that improves performance in a number of areas when compared to other TTS models, claimed Microsoft. After some experiments with FastSpeech, here are some of the conclusions drawn:
Experiments on the LJ Speech dataset, as well as on other voices and languages, demonstrate that FastSpeech has the following advantages. Briefly outlined, it is:
- fast: FastSpeech speeds up the mel-spectrogram generation by 270 times & voice generation by 38 times.
- robust: FastSpeech avoids the issues of error propagation & wrong attention alignments, & thus nearly eliminates word skipping and repeating.
- controllable: FastSpeech can adjust the voice speed smoothly & control the word break.
- high quality: FastSpeech achieves comparable voice quality to previous autoregressive models (such as Tacotron 2 & Transformer TTS).
For more on FastSpeech, click here.