New text to speech model: Fast & robust – News


FastSpeech
2023583 / Pixabay

A team of researchers from Microsoft and Zhejiang University has proposed FastSpeech, a “feed-forward network” that claims to add speed & robustness to text to speech model.

Microsoft announced this on its official blog. Explaining, it said FastSpeech generates “mel-spectrograms” with fast generation speed, robustness, controllability, and high quality.

So why are mel-spectograms so important & what’s their role?

Neural network-based TTS models usually first generate a mel-scale spectrogram (or mel-spectrogram) auto-regressively from text input & then synthesize speech from the mel-spectrogram using a vocoder. The Mel scale is used to measure frequency in Hertz.

All this is further, extensively explained in a research paper titled, “FastSpeech: Fast, Robust and Controllable Text to Speech,” has been accepted at the thirty-third Conference on Neural Information Processing Systems (NeurIPS 2019).

FastSpeech utilizes a unique architecture that improves performance in a number of areas when compared to other TTS models, claimed Microsoft. After some experiments with FastSpeech, here are some of the conclusions drawn:

Experiments on the LJ Speech dataset, as well as on other voices and languages, demonstrate that FastSpeech has the following advantages. Briefly outlined, it is:

  • fast: FastSpeech speeds up the mel-spectrogram generation by 270 times & voice generation by 38 times.
  • robust: FastSpeech avoids the issues of error propagation & wrong attention alignments, & thus nearly eliminates word skipping and repeating.
  • controllable: FastSpeech can adjust the voice speed smoothly & control the word break.
  • high quality: FastSpeech achieves comparable voice quality to previous autoregressive models (such as Tacotron 2 & Transformer TTS).

For more on FastSpeech, click here.


Click here to opt-out of Google Analytics