Artificial Intelligence · 2018-11-22

Alexa to get newscaster-like voice soon based on new tech – AI

Alexa new speaking styleAmazon Echo Alexa is all set to get a new speaking style soon. Amazon says it is working on a new speaking style for digital assistant Alexa based on “neural text-to-speech” tech, which eventually will sound more like a “newscaster”. If things go well, Amazon said it would launch this “voice” on enabled devices in the coming weeks.

According to a blog post by Trevor Wood, an applied-science manager, & Tom Merritt, an applied scientist, both in the Alexa Speech group, the new speaking style will be enabled by the NTTS. The latter uses machine learning to “generate expressive voices more quickly.”

With the increased flexibility provided by NTTS, the speaking style of synthesized speech can be tweaked quite easily. Amazon, for example, has, by augmenting a large, existing data set of style-neutral recordings with only a few hours of newscaster-style recordings, has already created a news-domain voice. That would have been impossible with previous techniques based on ‘concatenative speech synthesis’, claimed the duo.

Concatenative speech synthesis involves breaking up speech samples into distinct sounds (known as phonemes) & then bringing them together to form words & eventually sentences.

How NTTS can model different speaking styles

The Amazon neural TTS system comprises 2 components:

(1) a neural network that converts a sequence of phonemes — the most basic units of language — into a sequence of “spectrograms,” or snapshots of the energy levels in different frequency bands

(2) a vocoder, which converts the spectrograms into a continuous audio signal

When trained on the large data sets used to build general-purpose concatenative-synthesis systems, this sequence-to-sequence approach will yield high-quality, neutral-sounding voices. By design, those data sets lack the distinctive prosodic features required to represent particular speech styles. Although high in quality, the speech generated through this approach displays a limited variety of expression (pitch, breaks, rhythm).

On the other hand, producing a similar-sized data set by enlisting voice talent to read in the desired style is very time consuming and expensive. (Training a sequence-to-sequence model from scratch requires tens of hours of speech).

Suffice to say with the new method, the high quality achieved with relatively little additional training data allows for rapid expansion of speaking styles, claimed Amazon.

Amazon even conducted a perpetual test on Alexa new speaking style where it asked listeners to rate speech samples on a scale from 0 to 100, according to how suitable their speaking styles were for news reading. Included in the survey was an audio recording of a real newscaster as a reference.

Listeners love it

Listeners apparently loved neutral NTTS & rated in more higher than concatenative synthesis, reducing the score discrepancy between human and synthetic speech by 46%. They preferred NTTS newscaster style to both other systems, however, shrinking the discrepancy by a further 35%. The preference for the neutral-style NTTS reflects the widely reported increase in general speech synthesis quality due to neural generative methods.


 

Click here to opt-out of Google Analytics