Microsoft’s new “VALL-E” can mimic anyone’s voice -Artificial intelligence

A new text-to-speech artificial intelligence (AI) model named “VALL-E” was unveiled by Microsoft researchers recently.

Given a 3-second audio sample, VALL-E can accurately mimic a person’s voice. When VALL-E learns a particular voice, it can create audio of that person speaking anything while attempting to capture the speaker’s emotional tone.


VALL-E

According to its developers, VALL-E can also be combined with other generative AI models like GPT-3 to create audio Content, & be used for high-quality text-to-speech applications, speech editing, which would allow a person’s voice to be changed & edited from a text transcript (making them say something they didn’t originally say).

Experiment results show that VALL-E significantly “outperformed” the state-of-the-art zero-shot TTS system in terms of speech naturalness & speaker similarity, claimed Microsoft. Also, it preserved the speaker’s emotion & acoustic environment of the acoustic prompt in synthesis.

Image credit: GitHub

Click here to opt-out of Google Analytics