Researchers develop on-device speech recognition models with 97% accuracy – News

micro-phone, model, microphone

Image from Pixabay via ottlukas14

Researchers at the University of Waterloo have pioneered a method of developing on-device speech recognition models with 97% accuracy. The research which was carried out in collaboration with the DarwinAI Corp., Waterloo, & published on Arvix.org revealed the development of a design strategy which achieves about 97% accuracy & can run offline on low-end smartphones.

The research is a build up on earlier works such as the offline navigation, music playback & temperature control algorithm developed by Alexa’s Machine Learning (ML) team; the WaveNet voice model which was developed by Voysis which is also offline as well as the highly accurate on-device voice recognition model developed by Qualcomm.

Normally, voice recognition systems that make use of layers of neuron mimicking algorithms to interpret human speech depends largely on remote servers to process & interpret data, but the result of this research now claim to have pioneered a strategy that makes accurate speech recognition which is able to run offline possible.

Researchers elaborated in their paper:

In this study, we explore a human-machine collaborative design strategy for building low-footprint (deep neural network) architectures for speech recognition through a marriage of human-driven principled network design prototyping & machine-driven design exploration.

The paper further stated:

The efficacy of this design strategy is demonstrated through the design of a family of highly-efficient (deep neural networks), nicknamed EdgeSpeechNets, for limited vocabulary speech recognition.

The step by step process of the modeling, 1st involved a reconstruction of a prototype model which was able to perform speech recognition on a limited-vocabulary scale. Then, the team chose a design methodology which allowed the transformation of audio signals into what was known as Mel-frequency cepstrum coefficients, which are mathematical representations of the signal. This method leverages deep residual learning to attain greater representation & accuracy than the traditional methods in use before.

The work then went on to a generative synthesis, the phase where the researchers used a configuration based on a machine-driven strategy which builds deep neural networks with a strong emphasis on performance. The result from this configuration was a speech model whose validation accuracy scored 95% or more.

The research went on to evaluate each of the EdgeSpeechNets produced using the Google Speech Command datasets & the results as stated in the research showed that: “The EdgeSpeechNets had higher accuracies at much smaller sizes at lower computations costs than state of the art deep neural networks.”

From the results generated, it is clearly shown that the EdgeSpeechNets despite being obviously smaller & requiring fewer computation, they still achieved optimal top-rated performance. This makes them suitable to be applied for voice interfaces on local devices.

Researchers hope that future research will be able to apply the new strategy to other domains like natural language processing (NLP) & Visual Perception.


 

Click here to opt-out of Google Analytics