Researchers at Google are applying computer vision to images generated out of sound waves to develop Best-In-Breed Speech Recognition Without Language Models. AI researchers state that the latest SpecAugment method does not need any additional data or language models in order to recognize human speech precisely.
SpecAugment works by applying data augmentation of visual analytics to spectrograms (visual representations of speech).
“An unexpected outcome of our research was that models trained with SpecAugment out-performed all prior methods even without the aid of a language model,” Google AI resident Daniel S. Park and research scientist William Chan said in a blog post today. “While our networks still benefit from adding a language model, our results are encouraging in that it suggests the possibility of training networks that can be used for practical purposes without the aid of a language model.”
A combination of SpecAugment and LibriSpeech960h was applied for speech recognition which obtained a 2.6% word error rate. LibriSpeech960h consists of –
- 1,000 hours of spoken English
- 260 hours of telephone conversations in English
Automatic Speech recognition capabilities work by converting human speech into machine-readable text before sending out the answers. Known as conversational AI, the technology is used in a wide range of products such as Amazon’s Alexa. Google says that super conversational AI capabilities will only help in the adoption of the technology and the products associated with it.
Already advancing computing capabilities have drastically lowered errors in speech recognition. Isolating background noise improves Alexa’s speech recognition capabilities by 15%.
We recently covered the semi-supervised training method for Alexa which will improve voice recognition capabilities by 20%.