Amazon’s Alexa group of scientists announced on April 4, 2019, that they have used a very large unbundled data set to expose Alexa to a variety of human sounds. The data used is perhaps one of the largest in history used to train an acoustic model, the scientists say. The aim is to help the intelligent assistant better understand human voices.
Semi-Supervised Learning, a technology through which this is being achieved, is a combination of sounds that have been tagged by human beings as well as machines to train Artificial Intelligence engines such as Amazon’s Alexa. The results were a reduction in speech recognition errors by 10-22%. Scientists say that this method works better than Supervised Learning — the technology consisting of sounds tagged by machines only.
Discussing the development, Alexa Senior Applied Scientist, Hari Parthasarathi, stated in a blog post, “We are currently working to integrate the new model into Alexa, with a projected release date of later this year. The 7,000 hours of annotated data are more accurate than the machine-labeled data, so while training the student, we interleave the two. Our intuition was that if the machine-labeled data began to steer the model in the wrong direction, the annotated data could provide a course correction.”
The acoustic model was trained with 7,000 hours of tagged sound data, instead of the supervised learning method wherein untagged sounds up to 1 million hours can be used for training. Acoustic models are responsible for automatic speech recognition that converts human voices into voice commands.
This major development in Alexa was also achieved by another method commonly known as long-short-term memory (LSTM), colloquially known as the ‘teacher-student’ method. Here, the teacher is already trained in understanding 30-milliseconds of audio, some of it, the teacher transfers to the student.
A number of other techniques were used as well, such as:
- Singular, instead of the dual pattern of student model analysis
- Interleaving or mixing the two models
- Storing the 20 highest teacher model outputs, compared to the traditional way of storing results in 3,000 clusters
- Making the student model capable of learning from the maximum of these 20 teacher models
Recently, Amazon announced a reduction in speech recognition errors by 20% as well as a change in the design of the Echo device to reduce the microphone number from seven to two for better speech recognition capabilities.