Exploring Attention Mechanism for Acoustic-based Classification of Speech Utterances into System-directed and Non-system-directed.

2019 
Voice controlled virtual assistants (VAs) are now available in smartphones, cars, and standalone devices in homes. In most cases, the user needs to first "wake-up" the VA by saying a particular word/phrase every time he/she wants the VA to do something. Eliminating the need for saying the wake-up word for every interaction could improve the user experience. This would require the VA to have the capability of understanding whether the user is talking to it or not. In other words, the challenge is to distinguish between system-directed and non-system-directed speech utterances. In this paper, we present a number of neural network architectures for tackling this classification problem based on using only the acoustic signal. It is shown that a model comprised of convolutional, recurrent, and feed-forward layers can achieve an equal error rate (EER) of below 20% for this task. In addition, we investigate the use of an attention mechanism for helping the model to focus on the more important parts of the signal and to improve handling of variable length inputs sequences. The results show that the proposed attention mechanism significantly improves the model accuracy achieving an EER of 16.25% and 15.62% on two distinct realistic datasets.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    15
    References
    18
    Citations
    NaN
    KQI
    []