Emotion Recognition from Speech Using Deep Neural Network

2021 
Acoustics is the branch of physics that addresses the properties of sound. Speech is the form of sound energy that is produced after the generation of thoughts being produced in the form of spoken utterances, having different acoustic characteristics like presence or predominance of tone, intensity, strength, longitude, duration of speech, presence of noise, etc. Speech has been the most prominent and intelligent mode of sound. Researchers have been extensively using speech to understand the emotions of a person. Speech is a powerful and compelling platform to communicate the emotions and gestures (attitude, behavior of a person) in a particular language. Contemplating emotions by intonation and pitch of voice and relative loudness can be used for enhancing human-computer interaction. Emotion recognition not only requires artificial intelligence to analyze the emotions but also requires a considerably sound knowledge base of social science, psychology, and anthropology that helps to train the machine. In this chapter, analysis and classification of speech for emotion recognition is done on RAVDESS dataset. Features are being extracted from speech utterances using Mel-frequency cepstral coefficient. Thereafter, deep learning-based architectures are used to classify them into various emotions. For this purpose, two major approaches have been used in this work. The first approach that we have used is performing a comparative study on dense neural network architectures (CNN, DNN, GRU, and LSTM) on prosodic features. The second approach is analysis of the famous traditional computer vision-based technique called Bag of visual words that uses SURF-based features for clustering using unsupervised clustering algorithm K-means and classifies them using SVM. While the best out of these deep learning approaches were LSTM on MFCC and MLPC with MFCC and PCA for dimensionality reduction, with 70% accuracy in both cases, Bag of visual words technique performed 53% of correct classification. Ultimately, the study has been extended to construct a hybrid of acoustic features (HAF) and feed them into an ensemble of bagged MLP classifier imparting an accuracy of 85%. The chapter also puts forward some of the important findings of the existing literature.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    36
    References
    1
    Citations
    NaN
    KQI
    []