Audio and face video emotion recognition in the wild using deep neural networks and small datasets
2016
This paper presents the techniques used in our contribution to Emotion Recognition in the Wild 2016’s video based sub-challenge. The purpose of the sub-challenge is to classify the six basic emotions (angry, sad, happy, surprise, fear & disgust) and neutral. Compared to earlier years’ movie based datasets, this year’s test dataset introduced reality TV videos containing more spontaneous emotion. Our proposed solution is the fusion of facial expression recognition and audio emotion recognition subsystems at score level. For facial emotion recognition, starting from a network pre-trained on ImageNet training data, a deep Convolutional Neural Network is fine-tuned on FER2013 training data for feature extraction. The classifiers, i.e., kernel SVM, logistic regression and partial least squares are studied for comparison. An optimal fusion of classifiers learned from different kernels is carried out at the score level to improve system performance. For audio emotion recognition, a deep Long Short-Term Memory Recurrent Neural Network (LSTM-RNN) is trained directly using the challenge dataset. Experimental results show that both subsystems individually and as a whole can achieve state-of-the art performance. The overall accuracy of the proposed approach on the challenge test dataset is 53.9%, which is better than the challenge baseline of 40.47% .
Keywords:
- Computer vision
- Time delay neural network
- Artificial intelligence
- Emotion classification
- Recurrent neural network
- Convolutional neural network
- Artificial neural network
- Support vector machine
- Partial least squares regression
- Machine learning
- Feature extraction
- Computer science
- Pattern recognition
- Speech recognition
- Disgust
- Surprise
- Transfer of learning
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
39
References
33
Citations
NaN
KQI