Enhanced face/audio emotion recognition: video and instance level classification using ConvNets and restricted Boltzmann Machines
2017
Face-based and audio-based emotion recognition modalities have been studied profusely obtaining successful classification rates for arousal/valence levels and multiple emotion categories settings. However, recent studies only focus their attention on classifying discrete emotion categories with a single image representation and/or a single set of audio feature descriptors. Face-based emotion recognition systems use a single image channel representations such as principal-components-analysis whitening, isotropic smoothing, or ZCA whitening. Similarly, audio emotion recognition systems use a standardized set of audio descriptors, including only averaged Mel-Frequency Cepstral coefficients. Both approaches imply the inclusion of decision-fusion modalities to compensate the limited feature separability and achieve high classification rates. In this paper, we propose two new methodologies for enhancing face-based and audio-based emotion recognition based on a single classifier decision and using the EU Emotion Stimulus dataset: (1) A combination of a Convolutional Neural Networks for frame-level feature extraction with a k-Nearest Neighbors classifier for the subsequent frame-level aggregation and video-level classification, and (2) a shallow Restricted Boltzmann Machine network for arousal/valence classification.
Keywords:
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
26
References
8
Citations
NaN
KQI