Multimodal approach to engagement and disengagement detection with highly imbalanced in-the-wild data

2018 
Engagement/disengagement detection is a challenging task emerging in a range of human-human and human-computer interaction problems. While being important, the issue is still far from being solved and a number of studies involving in-the-wild data have been conducted by now. Disambiguation in the definition of engaged/disengaged states makes it hard to collect, annotate and analyze such data. In this paper we describe different approaches to building engagement/disengagement models working with highly imbalanced multimodal data from natural conversations. We set a baseline result of 0.695 (unweighted average recall) by direct classification. Then we try to detect disengagement by means of engagement regression models, as they have strong negative correlation. To deal with imbalanced data we apply class weighting and data augmentation techniques (SMOTE and mixup). We experiment with combinations of modalities in order to find the most contributing ones. We use features from both audio (speech) and video (face, body, lips, eyes) channels. We transform original features using Principal Component Analysis and experiment with several types of modality fusion. Finally, we combine approaches and increase the performance up to 0.715 using four modalities (all channels except face). Audio and lips features appear to be the most contributing ones, which may be tightly connected with speech.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    26
    References
    5
    Citations
    NaN
    KQI
    []