A Multimodal Framework for State of Mind Assessment with Sentiment Pre-classification

2019 
In this paper, we aim at the AVEC2019 State of Mind Sub-Challenge (SoMS), and propose a multimodal state of mind assessment framework, for valence and arousal, respectively. For valence, sentiment analysis is firstly performed on the English text obtained via German speech recognition and translation to classify the audio visual session into positive/negative narrative. Then each overlapping 60s segment of the session is input into an audio visual SoM assessment model trained for positive/negative narratives. The mean prediction of all the segments is adopted as the final prediction of the audio visual session. For arousal, the first step of positive/negative classification is not performed. For the audio-visual SoM assessment models, we propose to extract the functional features (Function) and VGGish based deep learning features (VGGish) from speech, and the abstract visual features based on convolutional neural network (CNN) from the baseline visual features. For each feature stream, a long short term memory (LSTM) model is trained to predict the valence/arousal values of a segment, and a support vector regression (SVR) model is adopted for the final decision fusion. Experiments on the USoM dataset show that the model with Function, baseline ResNet features and baseline VGG features obtains promising prediction results for valence, with concordance correlation coefficient (CCC) up to 0.531 on the test set, which is much higher than the baseline result 0.219.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    13
    References
    0
    Citations
    NaN
    KQI
    []