Improved Speech Emotion Recognition Using LAM and CTC

2021 
Time sequence based speech emotion recognition methods are difficult to distinguish between emotional and non-emotional frames of speech, and cannot calculate the amount of emotional information carried by emotional frames. In this paper, we propose a speech emotion recognition method using Local Attention Mechanism (LAM) and Connectionist Temporal Classification (CTC) to deal with these issues. First, we extract the Variational Gammatone Cepstral Coefficients (VGFCC) emotional feature from the speech as the input of LAM-CTC shared encoder. Second, CTC layer performs automatic hard alignment, which allows the network to have the largest activation value at the emotional key frame of the voice. LAM layer learns different degrees on the emotional auxiliary frame. Finally, BP neural network is used to integrate the decoding outputs of CTC layer and LAM layer to obtain emotion prediction results. Evaluation on IEMOCAP shows that the proposed model outperformed the state-of-the-art methods with a UAR of 68.5% and an WAR of 68.1% respectively.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    12
    References
    0
    Citations
    NaN
    KQI
    []