Learning Deep Representation of The Emotion Speech Signal

2021 
This paper aims at learning deep representation of emotion speech signal directly from raw audio clip using a 1D convolutional encoder, and reconstructing the audio signal using a 1D deconvolutional decoder. The learned deep features which contain the essential information of the signal, should be robust enough to reconstruct the speech signal. The location of the maximal value in the pooled receptive field of the max pooling layer is passed to the corresponding unpooling layer for reconstructing the audio clip. Residual learning is adopted to ease the training process. A dual training mechanism was developed to enable the decoder to reconstruct the speech signal from the deep representation more accurate. After completing the training of the convolutional-deconvolutional encoder-decoder as a whole, the decoder with transferred features was trained again. Experiments conducted on Berlin EmoDB and SAVEE database achieve excellent performances.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    17
    References
    0
    Citations
    NaN
    KQI
    []