Estimating Mutual Information in Prosody Representation for Emotional Prosody Transfer in Speech Synthesis

2021 
An end-to-end prosody transfer system aims to transfer the speech prosody from one speaker to another speaker. One major application is the generation of emotional speech with a new speaker’s voice. The end-to-end system uses an intermediate representation of prosody, which encompasses both speaker and emotion related information. The present study tackles the problem of estimating the mutual information between emotion and speaker-related factors in the prosody representation. A mutual information neural estimator (MINE) which could measure the mutual information between high-dimensional continuous prosody embedding and discrete speaker/emotion label is applied. The experimental results show that: 1) the prosody representation generated by the end-to-end system indeed contains both emotion and speaker information; 2) The mutual information would be determined by the type of input acoustic features to the reference encoder; 3) normalization for the log F0 feature is very effective in increasing emotion-related information in the prosody representation; 4) adversarial learning can be applied to reduce speaker information in the prosody representation. These results are useful to the further development of an optimal and practical emotional prosody transfer systems.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    31
    References
    0
    Citations
    NaN
    KQI
    []