Cross-Speaker Emotion Transfer Through Information Perturbation in Emotional Speech Synthesis

2022 
Through borrowing emotional expressions from an emotional speaker, cross-speaker emotion transfer is an effective way to produce emotional speech for target speakers without emotional training data. Since emotion and timbre of the source speaker are heavily entangled in speech, existing approaches often struggle to trade off between speaker similarity and emotional expression in the synthetic speech of the target speaker. In this letter, we propose to disentangle timbre and emotion through information perturbation to conduct cross-speaker emotion transfer, which effectively learns the emotional expression of the source speaker and maintains the timbre of the target speaker. Specifically, we separately perturb the timbre and emotion-related features (e.g., formant and pitch) of source speech to obtain and model the timbre- and emotion-independent signals, based on which the proposed model can deliver the emotional expression for target speakers. Experimental results demonstrate the proposed approach significantly outperforms the baselines in terms of naturalness and similarity, indicating the effectiveness of information perturbation for cross-speaker emotion transfer.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    37
    References
    0
    Citations
    NaN
    KQI
    []