The Importance Weighted Autoencoder in End-to-End Speech Synthesis

2020 
The modeling of style when synthesizing speech from text aim to enrich the emotional expression of synthesized speech. In this paper, the importance weighted autoencoder(IWAE) is introduced into Tacotron, an end-to-end speech synthesis system, to learn the latent representation of speaking styles independent of text content in an unsupervised way. Compared with the Variational Autoencoder(VAE), a generative model that can be used to model prior data distribution, IWAE has a strictly tighter variational lower bound derive from different weighted importance. In the proposed network, we input the style representation, learned by IWAE, into Tacotron to generate speech with more rhythmic. Finally, the proposed model shows better performance of speech quality in both objective and subjective ways compared to Global Style Token (GST) and VAE-Tacotron.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    21
    References
    0
    Citations
    NaN
    KQI
    []