Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis

Yuxuan Wang,Daisy Stanton,Yu Zhang,RJ-Skerry Ryan,Eric Battenberg,Joel Shor,Ying Xiao,Ye Jia,Fei Ren,Rif A. Saurous

Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis

2018

Yuxuan Wang
Daisy Stanton
Yu Zhang
RJ-Skerry Ryan
Eric Battenberg
Joel Shor
Ying Xiao
Ye Jia
Fei Ren
Rif A. Saurous

In this work, we propose "global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system. The embeddings are trained with no explicit labels, yet learn to model a large range of acoustic expressiveness. GSTs lead to a rich set of significant results. The soft interpretable "labels" they generate can be used to control synthesis in novel ways, such as varying speed and speaking style - independently of the text content. They can also be used for style transfer, replicating the speaking style of a single audio clip across an entire long-form text corpus. When trained on noisy, unlabeled found data, GSTs learn to factorize noise and speaker identity, providing a path towards highly scalable but robust speech synthesis.

Keywords:

Machine learning
Artificial intelligence
Text corpus
Expressivity
End-to-end principle
Pattern recognition
Scalability
Computer science
Speech synthesis
speaking style
Natural language processing
control synthesis
large range

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

250

Citations