Prediction of pronunciation variations for speech synthesis: a data-driven approach

2005 
The fact that speakers vary pronunciations of the same word within their own speech is well known, but little has been done to categorize and predict a speaker's pronunciation distribution automatically for unit selection speech synthesis. Recent work demonstrated how to identify automatically a speaker's choice between full and reduced pronunciations using acoustic modeling techniques from speech recognition. We extend this approach and show how its results can be used to predict a speaker's choice of pronunciations for synthesis. We apply machine learning techniques to the automatically categorized data to produce a pronunciation variation prediction model given only the utterance text - allowing the system to synthesize novel phrases with variations like those the speaker would make. Empirical studies emphasize that we can improve automatic pronunciation labels and successfully utilize the results for prediction of future synthesized examples. The prediction results based on these automatic labels are very similar to those trained from human labeled data - allowing us to reduce manual effort while still achieving comparable results.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    11
    References
    16
    Citations
    NaN
    KQI
    []