Statistical model training technique based on speaker clustering approach for HMM-based speech synthesis

2015 
We propose an average voice model training technique using speaker class.The speaker class is obtained on the basis of speaker clustering.The average voice model is trained using the conventional contextual factors and the speaker class.In the speaker adaptation process, the target speaker's speaker class is estimated.Our proposal can synthesize speech with better similarity and naturalness. This paper proposes an average voice model training technique based on a speaker clustering approach to generate synthetic speech with enhanced similarity to the target speakers' speech. A novel point of the proposed technique is the use of the speaker characteristics (called "speaker class"), which are obtained from unsupervised clustering, as the additional contextual factor for the average voice based speech synthesis. In the model training process, first, speaker clustering is performed for all speakers used for model training to obtain the speaker class for each speaker. The average voice model with multiple speaker characteristics is trained by using the obtained speaker class. For the speaker adaptation and speech parameter generation, the speaker class of the target speaker is estimated on the basis of the Euclidean distance between the centroids of each cluster and the target speaker's feature. The use of the estimated speaker class makes it possible to utilize the model parameters that have speaker characteristics similar to those of the target speaker for speaker adaptation and speech parameter generation. The results of objective and subjective experiments indicated the proposed technique can synthesize speech with improved similarity and naturalness.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    29
    References
    2
    Citations
    NaN
    KQI
    []