A Novel Method for Prosody Prediction in Voice Conversion
42
Citation
12
Reference
10
Related Paper
Citation Trend
Abstract:
Most of the published voice conversion schemes do not consider detailed prosody modeling but only control the F0 level and range. However, the detailed prosody can also carry a significant amount of speaker identity related information. This paper introduces a new method for converting the prosody in voice conversion. A syllable-based prosodic codebook is used to predict the converted F0 using not only the source contour but also linguistic information and segmental durations. The selection of the most suitable target contour is carried out using a trained classification and regression tree. The F0 contours in the codebook are represented in a transformed domain which allows compression and fast comparison. The performance of the prosodic conversion is evaluated in a real voice conversion system. The results indicate a significant improvement in speaker identity and naturalness when compared to GMM (Gaussian mixture model) based pitch prediction approach.Keywords:
Pitch contour
Generating near-to-natural F0 contours is an important issue in text-to-speech synthesis and contributes vastly to the quality of synthetic speech. In earlier studies by the authors, a model of German intonation was developed that is based on the quantitative Fujisaki model. A typical F0 contour is described as a sequence of major rises and falls, which are modeled by onsets and offsets of accent commands connected to accented syllables. The current paper addresses perception experiments comparing the intonational naturalness of a Fujisaki-model-based TTS and four other German TTS systems with comparably high segmental quality. Natural speech samples were used as a reference. Three of the TTS systems had PSOLA, and one LPC segmentals. Two types of experiments were conducted with 20 subjects: (1) a pair comparison of 15 isolated sentences, (2) a ranking test based on a news passage of about 15 sec produced with each of the systems. Preliminary results from experiment (1) show, that on a naturalness scale from 0 to 5, the natural speech samples reach a maximum score of 4.5, with values of 2.8 for the best synthesis, the LPC-based one. The system with Fujisaki-model-based intonation leads the group of PSOLA systems, which is closely clustered at a mean of 2.4.
Intonation
Stress
Pitch contour
Cite
Citations (1)
Pitch contour
Cite
Citations (0)
In speech synthesis the role of prosody is very crucial. To make the synthesized speech more natural and soothing to the human ears various prosody and intonation model together with emotional model have been experimented over last few decades. Apart from the segmental quality and voice characteristics, it depends mostly on the quality of the prosody model which is responsible for the naturalness of any TTS system. But as it is very hard to evaluate prosody model in an objective way, a perceptual comparison method is adopted in this work to evaluate prosody model.
Intonation
Chinese speech synthesis
Cite
Citations (2)
Alternation (linguistics)
Foot (prosody)
Sequence (biology)
Cite
Citations (1)
Cite
Citations (1)
This paper presents an approach to hierarchical prosody conversion for emotional speech synthesis. The pitch contour of the source speech is decomposed into a hierarchical prosodic structure consisting of sentence, prosodic word, and subsyllable levels. The pitch contour in the higher level is encoded by the discrete Legendre polynomial coefficients. The residual, the difference between the source pitch contour and the pitch contour decoded from the discrete Legendre polynomial coefficients, is then used for pitch modeling at the lower level. For prosody conversion, Gaussian mixture models (GMMs) are used for sentence- and prosodic word-level conversion. At subsyllable level, the pitch feature vectors are clustered via a proposed regression-based clustering method to generate the prosody conversion functions for selection. Linguistic and symbolic prosody features of the source speech are adopted to select the most suitable function using the classification and regression tree for prosody conversion. Three small-sized emotional parallel speech databases with happy, angry, and sad emotions, respectively, were designed and collected for training and evaluation. Objective and subjective evaluations were conducted and the comparison results to the GMM-based method for prosody conversion achieved an improved performance using the hierarchical prosodic structure and the proposed regression-based clustering method.
Pitch contour
Feature (linguistics)
Hierarchical clustering
Cite
Citations (58)
Voice Conversion (VC) aims to convert one's voice to sound like that of another. So far, most of the voice conversion frameworks mainly focus only on the conversion of spectrum. We note that speaker identity is also characterized by the prosody features such as fundamental frequency (F0), energy contour and duration. Motivated by this, we propose a framework that can perform F0, energy contour and duration conversion. In the traditional exemplar-based sparse representation approach to voice conversion, a general source-target dictionary of exemplars is constructed to establish the correspondence between source and target speakers. In this work, we propose a Phonetically Aware Sparse Representation of fundamental frequency and energy contour by using Continuous Wavelet Transform (CWT). Our idea is motivated by the facts that CWT decompositions of F0 and energy contours describe prosody patterns in different temporal scales and allow for effective prosody manipulation in speech synthesis. Furthermore, phonetically aware exemplars lead to better estimation of activation matrix, therefore, possibly better conversion of prosody. We also propose a phonetically aware duration conversion framework which takes into account both phone-level and sentence-level speaking rates. We report that the proposed prosody conversion outperforms the traditional prosody conversion techniques in both objective and subjective evaluations.
Pitch contour
Representation
Cite
Citations (35)
Observed frequently in human-human interactions, entrainment is a social phenomenon in which speakers become more like each other over the course of a conversation. Acoustic-prosodic entrainment occurs when individuals adapt their acoustic-prosodic speech features, such as pitch and intensity. Correlated with communicative success, naturalness, and conversational flow as well as social variables such as rapport, a dialogue system which automatically entrains has the potential to improve verbal interactions by increasing rapport, naturalness, and conversational flow. In an application like the learning companion, such a socially responsive dialogue system may improve learning and motivation. However, it is not clear how to produce entrainment in an automatic dialogue system in ways that produce the effects seen in human-human dialogue. In this paper, we take the first steps towards implementing a spoken dialogue system which can entrain. We propose three methods of pitch adaptation based on analysis of human entrainment, and design and implement a system which can manipulate the pitch of text-to-speech output adaptively. We find a clear relationship between perceptions of rapport and different forms of pitch adaptations. Certain adaptations are perceived as significantly more natural and rapport-like. Ultimately, adapting by shifting the pitch contour of the text-to-speech output by the mean pitch of the user results in the highest reported measures of rapport and naturalness.
Entrainment (biomusicology)
Pitch contour
Cite
Citations (29)