Database Mining for Flexible Concatenative Text-to-Speech

2007 
In this paper we explore mining a concatenative text-to-speech database to exploit subtle, naturally-occurring stylistic and contextual variability for runtime synthesis. By making a desired style or context known to the search during synthesis, the cost function can be biased toward finding units which satisfy these additional criteria. Having the ability to bias the output of the synthesizer towards a particular voice quality, or other characteristic such as speaking rate, increases its flexibility and potential value. In this paper we illustrate the approach to synthesizing subtle speech variation by focusing on three aspects: prosodic structure (phrase-finalness), prosodic prominence (prosodic accent), and voice quality (breathiness). Target values for the first two of these are automatically generated, while the target value for breathiness is specified by the user. We present results which indicate the value of distinguishing our data along these dimensions, and discuss possible improvements and new uses in the future.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    6
    References
    1
    Citations
    NaN
    KQI
    []