Modeling Polyphone Context Withweighted Finite-State Transducers

2006 
As coarticulation effects are prevalent in all speech, a phone must be modeled in its context to achieve optimal performance in large vocabulary continuous speech recognition systems. Schuster and Hori (2005) proposed a technique for modeling polyphone context with weighted finite-state transducers whereby all valid three-state sequences of Gaussian mixture models are enumerated, and there-after the possible connections between these three-state sequences are determined. Hence, the explicit modeling of all possible polyphones is avoided. Rather, Schuster and Hori derive a transducer HC that translates from sequences of Gaussian mixture models directly to phone sequences. The resulting network HC o L o G is much smaller than the conventional network H o C o L o G proposed by Mohri et al (1998). While Schuster and Hori's approach to modeling polyphone context is quite interesting, it is incorrect for contexts larger than triphones. In this work, we correct the errors of Schuster and Hori. Thereafter we discuss how the intermediate size of the network HC can be held in check. We also present the results of a set of experiments comparing network size and speech recognition performance for networks obtained with Schuster and Hori's technique and with the correct technique
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    8
    References
    2
    Citations
    NaN
    KQI
    []