language-icon Old Web
English
Sign In

Recognizing sloppy speech

2005 
As speech recognition moves from labs into the real world, the sloppy speech problem emerges as a major challenge. Sloppy speech, or conversational speech, refers to the speaking style people typically use in daily conversations. The recognition error rate for sloppy speech has been found to double that of read speech in many circumstances. Previous work on sloppy speech has focused on modeling pronunciation changes, primarily by adding pronunciation variants to the dictionary. The improvement, unfortunately, has been unsatisfactory. To improve recognition performance on sloppy speech, we revisit pronunciation modeling issues and focus on implicit pronunciation modeling, where we keep the dictionary simple and model reductions through phonetic decision trees and other acoustic modeling mechanisms. Another front of this thesis is to alleviate known limitations of the current HMM framework, such as the frame independence assumption, which can be aggravated by sloppy speech. Three novel approaches have been explored: (1) flexible parameter tying. We show that parameter tying is an integral part of pronunciation modeling, and introduce flexible tying to better model reductions in sloppy speech. We find that enhanced tree clustering, together with single pronunciation dictionary, improves performance significantly. (2) Gaussian transition modeling. By modeling transitions between Gaussians in adjacent states, this alleviates the frame independence assumption and can be regarded as a pronunciation network at the Gaussian level. (3)  thumbnail features. We try to achieve segmental modeling within the HMM framework by using these segment-level features. While they improve performance significantly in initial passes, the gain becomes marginal when combined with more sophisticated acoustic modeling techniques. We have also worked on system development on three large vocabulary tasks: Broadcast News, Switchboard and meeting transcription. By empirically improving all aspects of speech recognition, from front-ends to acoustic modeling and decoding strategies, we have achieved a 50% relative improvement on the Broadcast News task, a 38% relative improvement on the Switchboard task, and a 40% relative improvement on the meeting transcription task.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    89
    References
    2
    Citations
    NaN
    KQI
    []