Phone Synchronous Speech Recognition With CTC Lattices
2017
Connectionist temporal classification CTC has recently shown improved performance and efficiency in automatic speech recognition. One popular decoding implementation is to use a CTC model to predict the phone posteriors at each frame and then perform Viterbi beam search on a modified WFST network. This is still within the traditional frame synchronous decoding framework. In this paper, the peaky posterior property of CTC is carefully investigated and it is found that ignoring blank frames will not introduce additional search errors. Based on this phenomenon, a novel phone synchronous decoding framework is proposed by removing tremendous search redundancy due to blank frames, which results in significant search speed up. The framework naturally leads to an extremely compact phone-level acoustic space representation: CTC lattice. With CTC lattice, efficient and effective modular speech recognition approaches, second pass rescoring for large vocabulary continuous speech recognition LVCSR, and phone-based keyword spotting KWS, are also proposed in this paper. Experiments showed that phone synchronous decoding can achieve 3-4 times search speed up without performance degradation compared to frame synchronous decoding. Modular LVCSR with CTC lattice can achieve further WER improvement. KWS with CTC lattice not only achieved significant equal error rate improvement, but also greatly reduced the KWS model size and increased the search speed.
Keywords:
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
59
References
28
Citations
NaN
KQI