Neural Speech-to-Text Language Models for Rescoring Hypotheses of DNN-HMM Hybrid Automatic Speech Recognition Systems
2018
In this paper, we propose to leverage end-to-end automatic speech recognition (ASR) systems for assisting deep neural network-hidden Markov model (DNN-HMM) hybrid ASR systems. The DNN-HMM hybrid ASR system, which is composed of an acoustic model, a language model and a pronunciation model, is known to be the most practical architecture in ASR field. On the other hand, much attention has been paied in recent studies to the end-to-end ASR systems that are fully composed of neural networks. It is known that they can yield comparative performance without introducing heuristic operations. However, one problem is that the end-to-end ASR systems sometimes suffer from redundant generation and ommission of important words in text generation phases. This is because these systems cannot explicitly consider the connection between the input speech and the output text. Therefore, our idea is to regard the end-to-end ASR systems as neural speech-to-text language models (NS2TLMs) and to use them for rescoring hypotheses generated in the DNN-HMM hybrid ASR systems. This enables us to leverage the end-to-end ASR systems while avoiding the generation issues because the DNN-HMM hybrid ASR systems can generate speech-aligned hypotheses. It is expected that the NS2TLMs improve the DNN-HMM hybrid ASR systems because the end-to-end ASR systems correctly handle short-duration utterances. In our experiments, we use state-of-the-art DNN-HMM hybrid ASR systems with convolutional and long short-term memory recurrent neural network acoustic models and end-to-end ASR systems based on attetional encoder-decoder. We demonstrate that our proposed method can yield a better ASR performance than both the DNN-HMM hybrid ASR system and the end-to-end ASR system.
Keywords:
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
24
References
7
Citations
NaN
KQI