Multi-speaker Sequence-to-sequence Speech Synthesis for Data Augmentation in Acoustic-to-word Speech Recognition

2019 
The acoustic-to-word (A2W) automatic speech recognition (ASR) realizes very fast decoding with a simple architecture and achieves state-of-the-art performance. However, the A2W model suffers from the out-of-vocabulary (OOV) word problem and cannot use text-only data to improve the language modeling capability. Meanwhile, sequence-to-sequence neural speech synthesis has also been developed and achieved naturalness comparable to human speech. We investigate leveraging sequence-to-sequence neural speech synthesis to augment training data for the ASR system in a target domain. While speech synthesis model is usually trained with single speaker data, ASR needs to cover a variety of speakers. In this work, we extend the speech synthesizer so that it can output speech of many speakers. The multi-speaker speech synthesizer is trained with a large corpus in the source domain, then used to generate acoustic features from texts of the target domain. These synthesized speech features are combined with real speech features of the source domain to train an attention-based A2W model. Experimental results show that the A2W model trained with the multi-speaker model achieved a significant improvement over the baseline and the single speaker model.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    33
    References
    8
    Citations
    NaN
    KQI
    []