Semi-supervised Training for End-to-end Models via Weak Distillation

Bo Li,Ruoming Pang,Tara N. Sainath,Zelin Wu

Semi-supervised Training for End-to-end Models via Weak Distillation

2019

End-to-end (E2E) models are a promising research direction in speech recognition, as the single all-neural E2E system offers a much simpler and more compact solution compared to a conventional model, which has a separate acoustic (AM), pronunciation (PM) and language model (LM). However, it has been noted that E2E models perform poorly on tail words and proper nouns, likely because the end-to-end optimization requires joint audio-text pairs, and does not take advantage of additional lexicons and large amounts of text-only data used to train the LMs in conventional models. There has been numerous efforts in training an RNN-LM on text-only data and fusing it into the end-to-end model. In this work, we contrast this approach to training the E2E model with audio-text pairs generated from unsupervised speech data. To target the proper noun issue specifically, we adopt a Part-of-Speech (POS) tagger to filter the unsupervised data to use only those with proper nouns. We show that training with filtered unsupervised-data provides up to a 13% relative reduction in word-error-rate (WER), and when used in conjunction with a cold-fusion RNN-LM, up to a 17% relative improvement.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations