An Equal Data Setting for Attention-Based Encoder-Decoder and HMM/DNN Models: A Case Study in Finnish ASR

Aku Rouhe,Astrid Van Camp,Mittul Singh,Hugo Van hamme,Mikko Kurimo

An Equal Data Setting for Attention-Based Encoder-Decoder and HMM/DNN Models: A Case Study in Finnish ASR

2021

Aku Rouhe
Astrid Van Camp
Mittul Singh
Hugo Van hamme
Mikko Kurimo

Standard end-to-end training of attention-based ASR models only uses transcribed speech. If they are compared to HMM/DNN systems, which additionally leverage a large corpus of text-only data and expert-crafted lexica, the differences in modeling cannot be disentangled from differences in data. We propose an experimental setup, where only transcribed speech is used to train both model types. To highlight the difference that text-only data can make, we use Finnish, where an expert-crafted lexicon is not needed. With 1500h equal data, we find that both ASR paradigms perform similarly, but adding text data quickly improves the HMM/DNN system. On a smaller 160h subset we find that HMM/DNN models outperform AED models.

Keywords:

Leverage (statistics)
Hidden Markov model
encoder decoder
Speech recognition
Lexicon
Computer science

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations