Automatic Segmented-Syllable and Deep Learning-Based Indonesian Audiovisual Speech Recognition

2020 
Many studies proved that the audiovisual speech recognition system was better than the audio-only or visual-only ones. Nevertheless, three crucial issues should be carefully designed: the combination of both audio and visual, the acoustic model, and the feature extraction. In this paper, a deep learning-based Indonesian audiovisual speech recognition (INAVSR) system is developed. It is a combination of two models: Indonesian audio speech recognition (INASR) and Indonesian visual speech recognition (INVSR). The INASR is built using the Mel frequency cepstral coefficient (MFCC) as well as Mozilla DeepSpeech (MDS) and Kaituoxu Speech-Transformer (KST). Whereas the INVSR is implemented using the LipNet. A simple procedure of automatic syllable segmentation in the visual data is proposed. It functions to solve the out of vocabulary (OOV) words problem in recognizing speech in a sentence-level video. Evaluation of a small dataset shows that the developed deep speech-based INASR produces a relatively low word error rate (WER) of 22.0%. Meanwhile, the developed LipNet-based INVSR gives a bit higher WER of 30.8%. The proposed automatic syllable segmentation is able to tackle the problem of OOV words. Finally, an evaluation of the dataset of videos informs that the INAVSR system is capable of recognizing audiovisual speech in a sentence-level video. Compared to the INASR, the INAVSR provides slightly higher performance. It is able to give an absolute reduction of the WER by up to 2.0%.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    24
    References
    0
    Citations
    NaN
    KQI
    []