An Efficient Acoustic Model Based on Deep Feedforward Sequential Memory Networks and Convolutional Neural Networks

2021 
Speech recognition technology, known as Automatic Speech Recognition (ASR), is an interdisciplinary subject closely related to people's life and learning. Deep feedforward sequential memory networks (DFSMN) is an acoustic model framework proposed by Alibaba in 2018, which introduced an approach of skipping connections between memory blocks in adjacent layers. Compared to the traditional LSTM and BLSTM frameworks, DFSMN is able to alleviate the gradient vanishing problem when it is used to build a very deep structure. In this paper, we propose a new acoustic model by combining the multi-layer DFSMN and multi-layer CNN, and adjusting the structure of pooling and normalization layers. What we have done also includes collecting and processing data sets such as THCHS-30, ST-CMDS and Primewords, extracting audio features with Fbank, and designing and implementing the network using Connectionist Temporal Classification (CTC) as the cost function. The experiment results of the proposed DFSMN-CNN-CTC show that its CER has achieved an improved rate of 13%.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    4
    References
    0
    Citations
    NaN
    KQI
    []