Learning Frame-Level Recurrent Neural Networks Representations for Query-by-Example Spoken Term Detection on Mobile Devices

2018 
Recurrent neural networks (RNNs) with long short term memory (LSTM) acoustic model (AM) has achieved state-of-the-art performance in LVCSR. The strong ability in capturing context information makes the acoustic feature extracted from LSTM more discriminative. Feature extraction is also crucial to query-by-example spoken term detection (QbyE-STD), especially frame-level features. In this paper, we explore some frame-level recurrent neural networks representations for QbyE-STD, which is more robust than the original features. In addition, the designed model is a lightweight model that is suitable for the requirements for little footprint on mobile devices. Firstly, we use a traditional RAE to extract frame-level representations and use a correspondence RAE to depress non-semantic information. Then, we use the combination of the two models to extract more discriminative features. Some common tricks such as skipping frames have been used to make the model learn more context information. Experiment and evaluations show the performance of the proposed methods are superior to the conventional ones, in the same condition of computation requirements.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    21
    References
    0
    Citations
    NaN
    KQI
    []