On the importance of modeling and robustness for deep neural network feature

2015 
A large body of research has shown that acoustic features for speech recognition can be learned from data using neural networks with multiple hidden layers (DNNs) and that these learned features are superior to standard features (e.g., MFCCs). However, this superiority is usually demonstrated when the data used to learn the features is very similar in character to the data used to test recognition performance. An open question is how well these learned features generalize to realistic data that is different in character to their training data. The ability of a feature representation to generalize to unfamiliar data is a highly desirable form of robustness. In this paper we investigate the robustness of two DNN-based feature sets to training/test mismatch using the ICSI meeting corpus. The experiments were performed under 3 training/test scenarios: (1) matched near-field (2) matched far-field and (3) the mismatched condition near-field training with far-field testing. The experiments leverage simulation and a novel sampling process that we have developed for diagnostic analysis within the HMM-based speech recognition framework. First, diagnostic analysis shows that a DNN-based feature representation that uses MFCC inputs (MFCC-DNN) is indeed superior to the corresponding MFCC baselines in the two matched scenarios where the source of recognition errors are from incorrect model, but the DNN-based features and MFCCs have nearly identical and poor performance in the mismatched scenario. Second, we show that a DNN-based feature representation that uses a more robust input, namely power normalized spectrum (PNS) and Gabor filters, performs nearly as well as the MFCC-DNN features in the matched scenarios and much better than MFCCs and MFCC-DNNs in the mismatched scenario.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    22
    References
    9
    Citations
    NaN
    KQI
    []