Improving Voice Activity Detection for Multimodal Movie Dialogue Corpus

2018 
Detection of speech segments from sound sequences is an important task for various applications. Methods for the voice activity detection (VAD) task have been developed, but when they are applied to movie data, because of the noise in movies, the accuracy has deteriorated. This noise problem is addressed by using a deep neural network (DNN) model for VAD. Although the overall performance of the DNN-based model was satisfactory, there was a clear drop in performance when singing voices or musical sounds existed as background noise. In this study, the effectiveness of changing VAD models from a binary classifier to a multi-class classifier was examined. As a result, it was found that DNN-based and multi-class VAD models can incorporate singing voices and musical sounds sufficiently. In the experiments, an equal error rate of 3.92% was obtained in movies.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    10
    References
    4
    Citations
    NaN
    KQI
    []