Acoustic and temporal representations in convolutional neural network models of prosodic events

2020 
Abstract Prosodic events such as pitch accents and phrase boundaries have various acoustic and temporal correlates that are used as features in machine learning models to automatically detect these events from speech. These features are often linguistically motivated, high-level features that are hand-crafted by experts to best represent the prosodic events to be detected or classified. An alternative approach is to use a neural network that is trained and optimized to learn suitable feature representations on its own. An open question, however, is what exactly the learned feature representation consists of, since the high-level output of a neural network is not readily interpreted. In this paper, we use a convolutional neural network (CNN) that learns such features from frame-based acoustic input descriptors. We are concerned with the question of what the CNN has learned after being trained on different datasets to perform pitch accent and phrase boundary detection. Specifically, we suggest a methodology for analyzing what temporal, acoustic and context information is latent in the learned feature representation. We use the output representations learned by the CNN to predict various manually computed (aggregated) features using linear regression. The results show that the CNN learns word duration implicitly, and indicate that certain acoustic features may help to locate relevant voiced regions in speech that are useful for detecting pitch accents and phrase boundaries. Finally, our analysis of the latent contextual information learned by the CNN involves a comparison with a sequential model (LSTM) to investigate similarities and differences in what both network types have learned.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    36
    References
    4
    Citations
    NaN
    KQI
    []