Visualization and Interpretation of Siamese Style Convolutional Neural Networks for Sound Search by Vocal Imitation

2018 
Designing systems that allow users to search sounds through vocal imitation augments the current text-based search engines and advances human-computer interaction. Previously we proposed a Siamese style convolutional network called IMINET for sound search by vocal imitation, which jointly addresses feature extraction by Convolutional Neural Network (CNN) and similarity calculation by Fully Connected Network (FCN), and is currently the state of the art. However, how such architecture works is still a mystery. In this paper, we try to answer this question. First, we visualize the input patterns that maximize the activation of different neurons in each CNN tower; this helps us understand what features are extracted from vocal imitations and sound candidates. Second, we visualize the imitation-sound input pairs that maximize the activation of different neurons in the FCN layers; this helps us understand what kind of input pattern pairs are recognized during the similarity calculation. Interesting patterns are found to reveal the local-to-global and simple-to-conceptual learning mechanism of TL-IMINET. Experiments also show how transfer learning helps to improve TL-IMINET performance from the visualization aspect.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    11
    References
    7
    Citations
    NaN
    KQI
    []