IMINET: Convolutional semi-siamese networks for sound search by vocal imitation

2017 
Searching sounds by text labels is often difficult, as text labels cannot always provide sufficient information for the sound content. Previously we proposed an unsupervised system called IMISOUND for sound search by vocal imitation. In this paper, we further propose a Convolutional Semi-Siamese Network (CSN) called IMINET. IMINET uses two towers of Convolutional Neural Networks (CNN) to extract features from vocal imitations and sound recordings, respectively. It then adopts a fully connected network to predict the similarity between vocal imitations and sound recordings. We propose three different configurations of the CSN by choosing different weight sharing strategies between the two towers. We also propose late fusion of the retrieval results of IMINET's different configurations and those of IMISOUND as a baseline. Experiments show significant improvements of the retrieval performance from the IMISOUND baseline to the fusion of IMINET's different configurations, and to different fusions between IMINET and the IMISOUND baseline.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    29
    References
    13
    Citations
    NaN
    KQI
    []