Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization

2018 
There is a natural correlation between the visual and auditive elements of a video. In this work we leverage this connection to learn audio and video features from self-supervised temporal synchronization. We demonstrate that a calibrated curriculum learning scheme, a careful choice of negative examples, and the use of a contrastive loss are critical ingredients to obtain powerful multi-sensory representations from models optimized to discern temporal synchronization of audio-video pairs. Without further finetuning, the resulting audio features achieve state-of-the-art performance on established audio classification benchmarks (DCASE2014 and ESC-50), while our visual stream provides a very effective initialization to significantly improve the performance of video-based action recognition models (our self-supervised pretraining yields a remarkable gain in accuracy of 16.7% on UCF101 and 13.0% on HMDB51, compared to learning from scratch).
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    35
    References
    33
    Citations
    NaN
    KQI
    []