Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization

Bruno Korbar,Du Tran,Lorenzo Torresani

Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization

2018

There is a natural correlation between the visual and auditive elements of a video. In this work we leverage this connection to learn audio and video features from self-supervised temporal synchronization. We demonstrate that a calibrated curriculum learning scheme, a careful choice of negative examples, and the use of a contrastive loss are critical ingredients to obtain powerful multi-sensory representations from models optimized to discern temporal synchronization of audio-video pairs. Without further finetuning, the resulting audio features achieve state-of-the-art performance on established audio classification benchmarks (DCASE2014 and ESC-50), while our visual stream provides a very effective initialization to significantly improve the performance of video-based action recognition models (our self-supervised pretraining yields a remarkable gain in accuracy of 16.7% on UCF101 and 13.0% on HMDB51, compared to learning from scratch).

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations