AVOT: Audio-Visual Object Tracking of Multiple Objects for Robotics

2020 
Existing state-of-the-art object tracking can run into challenges when objects collide, occlude, or come close to one another. These visually based trackers may also fail to differentiate between objects with the same appearance but different materials. Existing methods may stop tracking or incorrectly start tracking another object. These failures are uneasy for trackers to recover from since they often use results from previous frames. By using audio of the impact sounds from object collisions, rolling, etc., our audio-visual object tracking (AVOT) neural network can reduce tracking error and drift. We train AVOT end to end and use audio-visual inputs over all frames. Our audio-based technique may be used in conjunction with other neural networks to augment visually based object detection and tracking methods. We evaluate its runtime frames-per-second (FPS) performance and intersection over union (IoU) performance against OpenCV object tracking implementations and a deep learning method. Our experiments, using the synthetic Sound-20K audio-visual dataset, demonstrate that AVOT outperforms single-modality deep learning methods, when there is audio from object collisions. A proposed scheduler network to switch between AVOT and other methods based on audio onset maximizes accuracy and performance over all frames in multimodal object tracking.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    50
    References
    4
    Citations
    NaN
    KQI
    []