STRONG: Spatio-Temporal Reinforcement Learning for Cross-Modal Video Moment Localization

2020 
In this article, we tackle the cross-modal video moment localization issue, namely, localizing the most relevant video moment in an untrimmed video given a sentence as the query. The majority of existing methods focus on generating video moment candidates with the help of multi-scale sliding window segmentation. They hence inevitably suffer from numerous candidates, which result in the less effective retrieval process. In addition, the spatial scene tracking is crucial for realizing the video moment localization process, but it is rarely considered in traditional techniques. To this end, we innovatively contribute a spatial-temporal reinforcement learning framework. Specifically, we first exploit a temporal-level reinforcement learning to dynamically adjust the boundary of localized video moment instead of the traditional window segmentation strategy, which is able to accelerate the localization process. Thereafter, a spatial-level reinforcement learning is proposed to track the scene on consecutive image frames, therefore filtering out less relevant information. Lastly, an alternative optimization strategy is proposed to jointly optimize the temporal- and spatial-level reinforcement learning. Thereinto, the two tasks of temporal boundary localization and spatial scene tracking are mutually reinforced. By experimenting on two real-world datasets, we demonstrate the effectiveness and rationality of our proposed solution.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    43
    References
    14
    Citations
    NaN
    KQI
    []