Weakly supervised moment localization with natural language based on semantic reconstruction

2022 
The goal of cross-modal moment localization is to find the temporal moment in the untrimmed video that semantically corresponds to the natural language query. The majority of current approaches learn the cross-modal moment localization models from fine-grained temporal annotations in the video, which are extremely time-consuming and labor-intensive to obtain. In this paper, we offer a novel framework for weakly supervised cross-modal moment localization that incorporates a proposal generation module and a semantic reconstruction module. The proposal generation module uses a two-dimensional temporal feature map to model cross-modal video representations and can encode the moment-by-moment temporal relationships of moment candidates. The semantic reconstruction module, which is based on the generated proposals, assesses a proposal's capacity to restore the text query and provides weak supervision for network training. Besides, a punishment loss is also proposed to further eliminate the effect of the invalid area. Extensive experimental results show that the proposed method achieves state-of-the-art performance, demonstrating its effectiveness for weakly supervised moment localization with natural language.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []