Weakly supervised moment localization with natural language based on semantic reconstruction

Tingting Han,Kai Wang,Jun Yu,Jianping Fan

Weakly supervised moment localization with natural language based on semantic reconstruction

2022

The goal of cross-modal moment localization is to find the temporal moment in the untrimmed video that semantically corresponds to the natural language query. The majority of current approaches learn the cross-modal moment localization models from fine-grained temporal annotations in the video, which are extremely time-consuming and labor-intensive to obtain. In this paper, we offer a novel framework for weakly supervised cross-modal moment localization that incorporates a proposal generation module and a semantic reconstruction module. The proposal generation module uses a two-dimensional temporal feature map to model cross-modal video representations and can encode the moment-by-moment temporal relationships of moment candidates. The semantic reconstruction module, which is based on the generated proposals, assesses a proposal's capacity to restore the text query and provides weak supervision for network training. Besides, a punishment loss is also proposed to further eliminate the effect of the invalid area. Extensive experimental results show that the proposed method achieves state-of-the-art performance, demonstrating its effectiveness for weakly supervised moment localization with natural language.

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations