Weakly Supervised Temporal Action Localization by Multi-Stage Fusion Network

2020 
Most temporal action localization methods are usually trained using video data-sets with frame-wise annotations which are expensive and time-consuming to acquire. To alleviate this problem, many weakly supervised temporal action localization methods which only leverage video-level annotations during training are proposed. In this paper, we first analyze three problems of weakly supervised temporal action localization, namely feature similarity, action completeness, and weak annotation. Based on these three problems, we propose a novel network called multi-stage fusion network, which decomposes the problems into three different modules within the network, namely feature, sub-action, and action modules. Specifically, for feature similarity, a Triplet Loss was introduced to ensure the action instances from the same class having similar feature sequences and expand the margin of the action instance from different classes in the feature module. For action completeness, each stage of the sub-action module can discover the different sub-actions. The complete action instances can be localized in the action module by fusing multiple sub-actions from the sub-action module. To alleviate weak annotation, we localize multiple action proposals from multi-stage outputs of the network in the action module and select the action proposals with higher confidence scores as predicted action instances. Extensive experiment results on data-sets Thumos'14 and ActivityNet1.2 demonstrate that our method outperforms the state-of-the-art methods and the average mean Average Precision (mAP) on Thumos'14 is significantly improved from 40.9% to 43.3%.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    3
    Citations
    NaN
    KQI
    []