One-Shot Example Videos Localization Network for Weakly-Supervised Temporal Action Localization

2021 
This paper tackles the problem of example-driven weakly-supervised temporal action localization. We propose the One-shot Example Videos Localization Network (OSEVLNet) for precisely localizing the action instances in untrimmed videos with only one trimmed example video. Since the frame-level ground truth is unavailable under weakly-supervised settings, our approach automatically trains a self-attention module with reconstruction and feature discrepancy restriction. Specifically, the reconstruction restriction minimizes the discrepancy between the original input features and the reconstructed features of a Variational AutoEncoder (VAE) module. The feature discrepancy restriction maximizes the distance of weighted features between highly-responsive regions and slightly-responsive regions. Our approach achieves comparable or better results on THUMOS’14 dataset than other weakly-supervised methods while it is trained with much less videos. Moreover, our approach is especially suitable for the expansion of newly emerging action categories to meet the requirements of different occasions.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    18
    References
    0
    Citations
    NaN
    KQI
    []