Spatio-Temporal Action Detection and Localization Using a Hierarchical LSTM

2020 
Video analysis is gaining importance in the recent past due to its usefulness in a wide variety of applications. The efficiency of a video analytics engine primarily depends on its ability to extract the spatio-temporal features, which has enough discriminative. Inspired by the way the human visual system operates, we propose a hierarchical architecture to capture the spatio-temporal information from a given input video at different time scales. The proposed architecture has a 3D Inception module followed by two layers of modified Convolutional Long Short Term Memory (ConvLSTM) as the fundamental unit. At each level, we consolidate the LSTM cell and hidden states to the next level by using an visual attention-based pooling approach. The proposed network is used for video action detection and localization application that is the foundational element for video analysis. UCF101 and AVA datasets are used to show that the recognition accuracy achieved by the proposed algorithm advances the state-of-the-art in spatio-temporal action detection and localization application.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    12
    References
    1
    Citations
    NaN
    KQI
    []