Group Sparse-Based Mid-Level Representation for Action Recognition

2017 
Mid-level parts are shown to be effective for human action recognition in videos. Typically, these semantic parts are first mined with some heuristic rules, then videos are represented via volumetric max-pooling (VMP) method. However, these methods have two issues: 1) the VMP strategy divides videos by static grids. In this case, a semantic part may occur in different localizations in different videos. That means the VMP strategy loses the space-time invariance. To solve this problem, we propose to apply a saliency-driven max-pooling scheme to represent a video. We extract the video semantic cues by the saliency map, and dynamically pool the local maximum responses. This scheme can be considered as a semantic content-based feature alignment method and 2) the parts discovered by heuristic rules may be intuitive but not discriminative enough for action classification because they neglect the relations between the detectors. For this issue, we propose to apply a sparse classifier model to select discriminative parts. Moreover, to further improve the discriminative ability of the representation, we propose to conduct feature selection by the corresponding entry magnitude of the model coefficients. We conduct experiments on four challenging datasets—KTH, Olympic Sports, UCF50, and HMDB51. The results show that the proposed method significantly outperforms the state-of-the-art methods.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    86
    References
    13
    Citations
    NaN
    KQI
    []