Lightweight Action Recognition in Compressed Videos

2020 
Most existing action recognition models are large convolutional neural networks that work only with raw RGB frames as input. However, practical applications require lightweight models that directly process compressed videos. In this work, for the first time, such a model is developed, which is lightweight enough to run in real-time on embedded AI devices without sacrifices in recognition accuracy. A new Aligned Temporal Trilinear Pooling (ATTP) module is formulated to fuse three modalities in a compressed video. To remedy the weaker motion vectors (compared to optical flow computed from raw RGB streams) for representing dynamic content, we introduce a temporal fusion method to explicitly induce the temporal context, as well as knowledge distillation from a model trained with optical flows via feature alignment. Compared to existing compressed video action recognition models, it is much more compact and faster thanks to adopting a lightweight CNN backbone.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    48
    References
    1
    Citations
    NaN
    KQI
    []