Lightweight Action Recognition in Compressed Videos

Yuqi Huo,Xiaoli Xu,Yao Lu,Yulei Niu,Mingyu Ding,Zhiwu Lu,Tao Xiang,Ji-Rong Wen

Lightweight Action Recognition in Compressed Videos

2020

Most existing action recognition models are large convolutional neural networks that work only with raw RGB frames as input. However, practical applications require lightweight models that directly process compressed videos. In this work, for the first time, such a model is developed, which is lightweight enough to run in real-time on embedded AI devices without sacrifices in recognition accuracy. A new Aligned Temporal Trilinear Pooling (ATTP) module is formulated to fuse three modalities in a compressed video. To remedy the weaker motion vectors (compared to optical flow computed from raw RGB streams) for representing dynamic content, we introduce a temporal fusion method to explicitly induce the temporal context, as well as knowledge distillation from a model trained with optical flows via feature alignment. Compared to existing compressed video action recognition models, it is much more compact and faster thanks to adopting a lightweight CNN backbone.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations