Learning Coarse and Fine Features for Precise Temporal Action Localization

2019 
Temporal action localization from untrimmed videos is a fundamental task for real-world computer vision applications such as video surveillance systems. Even though a great deal of research attention has been paid to the problem, precise localization of human activities at a frame level still remains as a challenge. In this paper, we propose CoarseFine networks that learn highly discriminative features without loss of time granularity with two streams: the coarse and fine networks. The coarse network aims to classify the action category based on the global context of a video by taking advantage of the description power of successful action recognition models. On the other hand, the fine network does not deploy temporal pooling constrained with a low channel capacity. The fine network is specialized to identify the per-frame location of actions based on local semantics. This approach enables CoarseFine networks to learn find-grained representations without any temporal information loss. Our extensive experiments on two challenging benchmarks, THUMOS14 and ActivityNet-v1.3, validate that our proposed method provides a higher accuracy compared to the state-of-the-art by a remarkable margin in per-frame labeling and temporal action localization tasks while the computational cost is significantly reduced.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    1
    Citations
    NaN
    KQI
    []