Efficient Spatio-Temporal Contrastive Learning for Skeleton-Based 3D Action Recognition

2021 
In this paper, we propose a simple yet effective self-supervised method called spatio-temporal contrastive learning (ST-CL for 3D skeleton-based action recognition. ST-CL acquires action-specific features by regarding the spatio-temporal continuity of motion tendency as the supervisory signal. To yield effective representations, ST-CL first designs a novel contrastive proxy task by providing different spatio-temporal observation scenes for the same 3D action and pulling them together in the embedding space. Second, three key components are devised in the action encoding to efficiently extract representations in contrastive tasks: (1 Information Representation introduces the awareness of joint type when analyzing the motion dynamics. (2 Non-local GCN learns a data-driven graph topology structure and promotes a spatial message passing among long-range joints in each frame. (3 Multi-Scale TCN makes larger receptive fields for capturing richer longe-range temporal dynamics amomg adjacent frames. In ST-CL, the effective proxy tasks yield useful representations and efficient action encoding further enhances representation capacity. \revise{As validated on the four large-scale datasets, ST-CL is a strong baseline with high performance and efficiency for the contrastive learning study of the skeleton data. Compared to previous self-supervised methods, the proposed ST-CL achieves significant improvement consistently with a smaller model size and better training efficiency.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []