A Unified Deep Framework for Hand Pose Estimation and Dynamic Hand Action Recognition from First-Person RGB Videos

2021 
Understanding hand action from the first-person video has emerged recently thanks to its wide potential applications such as hand rehabilitation, augmented reality. The majority of works mainly reply on RGB images. Compared with RGB images, hand joints have certain advantages as they are robust to illuminations and appearance variation. However, previous works for hand action recognition usually employed hand joints that are manually determined. This paper presents a unified framework for both hand pose estimation and hand action recognition from first-person RGB images. First, our framework estimates 3D hand joints from every RGB image using a combination of Resnet and a Graphical convolutional network. Then, an adaptation of a SOTA method PA-ResGCN for the human skeleton is proposed for hand action recognition from estimated hand joints. Our framework takes advantage of efficient graphical networks to model graph-like human hand structure in both phases: hand pose estimation and hand action recognition. We evaluate the proposed framework on the First Person Hand Action Benchmark (FPHAB). The experiments show that the proposed framework outperforms different SOTA methods on both hand pose estimation and hand action recognition tasks.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    14
    References
    0
    Citations
    NaN
    KQI
    []