Human Action Recognition Based on Analysis of Video Sequences

2021 
Human actions are defined as the coordinated movement of different body parts in a meaningful way to describe different aspects of human behavior. Recognizing human actions through computer vision is a trending research area as it has applications in both indoor and outdoor environments. Human action recognition (HAR) has a broad application area in the field of surveillance, patient’s behaviour detection, video retrieval, sports video analysis, human-computer interaction, etc. However, processing of the action videos is a challenging and complex task. This motivates to develop a good HAR algorithm with better video representation, feature extraction and classification capabilities to recognize different action classes effectively. In this regard, at first a semisupervised tree and 3D local feature based HAR paradigm (sST-3DF) is developed. Here, a motion history image (MHI) based interest point refinement is proposed to remove the noisy interest points. Histogram of oriented gradient (HOG) and histogram of optical flow (HOF) techniques are extended from spatial to spatio-temporal domain to preserve the temporal information. These local features are used to build the trees for the random forest technique. During tree building, a semi-supervised learning is proposed for better splitting of data points at each node. For recognition of an action from the video, mutual information is estimated for all the extracted interest points to each of the trained class by passing them through the random forest. Next, a two-stream sequential network is developed to leverage sequential and shape information for recognition of human actions more efficiently. In this technique, a deep bi-directional long short term memory (DBiLSTM) network is constructed to model temporal relationship between action frames through sequential learning. Action information in each frame is extracted using pre-trained convolutional neural network (CNN). During the shape learning, the knowledge of shape information for each action through depth history image (DHI) is used to train a deep pre-trained CNN network. Depth information of each action frame is estimated and projected onto the X-Y plane to create the DHI images. The major limitations of the above discussed algorithms are: first, the performance of the existing algorithms degrades comprehensively in the presence of partial loss of action data due to obstruction. Second, the performance of the existing networks is limited without exploiting the dependency relationship between different streams of a multistream network. To handle these problems, a novel double input sequential network (DISNet) is proposed with 3D obstruction model which takes care the HAR algorithm in the presence of partial loss of action data. The DISNet which learns inter-stream information, is jointly trained on the normal data and the artificially created obstructed data of a single video to provide immunity to the HAR network against obstructions. As most of the available action datasets do not have partial loss of action data, a 3D obstruction model is proposed to manually add obstructions in action videos. All the above discussed algorithms works well when training data is sufficient. However, in real time surveillance, generally it is difficult to collect a larger training dataset for rarely occurring actions. As the abnormal actions do not occur frequently, the HAR must have the ability to recognize abnormal actions from insufficient training data. To handle this challenge, a HAR technique is developed with a local maxima of difference image (LMDI) based interest point detection technique, random projection tree with overlapping split and modified voting score for better action recognition. In LMDI based interest point detection method, difference images are constructed using consecutive frame differencing technique and next, 3D peak detection is applied on these bunch of difference images to extract the required interest points. Histogram of oriented gradients and histogram of optical flow as local features are extracted around each of the interest point. These local features are then indexed by random projection trees. Overlapping split is used during tree structuring to reduce failure probability. Hough voting technique is applied on testing video to compute highest similarity matching score with individual training classes. In addition to Hough voting score, the number of matched interest points of a single query video with each training class, is considered for recognition. The effectiveness of all the proposed techniques are verified by conducting experiments on publicly available human action recognition datasets such as KTH, Weizmann, UCF sports, and JHMDB.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []