Efficient encoding of video descriptor distribution for action recognition

2019 
Action recognition has been an active area of study in the literature. Many proposed methods extract a set of descriptors from the video, which embody information about gradient, motion, etc. Then, the descriptors are mapped to a feature vector, which is used for classification. Two widely used methods for this mapping are bag-of-words and Fisher vectors. The former requires k-means clustering and the latter Gaussian mixture model training, prior to making feature vectors. Both of these algorithms need a global training phase on the whole dataset which is expensive both in time and memory. Moreover, because the final feature vector depends on the initial training phase, these feature vectors are not very scalable. In this paper, we seek to use Maclaurin coefficients of the density function and moments of the distribution to encode the distribution of video descriptors. Experiments on three datasets namely UCF Sports, JHMDB, and KTH suggest that our methods are much faster than Fisher vectors in training and testing, and are more scalable, too. In fact, as the features only depend on the video descriptors and not cluster centers, they offer more scalability when new videos are added to the dataset. However, Fisher vectors are, in some cases, more accurate than the proposed approaches. Comparison to state-of-the-art shows that our method is faster, and in some cases, achieves comparable accuracy.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    56
    References
    4
    Citations
    NaN
    KQI
    []