With the rapid growth of image and video data on the web, hashing has been extensively studied for image or video search in recent years. Benefit from recent advances in deep learning, deep hashing methods have achieved promising results for image retrieval. However, there are some limitations of previous deep hashing methods (e.g., the semantic information is not fully exploited). In this paper, we develop a deep supervised discrete hashing algorithm based on the assumption that the learned binary codes should be ideal for classification. Both the pairwise label information and the classification information are used to learn the hash codes within one stream framework. We constrain the outputs of the last layer to be binary codes directly, which is rarely investigated in deep hashing algorithm. Because of the discrete nature of hash codes, an alternating minimization method is used to optimize the objective function. Experimental results have shown that our method outperforms current state-of-the-art methods on benchmark datasets.
Abstract Hand pose estimation is the basis of dynamic gesture recognition. In vision-based hand pose estimation, the joints of the human hand are highly flexible, and problems such as local similarity and severe occlusion have great influence on the estimation of hand posture. In order to identify the complicated hand posture, the structural relationship between the hand nodes is established, more accurate hand pose estimation can be achieved through the improved Nonparametric Structure Regularization Machine (NSRM) in this paper. Based on the NSRM network, the backbone network is replaced by New High-Resolution Net (NHRNet), then the input and output channels of some convolutional layers are reduced. Finally, a public dataset is used to conduct the hand pose estimation experiments. The experimental results show that the optimized NSRM network has higher accuracy and faster recognition speed for hand pose estimation.
Aiming at the problem that the feature extraction method based on Gabor wavelet transform makes the feature vector dimension higher, a novel method named GCLBP (Gabor-CSLBP) is proposed in this paper. Based on Gabor wavelet transform, the proposed algorithm is a local feature extraction method, which extracted a new kind of feature through applying the idea of CS-LBP (Center-Symmetric Local Binary Pattern) into the resulted sub-images of Gabor transform. The feature vector obtained by the GCLBP method combines the advantages of Gabor wavelet transform and CS-LBP, which not only reduces the dimension of the feature vector, but also improves the robustness of image variation. The proposed method is evaluated by extensive experiments on benchmark databases CMU PIE, and Extended Yale B. The experimental results show that the proposed method -- GCLBP, can significantly improve the face recognition rate under complex illumination.
Vision-language models have been widely explored across a wide range of tasks and achieve satisfactory performance. However, it's under-explored how to consolidate entity understanding through a varying number of images and to align it with the pre-trained language models for generative tasks. In this paper, we propose MIVC, a general multiple instance visual component to bridge the gap between various image inputs with off-the-shelf vision-language models by aggregating visual representations in a permutation-invariant fashion through a neural network. We show that MIVC could be plugged into the visual-language models to improve the model performance consistently on visual question answering, classification and captioning tasks on a public available e-commerce dataset with multiple images per product. Furthermore, we show that the component provides insight into the contribution of each image to the downstream tasks.
In this paper, we propose a leg-driven physiology framework for pedestrian detection. The framework is introduced to reduce the search space of candidate regions of pedestrians. Given a set of vertical line segments, we can generate a space of rectangular candidate regions, based on a model of body proportions. The proposed framework can be either integrated with or without learning-based pedestrian detection methods to validate the candidate regions. A symmetry constraint is then applied to validate each candidate region to decrease the false positive rate. The experiment demonstrates the promising results of the proposed method by comparing it with Dalal & Triggs method. For example, rectangular regions detected by the proposed method has much similar area to the ground truth than regions detected by Dalal & Triggs method.
In recent years, domain adaptation techniques have been widely used to adapt face anti-spoofing models to a cross-scenario target domain. Most previous methods assume that the Presentation Attack Instruments (PAIs) in such cross-scenario target domain are same as in the source domain. However, as the malicious users are free to use any form of unknown PAIs to attack the system, this assumption does not always hold in practical applications of face anti-spoofing. Thus, unknown PAIs would inevitably lead to significant performance degradation, since samples of known and unknown PAIs usually have large differences. In this paper, we propose an Evidential Semantic Consistency Learning (ESCL) framework to address this problem. Specifically, a regularized evidential deep learning strategy with a two-way balance of class probability and uncertainty is leveraged to produce uncertainty scores for unknown PAI detection. Meanwhile, entropy optimization-based semantic consistency learning strategy is also employed to encourage features of live and known PAIs to be gathered in the label-conditioned clusters across the source and target domains, while make the features of unknown PAIs to be self-clustered according to intrinsic semantic information. In addition, a new evaluation metric, KUHAR, is proposed to comprehensively evaluate the error rate of known classes and unknown PAIs. Extensive experimental results on six public datasets demonstrate the effectiveness of our method in generalizing face anti-spoofing models to both known classes and unknown PAIs with different types and quantities in a cross-scenario testing domain. Our method achieves state-of-the-art performance on eight different protocols.
To improve the real-time performance of the meanshift algorithm in the embedded system, an improved meanshift algorithm for tracking moving target is proposed in this paper. In order to reduce the influence of background pixel in a target model, the target model is build by using the target model of continuous frames; to reduce the times of iteration, a kalman filter is used to predict the position of moving object in the current frame; to improve the accuracy of the target model, it is updated in real-time. At last, the improved algorithm is realized on a DM6437 platform and the experimental results show that the improved algorithm can track moving objects effectively.