Gender is an important cue in social activities. In this correspondence, we present a study and analysis of gender classification based on human gait. Psychological experiments were carried out. These experiments showed that humans can recognize gender based on gait information, and that contributions of different body components vary. The prior knowledge extracted from the psychological experiments can be combined with an automatic method to further improve classification accuracy. The proposed method which combines human knowledge achieves higher performance than some other methods, and is even more accurate than human observers. We also present a numerical analysis of the contributions of different human components, which shows that head and hair, back, chest and thigh are more discriminative than other components. We also did challenging cross-race experiments that used Asian gait data to classify the gender of Europeans, and vice versa. Encouraging results were obtained. All the above prove that gait-based gender classification is feasible in controlled environments. In real applications, it still suffers from many difficulties, such as view variation, clothing and shoes changes, or carrying objects. We analyze the difficulties and suggest some possible solutions.
Video object detection has been an important yet challenging topic in computer vision. Traditional methods mainly focus on designing the image-level or box-level feature propagation strategies to exploit temporal information. This paper argues that with a more effective and efficient feature propagation framework, video object detectors can gain improvement in terms of both accuracy and speed. For this purpose, this paper studies object-level feature propagation, and proposes an object query propagation (QueryProp) framework for high-performance video object detection. The proposed QueryProp contains two propagation strategies: 1) query propagation is performed from sparse key frames to dense non-key frames to reduce the redundant computation on non-key frames; 2) query propagation is performed from previous key frames to the current key frame to improve feature representation by temporal context modeling. To further facilitate query propagation, an adaptive propagation gate is designed to achieve flexible key frame selection. We conduct extensive experiments on the ImageNet VID dataset. QueryProp achieves comparable accuracy with state-of-the-art methods and strikes a decent accuracy/speed trade-off.
Exploration under sparse rewards is a key challenge for multi-agent reinforcement learning problems. Although population-based learning shows its potential in producing diverse behaviors, most previous works still focus on improving the exploration of a single joint policy. In this paper, we show that with a suitable exploration method, maintaining a population of joint policies rather than one joint policy can significantly improve exploration. Our key idea is to guide each member of the population to explore different regions of the environment. To this end, we propose a member-aware exploration objective which explicitly guides each member to maximize deviation from the explored regions of other members, thus forcing them to explore different regions. In addition, we further propose an exploration-enhanced policy constraint to guide each member to learn a joint policy that is both different from other members and promotes exploration, thus increasing the probability of exploring different regions. Under reward-free setting, our method achieves 72% average improvement in the number of explored states compared to classical exploration methods in the multiple-particle environment. Moreover, under sparse-reward setting, we show that the proposed method significantly outperforms the state-of-the-art methods in the multiple-particle environment, the Google Research Football, and StarCraft II micromanagement tasks.
The biologically inspired model (BIM) proposed by Serre presents a promising solution to object categorization. It emulates the process of object recognition in primates' visual cortex by constructing a set of scale- and position-tolerant features whose properties are similar to those of the cells along the ventral stream of visual cortex. However, BIM has potential to be further improved in two aspects: mismatch by dense input and randomly feature selection due to the feedforward framework. To solve or alleviate these limitations, we develop an enhanced BIM (EBIM) in terms of the following two aspects: 1) removing uninformative inputs by imposing sparsity constraints, 2) apply a feedback loop to middle level feature selection. Each aspect is motivated by relevant psychophysical research findings. To show the effectiveness of the EBIM, we apply it to object categorization and conduct empirical studies on four computer vision data sets. Experimental results demonstrate that the EBIM outperforms the BIM and is comparable to state-of-the-art approaches in terms of accuracy. Moreover, the new system is about 20 times faster than the BIM.
This paper presents a new training framework for multi-class moving object classification in surveillance-oriented scene. In many practical multi-class classification tasks, the instances are close to each other in the input feature space when they have similar features. These instances may have different class labels. Since the moving objects may have various view and shape, the above phenomenon is common in multi-class moving object classification. In our framework, firstly the input feature space is divided into several local clusters. Then, global training and local training are carried out sequential with an efficient online learning based algorithm. The induced global classifier is used to assign candidate instances to the most reliable clusters. Meanwhile, the trained local classifiers within those clusters can determine which classes the candidate instances belong to. Our experimental results illustrate the effectiveness of our method for moving object classification in surveillance-oriented scene.
Dimension reduction has been widely used in real-world applications such as image retrieval and document classification. In many scenarios, different features (or multiview data) can be obtained, and how to duly utilize them is a challenge. It is not appropriate for the conventional concatenating strategy to arrange features of different views into a long vector. That is because each view has its specific statistical property and physical interpretation. Even worse, the performance of the concatenating strategy will deteriorate if some views are corrupted by noise. In this paper, we propose a multiview stochastic neighbor embedding (m-SNE) that systematically integrates heterogeneous features into a unified representation for subsequent processing based on a probabilistic framework. Compared with conventional strategies, our approach can automatically learn a combination coefficient for each view adapted to its contribution to the data embedding. This combination coefficient plays an important role in utilizing the complementary information in multiview data. Also, our algorithm for learning the combination coefficient converges at a rate of O(1/k(2)), which is the optimal rate for smooth problems. Experiments on synthetic and real data sets suggest the effectiveness and robustness of m-SNE for data visualization, image retrieval, object categorization, and scene recognition.
The challenges in local-feature-based image matching are variations of view and illumination. Many methods have been recently proposed to address these problems by using invariant feature detectors and distinctive descriptors. However, the matching performance is still unstable and inaccurate, particularly when large variation in view or illumination occurs. In this paper, we propose a view and illumination invariant image-matching method. We iteratively estimate the relationship of the relative view and illumination of the images, transform the view of one image to the other, and normalize their illumination for accurate matching. Our method does not aim to increase the invariance of the detector but to improve the accuracy, stability, and reliability of the matching results. The performance of matching is significantly improved and is not affected by the changes of view and illumination in a valid range. The proposed method would fail when the initial view and illumination method fails, which gives us a new sight to evaluate the traditional detectors. We propose two novel indicators for detector evaluation, namely, valid angle and valid illumination, which reflect the maximum allowable change in view and illumination, respectively. Extensive experimental results show that our method improves the traditional detector significantly, even in large variations, and the two indicators are much more distinctive.
We present a new method of computing invariants in videos captured from different views to achieve view-invariant action recognition. To avoid the constraints of collinearity or coplanarity of image points for constructing invariants, we consider several neighboring frames to compute cross ratios, namely cross ratios across frames (CRAF), as our invariant representation of action. For every five points sampled with different intervals from the trajectories of action, we construct a pair of cross ratios (CRs). Afterwards, we transform the CRs to histograms as the feature vectors for classification. Experimental results demonstrate that the proposed method outperforms the state-of-the-art methods in effectiveness and stability.
Human behavior analysis is an important area of research in computer vision and is also driven by a wide spectrum of applications, such as smart video surveillance and human-computer interface. In this paper, we present a novel approach for human behavior analysis. Two research challenges, motion representation and behavior recognition, are addressed. A novel motion descriptor, which is an improved feature based on optical flow, is proposed for motion representation. Optical flow is improved with a motion filter, and feature fusion with the shape and trajectory information. To recognize the behavior, the support vector machine is employed to train the classifier where the concatenation of histograms is formed as the input features. Experimental results on the Weizmann behavior database and the Institute of Automation, Chinese Academy of Science real-world multiview behavior database demonstrate the robustness and effectiveness of our method.