In real video surveillance scenarios, visual pedestrian attributes, such as gender, backpack, clothes types, are very important for pedestrian retrieval and person reidentification. Existing methods for attributes recognition have two drawbacks: (a) handcrafted features (e.g. color histograms, local binary patterns) cannot cope well with the difficulty of real video surveillance scenarios; (b) the relationship among pedestrian attributes is ignored. To address the two drawbacks, we propose two deep learning based models to recognize pedestrian attributes. On the one hand, each attribute is treated as an independent component and the deep learning based single attribute recognition model (DeepSAR) is proposed to recognize each attribute one by one. On the other hand, to exploit the relationship among attributes, the deep learning framework which recognizes multiple attributes jointly (DeepMAR) is proposed. In the DeepMAR, one attribute can contribute to the representation of other attributes. For example, the gender of woman can contribute to the representation oflong hair and wearing skirt. Experiments on recent popular pedestrian attribute datasets illustrate that our proposed models achieve the state-of-the-art results.
Retrieving specific persons with various types of queries, e.g., a set of attributes or a portrait photo has great application potential in large-scale intelligent surveillance systems. In this paper, we propose a richly annotated pedestrian (RAP) dataset which serves as a unified benchmark for both attribute-based and image-based person retrieval in real surveillance scenarios. Typically, previous datasets have three improvable aspects, including limited data scale and annotation types, heterogeneous data source, and controlled scenarios. Differently, RAP is a large-scale dataset which contains 84928 images with 72 types of attributes and additional tags of viewpoint, occlusion, body parts, and 2589 person identities. It is collected in the real uncontrolled scene and has complex visual variations in pedestrian samples due to the change of viewpoints, pedestrian postures, and cloth appearance. Towards a high-quality person retrieval benchmark, an amount of state-of-the-art algorithms on pedestrian attribute recognition and person re-identification (ReID), are performed for quantitative analysis with three evaluation tasks, i.e., attribute recognition, attribute-based and image-based person retrieval, where a new instance-based metric is proposed to measure the dependency of the prediction of multiple attributes. Finally, some interesting problems, e.g., the joint feature learning of attribute recognition and ReID, and the problem of cross-day person ReID, are explored to show the challenges and future directions in person retrieval.
Person Re-identification (ReID) is to identify the same person across different cameras. It is a challenging task due to the large variations in person pose, occlusion, background clutter, etc. How to extract powerful features is a fundamental problem in ReID and is still an open problem today. In this paper, we design a Multi-Scale Context-Aware Network (MSCAN) to learn powerful features over full body and body parts, which can well capture the local context knowledge by stacking multi-scale convolutions in each layer. Moreover, instead of using predefined rigid parts, we propose to learn and localize deformable pedestrian parts using Spatial Transformer Networks (STN) with novel spatial constraints. The learned body parts can release some difficulties, e.g. pose variations and background clutters, in part-based representation. Finally, we integrate the representation learning processes of full body and body parts into a unified framework for person ReID through multi-class person identification tasks. Extensive evaluations on current challenging large-scale person ReID datasets, including the image-based Market1501, CUHK03 and sequence-based MARS datasets, show that the proposed method achieves the state-of-the-art results.
Person re-identification (ReID) is the task of retrieving particular persons across different cameras. Despite its great progress in recent years, it is still confronted with challenges like pose variation, occlusion, and similar appearance among different persons. The large gap between training and testing performance with existing models implies the insufficiency of generalization. Considering this fact, we propose to augment the variation of training data by introducing Adversarially Occluded Samples. These special samples are both a) meaningful in that they resemble real-scene occlusions, and b) effective in that they are tough for the original model and thus provide the momentum to jump out of local optimum. We mine these samples based on a trained ReID model and with the help of network visualization techniques. Extensive experiments show that the proposed samples help the model discover new discriminative clues on the body and generalize much better at test time. Our strategy makes significant improvement over strong baselines on three large-scale ReID datasets, Market1501, CUHK03 and DukeMTMC-reID.
Recognizing pedestrian attributes, such as gender, backpack, and cloth types, has obtained increasing attention recently due to its great potential in intelligent video surveillance. Existing methods usually solve it with end-to-end multi-label deep neural networks, while the structure knowledge of pedestrian body has been little utilized. Considering that attributes have strong spatial correlations with human structures, e.g. glasses are around the head, in this paper, we introduce pedestrian body structure into this task and propose a Pose Guided Deep Model (PGDM) to improve attribute recognition. The PGDM consists of three main components: 1) coarse pose estimation which distillates the pose knowledge from a pre-trained pose estimation model, 2) body parts localization which adaptively locates informative image regions with only image-level supervision, 3) multiple features fusion which combines the part-based features for attribute recognition. In the inference stage, we fuse the part-based PGDM results with global body based results for final attribute prediction and the performance can be consistently improved. Compared with state-of-the-art models, the performances on three large-scale pedestrian attribute datasets, i.e., PETA, RAP, and PA-100K, demonstrate the effectiveness of the proposed method.
Fashion retrieval in video suffers from the issues of imperfect visual representation and low quality of search results under the E-commercial circumstance. Previous works generally focus on searching the identical images from visual perspective only, but lack of leveraging multi-modal information for high quality commodities. As a cross-domain problem, instructional or exhibiting audio reveals rich semantic information to facilite the video-to-shop task. In this paper, we present a novel Visual-Audio Composition Alignment Network (VACANet) to deal with quality fashion retrieval in video. Firstly, we introduce the visual-audio composition module in VACANet aiming to distinguish attentive and residual entities by learning semantic embedding from both visual and audio streams. Secondly, a quality alignment training scheme is then designed by quality-aware triplet mining and domain alignment constraint for video-to-image adaptation. Finally, extensive experiments conducted on challenging video datasets demonstrate the scalable effectiveness of our model in alleviating quality fashion retrieval.
In many real-world datasets, like WebVision, the performance of DNN based classifier is often limited by the noisy labeled data. To tackle this problem, some image related side information, such as captions and tags, often reveal underlying relationships across images. In this paper, we present an efficient weakly supervised learning by using a Side Information Network (SINet), which aims to effectively carry out a large scale classification with severely noisy labels. The proposed SINet consists of a visual prototype module and a noise weighting module. The visual prototype module is designed to generate a compact representation for each category by introducing the side information. The noise weighting module aims to estimate the correctness of each noisy image and produce a confidence score for image ranking during the training procedure. The propsed SINet can largely alleviate the negative impact of noisy image labels, and is beneficial to train a high performance CNN based classifier. Besides, we released a fine-grained product dataset called AliProducts, which contains more than 2.5 million noisy web images crawled from the internet by using queries generated from 50,000 fine-grained semantic classes. Extensive experiments on several popular benchmarks (i.e. Webvision, ImageNet and Clothing-1M) and our proposed AliProducts achieve state-of-the-art performance. The SINet has won the first place in the classification task on WebVision Challenge 2019, and outperformed other competitors by a large margin.
In this paper, we propose a unified multi-modal retrieval framework to tackle two typical video understanding tasks, i.e., matching movie scenes and text descriptions, and scene sentiment classification. For the task of matching movie scenes and text descriptions, it is a natural multi-modal retrieval problem, while for the task of scene sentiment classification, the proposed framework aims at finding most related sentiment tag for each movie scene, which is also a multi-modal retrieval problem. By considering these two tasks as multi-modal retrieval problems, we propose a unified multi-modal retrieval framework, which can make full use of the models pre-trained on large scale multi-modal datasets, experiments show that it is critical for the tasks which have only hundreds of training examples. To further improve the performance on movie video understanding task, we also collect a large scale video-text dataset, which contains 427,603 movie-shot and text pairs. Experimental results validate the effectiveness of this dataset.
Abstract. With the development of science and technology and the increasing cost in labor, the fruit collection gradually turns to mechanized operation. Existing small fruit collecting devices have bulky and complex structure and need to be equipped with high-power drive, and strict requirement in the tree spacing. In order to improve the mechanical collecting device for the small fruits, a collecting device with flank deployable and foldable mechanism was introduced in this paper. The collect device consists of three parts which are flank deployable and foldable mechanism, lifting mechanism and mobile clamping mechanism. Kinematics of the flank deployable and foldable mechanism which is the core mechanism in the main structure of collect device was analyzed, and the main components of the mechanism were optimized using genetic algorithm (GA) with the MATLAB program. The optimal dimensions of the main components are lAB=277.88mmï¼lAD=661.64mmï¼lPE=306.58mm.The 3D model of the flank deployable and foldable mechanism was imported into the dynamic analysis software ADAMS, the motion trajectory of the mechanism was simulated and get the dynamic simulation result that the driving torque was 2.64N·m .A prototype of the flank deployable and foldable mechanism was manufactured and its trajectory was recorded by high-speed photography system. Trajectories of experiment and virtual simulation were basically consistent which verified the design of this collecting device.