Occluded person re-identification (Re-ID) aims to address the potential occlusion problem when matching occluded or holistic pedestrians from different camera views. Many methods use the background as artificial occlusion and rely on attention networks to exclude noisy interference. However, the significant discrepancy between simple background occlusion and realistic occlusion can negatively impact the generalization of the network. To address this issue, we propose a novel transformer-based Attention Disturbance and Dual-Path Constraint Network (ADP) to enhance the generalization of attention networks. Firstly, to imitate real-world obstacles, we introduce an Attention Disturbance Mask (ADM) module that generates an offensive noise, which can distract attention like a realistic occluder, as a more complex form of occlusion. Secondly, to fully exploit these complex occluded images, we develop a DualPath Constraint Module (DPC) that can obtain preferable supervision information from holistic images through dualpath interaction. With our proposed method, the network can effectively circumvent a wide variety of occlusions using the basic ViT baseline. Comprehensive experimental evaluations conducted on person re-ID benchmarks demonstrate the superiority of ADP over state-of-the-art methods.
Visible-infrared person re-identification (VI-ReID) is a task of matching the same individuals across the visible and infrared modalities. Its main challenge lies in the modality gap caused by cameras operating on different spectra. Existing VI-ReID methods mainly focus on learning general features across modalities, often at the expense of feature discriminability. To address this issue, we present a novel cycle-construction-based network for neutral yet discriminative feature learning, termed CycleTrans. Specifically, CycleTrans uses a lightweight Knowledge Capturing Module (KCM) to capture rich semantics from the modality-relevant feature maps according to pseudo queries. Afterwards, a Discrepancy Modeling Module (DMM) is deployed to transform these features into neutral ones according to the modality-irrelevant prototypes. To ensure feature discriminability, another two KCMs are further deployed for feature cycle constructions. With cycle construction, our method can learn effective neutral features for visible and infrared images while preserving their salient semantics. Extensive experiments on SYSU-MM01 and RegDB datasets validate the merits of CycleTrans against a flurry of state-of-the-art methods, +4.57% on rank-1 in SYSU-MM01 and +2.2% on rank-1 in RegDB.
In this paper, we developed a new navigation system, which detects obstacles in a sliding window with an adaptive threshold clustering algorithm, classifies the detected obstacles with a decision tree, heuristically predicts potential collision and finds optimal path with a simplified Mophin algorithm. This system has the merits of optimal free-collision path, small memory size and less computing complexity, compared with the state of the arts in robot navigation. The experiments on simulation and a robot for eight scenarios demonstrate that the robot can effectively and efficiently avoid potential collisions with any static or dynamic obstacles in its surrounding environment.
Image captioning has attracted ever-increasing research attention in multimedia and computer vision. To encode the visual content, existing approaches typically utilize the off-the-shelf deep Convolutional Neural Network (CNN) model to extract visual features, which are sent to Recurrent Neural Network (RNN) based textual generators to output word sequence. Some methods encode visual objects and scene information with attention mechanism more recently. Despite the promising progress, one distinct disadvantage lies in distinguishing and modeling key semantic entities and their relations, which are in turn widely regarded as the important cues for us to describe image content. In this paper, we propose a novel image captioning model, termed StructCap. It parses a given image into key entities and their relations organized in a visual parsing tree, which is transformed and embedded under an encoder-decoder framework via visual attention. We give an end-to-end formulation to facilitate joint training of visual tree parser, structured semantic attention and RNN-based captioning modules. Experimental results on two public benchmarks, Microsoft COCO and Flickr30K, show that the proposed StructCap model outperforms the state-of-the-art approaches under various standard evaluation metrics.
In this work, we propose a high fidelity face swapping method, called HifiFace, which can well preserve the face shape of the source face and generate photo-realistic results. Unlike other existing face swapping works that only use face recognition model to keep the identity similarity, we propose 3D shape-aware identity to control the face shape with the geometric supervision from 3DMM and 3D face reconstruction method. Meanwhile, we introduce the Semantic Facial Fusion module to optimize the combination of encoder and decoder features and make adaptive blending, which makes the results more photo-realistic. Extensive experiments on faces in the wild demonstrate that our method can preserve better identity, especially on the face shape, and can generate more photo-realistic results than previous state-of-the-art methods. Code is available at: https://johann.wang/HifiFace
Recently, person re-identification (re-ID) has attracted increasing research attention, which has broad application prospects in video surveillance and beyond. To this end, most existing methods highly relied on well-aligned pedestrian images and hand-engineered part-based model on the coarsest feature map. In this paper, to lighten the restriction of such fixed and coarse input alignment, an end-to-end part power set model with multi-scale features is proposed, which captures the discriminative parts of pedestrians from global to local, and from coarse to fine, enabling part-based scale-free person re-ID. In particular, we first factorize the visual appearance by enumerating $k$-combinations for all $k$ of $n$ body parts to exploit rich global and partial information to learn discriminative feature maps. Then, a combination ranking module is introduced to guide the model training with all combinations of body parts, which alternates between ranking combinations and estimating an appearance model. To enable scale-free input, we further exploit the pyramid architecture of deep networks to construct multi-scale feature maps with a feasible amount of extra cost in term of memory and time. Extensive experiments on the mainstream evaluation datasets, including Market-1501, DukeMTMC-reID and CUHK03, validate that our method achieves the state-of-the-art performance.
Transformer-based architectures have shown great success in image captioning, where object regions are encoded and then attended into the vectorial representations to guide the caption decoding. However, such vectorial representations only contain region-level information without considering the global information reflecting the entire image, which fails to expand the capability of complex multi-modal reasoning in image captioning. In this paper, we introduce a Global Enhanced Transformer (termed GET) to enable the extraction of a more comprehensive global representation, and then adaptively guide the decoder to generate high-quality captions. In GET, a Global Enhanced Encoder is designed for the embedding of the global feature, and a Global Adaptive Decoder are designed for the guidance of the caption generation. The former models intra- and inter-layer global representation by taking advantage of the proposed Global Enhanced Attention and a layer-wise fusion module. The latter contains a Global Adaptive Controller that can adaptively fuse the global information into the decoder to guide the caption generation. Extensive experiments on MS COCO dataset demonstrate the superiority of our GET over many state-of-the-arts.