Video-based person re-identification is an important research topic in computer vision that entails associating a pedestrian’s identity with non-overlapping cameras. It suffers from severe temporal appearance misalignment and visual ambiguity problems. We propose a novel self-supervised human semantic parsing approach (SS-HSP) for video-based person re-identification in this work. It employs self-supervised learning to adaptively segment the human body at pixel-level by estimating motion information of each body part between consecutive frames and explores complementary temporal relations for pursuing reinforced appearance and motion representations. Specifically, a semantic segmentation network within SS-HSP is designed, which exploits self-supervised learning by constructing a pretext task of predicting future frames. The network learns precise human semantic parsing together with the motion field of each body part between consecutive frames, which permits the reconstruction of future frames with the aid of several customized loss functions. Local aligned features of body parts are obtained according to the estimated human parsing. Moreover, an aggregation network is proposed to explore the correlation information across video frames for refining the appearance and motion representations. Extensive experiments on two video datasets have demonstrated the effectiveness of the proposed approach.
Generating detailed captions comprehending text-rich visual content in images has received growing attention for Large Vision-Language Models (LVLMs). However, few studies have developed benchmarks specifically tailored for detailed captions to measure their accuracy and comprehensiveness. In this paper, we introduce a detailed caption benchmark, termed as CompreCap, to evaluate the visual context from a directed scene graph view. Concretely, we first manually segment the image into semantically meaningful regions (i.e., semantic segmentation mask) according to common-object vocabulary, while also distinguishing attributes of objects within all those regions. Then directional relation labels of these objects are annotated to compose a directed scene graph that can well encode rich compositional information of the image. Based on our directed scene graph, we develop a pipeline to assess the generated detailed captions from LVLMs on multiple levels, including the object-level coverage, the accuracy of attribute descriptions, the score of key relationships, etc. Experimental results on the CompreCap dataset confirm that our evaluation method aligns closely with human evaluation scores across LVLMs.
For safety-related applications, it is crucial to produce trustworthy deep neural networks whose prediction is associated with confidence that can represent the likelihood of correctness for subsequent decision-making. Existing dense binary classification models are prone to being over-confident. To improve model calibration, we propose Adaptive Stochastic Label Perturbation (ASLP) which learns a unique label perturbation level for each training image. ASLP employs our proposed Self-Calibrating Binary Cross Entropy (SC-BCE) loss, which unifies label perturbation processes including stochastic approaches (like DisturbLabel), and label smoothing, to correct calibration while maintaining classification rates. ASLP follows Maximum Entropy Inference of classic statistical mechanics to maximise prediction entropy with respect to missing information. It performs this while: (1) preserving classification accuracy on known data as a conservative solution, or (2) specifically improves model calibration degree by minimising the gap between the prediction accuracy and expected confidence of the target training label. Extensive results demonstrate that ASLP can significantly improve calibration degrees of dense binary classification models on both in-distribution and out-of-distribution data. The code is available on https://github.com/Carlisle-Liu/ASLP.
Capturing and maintaining geometric interactions among different body parts is crucial for successful motion retargeting in skinned characters. Existing approaches often overlook body geometries or add a geometry correction stage after skeletal motion retargeting. This results in conflicts between skeleton interaction and geometry correction, leading to issues such as jittery, interpenetration, and contact mismatches. To address these challenges, we introduce a new retargeting framework, MeshRet, which directly models the dense geometric interactions in motion retargeting. Initially, we establish dense mesh correspondences between characters using semantically consistent sensors (SCS), effective across diverse mesh topologies. Subsequently, we develop a novel spatio-temporal representation called the dense mesh interaction (DMI) field. This field, a collection of interacting SCS feature vectors, skillfully captures both contact and non-contact interactions between body geometries. By aligning the DMI field during retargeting, MeshRet not only preserves motion semantics but also prevents self-interpenetration and ensures contact preservation. Extensive experiments on the public Mixamo dataset and our newly-collected ScanRet dataset demonstrate that MeshRet achieves state-of-the-art performance. Code available at https://github.com/abcyzj/MeshRet.
A key challenge in sensor-based fall prediction is the fact that a fall event can often occur in various configurations of fall poses together with their own spatio-temporal dependencies. This leads us to define a spatio-temporal model to explicitly characterize these internal configurations of poses. In particular, we introduce a graph neural network with spatio-temporal topological structure to encode such latent relations among poses by capturing representative patterns in fall events. Moreover, a human body orientation estimator is devised to capture human low limbs information, and as a result, separate pose dependencies are globally consistent. Empirical evaluations on two benchmark datasets and one in-house dataset suggest our approach significantly outperforms the state-of-the-art methods.
Despite the fact that deep neural networks (DNNs) have achieved prominent performance in various applications, it is well known that DNNs are vulnerable to adversarial examples/samples (AEs) with imperceptible perturbations in clean/original samples. To overcome the weakness of the existing defense methods against adversarial attacks, which damages the information on the original samples, leading to the decrease of the target classifier accuracy, this paper presents an enhanced countering adversarial attack method IDFR (via Input Denoising and Feature Restoring). The proposed IDFR is made up of an enhanced input denoiser (ID) and a hidden lossy feature restorer (FR) based on the convex hull optimization. Extensive experiments conducted on benchmark datasets show that the proposed IDFR outperforms the various state-of-the-art defense methods, and is highly effective for protecting target models against various adversarial black-box or white-box attacks. \footnote{Souce code is released at: \href{https://github.com/ID-FR/IDFR}{https://github.com/ID-FR/IDFR}}
Confidence-aware learning is proven as an effective solution to prevent networks becoming overconfident. We present a confidence-aware camouflaged object detection framework using dynamic supervision to produce both accurate camouflage map and meaningful "confidence" representing model awareness about the current prediction. A camouflaged object detection network is designed to produce our camouflage prediction. Then, we concatenate it with the input image and feed it to the confidence estimation network to produce an one channel confidence map.We generate dynamic supervision for the confidence estimation network, representing the agreement of camouflage prediction with the ground truth camouflage map. With the produced confidence map, we introduce confidence-aware learning with the confidence map as guidance to pay more attention to the hard/low-confidence pixels in the loss function. We claim that, once trained, our confidence estimation network can evaluate pixel-wise accuracy of the prediction without relying on the ground truth camouflage map. Extensive results on four camouflaged object detection testing datasets illustrate the superior performance of the proposed model in explaining the camouflage prediction.
Existing methods of privacy-preserving action recognition (PPAR) mainly focus on frame-level (spatial) privacy removal through 2D CNNs. Unfortunately, they have two major drawbacks. First, they may compromise temporal dynamics in input videos, which are critical for accurate action recognition. Second, they are vulnerable to practical attacking scenarios where attackers probe for privacy from an entire video rather than individual frames. To address these issues, we propose a novel framework STPrivacy to perform video-level PPAR. For the first time, we introduce vision Transformers into PPAR by treating a video as a tubelet sequence, and accordingly design two complementary mechanisms, i.e., sparsification and anonymization, to remove privacy from a spatio-temporal perspective. In specific, our privacy sparsification mechanism applies adaptive token selection to abandon action-irrelevant tubelets. Then, our anonymization mechanism implicitly manipulates the remaining action-tubelets to erase privacy in the embedding space through adversarial learning. These mechanisms provide significant advantages in terms of privacy preservation for human eyes and action-privacy trade-off adjustment during deployment. We additionally contribute the first two large-scale PPAR benchmarks, VP-HMDB51 and VP-UCF101, to the community. Extensive evaluations on them, as well as two other tasks, validate the effectiveness and generalization capability of our framework.
Radio frequency (RF) fingerprinting is a promising technique for device authentication due to its advantages of difficult-to-tamper and inimitableness. Rather than employing hand-crafted features, deep learning-based methods that learn low-level signal representation from raw signals have been explored recently. This letter proposes a novel RF fingerprinting scheme using learnable short-time Fourier transform (STFT) and convolutional neural network (CNN). Instead of representing radio signals by spectrograms from a fixed STFT and processing signal representation and classification separately, this approach integrates a parameterized STFT-based signal representation module and a CNN classifier into a single framework. The signal representation module is jointly trained with the CNN classifier using a single identification loss, transforming the input radio signals into spectrograms desired by the CNN classifier. Experimental results show that the RF fingerprint identification accuracy of the proposed scheme with learnable signal representation is significantly improved compared to baseline schemes with traditional hand-crafted signal representations.
Due to the advantages of a centralized critic to estimate the Q-function value and decentralized actors to optimize the agents’ policies, counterfactual multi-agent (COMA) stands out in most multi-agent reinforcement learning (MARL) algorithms. The sharing of policy parameters can improve sampling efficiency and learning effectiveness, but it may lead to a lack of policy diversity. Hence, to balance parameter sharing and diversity among agents in COMA has been a persistent research topic. In this paper, an effective training method for a COMA policy network based on a differential evolution (DE) algorithm is proposed, named DE-COMA. DE-COMA introduces individuals in a population as computational units to construct the policy network with operations such as mutation, crossover, and selection. The average return of DE-COMA is set as the fitness function, and the best individual of policy network will be chosen for the next generation. By maintaining better parameter sharing to enhance parameter diversity, multi-agent strategies will become more exploratory. To validate the effectiveness of DE-COMA, experiments were conducted in the StarCraft II environment with 2s_vs_1sc, 2s3z, 3m, and 8m battle scenarios. Experimental results demonstrate that DE-COMA significantly outperforms the traditional COMA and most other multi-agent reinforcement learning algorithms in terms of win rate and convergence speed.