Abstract: Extracellular vesicles (EVs) can contain DNA, RNA, proteins and metabolic molecules from primary origins; they are coated with a phospholipid bilayer membrane and released by cells into the extracellular matrix. EVs can be obtained from various body liquids, including the blood, saliva, cerebrospinal fluid, and urine. As has been proved, EVs-mediated transfer of biologically active molecules is crucial for various physiological and pathological processes. Extensive investigations have already begun to explore the diagnosis and prognosis potentials for EVs. Furthermore, research has continued to recognize the critical role of nucleic acids and proteins in EVs. However, our understanding of the comprehensive effects of metabolites in these nanoparticles is currently limited and in its infancy. Therefore, we have attempted to summarize the recent research into the metabolomics of EVs in relation to potential clinical applications and discuss the problems and challenges that have occurred, to provide more guidance for the future development in this field. Keywords: extracellular vesicles, metabolomics, metabolites, clinical application
Speech-driven gesture generation is highly challenging due to the random jitters of human motion. In addition, there is an inherent asynchronous relationship between human speech and gestures. To tackle these challenges, we introduce a novel quantization-based and phase-guided motion-matching framework. Specifically, we first present a gesture VQ-VAE module to learn a codebook to summarize meaningful gesture units. With each code representing a unique gesture, random jittering problems are alleviated effectively. We then use Levenshtein distance to align diverse gestures with different speech. Levenshtein distance based on audio quantization as a similarity metric of corresponding speech of gestures helps match more appropriate gestures with speech, and solves the alignment problem of speech and gestures well. Moreover, we introduce phase to guide the optimal gesture matching based on the semantics of context or rhythm of audio. Phase guides when text-based or speech-based gestures should be performed to make the generated gestures more natural. Extensive experiments show that our method outperforms recent approaches on speech-driven gesture generation. Our code, database, pre-trained models, and demos are available at https://github.com/YoungSeng/QPGesture.
Automatic fake news detection is an important, yet very challenging topic. Traditional methods using lexical features have only very limited success. This paper proposes a novel method to incorporate speaker profiles into an attention based LSTM model for fake news detection. Speaker profiles contribute to the model in two ways. One is to include them in the attention model. The other includes them as additional input data. By adding speaker profiles such as party affiliation, speaker title, location and credit history, our model outperforms the state-of-the-art method by 14.5% in accuracy using a benchmark fake news detection dataset. This proves that speaker profiles provide valuable information to validate the credibility of news articles.
Noise in class labels of any training set can lead to poor classification results no matter what machine learning method is used. In this paper, we first present the problem of binary classification in the presence of random noise on the class labels, which we call class noise. To model class noise, a class noise rate is normally defined as a small independent probability of the class labels being inverted on the whole set of training data. In this paper, we propose a method to estimate class noise rate at the level of individual samples in real data. Based on the estimation result, we propose two approaches to handle class noise. The first technique is based on modifying a given surrogate loss function. The second technique eliminates class noise by sampling. Furthermore, we prove that the optimal hypothesis on the noisy distribution can approximate the optimal hypothesis on the clean distribution using both approaches. Our methods achieve over 87% accuracy on a synthetic non-separable dataset even when 40% of the labels are inverted. Comparisons to other algorithms show that our methods outperform state-of-the-art approaches on several benchmark datasets in different domains with different noise rates.
Speech sentiments are inherently dynamic. However, most of the existing methods for speech sentiment analysis follow the paradigm: one single hard label for an entire utterance, which results in that segment-level fine-grained sentiment information is unavailable and sentiment dynamics can not be considered. In addition, due to the ambiguity of sentiments, it is difficult and time-consuming to annotate an utterance at the segment level. In this work, to alleviate the above issues, we propose to use sentiment profiles (SPs), which give a time series of the segment-level soft labels to capture the fine-grained sentiment cues across an utterance. To obtain a large number of labeled data with segment-level annotations, we propose a cross-modal knowledge transfer method at the segment level to transfer facial expression knowledge from images to audio segments. Further, we propose the sentiment profile refinery (SPR) to iteratively update sentiment profiles to improve the accuracy of SPs and overcome the noise problem in the cross-modal knowledge transfer method. Our experiments on the CH-SIMS dataset with the iQIYI-VID dataset as unlabeled data show that our method can effectively exploit additional unlabeled audio-visual data and achieve state-of-the-art performance.
Offline Multiple Appropriate Facial Reaction Generation (OMAFRG) aims to predict the reaction of different listeners given a speaker, which is useful in the senario of human-computer interaction and social media analysis. In recent years, the Offline Facial Reactions Generation (OFRG) task has been explored in different ways. However, most studies only focus on the deterministic reaction of the listeners. The research of the non-deterministic (i.e. OMAFRG) always lacks of sufficient attention and the results are far from satisfactory. Compared with the deterministic OFRG tasks, the OMAFRG task is closer to the true circumstance but corresponds to higher difficulty for its requirement of modeling stochasticity and context. In this paper, we propose a new model named FRDiff to tackle this issue. Our model is developed based on the diffusion model architecture with some modification to enhance its ability of aggregating the context features. And the inherent property of stochasticity in diffusion model enables our model to generate multiple reactions. We conduct experiments on the datasets provided by the ACM Multimedia REACT2023 and obtain the second place on the board, which demonstrates the effectiveness of our method.
Audio-driven co-speech human gesture generation has made remarkable advancements recently. However, most previous works only focus on single person audio-driven gesture generation. We aim at solving the problem of conversational co-speech gesture generation that considers multiple participants in a conversation, which is a novel and challenging task due to the difficulty of simultaneously incorporating semantic information and other relevant features from both the primary speaker and the interlocutor. To this end, we propose CoDiffuseGesture, a diffusion model-based approach for speech-driven interaction gesture generation via modeling bilateral conversational intention, emotion, and semantic context. Our method synthesizes appropriate interactive, speech-matched, high-quality gestures for conversational motions through the intention perception module and emotion reasoning module at the sentence level by a pretrained language model. Experimental results demonstrate the promising performance of the proposed method.