Similarity matrix is critical to the performance of spectral clustering. Mercer kernels have become popular largely due to its successes in applying kernel methods such as kernel PCA. A novel spectral clustering method is proposed based on local neighborhood in kernel space (SC-LNK), which assumes that each data point can be linearly reconstructed from its neighbors. The SC-LNK algorithm tries to project the data to a feature space by the Mercer kernel, and then learn a sparse matrix using linear reconstruction as the similarity graph for spectral clustering. Experiments have been performed on synthetic and real world data sets and have shown that spectral clustering based on linear reconstruction in kernel space outperforms the conventional spectral clustering and the other two algorithms, especially in real world data sets.
Ontologies are widely utilized in the biological domain for data annotation, integration, and analysis. Some representation learning methods have been proposed to learn the representation of entities to assist intelligent applications, such as knowledge discovery. However, most of them neglect the class information of entities in the ontology. In this article, we propose a unified framework, named ERCI, which jointly optimizes the knowledge graph embedding model and self-supervised learning. In this way, we can generate embeddings of bio-entities by fusing the class information. Moreover, ERCI is a pluggable framework that can be easily incorporated with any knowledge graph embedding model. We validate ERCI in two different ways. In the first way, we utilize the protein embeddings learned by the ERCI to predict protein-protein interactions on two different datasets. In the second way, we leverage the gene and disease embeddings generated by the ERCI to predict gene-disease associations. In addition, we create three datasets to simulate the long-tail scenario and evaluate ERCI on these. Experimental results show that ERCI has superior performance on all metrics compared with the state-of-the-art methods.
Proteins are the main undertakers of life activities, and accurately predicting their biological functions can help human better understand life mechanism and promote the development of themselves. With the rapid development of high-throughput technologies, an abundance of proteins are discovered. However, the gap between proteins and function annotations is still huge. To accelerate the process of protein function prediction, some computational methods taking advantage of multiple data have been proposed. Among these methods, the deep-learning-based methods are currently the most popular for their capability of learning information automatically from raw data. However, due to the diversity and scale difference between data, it is challenging for existing deep learning methods to capture related information from different data effectively. In this paper, we introduce a deep learning method that can adaptively learn information from protein sequences and biomedical literature, namely DeepAF. DeepAF first extracts the two kinds of information by using different extractors, which are built based on pre-trained language models and can capture rudimentary biological knowledge. Then, to integrate those information, it performs an adaptive fusion layer based on a Cross-attention mechanism that considers the knowledge of mutual interactions between two information. Finally, based on the mixed information, DeepAF utilizes logistic regression to obtain prediction scores. The experimental results on the datasets of two species (i.e., Human and Yeast) show that DeepAF outperforms other state-of-the-art approaches.
Syntactic knowledge has been widely employed in existing research to enhance relation extraction, providing guidance for the semantic understanding and text representation of models. However, the utilization of syntactic knowledge in most studies is not exhaustive, and there is often a lack of fine-grained noise reduction, leading to confusion in relation classification. In this paper, we propose an attention generator that comprehensively considers both syntactic dependency type information and syntactic position information to distinguish the importance of different dependency connections. Additionally, we integrate positional information, dependency type information, and word representations together to introduce location-enhanced syntactic knowledge for guiding our biomedical relation extraction. Experimental results on three widely used English benchmark datasets in the biomedical domain consistently outperform a range of baseline models, demonstrating that our approach not only makes full use of syntactic knowledge but also effectively reduces the impact of noisy words.
Abstract Background The recognition of pharmacological substances, compounds and proteins is essential for biomedical relation extraction, knowledge graph construction, drug discovery, as well as medical question answering. Although considerable efforts have been made to recognize biomedical entities in English texts, to date, only few limited attempts were made to recognize them from biomedical texts in other languages. PharmaCoNER is a named entity recognition challenge to recognize pharmacological entities from Spanish texts. Because there are currently abundant resources in the field of natural language processing, how to leverage these resources to the PharmaCoNER challenge is a meaningful study. Methods Inspired by the success of deep learning with language models, we compare and explore various representative BERT models to promote the development of the PharmaCoNER task. Results The experimental results show that deep learning with language models can effectively improve model performance on the PharmaCoNER dataset. Our method achieves state-of-the-art performance on the PharmaCoNER dataset, with a max F1-score of 92.01%. Conclusion For the BERT models on the PharmaCoNER dataset, biomedical domain knowledge has a greater impact on model performance than the native language (i.e., Spanish). The BERT models can obtain competitive performance by using WordPiece to alleviate the out of vocabulary limitation. The performance on the BERT model can be further improved by constructing a specific vocabulary based on domain knowledge. Moreover, the character case also has a certain impact on model performance.
Dialogue summarization is a task that aims to condense dialogues while retain the salient information. However, due to different domains involved in the dialogue, the corresponding format of reference summary varies from each other, e.g., QA pairs for customer service and SOAP notes in medical field. To address the common challenges encountered in various fields and alleviate the differences due to the format in the generation process, we introduce a novel unified topic-guided dialogue summarization framework, by which we can first capture the topic structure of the conversation and leverage it to guide the generation of the summary. This framework is the first to model fine-grained topic structure of the dialogue and pose its identification as a Seq2Seq task, as well as introduce the topic-guided segment-wise attention to produce the final summary in segments following the specific format in each domain. Such a concise but effective method avoids the trouble of customizing decoding schemes while retains the topic structure of a dialogue in its summary as much as possible. Comprehensive experiments were conducted on four benchmark datasets in different domains and the results show the better performance and generalization of our method compared with the baselines.