Inter-protein (interfacial) contact prediction is very useful for in silico structural characterization of protein-protein interactions. Although deep learning has been applied to this problem, its accuracy is not as good as intra-protein contact prediction.We propose a new deep learning method GLINTER (Graph Learning of INTER-protein contacts) for interfacial contact prediction of dimers, leveraging a rotational invariant representation of protein tertiary structures and a pretrained language model of multiple sequence alignments. Tested on the 13th and 14th CASP-CAPRI datasets, the average top L/10 precision achieved by GLINTER is 54% on the homodimers and 52% on all the dimers, much higher than 30% obtained by the latest deep learning method DeepHomo on the homodimers and 15% obtained by BIPSPI on all the dimers. Our experiments show that GLINTER-predicted contacts help improve selection of docking decoys.The software is available at https://github.com/zw2x/glinter. The datasets are available at https://github.com/zw2x/glinter/data.Supplementary data are available at Bioinformatics online.
AlphaFold-Multimer has greatly improved the protein complex structure prediction, but its accuracy also depends on the quality of the multiple sequence alignment (MSA) formed by the interacting homologs (i.e. interologs) of the complex under prediction. Here we propose a novel method, ESMPair, that can identify interologs of a complex using protein language models. We show that ESMPair can generate better interologs than the default MSA generation method in AlphaFold-Multimer. Our method results in better complex structure prediction than AlphaFold-Multimer by a large margin (+10.7% in terms of the Top-5 best DockQ), especially when the predicted complex structures have low confidence. We further show that by combining several MSA generation methods, we may yield even better complex structure prediction accuracy than Alphafold-Multimer (+22% in terms of the Top-5 best DockQ). By systematically analyzing the impact factors of our algorithm we find that the diversity of MSA of interologs significantly affects the prediction accuracy. Moreover, we show that ESMPair performs particularly well on complexes in eucaryotes.
In the user targeting and expanding of new shows on a video platform, the key point is how their embeddings are generated. It's supposed to be personalized from the perspective of both users and shows. Furthermore, the pursue of both instant (click) and long-time (view time) rewards, and the cold-start problem for new shows bring additional challenges. Such a problem is suitable for processing by heterogeneous graph models, because of the natural graph structure of data. But real-world networks usually have billions of nodes and various types of edges. Few existing methods focus on handling large-scale data and exploiting different types of edges, especially the latter. In this paper, we propose a two-stage audience expansion scheme based on an edge-prompted heterogeneous graph network which can take different double-sided interactions and features into account. In the offline stage, to construct the graph, user IDs and specific side information combinations of the shows are chosen to be the nodes, and click/co-click relations and view time are used to build the edges. Embeddings and clustered user groups are then calculated. When new shows arrive, their embeddings and subsequent matching users can be produced within a consistent space. In the online stage, posterior data including click/view users are employed as seeds to look for similar users. The results on the public datasets and our billion-scale data demonstrate the accuracy and efficiency of our approach.
Many biological pathways have been created to represent different types of knowledge, such as genetic interactions, metabolic reactions, and gene-regulating and physical-binding relationships. Biologists are using a wide range of omics data to elaborately construct various context-specific differential molecular networks. However, they cannot easily gain insight into unfamiliar gene networks with the tools that are currently available for pathways resource and network analysis. They would benefit from the development of a standardized tool to compare functions of multiple biological networks quantitatively and promptly. To address this challenge, we developed NFPscanner, a web server for deciphering gene networks with pathway associations. Adapted from a recently reported knowledge-based framework called network fingerprint, NFPscanner integrates the annotated pathways of 7 databases, 4 algorithms, and 2 graphical visualization modules into a webtool. It implements 3 types of network analysis: Users can upload gene networks to NFPscanner through the web interface and then interactively explore the networks’ functions. NFPscanner is open-source software for non-commercial use, freely accessible at http://biotech.bmi.ac.cn/nfs .
The leaves of the Chinese cabbage which is most widely consumed come in a wide variety of colors. Leaves that are dark green can promote photosynthesis, effectively improving crop yield, and therefore hold important application and cultivation value. In this study, we selected nine inbred lines of Chinese cabbage displaying slight differences in leaf color, and graded the leaf color using the reflectance spectra. We clarified the differences in gene sequences and the protein structure of ferrochelatase 2 (BrFC2) among the nine inbred lines, and used qRT-PCR to analyze the expression differences of photosynthesis-related genes in inbred lines with minor variations in dark-green leaves. We found expression differences among the inbred lines of Chinese cabbage in photosynthesis-related genes involved in the porphyrin and chlorophyll metabolism, as well as in photosynthesis and photosynthesis-antenna protein pathway. Chlorophyll b content was significantly positively correlated with the expression of PsbQ, LHCA1_1 and LHCB6_1, while chlorophyll a content was significantly negatively correlated with the expression PsbQ, LHCA1_1 and LHCA1_2. Our results provide an empirical basis for the precise identification of candidate genes and a better understanding of the molecular mechanisms responsible for the production of dark-green leaves in Chinese cabbage.
Abstract AlphaFold-Multimer has greatly improved protein complex structure prediction, but its accuracy also depends on the quality of the multiple sequence alignment (MSA) formed by the interacting homologs (i.e., interologs) of the complex under prediction. Here we propose a novel method, denoted as ESMPair, that can identify interologs of a complex by making use of protein language models (PLMs). We show that ESMPair can generate better interologs than the default MSA generation method in AlphaFold-Multimer. Our method results in better complex structure prediction than AlphaFold-Multimer by a large margin (+10.7% in terms of the Top-5 best DockQ), especially when the predicted complex structures have low confidence. We further show that by combining several MSA generation methods, we may yield even better complex structure prediction accuracy than Alphafold-Multimer (+22% in terms of the Top-5 best DockQ). We systematically analyze the impact factors of our algorithm and find out the diversity of MSA of interologs significantly affects the prediction accuracy. Moreover, we show that ESMPair performs particularly well on complexes in eucaryotes.
Performance of algorithm and parameter combinations on the same input data set. Table S2. AUC value of networks fingerprint results for 73 KEGG diseases datasets. Figure S1. Average ROC curves derived from Table S2. Table S3. Fingerprints data file in Case Studies. (ZIP 95 kb)
Enhancer elements are noncoding stretches of DNA that play key roles in controlling gene expression programmes. Despite major efforts to develop accurate enhancer prediction methods, identifying enhancer sequences continues to be a challenge in the annotation of mammalian genomes. One of the major issues is the lack of large, sufficiently comprehensive and experimentally validated enhancers for humans or other species. Thus, the development of computational methods based on limited experimentally validated enhancers and deciphering the transcriptional regulatory code encoded in the enhancer sequences is urgent.We present a deep-learning-based hybrid architecture, BiRen, which predicts enhancers using the DNA sequence alone. Our results demonstrate that BiRen can learn common enhancer patterns directly from the DNA sequence and exhibits superior accuracy, robustness and generalizability in enhancer prediction relative to other state-of-the-art enhancer predictors based on sequence characteristics. Our BiRen will enable researchers to acquire a deeper understanding of the regulatory code of enhancer sequences.Our BiRen method can be freely accessed at https://github.com/wenjiegroup/BiRen .shuwj@bmi.ac.cn or boxc@bmi.ac.cn.Supplementary data are available at Bioinformatics online.