Abstract Background In-silico identification of potential disease genes has become an essential aspect of drug target discovery. Recent studies suggest that one powerful way to identify successful targets is through the use of genetic and genomic information. Given a known disease gene, leveraging intermolecular connections via networks and pathways seems a natural way to identify other genes and proteins that are involved in similar biological processes, and that can therefore be analysed as additional targets. Results Here, we systematically tested the ability of 12 varied network-based algorithms to identify target genes and cross-validated these using gene-disease data from Open Targets on 22 common diseases. We considered two biological networks, six performance metrics and compared two types of input gene-disease association scores. We also compared several cross-validation schemes and showed that different choices had a remarkable impact on the performance estimates. When seeding biological networks with known drug targets, we found that machine learning and diffusion-based methods are able to find novel targets, showing around 2-4 true hits in the top 20 suggestions. Seeding the networks with genes associated to disease by genetics resulted in poorer performance, below 1 true hit on average. We also observed that the use of a larger network, although noisier, improved overall performance. Conclusions We conclude that machine learning and diffusion-based prioritisers are suited for drug discovery in practice and improve over simpler neighbour-voting methods. We also demonstrate the large effect of several factors on prediction performance, especially the validation strategy, input biological network, and definition of seed disease genes.
Una de las tecnicas experimentales mas extendidas en el ambito de investigacion biologica y la quimica analitica es la Cromatografia Liquida – Espectrometria de Masas, CL/EM, cuya salida informa sobre los compuestos presentes en las muestras mediante una tecnica de separacion fisica acoplada a una separacion en funcion de la relacion carga-masa. Las tecnicas de enriquecimiento de vias metabolicas son preciadas en el tratamiento de conjuntos extensivos de datos, puesto que traducen esta informacion sobre computestos en terminos de vias metabolicas a la vez que reducen el ruido estadistico. Las vias metabolicas son fuente de conocimiento por su estrecha relacion con los mecanismos biologicos.
Este trabajo propone una nueva tecnica de enriquecimiento de datos obtenidos en CL/EM mediante una estrategia en dos bloques. El primero consiste en plasmar la base de datos Kyoto Encyclopedia of Genes and Genomes en grafos interpretables. El segundo trata de aplicar algoritmos de difusion de calor y PageRank sobre dichos grafos, con el objetivo de llevar a termino el enriquecimiento. Estos procedimientos se han aplicado en un caso real y sus resultados coinciden con los de validacion funcional.
Abstract Summary Label propagation and diffusion over biological networks are a common mathematical formalism in computational biology for giving context to molecular entities and prioritizing novel candidates in the area of study. There are several choices in conceiving the diffusion process—involving the graph kernel, the score definitions and the presence of a posterior statistical normalization—which have an impact on the results. This manuscript describes diffuStats, an R package that provides a collection of graph kernels and diffusion scores, as well as a parallel permutation analysis for the normalized scores, that eases the computation of the scores and their benchmarking for an optimal choice. Availability and implementation The R package diffuStats is publicly available in Bioconductor, https://bioconductor.org, under the GPL-3 license. Supplementary information Supplementary data are available at Bioinformatics online.
The most common preclinical, in vivo model to study lung fibrosis is the bleomycin-induced lung fibrosis model in 2- to 3-mo-old mice. Although this model resembles key aspects of idiopathic pulmonary fibrosis (IPF), there are limitations in its predictability for the human disease. One of the main differences is the juvenile age of animals that are commonly used in experiments, resembling humans of around 20 yr. Because IPF patients are usually older than 60 yr, aging appears to play an important role in the pathogenesis of lung fibrosis. Therefore, we compared young (3 months) and old mice (21 months) 21 days after intratracheal bleomycin instillation. Analyzing lung transcriptomics (mRNAs and miRNAs) and proteomics, we found most pathways to be similarly regulated in young and old mice. However, old mice show imbalanced protein homeostasis as well as an increased inflammatory state in the fibrotic phase compared to young mice. Comparisons with published human transcriptomic data sets (GSE47460, GSE32537, and GSE24206) revealed that the gene signature of old animals correlates significantly better with IPF patients, and it also turned human healthy individuals better into "IPF patients" using an approach based on predictive disease modeling. Both young and old animals show similar molecular hallmarks of IPF in the bleomycin-induced lung fibrosis model, although old mice more closely resemble several features associated with IPF in comparison to young animals.
Abstract Summary High-throughput screening yields vast amounts of biological data which can be highly challenging to interpret. In response, knowledge-driven approaches emerged as possible solutions to analyze large datasets by leveraging prior knowledge of biomolecular interactions represented in the form of biological networks. Nonetheless, given their size and complexity, their manual investigation quickly becomes impractical. Thus, computational approaches, such as diffusion algorithms, are often employed to interpret and contextualize the results of high-throughput experiments. Here, we present MultiPaths, a framework consisting of two independent Python packages for network analysis. While the first package, DiffuPy, comprises numerous state-of-the-art diffusion algorithms applicable to any generic network, the second, DiffuPath, enables the application of these algorithms on multi-layer biological networks. To facilitate its usability, the framework includes a command line interface, reproducible examples, and documentation. To demonstrate the framework, we conducted several diffusion experiments on three independent multi -omics datasets over disparate networks generated from pathway databases, thus, highlighting the ability of multi-layer networks to integrate multiple modalities. Finally, the results of these experiments demonstrate how the generation of harmonized networks from disparate databases can improve predictive performance with respect to individual resources. Availability DiffuPy and DiffuPath are publicly available under the Apache License 2.0 at https://github.com/multipaths . Contact sergi.picart@upc.edu and daniel.domingo.fernandez@scai.fraunhofer.de
The advent of high-throughput technologies and their decreasing cost have fostered the creation of a rich ecosystem of public database resources. In an era of affordable data acquisition, the core challenge has shifted to improve data interpretation, in order to understand normal and disease states. To that end, leveraging the current contextual knowledge in the form of annotations and biological networks is a powerful data amplifier to elucidate novel hypotheses. Label propagation and diffusion are the linchpin of the state of the art in network algorithms. In its simplest form, label propagation predicts the labels of a given node (for instance a gene, protein or metabolite) using those of its interactors. More elaborated approaches propagate beyond direct interactors, with robust performance in many computational biology domains. It has been pointed out that the topological structure of biological networks can bias propagation algorithms. Poorly known entities are overlooked and harder to link to experimental findings, which in turn keeps them barely annotated. Some efforts try to break this circularity by statistically normalising the topological bias, but the properties of the bias and the real benefit of its removal are yet to be carefully examined. This thesis covers two blocks. First, a characterisation of the bias in diffusion-based algorithms, with the implementation of statistical normalisations. Second, the application of such normalisation in classical computational biology problems: pathway analysis for metabolomics data and target gene prediction for drug development. In the first block, the presence of the bias is confirmed and linked to the network topology, albeit dependent on which nodes have labels. Equivalences are proven between diffusion processes with variations on their definitions, thus easing its choice. Closed forms on the first and second statistical moments of the null distributions of the diffusion scores are provided and linked to the spectral features of the network. The normalisation can be detrimental if the bias favours nodes with positive labels. An ad-hoc study of the data and the expected properties of the findings is recommended for an optimal choice. To that end, this thesis contributes the diffuStats software package, easing the computation and benchmark of several normalised and unnormalised diffusion scores. The second block starts with pathway analysis for metabolomics data. This choice is driven by the relative lack of computational solutions for metabolomics, whose output still requires an effortful interpretation. Here, a knowledge graph is conceived to connect the metabolites to the biological pathways through intermediate entities, like reactions and enzymes. Given the metabolites of interest, a propagation process is run to prioritise a relevant sub-network, suitable for manual inspection. The statistical normalisation is required due to the network design and properties. The usefulness of this approach is proven not only regarding pathway findings, but also examining the metabolites and reactions within the suggested sub-networks. The knowledge network construction and the propagation algorithm are distributed in the FELLA software package. The second practical application is the prediction of plausible gene targets in disease. Besides benchmarking the effect of the statistical normalisation, particular care is put into obtaining meaningful performance estimates for practical drug development. Target data is usually known at the protein complex level, which leads to performance over-estimation if ignored. Here, this effect is corrected in a varied comparison of prioritisation algorithms, networks, performance metrics and diseases. The results support that the statistical normalisation has a small but negative impact. After correcting for the protein complex structure, network-based algorithms are still deemed useful for drug discovery. La aparición de tecnologías experimentales de alto rendimiento ha propiciado la creación de un rico entorno de bases de datos que aglomeran todo tipo de anotaciones moleculares. Dada la creciente facilidad para la adquisición de datos en varios niveles moleculares, el reto central de la biología computacional ha virado hacia la interpretación de dicho volumen de datos. La comprensión de los procesos de normalidad y enfermedad involucrados en los cambios observados en los estudios experimentales es el motor que expande la frontera del conocimiento humano. Para ello, es fundamental aprovechar la herencia de conocimiento previo, recogido en las bases de datos en forma de anotaciones y redes biológicas, y minarlo en busca de nuevos patrones e hipótesis. Los algoritmos más extendidos para extraer conocimiento de las redes biológicas son los denominados métodos de propagación y difusión. Su trasfondo es el principio de culpa por asociación, que postula que las entidades biológicas que mantienen relación o interacción son más propensas a compartir funciones y propiedades. Dichos algoritmos aprovechan las interacciones conocidas, en formato de red, para predecir propiedades de nodos (por ejemplo, genes, proteínas o metabolitos) usando las propiedades de sus interactores. Existe evidencia de que la estructura topológica de las redes sesga los algoritmos de propagación, de forma que los nodos mejor descritos gozan de una ventaja sistemática. Los nodos menos conocidos quedan en desventaja, se entorpece el descubrimiento de su implicación en los experimentos, a su vez perpetuando nuestro pobre conocimiento sobre ellos. La literatura ofrece algunos estudios donde se normaliza dicho efecto, pero las propiedades intrínsecas del sesgo y el beneficio real de dicha normalización requiere un estudio más detallado. El objeto de esta tesis tiene dos vertientes. Primero, la caracterización de la estadística del sesgo en los algoritmos de propagación, la concepción de normalizaciones estadísticas y su distribución como software científico. Segundo, la aplicación de dicha normalización en problemas clásicos de biología computacional. Concretamente, en el análisis de vías biológicas para datos de metabolómica y en la predicción de genes como dianas terapéuticas en el desarrollo de fármacos. Ambos problemas son abordables mediante técnicas de propagación y, por lo tanto, potencialmente sensibles al efecto del sesgo topológico. En el primer bloque, se corrobora la existencia del sesgo y su dependencia no sólo de la estructura de la red, sino de los nodos en los que se define la propagación. Se demuestran equivalencias matemáticas entre ciertas variaciones en la definición de la propagación, facilitando así su elección. Se proporcionan expresiones cerradas sobre los momentos estadísticos de la difusión y se halla una conexión con las propiedades espectrales de las redes. Un punto importante es que la normalización no siempre ayuda, y su aplicabilidad dependerá de cada caso particular y de las hipótesis sobre la topología de los nodos que deben ser descubiertos. Para ello, esta tesis deja como resultado diffuStats, un software disponible en un repositorio púlico, que permite calcular y comparar la propagación con ciertas variantes, y con presencia o ausencia de normalización. En el segundo bloque, se escoge el análisis de vías en metabolómica dada la relativa juventud de los estudios metabolómicos y, por ende, su falta de herramientas informáticas dedicadas. El análisis de vías clásico parte de una lista de metabolitos de interés, normalmente procedentes de un estudio, y reporta una lista de vías o procesos metabólicos estadísticamente relacionados con ellos. Algunas variantes usan redes de metabolitos para dar más contexto biológico, pero la interpretación de los datos sigue requiriendo un extenso esfuerzo manual. La aportación de esta tesis es la creación de una red de conocimiento que relaciona los metabolitos con las vías a través de las entidades intermedias anotadas, como reacciones y enzimas. Sobre dicha red se aplican algoritmos de propagación para identificar las entidades más relacionadas con los metabolitos de interés. La normalización estadística es necesaria, dada la estructura y las características de la red. Se demuestra no sólo la coherencia de las vías metabólicas propuestas, sino la de los metabolitos y las reacciones priorizadas. La publicación del software FELLA proporciona la construcción de la red de conocimiento y el algoritmo de difusión a la comunidad científica. FELLA va acompañado de seis casos de aplicación en estudios humanos y animales. Por otro lado, se aborda el problema de predicción de genes para dianas terapéuticas a través de redes biológicas. Además de probar el efecto de la normalización estadística, se pone énfasis en estimar el desempeño real esperado en un escenario de desarrollo de fármacos. Los datos de dianas terapéuticas no se suelen conocer al nivel de proteína sino al de complejo o familia de proteínas. La mayoría de estudios no lo tiene en cuenta, llegando a estimaciones optimistas sobre el desempeño esperado. En esta tesis se propone un estudio exhaustivo que corrige el efecto de los complejos de proteínas, compara algoritmos de propagación con distintas métricas de rendimiento por su informatividad y explora el rol de la red biológica y de la enfermedad en cuestión. Se demuestra que la normalización estadística tiene poco efecto en el desempeño y que, en general, los métodos de propagación siguen siendo útiles en el desarrollo de fármacos después de corregir las estimaciones optimistas de su rendimiento.
Indication expansion aims to find new indications for existing targets in order to accelerate the process of launching a new drug for a disease on the market. The rapid increase in data types and data sources for computational drug discovery has fostered the use of semantic knowledge graphs (KGs) for indication expansion through target centric approaches, or in other words, target repositioning. Previously, we developed a novel method to construct a KG for indication expansion studies, with the aim of finding and justifying alternative indications for a target gene of interest. In contrast to other KGs, ours combines human-curated full-text literature and gene expression data from biomedical databases to encode relationships between genes, diseases, and tissues. Here, we assessed the suitability of our KG for explainable target-disease link prediction using a glass-box approach. To evaluate the predictive power of our KG, we applied shortest path with tissue information- and embedding-based prediction methods to a graph constructed with information published before or during 2010. We also obtained random baselines by applying the shortest path predictive methods to KGs with randomly shuffled node labels. Then, we evaluated the accuracy of the top predictions using gene-disease links reported after 2010. In addition, we investigated the contribution of the KG's tissue expression entity to the prediction performance. Our experiments showed that shortest path-based methods significantly outperform the random baselines and embedding-based methods outperform the shortest path predictions. Importantly, removing the tissue expression entity from the KG severely impacts the quality of the predictions, especially those produced by the embedding approaches. Finally, since the interpretability of the predictions is crucial in indication expansion, we highlight the advantages of our glass-box model through the examination of example candidate target-disease predictions.
Abstract Motivation Network diffusion and label propagation are fundamental tools in computational biology, with applications like gene-disease association, protein function prediction and module discovery. More recently, several publications have introduced a permutation analysis after the propagation process, due to concerns that network topology can bias diffusion scores. This opens the question of the statistical properties and the presence of bias of such diffusion processes in each of its applications. In this work, we characterised some common null models behind the permutation analysis and the statistical properties of the diffusion scores. We benchmarked seven diffusion scores on three case studies: synthetic signals on a yeast interactome, simulated differential gene expression on a protein-protein interaction network and prospective gene set prediction on another interaction network. For clarity, all the datasets were based on binary labels, but we also present theoretical results for quantitative labels. Results Diffusion scores starting from binary labels were affected by the label codification, and exhibited a problem-dependent topological bias that could be removed by the statistical normalisation. Parametric and non-parametric normalisation addressed both points by being codification-independent and by equalising the bias. We identified and quantified two sources of bias -mean value and variance- that yielded performance differences when normalising the scores. We provided closed formulae for both and showed how the null covariance is related to the spectral properties of the graph. Despite none of the proposed scores systematically outperformed the others, normalisation was preferred when the sought positive labels were not aligned with the bias. We conclude that the decision on bias removal should be problem and data-driven, i.e. based on a quantitative analysis of the bias and its relation to the positive entities. Availability The code is publicly available at https://github.com/b2slab/diffuBench Contact sergi.picart@upc.edu