In recent years, machine learning (ML) researchers have changed their focus towards biological problems that are difficult to analyse with standard approaches. Large initiatives such as The Cancer Genome Atlas (TCGA) have allowed the use of omic data for the training of these algorithms. In order to study the state of the art, this review is provided to cover the main works that have used ML with TCGA data. Firstly, the principal discoveries made by the TCGA consortium are presented. Once these bases have been established, we begin with the main objective of this study, the identification and discussion of those works that have used the TCGA data for the training of different ML approaches. After a review of more than 100 different papers, it has been possible to make a classification according to following three pillars: the type of tumour, the type of algorithm and the predicted biological problem. One of the conclusions drawn in this work shows a high density of studies based on two major algorithms: Random Forest and Support Vector Machines. We also observe the rise in the use of deep artificial neural networks. It is worth emphasizing, the increase of integrative models of multi-omic data analysis. The different biological conditions are a consequence of molecular homeostasis, driven by both protein coding regions, regulatory elements and the surrounding environment. It is notable that a large number of works make use of genetic expression data, which has been found to be the preferred method by researchers when training the different models. The biological problems addressed have been classified into five types: prognosis prediction, tumour subtypes, microsatellite instability (MSI), immunological aspects and certain pathways of interest. A clear trend was detected in the prediction of these conditions according to the type of tumour. That is the reason for which a greater number of works have focused on the BRCA cohort, while specific works for survival, for example, were centred on the GBM cohort, due to its large number of events. Throughout this review, it will be possible to go in depth into the works and the methodologies used to study TCGA cancer data. Finally, it is intended that this work will serve as a basis for future research in this field of study.
El análisis del transcriptoma juega un rol principal en el entendendimiento de enfermedades complejas, heterogéneas y multifactoriales como el cáncer. Es utilizado como una herramienta para la caracterización y entendimiento de alteraciones fenotípicas y la biología molecular. Perfiles transcriptómicos son utilizados para la búsqueda de genes que muestran diferencias en sus niveles de expresión asociados con una respuesta particular. Los datos RNA-seq permiten a los investigadores estudiar millones de lecturas cortas provenientes de muestras de mRNA secuenciados a través de plataformas Next Generation Sequencing (NGS). En términos generales, tales cantidades de datos son difíciles de interpretar y no hay un protocolo de análisis óptimo para cada análisis individual. Por una parte, enfoques estadísticos clásicos están disponibles en diferentes paquetes de R (como por ejemplo los paquetes DESeq o edgeR, entre otros). Por otra parte, en medicina somos capaces de utilizar algoritmos de Machine Learning para el análisis de expresión diferencial de una particular variable respuesta (por ejemplo, sano frente a enfermo) seleccionando los genes que son más relevantes y que discriminan de mejor manera ambas categorías, considerando así, información de rutas biológicas, relación de gene o usando enfoques integrativos con el fin de incluir toda la información disponible de diferentes fuentes de datos. El principal objetivo de esta presentación será introducir un enfoque basado en Machine Learning para el análisis de expresión génica con RNA-seq en el estudio del cáncer.
Inflammatory bowel disease (IBD) is a chronic disease with unknown pathophysiological mechanisms. There is evidence of the role of microorganims in this disease development. Thanks to the open access to multiple omics data, it is possible to develop predictive models that are able to prognosticate the course and development of the disease. The interpretability of these models, and the study of the variables used, allows the identification of biological aspects of great importance in the development of the disease. In this work we generated a metagenomic signature with predictive capacity to identify IBD from fecal samples. Different Machine Learning models were trained, obtaining high performance measures. The predictive capacity of the identified signature was validated in two external cohorts. More precisely a cohort containing samples from patients suffering Ulcerative Colitis and another from patients suffering Crohn's Disease, the two major subtypes of IBD. The results obtained in this validation (AUC 0.74 and AUC = 0.76, respectively) show that our signature presents a generalization capacity in both subtypes. The study of the variables within the model, and a correlation study based on text mining, identified different genera that play an important and common role in the development of these two subtypes.
With the cheapening of mass sequencing techniques and the rise of computer technologies, capable of analyzing a huge amount of data, it is necessary nowadays that both branches mutually benefit. Transcriptomics, in this case, is a branch of biology focused on the study of mRNA molecules, among others. The quantification of these molecules gives us information about the expression that a gene is having at a given moment. Having information on the expression of the approximately 20,000 genes harbored by human beings is a really useful source of information for the study of certain conditions and/or pathologies. In this work, patient expression -omic data data have been used to offer a new analysis methodology through Machine Learning. The results of this methodology were compared with a conventional methodology to observe how they differed and how they resembled each other. These techniques, therefore, offer a new mechanism for the search of genetic signatures involved, in this case, with cancer.
Screening and in silico modeling are critical activities for the reduction of experimental costs. They also speed up research notably and strengthen the theoretical framework, thus allowing researchers to numerically quantify the importance of a particular subset of information. For example, in fields such as cancer and other highly prevalent diseases, having a reliable prediction method is crucial. The objective of this paper is to classify peptide sequences according to their anti-angiogenic activity to understand the underlying principles via machine learning. First, the peptide sequences were converted into three types of numerical molecular descriptors based on the amino acid composition. We performed different experiments with the descriptors and merged them to obtain baseline results for the performance of the models, particularly of each molecular descriptor subset. A feature selection process was applied to reduce the dimensionality of the problem and remove noisy features - which are highly present in biological problems. After a robust machine learning experimental design under equal conditions (nested resampling, cross-validation, hyperparameter tuning and different runs), we statistically and significantly outperformed the best previously published anti-angiogenic model with a generalized linear model via coordinate descent (glmnet), achieving a mean AUC value greater than 0.96 and with an accuracy of 0.86 with 200 molecular descriptors, mixed from the three groups. A final analysis with the top-40 discriminative anti-angiogenic activity peptides is presented along with a discussion of the feature selection process and the individual importance of each molecular descriptors According to our findings, anti-angiogenic activity peptides are strongly associated with amino acid sequences SP, LSL, PF, DIT, PC, GH, RQ, QD, TC, SC, AS, CLD, ST, MF, GRE, IQ, CQ and HG.
Abstract Heart failure (HF) is a major public health problem. Early identification of at-risk individuals could allow for interventions that reduce morbidity or mortality. The community-based FINRISK Microbiome DREAM challenge (synapse.org/finrisk) evaluated the use of machine learning approaches on shotgun metagenomics data obtained from fecal samples to predict incident HF risk over 15 years in a population cohort of 7231 Finnish adults (FINRISK 2002, n=559 incident HF cases). Challenge participants used synthetic data for model training and testing. Final models submitted by seven teams were evaluated in the real data. The two highest-scoring models were both based on Cox regression but used different feature selection approaches. We aggregated their predictions to create an ensemble model. Additionally, we refined the models after the DREAM challenge by eliminating phylum information. Models were also evaluated at intermediate timepoints and they predicted 10-year incident HF more accurately than models for 5- or 15-year incidence. We found that bacterial species, especially those linked to inflammation, are predictive of incident HF. This highlights the role of the gut microbiome as a potential driver of inflammation in HF pathophysiology. Our results provide insights into potential modeling strategies of microbiome data in prospective cohort studies. Overall, this study provides evidence that incorporating microbiome information into incident risk models can provide important biological insights into the pathogenesis of HF.
Previous works have reported different bacterial strains and genera as the cause of different clinical pathological conditions. In our approach, using the fecal metagenomic profiles of newborns, a machine learning-based model was generated capable of discerning between patients affected by type I diabetes and controls. Furthermore, a random forest algorithm achieved a 0.915 in AUROC. The automation of processes and support to clinical decision making under metagenomic variables of interest may result in lower experimental costs in the diagnosis of complex diseases of high prevalence worldwide.
Abstract Heart failure (HF) is characterized by severely reduced cardiac function and tissue remodeling, driven by complex multicellular regulatory processes. Extensive studies have generated molecular profiles at both bulk and single-cell levels; however, systematic integration that describes tissue-wide changes as a function of cell type coordination remains challenging. This disconnect hampers our understanding of the complex multicellular interactions driving heart failure, limiting our ability to translate molecular insights into actionable therapeutic strategies. Here, we integrated bulk and single-cell transcriptional profiles from cardiac tissues of HF and control patients across 25 studies, covering 1,524 individuals and seven cell-types, to delineate consensus multicellular transcriptional changes associated with cardiac remodeling. Our analyses revealed conserved cellular coordination events involving fibrotic, metabolic, inflammatory, and hypertrophic mechanisms, with fibroblasts playing a central role in predicting cardiomyocyte stress. Further analysis of fibroblast populations suggested that their activation in HF represents a broad phenotypic shift rather than solely accumulating distinct cell states. The integration of bulk and single-cell data within our data collection indicated that transcriptional responses to HF across cell types occur independently of tissue composition. Mapping independent data into our consensus programs demonstrated that recovery after left ventricular assist device implantation aligns with molecular recovery, highlighting the clinical relevance of the multicellular molecular state. Overall, our work synthesizes independent cardiac transcriptomics studies and makes the conserved HF associated insights available, establishing a reference for detailed exploration of HF-related multicellular molecular events. Graphical Abstract
Colon cancer is the second most common cause of cancer death worldwide. Despite advances in the development of new molecular strategies for stratifying patients with colon cancer, many of these patients do not respond adequately to the standard of care. While previous studies have focused on the development of prognostic gene expression signatures, the exploration of predictive signatures to inform treatment decisions remains incomplete. In this study, we leveraged public gene expression datasets to design and experimentally validate a 37-gene expression signature for prognosis in colon cancer patients. We obtained a C-index of 0.732 (0.610-0.853) in four independent studies. Specifically, we discovered that the signature is associated with the mitotic phase of the cell cycle. Furthermore, the signature identified a population of colon cancer patients sensitive to tubulin inhibitor drugs. In particular, we validated in vitro and in vivo the efficacy of paclitaxel, a commonly used tubulin inhibitor in breast cancer treatment, in patient-derived preclinical models. These results highlight the importance of incorporating gene expression signatures to identify new therapeutic options for colon cancer treatment. Furthermore, the identification of alternative treatment options with potentially improved efficacy holds promise for the development of new clinical trials, and reshapes the biomarker-based treatment strategy for second line and refractory colon cancer patients.
The prediction of metabolic activities in silico form is crucial to be able to address all research possibilities without exceeding the experimental costs. In particular, for cancer research, the prediction of certain activities can be of great help in the discovery of different treatments. In this work it has been proposed to predict, through Machine Learning, the anti-angiogenic activity of peptides is currently being used in cancer treatment and is giving hopeful results. From a list of peptide sequences, three types of molecular descriptors were obtained (AAC, DC and TC) that offered the possibility of training different ML algorithms. After a Feature Selection process, different models were obtained with a predictive value that surpassed the current state of the art. These results shown that ML is useful for the classification and prediction of the activity of new peptides, making experimental screening cheaper and faster.