Deep Learning (DL) methods have dramatically increased in popularity in recent years, with significant growth in their application to various supervised learning problems. However, the greater prevalence and complexity of missing data in such datasets present significant challenges for DL methods. Here, we provide a formal treatment of missing data in the context of deeply learned generalized linear models, a supervised DL architecture for regression and classification problems. We propose a new architecture, dlglm, that is one of the first to be able to flexibly account for both ignorable and non-ignorable patterns of missingness in input features and response at training time. We demonstrate through statistical simulation that our method outperforms existing approaches for supervised learning tasks in the presence of missing not at random (MNAR) missingness. We conclude with a case study of the Bank Marketing dataset from the UCI Machine Learning Repository, in which we predict whether clients subscribed to a product based on phone survey data. Supplementary materials for this article are available online.
In DAE (DNA After Enrichment)-seq experiments, genomic regions related with certain biological processes are enriched/isolated by an assay and are then sequenced on a high-throughput sequencing platform to determine their genomic positions. Statistical analysis of DAE-seq data aims to detect genomic regions with significant aggregations of isolated DNA fragments ("enriched regions") versus all the other regions ("background"). However, many confounding factors may influence DAE-seq signals. In addition, the signals in adjacent genomic regions may exhibit strong correlations, which invalidate the independence assumption employed by many existing methods. To mitigate these issues, we develop a novel Autoregressive Hidden Markov Model (AR-HMM) to account for covariates effects and violations of the independence assumption. We demonstrate that our AR-HMM leads to improved performance in identifying enriched regions in both simulated and real datasets, especially in those in epigenetic datasets with broader regions of DAE-seq signal enrichment. We also introduce a variable selection procedure in the context of the HMM/AR-HMM where the observations are not independent and the mean value of each state-specific emission distribution is modeled by some covariates. We study the theoretical properties of this variable selection procedure and demonstrate its efficacy in simulated and real DAE-seq data. In summary, we develop several practical approaches for DAE-seq data analysis that are also applicable to more general problems in statistics.
In the genomic era, the identification of gene signatures associated with disease is of significant interest. Such signatures are often used to predict clinical outcomes in new patients and aid clinical decision-making. However, recent studies have shown that gene signatures are often not replicable. This occurrence has practical implications regarding the generalizability and clinical applicability of such signatures. To improve replicability, we introduce a novel approach to select gene signatures from multiple datasets whose effects are consistently nonzero and account for between-study heterogeneity. We build our model upon some rank-based quantities, facilitating integration over different genomic datasets. A high-dimensional penalized generalized linear mixed model is used to select gene signatures and address data heterogeneity. We compare our method to some commonly used strategies that select gene signatures ignoring between-study heterogeneity. We provide asymptotic results justifying the performance of our method and demonstrate its advantage in the presence of heterogeneity through thorough simulation studies. Lastly, we motivate our method through a case study subtyping pancreatic cancer patients from four gene expression studies. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Abstract Predictive and prognostic gene signatures derived from interconnectivity among genes can tailor clinical care to patients in cancer treatment. We identified gene interconnectivity as the transcriptomic-causal network by integrating germline genotyping and tumor RNA-seq data from 1,165 patients with metastatic colorectal cancer (CRC). The patients were enrolled in a clinical trial with randomized treatment, either cetuximab or bevacizumab in combination with chemotherapy. We linked the network to overall survival (OS) and detected novel biomarkers by controlling for confounding genes. Our data-driven approach discerned sets of genes, each set collectively stratify patients based on OS. Two signatures under the cetuximab treatment were related to wound healing and macrophages. The signature under the bevacizumab treatment was related to cytotoxicity and we replicated its effect on OS using an external cohort. We also showed that the genes influencing OS within the signatures are downregulated in CRC tumor vs. normal tissue using another external cohort. Furthermore, the corresponding proteins encoded by the genes within the signatures interact each other and are functionally related. In conclusion, this study identified a group of genes that collectively stratified patients based on OS and uncovered promising novel prognostic biomarkers for personalized treatment of CRC using transcriptomic causal networks.
Screening of an inhibitor library targeting kinases and epigenetic regulators identified several molecules having antiproliferative synergy with extraterminal domain (BET) bromodomain (BD) inhibitors (JQ1, OTX015) in triple-negative breast cancer (TNBC). GSK2801, an inhibitor of BAZ2A/B BDs, of the imitation switch chromatin remodeling complexes, and BRD9, of the SWI/SNF complex, demonstrated synergy independent of BRD4 control of P-TEFb-mediated pause-release of RNA polymerase II. GSK2801 or RNAi knockdown of BAZ2A/B with JQ1 selectively displaced BRD2 at promoters/enhancers of ETS-regulated genes. Additional displacement of BRD2 from rDNA in the nucleolus coincided with decreased 45S rRNA, revealing a function of BRD2 in regulating RNA polymerase I transcription. In 2D cultures, enhanced displacement of BRD2 from chromatin by combination drug treatment induced senescence. In spheroid cultures, combination treatment induced cleaved caspase-3 and cleaved PARP characteristic of apoptosis in tumor cells. Thus, GSK2801 blocks BRD2-driven transcription in combination with BET inhibitor and induces apoptosis of TNBC. IMPLICATIONS: Synergistic inhibition of BDs encoded in BAZ2A/B, BRD9, and BET proteins induces apoptosis of TNBC by a combinatorial suppression of ribosomal DNA transcription and ETS-regulated genes.
Deep Learning (DL) methods have dramatically increased in popularity in recent years. While its initial success was demonstrated in the classification and manipulation of image data, there has been significant growth in the application of DL methods to problems in the biomedical sciences. However, the greater prevalence and complexity of missing data in biomedical datasets present significant challenges for DL methods. Here, we provide a formal treatment of missing data in the context of Variational Autoencoders (VAEs), a popular unsupervised DL architecture commonly utilized for dimension reduction, imputation, and learning latent representations of complex data. We propose a new VAE architecture, NIMIWAE, that is one of the first to flexibly account for both ignorable and non-ignorable patterns of missingness in input features at training time. Following training, samples can be drawn from the approximate posterior distribution of the missing data can be used for multiple imputation, facilitating downstream analyses on high dimensional incomplete datasets. We demonstrate through statistical simulation that our method outperforms existing approaches for unsupervised learning tasks and imputation accuracy. We conclude with a case study of an EHR dataset pertaining to 12,000 ICU patients containing a large number of diagnostic measurements and clinical outcomes, where many features are only partially observed.