David K. Lim

University of North Carolina at Chapel Hill

Author Statistics

Papers

Citation

H-Index

i-10 index

Research Trends

Author Order

Document Type

Co-Authors

Joseph G. Ibrahim

University of North Carolina at Chapel Hill

Naim U. Rashid

University of North Carolina at Chapel Hill

Junier B. Oliva

University of North Carolina at Chapel Hill

Kelly Westbrook

Duke University

Michael J. Lombardi

Brigham and Women's Hospital

Katie Papathakis

Johns Hopkins University

Deborah K. Armstrong

Johns Hopkins University

Roisín M. Connolly

Sidney Kimmel Comprehensive Cancer Center

Vered Stearns

Johns Hopkins University

Jennifer Y. Sheng

Johns Hopkins University

Cooperative Institutions

Johns Hopkins University

Johns Hopkins Medicine

Sidney Kimmel Comprehensive Cancer Center

University of Baltimore

Johns Hopkins Hospital

University of North Carolina at Chapel Hill

Sidney Kimmel Cancer Center

Cancer Research Center

Harvard University

University of North Carolina Health Care

Author Statistics

Papers

Citation

H-Index

i-10 index

Research Field

Unsupervised Imputation of Non-ignorably Missing Data Using Importance-Weighted Autoencoders

arXiv (Cornell University) (2021)

David K. Lim Naim U. Rashid Junier B. Oliva Joseph G. Ibrahim

Deep Learning (DL) methods have dramatically increased in popularity in recent years. While its initial success was demonstrated in the classification and manipulation of image data, there has been significant growth in the application of DL methods to problems in the biomedical sciences. However, the greater prevalence and complexity of missing data in biomedical datasets present significant challenges for DL methods. Here, we provide a formal treatment of missing data in the context of Variational Autoencoders (VAEs), a popular unsupervised DL architecture commonly utilized for dimension reduction, imputation, and learning latent representations of complex data. We propose a new VAE architecture, NIMIWAE, that is one of the first to flexibly account for both ignorable and non-ignorable patterns of missingness in input features at training time. Following training, samples can be drawn from the approximate posterior distribution of the missing data can be used for multiple imputation, facilitating downstream analyses on high dimensional incomplete datasets. We demonstrate through statistical simulation that our method outperforms existing approaches for unsupervised learning tasks and imputation accuracy. We conclude with a case study of an EHR dataset pertaining to 12,000 ICU patients containing a large number of diagnostic measurements and clinical outcomes, where many features are only partially observed.

Imputation (statistics)

10.48550/arxiv.2101.07357

Cite

Citations (1)

Deeply Learned Generalized Linear Models with Missing Data

Journal of Computational and Graphical Statistics (2023)

David K. Lim Naim U. Rashid Junier B. Oliva Joseph G. Ibrahim

Deep Learning (DL) methods have dramatically increased in popularity in recent years, with significant growth in their application to various supervised learning problems. However, the greater prevalence and complexity of missing data in such datasets present significant challenges for DL methods. Here, we provide a formal treatment of missing data in the context of deeply learned generalized linear models, a supervised DL architecture for regression and classification problems. We propose a new architecture, dlglm, that is one of the first to be able to flexibly account for both ignorable and non-ignorable patterns of missingness in input features and response at training time. We demonstrate through statistical simulation that our method outperforms existing approaches for supervised learning tasks in the presence of missing not at random (MNAR) missingness. We conclude with a case study of the Bank Marketing dataset from the UCI Machine Learning Repository, in which we predict whether clients subscribed to a product based on phone survey data. Supplementary materials for this article are available online.

Generalized additive model

10.1080/10618600.2023.2276122

Cite

Citations (0)

Deeply Learned Generalized Linear Models with Missing Data

Carolina Digital Repository (University of North Carolina at Chapel Hill) (2023)

David K. Lim Naim U. Rashid Junier B. Oliva Joseph G. Ibrahim

10.17615/8fps-sp77

Cite

Citations (0)

Unsupervised Imputation of Non-Ignorably Missing Data Using Importance-Weighted Autoencoders

Carolina Digital Repository (University of North Carolina at Chapel Hill) (2024)

David K. Lim Naim U. Rashid Junier B. Oliva Joseph G. Ibrahim

Imputation (statistics)

10.17615/wdw0-y798

Cite

Citations (0)

Model-based feature selection and clustering of RNA-seq data for unsupervised subtype discovery

The Annals of Applied Statistics (2021)

David K. Lim Naim U. Rashid Joseph G. Ibrahim

Clustering is a form of unsupervised learning that aims to uncover latent groups within data based on similarity across a set of features. A common application of this in biomedical research is in delineating novel cancer subtypes from patient gene expression data, given a set of informative genes. However, it is typically unknown a priori what genes may be informative in discriminating between clusters and what the optimal number of clusters are. Few methods exist for performing unsupervised clustering of RNA-seq samples, and none currently adjust for between-sample global normalization factors, select cluster-discriminatory genes or account for potential confounding variables during clustering. To address these issues, we propose the feature selection and clustering of RNA-seq (FSCseq): a model-based clustering algorithm that utilizes a finite mixture of regression (FMR) model and the quadratic penalty method with a smoothly clipped absolute deviation (SCAD) penalty. The maximization is done by a penalized Classification EM algorithm, allowing us to include normalization factors and confounders in our modeling framework. Given the fitted model, our framework allows for subtype prediction in new patients via posterior probabilities of cluster membership, even in the presence of batch effects. Based on simulations and real data analysis, we show the advantages of our method relative to competing approaches.

Normalization

Clustering high-dimensional data

Feature (linguistics)

10.1214/20-aoas1407

Cite

Citations (6)

Unsupervised Imputation of Non-ignorably Missing Data Using Importance-Weighted Autoencoders

Statistics in Biopharmaceutical Research (2024)

David K. Lim Naim U. Rashid Junier B. Oliva Joseph G. Ibrahim

Imputation (statistics)

10.1080/19466315.2024.2368787

Cite

Citations (1)

Association of treatment-emergent symptoms identified by patient-reported outcomes with adjuvant endocrine therapy discontinuation

npj Breast Cancer (2022)

Karen L. Smith Neha Verma Amanda L. Blackford Jennifer Lehman Kelly Westbrook

Abstract Many patients discontinue endocrine therapy for breast cancer due to intolerance. Identification of patients at risk for discontinuation is challenging. The minimal important difference (MID) is the smallest change in a score on a patient-reported outcome (PRO) that is clinically significant. We evaluated the association between treatment-emergent symptoms detected by worsening PRO scores in units equal to the MID with discontinuation. We enrolled females with stage 0-III breast cancer initiating endocrine therapy in a prospective cohort. Participants completed PROs at baseline, 3, 6, 12, 24, 36, 48, and 60 months. Measures included PROMIS pain interference, fatigue, depression, anxiety, physical function, and sleep disturbance; Endocrine Subscale of the FACT-ES; and MOS-Sexual Problems (MOS-SP). We evaluated associations between continuous PRO scores in units corresponding to MIDs (PROMIS: 4-points; FACT-ES: 5-points; MOS-SP: 8-points) with time to endocrine therapy discontinuation using Cox proportional hazards models. Among 321 participants, 140 (43.6%) initiated tamoxifen and 181 (56.4%) initiated aromatase inhibitor (AI). The cumulative probability of discontinuation was 23% (95% CI 18–27%) at 48 months. For every 5- and 4-point worsening in endocrine symptoms and sleep disturbance respectively, participants were 13 and 14% more likely to discontinue endocrine therapy respectively (endocrine symptoms HR 1.13, 95% CI 1.02–1.25, p = 0.02; sleep disturbance HR 1.14, 95% CI 1.01–1.29, p = 0.03). AI treatment was associated with greater likelihood of discontinuation than tamoxifen. Treatment-emergent endocrine symptoms and sleep disturbance are associated with endocrine therapy discontinuation. Monitoring for worsening scores meeting or exceeding the MID on PROs may identify patients at risk for discontinuation.

Discontinuation

Depression

Aromatase inhibitor

10.1038/s41523-022-00414-0

Cite

Citations (18)

Adaptive User Interface for a Camera Aperture within an Active Display Area

Michael J. Lombardi David K. Lim Joseph L. Allore

Interface (matter)

Source

Cite

Citations (0)

Model-Based Feature Selection and Clustering of Rna-Seq Data for Unsupervised Subtype Discovery

bioRxiv (Cold Spring Harbor Laboratory) (2020)

David K. Lim Naim U. Rashid Joseph G. Ibrahim

Clustering is a form of unsupervised learning that aims to un-cover latent groups within data based on similarity across a set of features. A common application of this in biomedical research is in delineating novel cancer subtypes from patient gene expression data, given a set of informative genes. However, it is typically unknown a priori what genes may be informative in discriminating between clusters, and what the optimal number of clusters are. Few methods exist for performing unsupervised clustering of RNA-seq samples, and none currently adjust for between-sample global normalization factors, select cluster-discriminatory genes, or account for potential confounding variables during clustering. To address these issues, we propose the Feature Selection and Clustering of RNA-seq (FSCseq): a model-based clustering algorithm that utilizes a finite mixture of regression (FMR) model and utilized the quadratic penalty method with a SCAD penalty. The maximization is done by a penalized Classification EM algorithm, allowing us to include normalization factors and confounders in our modeling framework. Given the fitted model, our framework allows for subtype prediction in new patients via posterior probabilities of cluster membership. Based on simulations and real data analysis, we show the advantages of our method relative to competing approaches.

Normalization

Clustering high-dimensional data

10.1101/2020.05.23.111799

Cite

Citations (1)

Handling Non-ignorably Missing Features in Electronic Health Records Data Using Importance-Weighted Autoencoders.

arXiv (Cornell University) (2021)

David K. Lim Naim U. Rashid Junier B. Oliva Joseph G. Ibrahim

Electronic Health Records (EHRs) are commonly used to investigate relationships between patient health information and outcomes. Deep learning methods are emerging as powerful tools to learn such relationships, given the characteristic high dimension and large sample size of EHR datasets. The Physionet 2012 Challenge involves an EHR dataset pertaining to 12,000 ICU patients, where researchers investigated the relationships between clinical measurements, and in-hospital mortality. However, the prevalence and complexity of missing data in the Physionet data present significant challenges for the application of deep learning methods, such as Variational Autoencoders (VAEs). Although a rich literature exists regarding the treatment of missing data in traditional statistical models, it is unclear how this extends to deep learning architectures. To address these issues, we propose a novel extension of VAEs called Importance-Weighted Autoencoders (IWAEs) to flexibly handle Missing Not At Random (MNAR) patterns in the Physionet data. Our proposed method models the missingness mechanism using an embedded neural network, eliminating the need to specify the exact form of the missingness mechanism a priori. We show that the use of our method leads to more realistic imputed values relative to the state-of-the-art, as well as significant differences in fitted downstream models for mortality.

Health records

Sample (material)

Source

Cite

Citations (2)