The Effective Sample Size of EHR-Derived Cohorts Under Biased Sampling

Rebecca A. Hubbard,Carolyn Lou,Blanca E. Himes

The Effective Sample Size of EHR-Derived Cohorts Under Biased Sampling

2021

Electronic Health Records (EHRs) have become a popular data source for conducting observational studies of health outcomes. One advantage of using EHR-derived data for biomedical and epidemiologic research is the ability to efficiently construct large cohorts, providing access to “big data” in healthcare. For example, the U.S. Food and Drug Administration’s Sentinel System, which is composed of EHR and administrative claims data, includes over 100 million people, constituting approximately one-third of the U.S. population. Although the sample size of EHR-derived cohorts can be very large, EHR data arise through a complex, non-random sampling process that can induce bias when using such data to obtain parameter estimates that are meant to be representative of an underlying population. In the U.S.A., where most health insurance is employment-based, insured populations are often non-representative of uninsured populations, and thus, insurance status, as well as health literacy and healthcare-seeking behavior, is associated with representation in EHRs. As a result, the non-random sampling mechanism that gives rise to EHR data can induce significant bias in parameter estimates derived from EHR-based studies relative to the underlying population parameters. Here, we derive formulas for the mean-squared error of an EHR-derived sample as a function of the strength of association between a health outcome of interest, the sampling process, and an underlying unobserved covariate. We also provide a formula for the effective sample size of an EHR-derived cohort defined as the sample size of a simple random sample with equivalent mean-squared error to an EHR-derived sample arising from a biased sampling mechanism. The effective sample size allows for assessment of the advantage of using an EHR-derived sample as opposed to conducting a more traditional, designed observational study, taking into account both the number of patients and the biased sampling mechanism. Through simulation studies, we demonstrate the magnitude of bias induced in EHR-based parameter estimates under varying sample selection mechanisms, and we demonstrate how the effective sample size can be used to compute confidence intervals that account for the biased sampling scheme. We conclude that attention to biased sampling is necessary to avoid erroneous inference due to the large sample size and complex, non-random provenance of EHR-derived data, when the goal of a study is to use EHR-derived data to capture parameter estimates that are representative of an underlying population.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations