When no answer is better than a wrong answer: a causal perspective on batch effects
Eric BridgefordMichael PowellGregory KiarStephanie NobleJaewon ChungSambit PandaRoss LawrenceTing XuMichael P. MilhamBrian CaffoJoshua T Vogelstein
5
Citation
92
Reference
10
Related Paper
Citation Trend
Abstract:
Abstract Batch effects, undesirable sources of variability across multiple experiments, present significant challenges for scientific and clinical discoveries. Batch effects can (i) produce spurious signals and/or (ii) obscure genuine signals, contributing to the ongoing reproducibility crisis. Because batch effects are typically modeled as classical statistical effects, they often cannot differentiate between sources of variability due to confounding biases, which may lead them to erroneously conclude batch effects are present (or not). We formalize batch effects as causal effects, and introduce algorithms leveraging causal machinery, to address these concerns. Simulations illustrate that when non-causal methods provide the wrong answer, our methods either produce more accurate answers or “no answer”, meaning they assert the data are an inadequate to confidently conclude on the presence of a batch effect. Applying our causal methods to 27 neuroimaging datasets yields qualitatively similar results: in situations where it is unclear whether batch effects are present, non-causal methods confidently identify (or fail to identify) batch effects, whereas our causal methods assert that it is unclear whether there are batch effects or not. In instances where batch effects should be discernable, our techniques produce different results from prior art, each of which produce results more qualitatively similar to not applying any batch effect correction to the data at all. This work therefore provides a causal framework for understanding the potential capabilities and limitations of analysis of multi-site data.Keywords:
Spurious relationship
Overconfidence effect
Causal model
Causal analysis
Causality
Many causal models of interest in epidemiology involve longitudinal exposures, confounders and mediators. However, in practice, repeated measurements are not always available. Then, practitioners tend to overlook the time-varying nature of exposures and work under over-simplified causal models. Our objective here was to assess whether - and how - the causal effect identified under such misspecified causal models relates to true causal effects of interest. We focus on situations regarding the type of available data for exposures: when they correspond to (i) ``instantaneous'' levels measured at inclusion in the study or (ii) summary measures of their levels up to inclusion in the study. In each of these two situations, we derive sufficient conditions ensuring that the quantities estimated in practice under over-simplified causal models can be expressed as true longitudinal causal effects of interest, or some weighted averages thereof. Unsurprisingly, these sufficient conditions are very restrictive, and our results state that inference based on either ``instantaneous'' levels or summary measures usually returns quantities that do not directly relate to any causal effect of interest and should be interpreted with caution. They raise the need for the availability of repeated measurements and/or the development of sensitivity analyses when such data is not available.
Causal model
Causality
Longitudinal data
Cite
Citations (1)
We introduce DoWhy-GCM, an extension of the DoWhy Python library, that leverages graphical causal models. Unlike existing causality libraries, which mainly focus on effect estimation questions, with DoWhy-GCM, users can ask a wide range of additional causal questions, such as identifying the root causes of outliers and distributional changes, causal structure learning, attributing causal influences, and diagnosis of causal structures. To this end, DoWhy-GCM users first model cause-effect relations between variables in a system under study through a graphical causal model, fit the causal mechanisms of variables next, and then ask the causal question. All these steps take only a few lines of code in DoWhy-GCM. The library is available at https://github.com/py-why/dowhy.
Causal model
Graphical model
Causality
Causal structure
Python
Statistical graphics
Cite
Citations (14)
Some argue scale is all what is needed to achieve AI, covering even causal models. We make it clear that large language models (LLMs) cannot be causal and give reason onto why sometimes we might feel otherwise. To this end, we define and exemplify a new subgroup of Structural Causal Model (SCM) that we call meta SCM which encode causal facts about other SCM within their variables. We conjecture that in the cases where LLM succeed in doing causal inference, underlying was a respective meta SCM that exposed correlations between causal facts in natural language on whose data the LLM was ultimately trained. If our hypothesis holds true, then this would imply that LLMs are like parrots in that they simply recite the causal knowledge embedded in the data. Our empirical analysis provides favoring evidence that current LLMs are even weak `causal parrots.'
Causality
Causal model
Causal analysis
Causal reasoning
Causal structure
Cite
Citations (6)
Suboptimal diet is one of the most important controllable risk factors for non-communicable diseases. However, randomized controlled trials make it difficult to quantify the causal association between specific dietary factors and health outcomes. In recent years, the rapid development of causal inference has provided a robust theoretical and methodological tool for making full use of observational research data and producing high-quality nutritional epidemiologic research evidence. The causal graph model visualizes the complex causal relationship system by integrating a large amount of prior knowledge and provides a basic framework for identifying confounding and determining causal effect estimation strategies. Different analysis strategies such as adjusting confounders, instrumental variables, or mediation analysis can be created based on other causal graphs. This paper introduces the idea of the causal graph model and the characteristics of various analysis strategies and their application in nutritional epidemiology research, aiming to promote the application of the causal graph model in nutrition and provide references and suggestions for the follow-up research.不良饮食是慢性非传染性疾病最重要的可控危险因素之一,但通过随机对照试验定量阐明具体饮食因素与健康结局的因果关联面临很多困难。近年来,因果推断的迅速发展为充分利用和发掘观察性研究数据,产生高质量的营养流行病学研究证据提供了有力的理论和方法工具。其中,因果图模型通过整合大量先验知识将复杂的因果关系系统可视化,提供了识别混杂和确定因果效应估计策略的基础框架,基于不同的因果图,可选择调整混杂、工具变量或中介分析等不同的分析策略。本文对因果图模型的思想和各种分析策略的特点及其在营养流行病学研究中的应用进行介绍,旨在促进因果图模型在营养领域的应用,为后续研究提供参考和建议。.
Causal model
Causality
Causal structure
Instrumental variable
Randomized experiment
Causal analysis
Cite
Citations (0)
Causal inference is fundamental to empirical scientific discoveries in natural and social sciences; however, in the process of conducting causal inference, data management problems can lead to false discoveries. Two such problems are (i) not having all attributes required for analysis, and (ii) misidentifying which attributes are to be included in the analysis. Analysts often only have access to partial data, and they critically rely on (often unavailable or incomplete) domain knowledge to identify attributes to include for analysis, which is often given in the form of a causal DAG. We argue that data management techniques can surmount both of these challenges. In this work, we introduce the Causal Data Integration (CDI) problem, in which unobserved attributes are mined from external sources and a corresponding causal DAG is automatically built. We identify key challenges and research opportunities in designing a CDI system, and present a system architecture for solving the CDI problem. Our preliminary experimental results demonstrate that solving CDI is achievable and pave the way for future research.
Causal model
Causal analysis
Causal reasoning
Causality
Cite
Citations (5)
Many causal models of interest in epidemiology involve longitudinal exposures, confounders and mediators. However, repeated measurements are not always available or used in practice, leading analysts to overlook the time-varying nature of exposures and work under over-simplified causal models. Our objective is to assess whether - and how - causal effects identified under such misspecified causal models relates to true causal effects of interest. We derive sufficient conditions ensuring that the quantities estimated in practice under over-simplified causal models can be expressed as weighted averages of longitudinal causal effects of interest. Unsurprisingly, these sufficient conditions are very restrictive, and our results state that the quantities estimated in practice should be interpreted with caution in general, as they usually do not relate to any longitudinal causal effect of interest. Our simulations further illustrate that the bias between the quantities estimated in practice and the weighted averages of longitudinal causal effects of interest can be substantial. Overall, our results confirm the need for repeated measurements to conduct proper analyses and/or the development of sensitivity analyses when they are not available.
Causal model
Marginal structural model
Causality
Cite
Citations (1)
Based on the structural causal model, this study derived a causal graph that shows the causal relationship between the factors predicting the teaching competency of lower secondary school teachers in South Korea, the UK(England), and Finland. Also, it compared and analyzed the causal path to each country’s teaching competency. To this end, the data of lower secondary school teachers and principals, who participated in TALIS 2018, in Korea, the UK(England), and Finland were analyzed. First, the top 20 factors that predict teaching competency by each country were extracted by applying the mixed-effect random forest technique in consideration of the multi-layer structure of the data. Then, the causal graphs were derived by applying the causal discovery algorithm based on a structural causal model with the extracted predictors. As a result, there were common factors and discrimination factors in the top 20 predictors extracted from each national data, and the causal paths to teaching competency were compared and analyzed in the context of each country based on the causal graph by country. In addition, in the field of education, the possibility of using causal inference based on structural causal models was discussed, and the limitations and implications of this study were presented.
Causal model
Causal analysis
Causal reasoning
Cite
Citations (0)
Abstract We used a new method to assess how people can infer unobserved causal structure from patterns of observed events. Participants were taught to draw causal graphs, and then shown a pattern of associations and interventions on a novel causal system. Given minimal training and no feedback, participants in Experiment 1 used causal graph notation to spontaneously draw structures containing one observed cause, one unobserved common cause, and two unobserved independent causes, depending on the pattern of associations and interventions they saw. We replicated these findings with less‐informative training (Experiments 2 and 3) and a new apparatus (Experiment 3) to show that the pattern of data leads to hidden causal inferences across a range of prior constraints on causal knowledge.
Causal structure
Causal model
Causality
Causal reasoning
Causal analysis
Cite
Citations (47)
Although a number of investigators have attempted to identify empirically a process of political development, substantial controversy still surrounds a determination of the causal factors involved. It is my contention that this state of affairs is the result of inadequacies inherent in traditional techniques of causal modeling, and aggravated when multicollinear variables are involved. To resolve this problem I first review a technique capable of reducing the confounding effects of multicollinearity. I then illustrate use of this technique, as well as a strategy for inferring causal relationships, by means of a reanalysis of published data used to construct models of political development. The strategy for causal inference utilized herein is derived from knowledge of the effects of model specification errors. On the basis of these findings a new causal model of political development, which is both theoretically and empirically consistent, is presented.
Causal model
Causal analysis
Cite
Citations (17)
Causal inference is fundamental to empirical scientific discoveries in natural and social sciences; however, in the process of conducting causal inference, data management problems can lead to false discoveries. Two such problems are (i) not having all attributes required for analysis, and (ii) misidentifying which attributes are to be included in the analysis. Analysts often only have access to partial data, and they critically rely on (often unavailable or incomplete) domain knowledge to identify attributes to include for analysis, which is often given in the form of a causal DAG. We argue that data management techniques can surmount both of these challenges. In this work, we introduce the Causal Data Integration (CDI) problem, in which unobserved attributes are mined from external sources and a corresponding causal DAG is automatically built. We identify key challenges and research opportunities in designing a CDI system, and present a system architecture for solving the CDI problem. Our preliminary experimental results demonstrate that solving CDI is achievable and pave the way for future research.
Causal model
Causal analysis
Causal reasoning
Cite
Citations (0)