A visual analytics approach to high-dimensional logistic regression modeling and its application to an environmental health study
2016
In the domain of epidemiology, logistic regression modeling is widely used to explain the relationships among explanatory variables and dichotomous outcome variables. However, logistic regression modeling faces challenges such as overfitting, confounding, and multicollinearity when there is a large number of explanatory variables. For example, in the birth defect study presented in this paper, variable selection for building high quality models to identify risk factors from hundreds of pollutant variables is difficult. To address this problem, we propose a novel visual analytics approach to logistic regression modeling for high-dimensional datasets. It leverages the traditional modeling pipeline by providing (1) intuitive visualizations for inspecting statistical indicators and the relationships among the variables and (2) a seamless, effective dimension reduction pipeline for selecting variables for inclusion in high quality logistic regression models. A fully working prototype of this approach has been developed and successfully applied to the birth defect study, which illustrates its effectiveness and efficiency. Its application in an insurance policy study and feedback from domain experts further demonstrate its usefulness.
Keywords:
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
37
References
13
Citations
NaN
KQI