A Framework for Analyzing the Impact of Missing Data in Predictive Models

Fabiola Santore,Eduardo Cunha de Almeida,Wagner Hugo Bonat,Eduardo H. M. Pena,Luiz S. Oliveira

A Framework for Analyzing the Impact of Missing Data in Predictive Models

2020

We propose a stochastic framework to evaluate the impact of missing data on the performance of predictive models. The framework allows full control of important aspects of the data set structure. These include the number and type of the input variables, the correlation between the input variables and their general predictive power, and sample size. The missing process is generated from a multivariate Bernoulli distribution, which allows us to simulate missing patterns corresponding to the MCAR, MAR and MNAR mechanisms. Although the framework may be applied to virtually all types of predictive models, in this article, we focus on the logistic regression model and choose the accuracy as the predictive measure. The simulation results show that the effects of missing data disappear for large sample sizes, as expected. On the other hand, as the number of input variables increases, the accuracy decreases mainly for binary inputs.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations