logo
    Missing Data in a Long Food Frequency Questionnaire
    32
    Citation
    19
    Reference
    10
    Related Paper
    Citation Trend
    Abstract:
    Missing data are a common problem in nutritional epidemiology. Little is known of the characteristics of these missing data, which makes it difficult to conduct appropriate imputation.We telephoned, at random, 20% of subjects (n = 2091) from the Adventist Health Study-2 cohort who had any of 80 key variables missing from a dietary questionnaire. We were able to obtain responses for 92% of the missing variables.We found a consistent excess of "zero" intakes in the filled-in data that were initially missing. However, for frequently consumed foods, most missing data were not zero, and these were usually not distinguishable from a random sample of nonzero data. Older, black, and less-well-educated subjects had more missing data. Missing data are more likely to be true zeroes in older subjects and those with more missing data. Zero imputation for missing data may create little bias except for more frequently consumed foods, in which case, zero imputation will be suboptimal if there is more than 5%-10% missing.Although some missing data represent true zeroes, much of it does not, and data are usually not missing at random. Automatic imputation of zeroes for missing data will usually be incorrect, although there is [corrected] little bias unless the foods are frequently consumed. Certain identifiable subgroups have greater amounts of missing data, and require greater care in making imputations.
    Keywords:
    Imputation (statistics)
    Abstract This paper evaluates the error measures of missing value imputations in medical research. Several imputation techniques have been designed and implemented, however, the evaluation of the degree of deviation of the imputed values from the original values have not been given adequate attention. Predictive Mean Matching Imputation (PMMI) and K-Nearest Neighbour Imputation (KNNI) techniques were implemented on imputation of fertility dataset. The implementation was on three mechanisms of missing values: Missing At Random (MAR), Missing Completely At Random (MCAR) and Missing Not At Random (MNAR). The results were evaluated by mean square error (MSE), root mean square error (RMSE) and mean absolute error (MAE). PMMI performed better than KNNI in all the results. MSE for example, has the ratio of 0.0260/2.8555 (PMMI/KNNI) for 1-10% MAR – 99.09% reduced error rate; 0.1108/3.0120 (PMMI/KNNI) for 30-40% MCAR – 96.32 reduced error rate; and 0.0642/3.7187 (PMMI/KNNI) for 40-50% MNAR – 98.27% reduced error rate. MCAR was the most consistent missingness mechanism for the evaluations. Density distributions of the imputed dataset were compared with the original dataset. The distribution plots of the imputed missing data followed the curve of the original dataset.
    Imputation (statistics)
    Mean square
    Mean absolute error
    Multivariate imputation by chained equations (MICE) is commonly used for imputing missing data in epidemiologic research. The “true” imputation model may contain nonlinearities which are not included in default imputation models. Random forest imputation is a machine learning technique which can accommodate nonlinearities and interactions and does not require a particular regression model to be specified. We compared parametric MICE with a random forest-based MICE algorithm in 2 simulation studies. The first study used 1,000 random samples of 2,000 persons drawn from the 10,128 stable angina patients in the CALIBER database (Cardiovascular Disease Research using Linked Bespoke Studies and Electronic Records; 2001–2010) with complete data on all covariates. Variables were artificially made “missing at random,” and the bias and efficiency of parameter estimates obtained using different imputation methods were compared. Both MICE methods produced unbiased estimates of (log) hazard ratios, but random forest was more efficient and produced narrower confidence intervals. The second study used simulated data in which the partially observed variable depended on the fully observed variables in a nonlinear way. Parameter estimates were less biased using random forest MICE, and confidence interval coverage was better. This suggests that random forest imputation may be useful for imputing complex epidemiologic data sets in which some patients have missing data.
    Imputation (statistics)
    Caliber
    Citations (567)
    Background: Monitoring of environmental contaminants is a critical part of exposure sciences and epidemiological research. Missing data are often encountered when performing short-term monitoring (<24hr) of air pollutants with real-time monitors, especially in resource-limited areas. Approaches for handling consecutive periods of missing and incomplete data in this context remain unclear. Our aim is to evaluate existing imputation methods for handling missing data for real-time monitors operating for short durations.Methods: In a current field-study, real-time particulate monitors were placed outside of 20 households for 24-hours. Missing data was simulated at four consecutive periods of missingness (20%, 40%, 60%, 80%). Univariate (Mean, Median, Last Observation Carried Forward, Kalman Filter, Random, Markov) and multivariate time-series (Predictive Mean Matching, Row Mean Method) methods were used to impute missing concentrations, and performance was evaluated using five error metrics (Absolute Bias, Percent Absolute Error in Means, R2 Coefficient of Determination, Root Mean Square Error, Mean Absolute Error). Results: Univariate methods of Markov, random, and mean imputations performed best, yielding 24-hour mean concentrations with low error and high R2 values across all levels of missingness. When evaluating error metrics minute-by-minute, Kalman Filters, median, and Markov methods performed well at low levels of missingness (20-40%). However, at higher levels of missingness (60-80%), Markov, random, median, and mean imputation performed best on average. Multivariate imputation methods performed worst across all levels of missingness. Conclusion: Epidemiological studies often report pollutant concentration in relationship to their potential health effect by averaging minute or hourly concentrations over 24-hours. However, when more than 25% of data is missing, daily average pollutant concentrations cannot be reliably computed. Univariate imputation may provide a reasonable solution to addressing missing data for short-term monitoring of air pollutants. Further efforts are needed to evaluate imputation methods that are generalizable across a diverse range of study environments.
    Imputation (statistics)
    Univariate
    Missing data imputation is an important task in cases where it is crucial to use all available data and not discard records with missing values. The purpose of this work were first to develop the Weighted of Regime Switching Mean and Regression (WRSMRI) for missing data estimation and secondly to compare its efficiency of estimation and statistical power of a test under Missing Complete At Random (MCAR) and simple random sampling with another methods, namely; Mean Imputation (MI) Regression Imputation (RI) Regime Switching Mean Imputation (RSMI) Regime Switching Regression Imputation (RSRI) and Average of Regime Switching Mean and Regression Imputation (ARSMRI). By using simulation data, the comparisons were made with the following conditions: (i) Three sample size (100, 200 and 500) (ii) three level of correlation of variables (low, moderate and high) and (iii) four level of percentage of missing data (5, 10, 15 and 20%). The best imputation under MSE and sample correlation estimated were obtained using WRSMRI method, under MAE MAPE power of the test sample mean and variance estimated were obtained using RSRI.
    Imputation (statistics)
    When study variances are not reported or ‘missing”, it is common practice in meta analysis to assume that the missing variances are missing completely at random (MCAR). In practice, however, it is possible that the variances are not missing completely at random (NMAR). In this paper, we examine, analytically, the biases introduce in the meta analysis estimates when the missing study variances occur with non-random missing mechanism (MNAR), namely, when the magnitude of the missing variances are mostly larger than those that are reported. In meta analysis, this is more likely to occur in studies which carry larger variances. We looked at two common approaches in handling this problem, namely, the missing variances are imputed using the mean imputation, and the studies with missing study-variances are omitted from the analysis. The results suggest that for the estimate of the variance of the effect size, if the magnitude of the study-variances that are missing are mostly larger than those that are reported, the variance of the effect size will be underestimated. Thus under MNAR, the mean imputation gives false impression of precision as the estimated variance of the overall effect is too small. On the other hand, if the missing variances are mostly smaller, the variance will be overestimated.
    Imputation (statistics)
    This paper investigates three MICE methods: Predictive Mean Matching (PMM), Quantile Regression-based Multiple Imputation (QR-basedMI) and Simple Random Sampling Imputation (SRSI) at imputation numbers 5, 15, 20 and 30 with 5% and 20% missing values, to ascertain the one that produces imputed values that best matches the observed values and compare the model fit based on the AIC and MSE. The results show that; QR-basedMI produced more imputed values that didn’t match the observed, SRSI produced imputed values that match the observed values better as the number of imputations increases while PMM produced imputed values that matched the observed at all number of imputations and missingness considered. The model fit results for 5% missingness showed that QR-basedMI produced the best results in terms of MSE except for M=15, while AIC results showed that PMM produced best result for M= 5, QR-basedMI produced best results for M=15 and for M=20 and 30 SRSI produced the best results. The model fit results for 20% missingness shows that PMM produced the best results at all the number of imputations considered for both AIC and MSE except the AIC at M=15 where SRSI was seen to produce the best results. It is concluded that in comparison, the PMM is most suited when missingness is 20% but for 5% missingness the model fit is best with QR-basedMI.
    Imputation (statistics)
    Quantile
    This study compares four methods (Mean, Part-mean, Regression, and K-nearest neighbor (KNN)) for imputing the missing response values in a CCD. Four test functions are used to perform all possible cases of a single missing response in a CCD with two factors. The performance was measured by using the mean-squared error and correlation coefficient of the three parts of the CCD (factorial, center, or axial) by comparing the imputed and actual values. The study results show that the missing response value's influence of the affected part of the CCD (factorial, center, axial) could not be neglected, and the Regression imputation method was superior to the other three for an imputed value in the factorial or axial parts of the CCD. Furthermore, for missing values in the center part, the results show that the Part-mean imputation method was equally as good as the complex imputation methods of Regression and KNN.
    Imputation (statistics)
    Abstract Background Multiple imputation is frequently used to address missing data when conducting statistical analyses. There is a paucity of research into the performance of multiple imputation when the prevalence of missing data is very high. Our objective was to assess the performance of multiple imputation when estimating a logistic regression model when the prevalence of missing data for predictor variables is very high. Methods Monte Carlo simulations were used to examine the performance of multiple imputation when estimating a multivariable logistic regression model. We varied the size of the analysis samples ( N = 500, 1,000, 5,000, 10,000, and 25,000) and the prevalence of missing data (5–95% in increments of 5%). Results In general, multiple imputation performed well across the range of scenarios. The exceptions were in scenarios when the sample size was 500 or 1,000 and the prevalence of missing data was at least 90%. In these scenarios, the estimated standard errors of the log-odds ratios were very large and did not accurately estimate the standard deviation of the sampling distribution of the log-odds ratio. Furthermore, in these settings, estimated confidence intervals tended to be conservative. In all other settings (i.e., sample sizes > 1,000 or when the prevalence of missing data was less than 90%), then multiple imputation allowed for accurate estimation of a logistic regression model. Conclusions Multiple imputation can be used in many scenarios with a very high prevalence of missing data.
    Imputation (statistics)
    Citations (18)