Different methods to complete datasets used for capture-recapture estimation: Estimating the number of usual residents in the Netherlands
4
Citation
26
Reference
10
Related Paper
Citation Trend
Abstract:
We are interested in an estimate of the usual residents in the Netherlands.Capture-recapture estimation with three registers enables us to estimate the size of the total population, of which the usual residents are a part.However, usual residence cannot be used as a covariate because it is not available in one of the registers.We approach this as a missing data problem.There are different methods available to handle missing data.In this manuscript we use Expectation Maximization (EM) algorithm and Predictive Mean Matching (PMM).The EM algorithm is often used in categorical data analysis, but PMM has the advantage of flexibility in the choice for a specific part of the observed data used for the imputation of the missing data.Four scenarios have been identified where the missing data are completed via either the EM algorithm or PMM imputation, resulting in different population size estimates for usual residence.It was found that the different scenarios lead to different population size estimates.Even small changes in the completed data lead to different population size estimates.In this study PMM imputation performs best according flexibility and it is theoretically better motivated.Keywords:
Imputation (statistics)
Categorical variable
Mark and recapture
Imputation (statistics)
Cite
Citations (11)
Multiple imputation can be a good solution to handling missing data if data are missing at random. However, this assumption is often difficult to verify. We describe an application of multiple imputation that makes this assumption plausible. This procedure requires contacting a random sample of subjects with incomplete data to fill in the missing information, and then adjusting the imputation model to incorporate the new data. Simulations with missing data that were decidedly not missing at random showed, as expected, that the method restored the original beta coefficients, whereas other methods of dealing with missing data failed. Using a dataset with real missing data, we found that different approaches to imputation produced moderately different results. Simulations suggest that filling in 10% of data that was initially missing is sufficient for imputation in many epidemiologic applications, and should produce approximately unbiased results, provided there is a high response on follow-up from the subsample of those with some originally missing data. This response can probably be achieved if this data collection is planned as an initial approach to dealing with the missing data, rather than at later stages, after further attempts that leave only data that is very difficult to complete.
Imputation (statistics)
Cite
Citations (47)
Missing data are a common problem in nutritional epidemiology. Little is known of the characteristics of these missing data, which makes it difficult to conduct appropriate imputation.We telephoned, at random, 20% of subjects (n = 2091) from the Adventist Health Study-2 cohort who had any of 80 key variables missing from a dietary questionnaire. We were able to obtain responses for 92% of the missing variables.We found a consistent excess of "zero" intakes in the filled-in data that were initially missing. However, for frequently consumed foods, most missing data were not zero, and these were usually not distinguishable from a random sample of nonzero data. Older, black, and less-well-educated subjects had more missing data. Missing data are more likely to be true zeroes in older subjects and those with more missing data. Zero imputation for missing data may create little bias except for more frequently consumed foods, in which case, zero imputation will be suboptimal if there is more than 5%-10% missing.Although some missing data represent true zeroes, much of it does not, and data are usually not missing at random. Automatic imputation of zeroes for missing data will usually be incorrect, although there is [corrected] little bias unless the foods are frequently consumed. Certain identifiable subgroups have greater amounts of missing data, and require greater care in making imputations.
Imputation (statistics)
Cite
Citations (32)
Abstract Missing data is a major problem in real-world datasets, which hinders the performance of data analytics. Conventional data imputation schemes such as univariate single imputation replace missing values in each column with the same approximated value. These univariate single imputation techniques underestimate the variance of the imputed values. On the other hand, multivariate imputation explores the relationships between different columns of data, to impute the missing values. Reinforcement Learning (RL) is a machine learning paradigm where the agent learns by taking actions and receiving rewards in response, to achieve its goal. In this work, we propose an RL-based approach to impute missing data by learning a policy to impute data through an action-reward-based experience. Our approach imputes missing values in a column by working only on the same column (similar to univariate single imputation) but imputes the missing values in the column with different values thus keeping the variance in the imputed values. We report superior performance of our approach, compared with other imputation techniques, on a number of datasets.
Imputation (statistics)
Univariate
Cite
Citations (15)
In medical research missing data are sometimes inevitable. Different missingness mechanisms can be distinguished: (a) missing completely at random; (b) missing by design; (c) missing at random, and (d) missing not at random. If participants with missing data are excluded from statistical analyses, this can lead to biased study results and loss of statistical power. Imputation methods can be applied to estimate missing values; multiple imputation gives a good idea of the inaccuracy of the reconstructed measurements. The most common imputation methods assume that missing data are missing at random. Multiple imputation contributes greatly to the efficiency and reliability of estimates because maximum use is made of the data collected. Imputation is not meant to obviate low-quality data.
Imputation (statistics)
Cite
Citations (18)
Summary In this article, we first review the literature on dealing with missing values on a covariate in randomized studies and summarize what has been done and what is lacking to date. We then investigate the situation with a continuous outcome and a missing binary covariate in more details through simulations, comparing the performance of multiple imputation (MI) with various simple alternative methods. This is finally extended to the case of time‐to‐event outcome. The simulations consider five different missingness scenarios: missing completely at random (MCAR), at random (MAR) with missingness depending only on the treatment, and missing not at random (MNAR) with missingness depending on the covariate itself (MNAR1), missingness depending on both the treatment and covariate (MNAR2), and missingness depending on the treatment, covariate and their interaction (MNAR3). Here, we distinguish two different cases: (1) when the covariate is measured before randomization (best practice), where only MCAR and MNAR1 are plausible, and (2) when it is measured after randomization but before treatment (which sometimes occurs in nonpharmaceutical research), where the other three missingness mechanisms can also occur. The proposed methods are compared based on the treatment effect estimate and its standard error. The simulation results suggest that the patterns of results are very similar for all missingness scenarios in case (1) and also in case (2) except for MNAR3. Furthermore, in each scenario for continuous outcome, there is at least one simple method that performs at least as well as MI, while for time‐to‐event outcome MI is best.
Imputation (statistics)
Cite
Citations (10)
Databases for machine learning and data mining often have missing values.How to develop effective method for missing values imputation is an important problem in the field of machine learning and data mining.In this paper, several methods for dealing with missing values in incomplete data are reviewed, and a new method for missing values imputation based on iterative learning is proposed.The proposed method is based on a basic assumption: There exist cause-effect connections among condition attribute values, and the missing values can be induced from known values.In the process of missing values imputation, a part of missing values are filled in at first and converted to known values, which are used for the next step of missing values imputation.The iterative learning process will go on until an incomplete data is entirely converted to a complete data.The paper also presents an example to illustrate the framework of iterative learning for missing values imputation.
Imputation (statistics)
Iterative Learning Control
Cite
Citations (4)
When we analyze incomplete data, i.e., data with missing values, we need treatment for the missing values. A common way to deal with this problem is to delete the cases with missing values. Various other methods have been developed. Among them are EM algorithm and regression algorithm which can estimate missing values and impute the missing elements with the estimated values. In this paper, we introduce multiple imputation software SOLAS which generates multiple data sets and imputes with them.
Imputation (statistics)
Cite
Citations (0)
Imputation (statistics)
Cite
Citations (43)
Missing values are a common occurrence in condition monitoring datasets. To effectively improve the integrity of data, many data imputation methods have been developed to replace the missing values with the estimated values. However, these methods do not always perform well in datasets containing different types of missing values. Three types of missing data are defined, namely isolated missing value, continuous missing variable, and continuous missing sample. A three‐step data imputation method is proposed to sequentially impute these missing values following the principle from easy to difficult. The original time series data is first to split into different segments according to the positions of continuous missing samples. Then, interpolation and space‐based methods are applied to sequentially estimate isolated missing values and continuous missing variables in each segment. Finally, a stepwise extrapolation prediction model based on the long short‐term memory network is established to repair continuous missing samples between each segment. Two application examples are implemented on different dissolved gas analysis datasets and load datasets. Compared with state‐of‐the‐art techniques, the proposed three‐step data imputation method is general and can be applied to many scenarios because it establishes a rational data recovery sequence to accurately repair both stationary and non‐stationary condition monitoring data.
Imputation (statistics)
Cite
Citations (15)