Handling the Missing Data Problem in Electronic Health Records for Cancer Prediction

2020 
Electronic health records (EHRs) are the records containing the patients’ clinic information. The EHRs have been widely used in disease diagnosis and therapy due to the numerous and valuable medical information in them. However, the missing data problem of EHRs hinders the usage. Replacing the missing data with mean values is an approach of data imputation. But, that method weakens the feature importance. In this study, we use the expectation-maximization (EM) algorithm to impute the missing data in EHRs. Some machine learning models, including artificial neural network, logistic regression, support vector machine, and random forests are used to evaluate the effectiveness of data imputation. The experimental results show that the prediction accuracies of cancers by using those models on the EHRs imputed by EM algorithm are higher than those by mean values, which indicates the EM algorithm is able to provide accurate estimations in data imputation of EHRs.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []