Reliability of Supervised Machine Learning Using Synthetic Data in Healthcare: A Model to Preserve Privacy for Data Sharing.

2020 
BACKGROUND The exploitation of synthetic data in healthcare is at an early stage. Synthetic data generation could unlock the vast potential within healthcare datasets that are too sensitive for release due to privacy concerns. Several synthetic data generators have been developed to date, however studies evaluating their efficacy and generalisability are scarce. OBJECTIVE This work sets out to understand the difference in performance of supervised machine learning models trained on synthetic data compared with those trained on real data. METHODS A total of 19 open healthcare datasets containing both categorical and numerical data have been selected for experimental work. Synthetic data is generated using three popular synthetic data generators that apply Classification and Regression Trees, parametric and Bayesian network approaches. Real and synthetic data are used (separately) to train five supervised machine learning models: stochastic gradient descent, decision tree, k-nearest neighbors, random forest and support vector machine. Models are tested only on real data to determine whether a model developed by training on synthetic data can be put into use by healthcare departments and used to accurately classify new, real examples. Evaluation metrics are computed and differentials in these scores are compared. The impact of statistical disclosure control on model performance is also assessed. RESULTS The accuracy of ML models trained on synthetic data is lower than models trained on real data in 92% of cases. Tree-based models trained on synthetic data have deviations in accuracy from models trained on real data of 17.7-19.3%, whilst other models have lower deviations of 5.8-7.2%. The winning classifier when trained and tested on real data versus models trained on synthetic data and tested on real data is the same in 26.3% of cases for CART and parametric synthetic data, and in 21.1% of cases for Bayesian network generated synthetic data. Tree-based models perform best with real data and are the winning classifier in 94.7% of cases. This is not the case for models trained on synthetic data. When tree-based models are not considered, the winning classifier for real and synthetic data is matched in 73.7%, 52.6% and 68.4% of cases for CART, parametric and Bayesian network synthetic data, respectively. Statistical disclosure control methods did not have a notable impact on data utility. CONCLUSIONS The results of this study are promising with small decreases in accuracy observed in models trained with synthetic data compared to models trained with real data, where both are tested on real data. Such deviations are expected and manageable. Tree-based classifiers have some sensitivity to synthetic data and the underlying cause requires further investigation. This study highlights the potential of synthetic data and the need for further evaluation its robustness. Synthetic data must ensure individual privacy and data utility is preserved in order to instil confidence in healthcare departments when utilising such data to inform policy decision-making. CLINICALTRIAL
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    33
    References
    10
    Citations
    NaN
    KQI
    []