Evaluating a Longitudinal Synthetic Data Generator using Real World Data

2021 
Synthetic data offer a number of advantages over using ground truth data when working with private and personal information about individuals. Firstly, the risk of identifying individuals is reduced considerably, which enables the sharing of data for analysis amongst more organisations. Secondly, the fine tuning of synthetic datapoints to suit particular modelling and analyses could help to build more suitable models that can avoid biases found in the original ground truth data. In this paper we explore how a probabilistic synthetic data generator can be used to model data with high enough fidelity that it can be used to develop and validate state-of-the-art machine learning models. In particular, we use a Bayesian network model trained on gestational diabetes data, generated from a mobile health app collected from a number of health trusts in the UK. These data are used to train and test an established machine learning model developed by Sensyne Health using real-world data, and the resulting performance is compared to performance on ground truth data. In addition, a clinical validation is undertaken to explore if human experts can differentiate real patients from synthetic ones. We demonstrate that the Bayesian network synthetic data generator is able to mimic the ground truth closely enough to make it difficult for a human expert to distinguish between the two. We show that the data generator captures the interactions between features and the multivariate distributions close enough to enable classifiers to be inferred that imitate the key performance characteristics of models inferred from ground truth data. What is more, we demonstrate that the discovered mis-classifications found when testing using the synthetic data, are as informative as when testing using ground truth data.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    6
    References
    0
    Citations
    NaN
    KQI
    []