Distribution-Driven, Embedded Synthetic Data Generation System and Tool for RDBMS

2019 
Many self-managing relational database management systems (RDBMS) need to programmatically generate synthetic data to train machine learning models. This paper proposes the concept of shadow database and a framework to derive shadow database from production database that matches distribution properties of source data. Moreover, we have designed and implemented an embedded synthetic data generation tool that takes data distribution profile as input and generates a shadow database according to histograms of source data. The distribution profile is passed into the tool either through an export-import mechanism or as a JSON string. The shadow database can scale to be larger or smaller than the original database and serve as a testbed to train learning models. Unlike most other data generation tools, our tool is implemented as SQL procedures that can be embedded in the underlying RDBMS.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    4
    References
    1
    Citations
    NaN
    KQI
    []