Distribution-Driven, Embedded Synthetic Data Generation System and Tool for RDBMS

Joseph W. Hu,Ivan T. Bowman,Anisoara Nica,Anil Kumar Goel

Distribution-Driven, Embedded Synthetic Data Generation System and Tool for RDBMS

2019

Many self-managing relational database management systems (RDBMS) need to programmatically generate synthetic data to train machine learning models. This paper proposes the concept of shadow database and a framework to derive shadow database from production database that matches distribution properties of source data. Moreover, we have designed and implemented an embedded synthetic data generation tool that takes data distribution profile as input and generates a shadow database according to histograms of source data. The distribution profile is passed into the tool either through an export-import mechanism or as a JSON string. The shadow database can scale to be larger or smaller than the original database and serve as a testbed to train learning models. Unlike most other data generation tools, our tool is implemented as SQL procedures that can be embedded in the underlying RDBMS.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations