Sampling to Maintain Approximate Probability Distribution Under Chi-Square Test

2019 
In data management center, sometimes it is necessary to provide a subset to show data characteristics, among which probability distribution is an important one. Sampling is a fundamental method to generate data subsets. But how to sample a minimum subset with fixed approximation ratio of probability distributions is still a problem. In this paper, we define the approximation ratio as the significant difference level in chi-square test and use this test to formulate the sampling problem. We decompose the probability distribution as conditional probabilities based on Bayesian networks and propose a heuristic search algorithm to generate the subset by designing two scoring functions, which are based on chi-square test and likelihood functions, respectively. Experiments on four types of datasets with size 60000 show that when setting significant difference level \(\alpha \) to 0.05, the algorithm could exclude \(99.5\%\), \(97.5\%\), \(84.8\%\) and \(90.8\%\) samples based on their Bayesian networks, respectively.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    22
    References
    1
    Citations
    NaN
    KQI
    []