Efficient estimation of inclusion coefficient using hyperloglog sketches

2018 
Efficiently estimating the inclusion coefficient - the fraction of values of one column that are contained in another column - is useful for tasks such as data profiling and foreign-key detection. We present a new estimator, BML, for inclusion coefficient based on Hyperloglog sketches that results in significantly lower error compared to the state-of-the art approach that uses Bottom-k sketches. We evaluate the error of the BML estimator using experiments on industry benchmarks such as TPC-H and TPC-DS, and several real-world databases. As an independent contribution, we show how Hyperloglog sketches can be maintained incrementally with data deletions using only a constant amount of additional memory.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    24
    References
    6
    Citations
    NaN
    KQI
    []