Efficient estimation of inclusion coefficient using hyperloglog sketches
2018
Efficiently estimating the inclusion coefficient - the fraction of values of one column that are contained in another column - is useful for tasks such as data profiling and foreign-key detection. We present a new estimator, BML, for inclusion coefficient based on Hyperloglog sketches that results in significantly lower error compared to the state-of-the art approach that uses Bottom-k sketches. We evaluate the error of the BML estimator using experiments on industry benchmarks such as TPC-H and TPC-DS, and several real-world databases. As an independent contribution, we show how Hyperloglog sketches can be maintained incrementally with data deletions using only a constant amount of additional memory.
Keywords:
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
24
References
6
Citations
NaN
KQI