A Framework for Estimating Stream Expression Cardinalities

Anirban Dasgupta,Kevin J. Lang,Lee Rhodes,Justin Thaler

A Framework for Estimating Stream Expression Cardinalities

2015

Anirban Dasgupta
Kevin J. Lang
Lee Rhodes
Justin Thaler

Given $m$ distributed data streams $A_1, \dots, A_m$, we consider the problem of estimating the number of unique identifiers in streams defined by set expressions over $A_1, \dots, A_m$. We identify a broad class of algorithms for solving this problem, and show that the estimators output by any algorithm in this class are perfectly unbiased and satisfy strong variance bounds. Our analysis unifies and generalizes a variety of earlier results in the literature. To demonstrate its generality, we describe several novel sampling algorithms in our class, and show that they achieve a novel tradeoff between accuracy, space usage, update speed, and applicability.

Keywords:

Data stream mining
Expression (mathematics)
Unique identifier
Discrete mathematics
Gibbs sampling
Cardinality
Estimator
Generality
Mathematics

Correction
Cite
Save
Machine Reading By IdeaReader

References

Citations