Consistent Sampling Through Extremal Process

2021 
The1 Jaccard similarity has been widely used in search and machine learning, especially in industrial practice. For binary (0/1) data, the Jaccard similarity is often called the “resemblance” and the method of minwise hashing has been the standard tool for computing resemblances in massive data. For general weighted data, the commonly used sampling algorithm for computing the (weighted) Jaccard similarity is the Consistent Weighted Sampling (CWS). A convenient (and perhaps also mysterious) implementation of CWS is the so-called “0-bit CWS” published in KDD 2015 [31], which, in this paper, we refer to as the “relaxed CWS” and was purely an empirical observation without theoretical justification. The difficulty in the analysis of the “relaxed CWS” is due to the complicated probability problem, which we could not resolve at this point. In this paper, we propose using extremal processes to generate samples for estimating the Jaccard similarity. Surprisingly, the proposed “extremal sampling” (ES) scheme makes it possible to analyze the “relaxed ES” variant. Through some novel probability endeavours, we are able to rigorously compute the bias of the “relaxed ES” which, to a good extent, explains why the “relaxed ES” works so well and when it does not in extreme corner cases. Interestingly, compared with CWS, the resultant algorithm only involves counting and does not need sophisticated mathematical operations (as required by CWS). It is therefore not surprising that the proposed ES scheme is actually noticeably faster than CWS. Although ES is different from CWS (and other algorithms in the literature for estimating the Jaccard similarity), in retrospect ES is indeed closely related to CWS. This paper provides the much needed insight which connects CWS with extremal processes. This insight may help understand CWS (and variants), and might help develop new algorithms for similarity estimation, in future research.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    47
    References
    1
    Citations
    NaN
    KQI
    []