QJoin: A Q-Sample-Based Method for Large-Scale String Similarity Joins

2018 
Similarity joins has received significant attention in the last three decades because it is an essential operation in data integration and data cleaning. To design algorithms for large-scale similarity joins, MapReduce framework is often employed. But the large number of signatures incurs on both large shuffle cost and transmission cost. To decrease join time by reducing the shuffle cost and transmission cost, we propose a new q-sample-based algorithm, called QJoin, to support efficient string similarity joins. Experimental result on realworld datasets shows that our algorithm achieves high performance and it outperforms state-of-the-art approaches except on condition of edit distance 0.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    12
    References
    0
    Citations
    NaN
    KQI
    []