MLS-Join: An Efficient MapReduce-Based Algorithm for String Similarity Self-joins with Edit Distance Constraint

2018 
String similarity joins is an essential operation in data integration. The era of big data calls for scalable algorithms to support large-scale string similarity joins. In this paper, we study scalable string similarity self-joins with edit distance constraint, and a MapReduce based algorithm, called MLS-Join, is proposed to supports similarity self-joins. The proposed self-join algorithm is a filter-verify based method. In filter stage, the existing multi-match-aware select substring scheme is improved to decrease the amount of generated signatures and to eliminate redundant string pairs including self-to-self pairs and duplicate pairs. In verify stage, the dataset is read only once by use of the techniques of positive/reversed pairs and combined key. Experimental results on real-world datasets show that our algorithm significantly outperformed state-of-the-art approaches.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    20
    References
    0
    Citations
    NaN
    KQI
    []