FSampleJoin: A Fixed-Sample-Based Method for String Similarity Joins Using MapReduce

2019 
Data integration and data cleaning have received significant attention in the last three decades, and similarity joins is a basic operation in these areas. In this paper, a new fixed-sample-based algorithm, called FSampleJoin, is proposed to do string similarity joins using MapReduce. Our algorithm employs a filter-verify based framework. In filter stage, a fixed-sample partition scheme is adopted to generate high-quality signatures without losing any true pairs. In verify stage, a secondary filter is employed to eliminate the dissimilar string pairs further, and the remaining candidate pairs are verified with length-aware verification method. Experimental results show that our algorithm outperforms state-of-the-art approaches though they are similar in condition of edit distance zero.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    16
    References
    0
    Citations
    NaN
    KQI
    []