FSampleJoin: A Fixed-Sample-Based Method for String Similarity Joins Using MapReduce
2019
Data integration and data cleaning have received significant attention in the last three decades, and similarity joins is a basic operation in these areas. In this paper, a new fixed-sample-based algorithm, called FSampleJoin, is proposed to do string similarity joins using MapReduce. Our algorithm employs a filter-verify based framework. In filter stage, a fixed-sample partition scheme is adopted to generate high-quality signatures without losing any true pairs. In verify stage, a secondary filter is employed to eliminate the dissimilar string pairs further, and the remaining candidate pairs are verified with length-aware verification method. Experimental results show that our algorithm outperforms state-of-the-art approaches though they are similar in condition of edit distance zero.
Keywords:
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
16
References
0
Citations
NaN
KQI