QJoin: A Q-Sample-Based Method for Large-Scale String Similarity Joins
2018
Similarity joins has received significant attention in the last three decades because it is an essential operation in data integration and data cleaning. To design algorithms for large-scale similarity joins, MapReduce framework is often employed. But the large number of signatures incurs on both large shuffle cost and transmission cost. To decrease join time by reducing the shuffle cost and transmission cost, we propose a new q-sample-based algorithm, called QJoin, to support efficient string similarity joins. Experimental result on realworld datasets shows that our algorithm achieves high performance and it outperforms state-of-the-art approaches except on condition of edit distance 0.
Keywords:
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
12
References
0
Citations
NaN
KQI