Approximate String Similarity Join using Hashing Techniques under Edit Distance Constraints

2014 
The string similarity join, which is employed to find similar string pairs from string sets, has received extensive attention in database and information retrieval fields. To this problem, the filter-and-refine framework is usually adopted by the existing research work firstly, and then various filtering methods have been proposed. Recently, tree based index techniques with the edit distance constraint are effectively employed for evaluating the string similarity join. However, they do not scale well with large distance threshold. In this paper, we propose an efficient framework for approximate string similarity join based on Min-Hashing locality sensitive hashing and trie-based index techniques under string edit distance constraints. We show that our framework is flexible between trading the efficiency and performance. An empirical study using the real datasets demonstrates that our framework is more efficient and scales better.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    38
    References
    3
    Citations
    NaN
    KQI
    []