A New Efficient Data Cleansing Method

2002 
One of the most important tasks in data cleansing is to detect and remove duplicate records, which consists of two main components, detection and comparison. A detection method decides which records will be compared, and a comparison method determines whether two records compared are duplicate. Comparisons take a great deal of data cleansing time. We discover that if certain properties are satisfied by a comparison method then many unnecessary expensive comparisons can be avoided. In this paper, we first propose a new comparison method, LCSS, based on the longest common subsequence, and show that it possesses the desired properties. We then propose two new detection methods, SNM-IN and SNM-INOUT, which are variances of the popular detection method SNM. The performance study on real and synthetic databases shows that the integration of SNM-IN (SNM-INOUT) and LCSS saves about 39% (56%) of comparisons.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    10
    References
    10
    Citations
    NaN
    KQI
    []