Deduplication in Databases using Locality Sensitive Hashing and Bloom filter

2018 
Duplicates in databases represent today an important data quality challenge which leads to bad decisions. Deduplication is a capacity optimization technology that is being used to dramatically improve storage efficiency. In large databases, sometimes find ourselves with tens of thousands of duplicates, which necessitates an automatic deduplication. It can reduce the amount of storage cost by eliminating duplicate data copies. In proposed system, introduce an effective duplicate detection method for automatic deduplication of text files and repeated strings. In this paper, we propose a similarity-based data deduplication scheme by integrating the technologies of bloom filter and Locality Sensitive hashing (LSH), which can significantly reduce the computation overhead by only performing deduplication operations for similar texts. In the proposed system check the strings or text in the repository that are similar. If they are similar, then remove the string and maintain only one copy of the data. Locality Sensitive Hashing and bloom filter methods provide better results than those of known methods, with a lesser complexity.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    10
    References
    0
    Citations
    NaN
    KQI
    []