File-level Deduplication by using text files – Hive integration

2021 
With the enormous increase in data size, the complexity of finding duplicate data is recognized as one of the significant challenges. Elimination of duplicate data is an essential step in data cleaning as redundant data can affect a system's performance in the data processing. In order to do this deduplication technique is used to eliminate the duplicated data at the file or content level which helps to only store one copy of the file in the database. In this paper a technique is proposed to solve the storage issues and deduplication where the Hadoop Distributed File System is used to solve the vast amount of data storage issues and to identify the duplicate data a cryptography algorithm SHA 256 is used. Finally, HBase a non-relational distributed database including Hive Integration is used for data retrieval. The dataset containing counts of tests and results for COVID-19 is taken from Data.gov for experimentation. The experimental results divulge an increase in deduplication ratio, less time consumed and a gain in the storage space used.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    4
    References
    0
    Citations
    NaN
    KQI
    []