A Duplication Reduction Approach for Unstructured Data Using Machine Learning Method

2019 
With the development of the Internet technology, the data becomes more and more large, and occupy more and more storage space. Although the price of storage is becoming much cheaper than before, the physical space and the electric power consuming is still large, making much load when operating. In this paper, we propose a new method for duplicated unstructured data reduction. Firstly, we compute the features for all of our unstructured data, and then compare the them with other files in the filesystem. We remove the files with the high similarity and only leave one file. In this way, we can reduce many of duplicate files. We conduct experiments on real-world data. The results suggest the effectiveness of our method.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    10
    References
    1
    Citations
    NaN
    KQI
    []