Duplicate Detection for Chinese Texts Based on Semantic Fingerprint and LCS

2014 
In the traditional duplicated detection algorithms for the Chinese content, they often encountered the low accuracy issue. To address this issue, this paper proposes a novel method based on semantic fingerprint and LCS. With the pre-processed text synopsis, first,get the abstract of the article, and then implemented tf-idf algorithm to obtain the content's feature vector and the abstract's feature vector. By using the two vectors as input, we calculated the fingerprints of both the content and the abstract with simhash method. Calculate the Hamming Distance of the corresponding fingerprint of the two texts individually, and put the two distances into the formula raised in this paper, then get the fingerprint similarity of the two texts. This method use fingerprint as the preliminary selection and further determine the similarity with the LCS algorithm. With two-level selection, this method avoid the fallacious results and gain a better accuracy. In addition, this paper evaluated our method through comparing results with other widespread algorithms like the LCS and simhash. Experiments showed this method not only advances the accuracy but also enhances the operation speed which has better performance on the large scale data.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []