Identifying Duplicate and Contradictory Information in Wikipedia

2015 
In this paper, we identify sentences in Wikipedia articles that are either identical or highly similar by applying techniques for near-duplicate detection of web pages. This is accomplished with a MapReduce implementation of minhash to identify sentences with high Jaccard similarity, followed by a pass to generate sentence clusters. Based on manual examination, we discovered that these clusters can be categorized into six different types: templates, identical sentences, copyediting, factual drift, references, and other. Two of these categories are particularly interesting: identical sentences quantify the extent to which content in Wikipedia is copied and pasted, and near-duplicate sentences that state contradictory facts point to quality issues in Wikipedia.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    16
    References
    8
    Citations
    NaN
    KQI
    []