Identifying Duplicate and Contradictory Information in Wikipedia

Sarah Weissman,Samet Ayhan,Joshua Bradley,Jimmy J. Lin

Identifying Duplicate and Contradictory Information in Wikipedia

2015

Sarah Weissman
Samet Ayhan
Joshua Bradley
Jimmy J. Lin

In this paper, we identify sentences in Wikipedia articles that are either identical or highly similar by applying techniques for near-duplicate detection of web pages. This is accomplished with a MapReduce implementation of minhash to identify sentences with high Jaccard similarity, followed by a pass to generate sentence clusters. Based on manual examination, we discovered that these clusters can be categorized into six different types: templates, identical sentences, copyediting, factual drift, references, and other. Two of these categories are particularly interesting: identical sentences quantify the extent to which content in Wikipedia is copied and pasted, and near-duplicate sentences that state contradictory facts point to quality issues in Wikipedia.

Keywords:

Jaccard index
Web page
Data mining
Information retrieval
MinHash
Computer science
Sentence

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations