Code Similarity in Clone Detection

Jens Krinke,Chaiyong Ragkhitwetsagul

Code Similarity in Clone Detection

2021

Jens Krinke
Chaiyong Ragkhitwetsagul

Clone detection is one application of measuring the similarity of code. However, clone and plagiarism detectors use very different representations of source code and different techniques to identify similar code fragments. This chapter investigates the impact of source code representation (i.e. tokenisation and renaming of identifiers and literals) and the impact of similarity measurements (e.g. Jaccard index or Kondrak’s distance over n-grams) for measuring source code similarity on two known datasets. A comparison using average precision at k with dedicated clone and plagiarism detectors shows that simple similarity measurements like Kondrak’s distance using n-grams over tokenised source code usually outperform specialised tools for the detection of similar, cloned, plagiarised or duplicated code.

Keywords:

Source code
clone
Detector
Pattern recognition
Code (cryptography)
Computer science
Artificial intelligence
Jaccard index
Identifier
Similarity (network science)
Representation (mathematics)

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations