language-icon Old Web
English
Sign In

Code Similarity in Clone Detection

2021 
Clone detection is one application of measuring the similarity of code. However, clone and plagiarism detectors use very different representations of source code and different techniques to identify similar code fragments. This chapter investigates the impact of source code representation (i.e. tokenisation and renaming of identifiers and literals) and the impact of similarity measurements (e.g. Jaccard index or Kondrak’s distance over n-grams) for measuring source code similarity on two known datasets. A comparison using average precision at k with dedicated clone and plagiarism detectors shows that simple similarity measurements like Kondrak’s distance using n-grams over tokenised source code usually outperform specialised tools for the detection of similar, cloned, plagiarised or duplicated code.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    22
    References
    0
    Citations
    NaN
    KQI
    []