Assessing lexical similarity between short sentences of source code based on granularity

2018 
Detecting similarity between two source code bases or inside one code base has many applications in the area of plagiarism detection and reused code which is manageable for refactoring. In this paper, State of the art techniques: Levenshtein Distance, Cosine Similarity, Hamming Distance and ASCII based hashing and Rabin–Karp rolling hashing have been investigated on source code strings, which is an extended work to already published research work. From experimentation, it has been observed that Rabin–Karp hashing performs better than other techniques in terms of running time, accuracy and type-of-clones. All techniques face one issue of increase in similarity searching time linearly with database size, whereas Rabin–Karp hashing handled this issue efficiently. Moreover, Rabin–Karp rolling hash method reported minimum false positives and it is also able to manage multiple patterns at a time.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    25
    References
    3
    Citations
    NaN
    KQI
    []