Style analysis for source code plagiarism detection

2015 
The enormous growth in the available online code resources has created new challenges for detecting plagiarism in source code of programs. Several software applications can detect source code similarity using different detection methods. However, few current detection tools detect every kind of detection plagiarism attack. The aim of this thesis is, therefore, to enhance methods for plagiarism detection in source code using a style analysis approach that has been used to detect authorship. There are very few large source-code datasets which are suitable for research purposes, and two such datasets include the BlackBox dataset and the SOCO (Detection of SOurce COde) dataset. SOCO is a benchmark dataset that contains groups of similar source-code files that can be considered plagiarised and has been used in authorship and plagiarism detection competitions. In the first part of the thesis, the suitability of BlackBox as source of datasets for testing plagiarism detection is explored. The files in BlackBox were analysed and visualised in order to evaluate its suitability as a dataset that can be used in this research. The analysis aimed to identify similar source code files, and therefore to detect groups of Java files within BlackBox that can be used for evaluating the performance of source-code plagiarism detection methods. In the second part of the thesis, a plagiarism detection framework (\the Metric-File Matrix Framework (MFM)" is proposed. The MFM framework is designed to overcome some of the limitations of existing plagiarism detection methods by 1) proposing a new set of metrics which consider structural and stylistic similarities; and 2) by using Singular Value Decomposition as a technique to remove noise and to reduce the dimensionality of the data to enhance the similarity detection. The MFM framework was implemented and its performance was evaluated using the proposed metrics. For the evaluations, the SOCO dataset was adopted and the performance of the proposed framework was compared against other state-of- the-art plagiarism detection tools including JPlag.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    8
    Citations
    NaN
    KQI
    []