Using Code Evolution Information to Improve the Quality of Labels in Code Smell Datasets

2018 
Several approaches are proposed to detect code smells. A set of important approaches are based on machine learning algorithms, which require the code smells have been labeled in source codes as training data firstly. The common labeling approaches are based on manual or tools, but it is difficult for current approaches to get reliable large-scale datasets. In this paper, an approach using the evolution information of source codes is proposed to get large-scale and more reliable training datasets for detecting code smells based on machine learning algorithms. Our approach analyzes the evolving of the source code smells firstly labeled by a tool from the baseline version into the contrastive version of a software system, and then constructs training datasets based on those "changed smells". Experiments conducted on three open source software projects for detecting four types of code smells(which are Data Class, God Class, Brain Class and Brain Method) show that the models obtained by changed smells datasets have better performance on code smell detection than those obtained by unchanged smells datasets (with an average improvement rate of 7.8% and a maximum increase of 30%). The experiments results indicate that using the evolution information of source codes can construct more reliable training datasets for detecting code smells based on machine learning algorithms.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    20
    References
    0
    Citations
    NaN
    KQI
    []