Software Fault Proneness Prediction with Group Lasso Regression: On Factors that Affect Classification Performance

2019 
Machine learning algorithms have been used extensively for software fault proneness prediction. This paper presents the first application of Group Lasso Regression (G-Lasso) for software fault proneness classification and compares its performance to six widely used machine learning algorithms. Furthermore, we explore the effects of two factors on the prediction performance: the effect of imbalance treatment using the Synthetic Minority Over-sampling Technique (SMOTE), and the effect of datasets used in building the prediction models. Our experimental results are based on 22 datasets extracted from open source projects. The main findings include: (1) G-Lasso is robust to imbalanced data and significantly outperforms the other machine learning algorithms with respect to the Recall and G-Score, i.e., the harmonic mean of Recall and (1- False Positive Rate). (2) Even though SMOTE improved the performance of all learners, it did not have statistically significant effect on G-Lasso's Recall and G-Score. Random Forest was in the top performing group of learners for all performance metrics, while Naive Bayes performed the worst of all learners. (3) When using the same change metrics as features, the choice of the dataset had no effect on the performance of most learners, including G-Lasso. Naive Bayes was the most affected, especially when balanced datasets were used.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    41
    References
    2
    Citations
    NaN
    KQI
    []