Inside Commits: An Empirical Study on Commits in Open-Source Software

2021 
GitHub is currently the most popular open-source software hosting platform, containing about 20 million public repositories. Many studies have relied on data mined from GitHub repositories, especially commits. However, not knowing the characteristics of commits may introduce biases and threats in those studies. This work presents an empirical study to characterize commits in terms of three aspects: categories of activities performed in the commits; co-occurrences of activities in commits; and size of commits by category. We analyzed 1M commits from the 24 most popular and most active Java-based projects hosted in GitHub. The main findings of this work show that: reengineering is the most frequent activity; 30% of commits involve more than one type of activity; the most common co-occurrence of activities in commits is reengineering with forwarding and corrective reengineering, however in a low rate, only 8%. The results of this study should be considered by empirical works to avoid threats and biases when considering commits’ data.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    16
    References
    0
    Citations
    NaN
    KQI
    []