skip to main content
10.1145/2901739.2903496acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
short-paper

The relationship between commit message detail and defect proneness in Java projects on GitHub

Published: 14 May 2016 Publication History

Abstract

Just-In-Time (JIT) defect prediction models aim to predict the commits that will introduce defects in the future. Traditionally, JIT defect prediction models are trained using metrics that are primarily derived from aspects of the code change itself (e.g., the size of the change, the author's prior experience). In addition to the code that is submitted during a commit, authors write commit messages, which describe the commit for archival purposes. It is our position that the level of detail in these commit messages can provide additional explanatory power to JIT defect prediction models. Hence, in this paper, we analyze the relationship between the defect proneness of commits and commit message volume (i.e., the length of the commit message) and commit message content (approximated using spam filtering technology). Through analysis of JIT models that were trained using 342 GitHub repositories, we find that our JIT models outperform random guessing models, achieving AUC and Brier scores that range between 0.63-0.96 and 0.01-0.21, respectively. Furthermore, our metrics that are derived from commit message detail provide a statistically significant boost to the explanatory power to the JIT models in 43%-80% of the studied systems, accounting for up to 72% of the explanatory power. Future JIT studies should consider adding commit message detail metrics.

References

[1]
S. Bird, E. Loper, and E. Klein. Natural Language Processing with Python. O'Reilly Media Inc., 2009.
[2]
J. M. Chambers and T. J. Hastie, editors. Statistical Models in S. Wadsworth and Brooks/Cole, 1992.
[3]
R. Dyer, H. A. Nguyen, H. Rajan, and T. N. Nguyen. Boa: A Language and Infrastructure for Analyzing Ultra-Large-Scale Software Repositories. In Proc. of the 35th Int'l Conf. on Software Engineering (ICSE), pages 422--431, 2013.
[4]
B. Efron. How Biased is the Apparent Error Rate of a Prediction Rule? Journal of the American Statistical Association, 81(394):461--470, 1986.
[5]
T. Fukushima, Y. Kamei, S. McIntosh, K. Yamashita, and N. Ubayashi. An Empirical Study of Just-in-Time Defect Prediction using Cross-Project Models. In Proc. of the 11th Working Conf. on Mining Software Repositories (MSR), pages 172--181, 2014.
[6]
Y. Kamei, T. Fukushima, S. McIntosh, K. Yamashita, N. Ubayashi, and A. E. Hassan. Studying Just-In-Time Defect Prediction using Cross-Project Models. Empirical Software Engineering, To appear, 2016.
[7]
Y. Kamei, E. Shihab, B. Adams, A. E. Hassan, A. Mockus, A. Sinha, and N. Ubayashi. A Large-Scale Empirical Study of Just-in-Time Quality Assurance. Transactions on Software Engineering (TSE), 39(6):757--773, 2013.
[8]
S. Kim, E. J. Whitehead, Jr., and Y. Zhang. Classifying software changes: Clean or buggy? Transactions on Software Engineering (TSE), 34(2):181--196, 2008.
[9]
S. McIntosh, Y. Kamei, B. Adams, and A. E. Hassan. An Empirical Study of the Impact of Modern Code Review Practices on Software Quality. Empirical Software Engineering, In press, 2015.
[10]
A. Mockus and D. M. Weiss. Predicting Risk of Software Changes. Bell Labs Technical Journal, 5(2):169--180, 2000.
[11]
T. Myer and B. Whately. SpamBayes: Effective open-source, Bayesian based, email classification system. In Proc. of the 7th Annual Collaboration, Electronic Messaging, Anti-Abuse and Spam Conference, 2004.
[12]
C. Rosen, B. Grawi, and E. Shihab. Commit guru: Analytics and risk prediction of software commits. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2015, pages 966--969, New York, NY, USA, 2015. ACM.
[13]
E. Shihab, A. E. Hassan, B. Adams, and Z. M. Zhang. An Industrial Study on the Risk of Software Change. In Proc. of the 20th Symposium on the Foundations of Software Engineering (FSE), pages 62:1--62:11, 2012.
[14]
E. Shihab, A. Ihara, Y., Kamei, W. M. Ibrahim, M. Ohira, B. Adams, A. E. Hassan, and K. Matsumoto. Studying re-opened bugs in open source software. Empirical Software Engineering, 18(5): 1005--1042, 2012.
[15]
M. Tan, L. Tan, S. Dara, and C. Mayeux. Online Defect Prediction for Imbalanced Data. In Proc. of the 37th Int'l Conf. on Software Engineering (ICSE), volume 2, pages 99--108, 2015.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MSR '16: Proceedings of the 13th International Conference on Mining Software Repositories
May 2016
544 pages
ISBN:9781450341868
DOI:10.1145/2901739
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 May 2016

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Short-paper

Conference

ICSE '16
Sponsor:

Upcoming Conference

ICSE 2025

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)43
  • Downloads (Last 6 weeks)6
Reflects downloads up to 24 Dec 2024

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media