skip to main content
research-article

Just-in-time software defect prediction via bi-modal change representation learning

Published: 01 January 2025 Publication History

Abstract

For predicting software defects at an early stage, researchers have proposed just-in-time defect prediction (JIT-DP) to identify potential defects in code commits. The prevailing approaches train models to represent code changes in history commits and utilize the learned representations to predict the presence of defects in the latest commit. However, existing models merely learn editions in source code, without considering the natural language intentions behind the changes. This limitation hinders their ability to capture deeper semantics. To address this, we introduce a novel bi-modal change pre-training model called BiCC-BERT. BiCC-BERT is pre-trained on a code change corpus to learn bi-modal semantic representations. To incorporate commit messages from the corpus, we design a novel pre-training objective called Replaced Message Identification (RMI), which learns the semantic association between commit messages and code changes. Subsequently, we integrate BiCC-BERT into JIT-DP and propose a new defect prediction approach — JIT-BiCC. By leveraging the bi-modal representations from BiCC-BERT, JIT-BiCC captures more profound change semantics. We train JIT-BiCC using 27,391 code changes and compare its performance with 8 state-of-the-art JIT-DP approaches. The results demonstrate that JIT-BiCC outperforms all baselines, achieving a 10.8% improvement in F1-score. This highlights its effectiveness in learning the bi-modal semantics for JIT-DP.

Highlights

A novel code change pre-training model that extracts bi-modal semantic information from code changes.
A novel pre-training objective allowing the model to explicitly learn the semantic association between commit messages and code changes.
An approach for just-in-time defect prediction (JIT-DP) based on semantic representations extracted by the pre-trained BiCC-BERT.
Significantly outperform the state-of-the-art approaches.

References

[1]
Adadi A., A survey on data-efficient algorithms in big data era, J. Big Data 8 (1) (2021) 24.
[2]
Alon U., Zilberstein M., Levy O., Yahav E., Code2vec: learning distributed representations of code, Proc. ACM Program. Lang. 3 (POPL) (2019) 40:1–40:29,.
[3]
Brody S., Alon U., Yahav E., A structural model for contextual code changes, Proc. ACM Prog. Lang. 4 (OOPSLA) (2020) 215:1–215:28,.
[4]
Chen X., Xia H., Pei W., Ni C., Liu K., Boosting multi-objective just-in-time software defect prediction by fusing expert metrics and semantic metrics, J. Syst. Softw. 206 (2023).
[5]
Cui, N., Jiang, Y., Gu, X., Shen, B., 2022. Zero-shot program representation learning. In: Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension. pp. 60–70.
[6]
Dai M., Shen B., Zhang T., Zhao M., Impact of consecutive changes on later file versions, in: International Workshop on Evidential Assessment of Software Technologies, EAST, ACM, 2014, pp. 17–24.
[7]
Devlin J., Chang M., Lee K., Toutanova K., BERT: pre-training of deep bidirectional transformers for language understanding, in: Burstein J., Doran C., Solorio T. (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), Association for Computational Linguistics, 2019, pp. 4171–4186,.
[8]
Feng Z., Guo D., Tang D., Duan N., Feng X., Gong M., Shou L., Qin B., Liu T., Jiang D., Zhou M., CodeBERT: A pre-trained model for programming and natural languages, in: Findings of the Association for Computational Linguistics, Vol. EMNLP, EMNLP, 2020, pp. 1536–1547.
[9]
Giray G., Bennin K.E., Köksal Ö., Babur Ö., Tekinerdogan B., On the use of deep learning in software defect prediction, J. Syst. Softw. 195 (2023).
[10]
He Q., Shen B., Chen Y., Software defect prediction using semi-supervised learning with change burst information, in: 40th IEEE Annual Computer Software and Applications Conference, COMPSAC 2016, Atlanta, GA, USA, June 10-14, 2016, IEEE Computer Society, 2016, pp. 113–122.
[11]
Herbold S., Trautsch A., Ledel B., Aghamohammadi A., Ghaleb T.A., Chahal K.K., Bossenmaier T., Nagaria B., Makedonski P., Ahmadabadi M.N., Szabados K., Spieker H., Madeja M., Hoy N., Lenarduzzi V., Wang S., Rodríguez-Pérez G., Palacios R.C., Verdecchia R., Singh P., Qin Y., Chakroborti D., Davis W., Walunj V., Wu H., Marcilio D., Alam O., Aldaeej A., Amit I., Turhan B., Eismann S., Wickert A., Malavolta I., Sulír M., Fard F.H., Henley A.Z., Kourtzanidis S., Tuzun E., Treude C., Shamasbi S.M., Pashchenko I., Wyrich M., Davis J., Serebrenik A., Albrecht E., Aktas E.U., Strüber D., Erbel J., A fine-grained data set and analysis of tangling in bug fixing commits, Empir. Softw. Eng. 27 (6) (2022) 125,.
[12]
Herzig K., Just S., Zeller A., It’s not a bug, it’s a feature: how misclassification impacts bug prediction, in: Notkin D., Cheng B.H.C., Pohl K. (Eds.), 35th International Conference on Software Engineering, ICSE ’13, San Francisco, CA, USA, May 18-26, 2013, IEEE Computer Society, 2013, pp. 392–401,.
[13]
Hoang T., Dam H.K., Kamei Y., Lo D., Ubayashi N., DeepJIT: an end-to-end deep learning framework for just-in-time defect prediction, in: Storey M.D., Adams B., Haiduc S. (Eds.), Proceedings of the 16th International Conference on Mining Software Repositories, MSR 2019, 26-27 May 2019, Montreal, Canada, IEEE / ACM, 2019, pp. 34–45,.
[14]
Hoang T., Kang H.J., Lo D., Lawall J., CC2vec: distributed representations of code changes, in: Rothermel G., Bae D. (Eds.), ICSE ’20: 42nd International Conference on Software Engineering, Seoul, South Korea, 27 June - 19 July, 2020, ACM, 2020, pp. 518–529,.
[15]
Li Z., Lu S., Guo D., Duan N., Jannu S., Jenks G., Majumder D., Green J., Svyatkovskiy A., Fu S., Sundaresan N., CodeReviewer: Pre-training for automating code review activities, 2022,. CoRR abs/2203.09095. arXiv:2203.09095.
[16]
Lin B., Wang S., Liu Z., Liu Y., Xia X., Mao X., CCT5: A code-change-oriented pre-trained model, 2023, arXiv preprint arXiv:2305.10785.
[17]
Liu S., Keung J., Yang Z., Liu F., Zhou Q., Liao Y., Delving into parameter-efficient fine-tuning in code change learning: An empirical study, 2024, arXiv preprint arXiv:2402.06247.
[18]
Liu Y., Ott M., Goyal N., Du J., Joshi M., Chen D., Levy O., Lewis M., Zettlemoyer L., Stoyanov V., RoBERTa: A robustly optimized BERT pretraining approach, 2019, CoRR abs/1907.11692. arXiv:1907.11692. URL http://arxiv.org/abs/1907.11692.
[19]
Liu X., Zhou Y., Tang Y., Qian J., Zhou Y., Human-in-the-loop online just-in-time software defect prediction, 2023, arXiv preprint arXiv:2308.13707.
[20]
Loshchilov I., Hutter F., Fixing weight decay regularization in adam, 2017, CoRR abs/1711.05101. arXiv:1711.05101. URL http://arxiv.org/abs/1711.05101.
[21]
Lozoya R.C., Baumann A., Sabetta A., Bezzi M., Commit2Vec: Learning distributed representations of code changes, SN Comput. Sci. 2 (3) (2021) 150,.
[22]
Mahto S., Vo V.A., Turek J.S., Huth A., Multi-timescale representation learning in LSTM language models, in: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021, OpenReview.net, 2021, URL https://openreview.net/forum?id=9ITXiTrAoT.
[23]
Malhotra R., Chawla S., Sharma A., Software defect prediction using hybrid techniques: A systematic literature review, Soft Comput. 27 (12) (2023) 8255–8288.
[24]
McIntosh S., Kamei Y., Are fix-inducing changes a moving target?: a longitudinal case study of just-in-time defect prediction, in: Chaudron M., Crnkovic I., Chechik M., Harman M. (Eds.), Proceedings of the 40th International Conference on Software Engineering, ICSE 2018, Gothenburg, Sweden, May 27 - June 03, 2018, ACM, 2018, p. 560,.
[25]
Mills C., Pantiuchina J., Parra E., Bavota G., Haiduc S., Are bug reports enough for text retrieval-based bug localization?, in: 2018 IEEE International Conference on Software Maintenance and Evolution, ICSME 2018, Madrid, Spain, September 23-29, 2018, IEEE Computer Society, 2018, pp. 381–392,.
[26]
Mockus A., Weiss D.M., Predicting risk of software changes, Bell Labs Tech. J. 5 (2) (2000) 169–180.
[27]
Nguyen S., Nguyen T.-T., Vu T.T., Do T.-D., Ngo K.-T., Vo H.D., Code-centric learning-based just-in-time vulnerability detection, J. Syst. Softw. (2024).
[28]
Ni C., Wang W., Yang K., Xia X., Liu K., Lo D., The best of both worlds: integrating semantic features with expert features for defect prediction and localization, in: Roychoudhury A., Cadar C., Kim M. (Eds.), Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2022, Singapore, Singapore, November 14-18, 2022, ACM, 2022, pp. 672–683,.
[29]
Ni C., Xia X., Lo D., Yang X., Hassan A.E., Just-in-time defect prediction on JavaScript projects: A replication study, ACM Trans. Softw. Eng. Methodol. 31 (4) (2022) 76:1–76:38,.
[30]
Ni C., Xu X., Yang K., Lo D., Boosting just-in-time defect prediction with specific features of c/c++ programming languages in code changes, in: 2023 IEEE/ACM 20th International Conference on Mining Software Repositories, MSR, IEEE, 2023, pp. 472–484.
[31]
Nie L.Y., Gao C., Zhong Z., Lam W., Liu Y., Xu Z., CoreGen: Contextualized code representation learning for commit message generation, Neurocomputing 459 (2021) 97–107,.
[32]
Panthaplackel S., Allamanis M., Brockschmidt M., Copy that! Editing sequences by copying spans, in: Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, the Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, AAAI Press, 2021, pp. 13622–13630. URL https://ojs.aaai.org/index.php/AAAI/article/view/17606.
[33]
Pornprasit C., Tantithamthavorn C., JITLine: A simpler, better, faster, finer-grained just-in-time defect prediction, in: 18th IEEE/ACM International Conference on Mining Software Repositories, MSR 2021, Madrid, Spain, May 17-19, 2021, IEEE, 2021, pp. 369–379,.
[34]
Pravilov M., Bogomolov E., Golubev Y., Bryksin T., Unsupervised learning of general-purpose embeddings for code changes, in: Ampatzoglou A., Feitosa D., Catolino G., Lenarduzzi V. (Eds.), MaLTeSQuE@ESEC/SIGSOFT FSE 2021: Proceedings of the 5th International Workshop on Machine Learning Techniques for Software Quality Evolution, Athens, Greece, 23 August 2021, ACM, 2021, pp. 7–12,.
[35]
Ribeiro, M.T., Singh, S., Guestrin, C., 2016. “Why should i trust you?” Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 1135–1144.
[36]
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D., 2017. Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 618–626.
[37]
Śliwerski J., Zimmermann T., Zeller A., When do changes induce fixes?, ACM SIGSOFT Softw. Eng. Not. 30 (4) (2005) 1–5.
[38]
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I., 2017. Attention is All you Need. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems. pp. 5998–6008.
[39]
Vu T.T., Do T.-D., Vo H.D., Context-encoded code change representation for automated commit message generation, 2023, arXiv preprint arXiv:2306.14418.
[40]
Wang C., Zhang L., Zhang X., Multi-grained contextual code representation learning for commit message generation, Inf. Softw. Technol. 167 (2024).
[41]
Wen M., Wu R., Cheung S., How well do change sequences predict defects? Sequence learning from software changes, IEEE Trans. Softw. Eng. 46 (11) (2020) 1155–1175,.
[42]
Wen M., Wu R., Liu Y., Tian Y., Xie X., Cheung S., Su Z., Exploring and exploiting the correlations between bug-inducing and bug-fixing commits, in: Dumas M., Pfahl D., Apel S., Russo A. (Eds.), Proceedings of the ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/SIGSOFT FSE 2019, Tallinn, Estonia, August 26-30, 2019, ACM, 2019, pp. 326–337,.
[43]
Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al., 2020. Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. pp. 38–45.
[44]
Xu H., Duan R., Yang S., Guo L., An empirical study on data sampling for just-in-time defect prediction, in: Artificial Intelligence and Security: 7th International Conference, ICAIS 2021, Dublin, Ireland, July 19–23, 2021, Proceedings, Part II 7, Springer, 2021, pp. 54–69.
[45]
Yan M., Xia X., Fan Y., Hassan A.E., Lo D., Li S., Just-in-time defect identification and localization: A two-phase framework, IEEE Trans. Softw. Eng. 48 (2) (2022) 82–101,.
[46]
Yang X., Lo D., Xia X., Zhang Y., Sun J., Deep learning for just-in-time defect prediction, in: 2015 IEEE International Conference on Software Quality, Reliability and Security, QRS 2015, Vancouver, BC, Canada, August 3-5, 2015, IEEE, 2015, pp. 17–26,.
[47]
Yao Z., Xu F.F., Yin P., Sun H., Neubig G., Learning structural edits via incremental tree transformations, in: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021, OpenReview.net, 2021, URL https://openreview.net/forum?id=v9hAX77--cZ.
[48]
Yin P., Neubig G., Allamanis M., Brockschmidt M., Gaunt A.L., Learning to represent edits, in: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, la, USA, May 6-9, 2019, OpenReview.net, 2019, URL https://openreview.net/forum?id=BJl6AjC5F7.
[49]
Zeng Z., Zhang Y., Zhang H., Zhang L., Deep just-in-time defect prediction: how far are we?, in: Cadar C., Zhang X. (Eds.), ISSTA ’21: 30th ACM SIGSOFT International Symposium on Software Testing and Analysis, Virtual Event, Denmark, July 11-17, 2021, ACM, 2021, pp. 427–438,.
[50]
Zhang F., Chen B., Zhao Y., Peng X., Slice-based code change representation learning, in: 2023 IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER, IEEE, 2023, pp. 319–330.
[51]
Zhang J., Panthaplackel S., Nie P., Li J.J., Gligoric M., CoditT5: Pretraining for source code and natural language editing, in: 37th IEEE/ACM International Conference on Automated Software Engineering, ASE 2022, Rochester, MI, USA, October 10-14, 2022, ACM, 2022, pp. 22:1–22:12,.
[52]
Zhao Y., Damevski K., Chen H., A systematic survey of just-in-time software defect prediction, ACM Comput. Surv. 55 (10) (2023) 1–35.
[53]
Zheng W., Shen T., Chen X., Just-in-time defect prediction technology based on interpretability technology, in: 8th International Conference on Dependable Systems and their Applications, DSA 2021, Yinchuan, China, August 5-6, 2021, IEEE, 2021, pp. 78–89,.
[54]
Zhou X., Han D., Lo D., Bridging expert knowledge with deep learning techniques for just-in-time defect prediction, 2024, arXiv preprint arXiv:2403.11079.
[55]
Zhou X., Xu B., Han D., Yang Z., He J., Lo D., CCBERT: Self-supervised code change representation learning, in: 2023 IEEE International Conference on Software Maintenance and Evolution, ICSME, IEEE, 2023, pp. 182–193.
[56]
Zhuang W., Wang H., Zhang X., Just-in-time defect prediction based on AST change embedding, Knowl.-Based Syst. 248 (2022),.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Journal of Systems and Software
Journal of Systems and Software  Volume 219, Issue C
Jan 2025
734 pages

Publisher

Elsevier Science Inc.

United States

Publication History

Published: 01 January 2025

Author Tags

  1. JIT software defect prediction
  2. PLM for code changes
  3. Replaced message identification

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 29 Jan 2025

Other Metrics

Citations

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media