skip to main content
10.1145/3540250.3549149acmconferencesArticle/Chapter ViewAbstractPublication PagesfseConference Proceedingsconference-collections
research-article

Do bugs lead to unnaturalness of source code?

Published: 09 November 2022 Publication History

Abstract

Texts in natural languages are highly repetitive and predictable because of the naturalness of natural languages. Recent research validated that source code in programming languages is also repetitive and predictable, and naturalness is an inherent property of source code. It was also reported that buggy code is significantly less natural than bug-free one, and bug fixing substantially improves the naturalness of the involved source code. In this paper, we revisit the naturalness of buggy code and investigate the effect of bug-fixing on the naturalness of source code. Different from the existing investigation, we leverage two large-scale and high-quality bug repositories where bug-irrelevant changes in bug-fixing commits have been explicitly excluded. Our evaluation results confirm that buggy lines are often less natural than bug-free ones. However, fixing bugs could not significantly improve the naturalness of involved code lines. Fixed lines on average are as unnatural as buggy ones. Consequently, bugs are not the root cause of the unnaturalness of source code, and it could be inaccurate to identify buggy code lines solely by the naturalness of source code. Our evaluation results suggest that the naturalness-based buggy line detection results in extremely low precision (less than one percentage).

References

[1]
2020. ManySStuBs4J Dataset. https://doi.org/10.5281/zenodo.3653444
[2]
2021. Gson-21. https://github.com/rjust/defects4j/blob/master/framework/projects/Gson/patches/17.src.patch
[3]
2021. JFreeChart. https://sourceforge.net/projects/jfreechart/
[4]
2022. https://issues.apache.org/jira/browse/COMPRESS-367
[5]
2022. A commit from Commons-IO. https://github.com/apache/commons-io/commit/ac36a6df
[6]
2022. Eclipse. https://www.eclipse.org/
[7]
2022. GrowingBugs. https://github.com/jiangyanjie/GrowingBugs
[8]
2022. Lang-Bug24. https://github.com/rjust/defects4j/blob/master/framework/projects/Lang/patches/24.src.patch
[9]
2022. Replication Package. https://github.com/jiangyanjie/RevisitingNaturalness
[10]
Miltiadis Allamanis, Earl T. Barr, Christian Bird, and Charles Sutton. 2014. Learning Natural Coding Conventions. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE’14). Association for Computing Machinery, New York, NY, USA. 281–293. isbn:9781450330565
[11]
Miltiadis Allamanis, Earl T Barr, Christian Bird, and Charles Sutton. 2014. Learning Natural Coding Conventions. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE’14). 281–293.
[12]
Miltiadis Allamanis, Earl T Barr, Premkumar Devanbu, and Charles Sutton. 2018. A survey of machine learning for big code and naturalness. ACM Computing Surveys (CSUR), 51, 4 (2018), 1–37.
[13]
Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2019. code2vec: Learning distributed representations of code. Proceedings of the ACM on Programming Languages, 3, POPL (2019), 1–29.
[14]
Lalit R Bahl, Frederick Jelinek, and Robert L Mercer. 1983. A Maximum Likelihood Approach to Continuous Speech Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 179–190.
[15]
William B Cavnar and John M Trenkle. 1994. N-gram-based Text Categorization. In Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval. 161–175.
[16]
Saikat Chakraborty, Yangruibo Ding Toufique Ahmed, and Baishakhi Ray Premkumar Devanbu. 2022. NATGEN:Generative pre-training by "Naturalizing" source code. In Proceedings of the 2022 30th Joint Meeting on Foundations of Software Engineering (ESEC/FSE’22).
[17]
Valentin Dallmeier and Thomas Zimmermann. 2007. Extraction of Bug Localization Benchmarks from History. In Proceedings of the twenty-second IEEE/ACM International Conference on Automated Software Engineering (ASE’07). 433–436.
[18]
Premkumar T. Devanbu. 2015. New Initiative: The Naturalness of Software. In 37th IEEE/ACM International Conference on Software Engineering, ICSE 2015, Florence, Italy, May 16-24, 2015, Volume 2, Antonia Bertolino, Gerardo Canfora, and Sebastian G. Elbaum (Eds.). IEEE Computer Society, 543–546. https://doi.org/10.1109/ICSE.2015.190
[19]
Tom Fawcett. 2006. An introduction to ROC analysis. Pattern recognition letters, 27, 8 (2006), 861–874.
[20]
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, and Daxin Jiang. 2020. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155.
[21]
Mark Gabel and Zhendong Su. 2010. A Study of the Uniqueness of Source Code. In Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering (FSE’10). 147–156.
[22]
Péter Gyimesi, Béla Vancsics, Andrea Stocco, Davood Mazinanian, Arpád Beszédes, Rudolf Ferenc, and Ali Mesbah. 2019. BugsJS: A Benchmark of JavaScript Bugs. In 12th IEEE Conference on Software Testing, Validation and Verification (ICST). 90–101.
[23]
Abram Hindle, Earl T. Barr, Mark Gabel, Zhendong Su, and Premkumar T. Devanbu. 2016. On the Naturalness of Software. Commun. ACM, 59, 5 (2016), 122–131. https://doi.org/10.1145/2902362
[24]
Abram Hindle, Earl T. Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. 2012. On the Naturalness of Software. In Proceedings of the 34th International Conference on Software Engineering (ICSE). IEEE Press, 837–847. isbn:9781467310673
[25]
Yanjie Jiang, Hui Liu, Xiaoqing Luo, Zhihao Zhu, Xiaye Chi, Nan Niu, Yuxia Zhang, Yamin Hu, Pan Bian, and Lu Zhang. 2022. BugBuilder: An Automated Approach to Building Bug Repository. IEEE Transactions on Software Engineering, 1–22. https://doi.org/10.1109/TSE.2022.3177713
[26]
Yanjie Jiang, Hui Liu, Nan Niu, Lu Zhang, and Yamin Hu. 2021. Extracting Concise Bug-Fixing Patches from Human-Written Patches in Version Control Systems. In Proceedings of the 43rd International Conference on Software Engineering (ICSE’21).
[27]
Matthieu Jimenez, Thierry Titcheu Chekam, Maxime Cordy, Mike Papadakis, Marinos Kintis, Yves Le Traon, and Mark Harman. 2018. Are Mutants Really Natural?: A Study on How "Naturalness" Helps Mutant Selection. In Proceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, ESEM 2018, Oulu, Finland, October 11-12, 2018, Markku Oivo, Daniel Méndez Fernández, and Audris Mockus (Eds.). ACM, 3:1–3:10. https://doi.org/10.1145/3239235.3240500
[28]
René Just, Darioush Jalali, and Michael D Ernst. 2014. Defects4J: A Database of Existing Faults to Enable Controlled Testing Studies for Java Programs. In Proceedings of the 2014 International Symposium on Software Testing and Analysis (ISSTA’14). 437–440.
[29]
René Just, Franz Schweiggert, and Gregory M. Kapfhammer. 2011. MAJOR: An Efficient and Extensible Tool for Mutation Analysis in a Java Compiler. In 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011), Lawrence, KS, USA, November 6-10, 2011, Perry Alexander, Corina S. Pasareanu, and John G. Hosking (Eds.). IEEE Computer Society, 612–615. https://doi.org/10.1109/ASE.2011.6100138
[30]
Rafael-Michael Karampatsis and Charles Sutton. 2020. How often do single-statement bugs occur? the manysstubs4j dataset. In Proceedings of the 17th International Conference on Mining Software Repositories. 573–577.
[31]
Peter Kennedy. February 19, 2008. A Guide to Econometrics. 6th edition 6th Edition. Wiley-Blackwell.
[32]
Kamran Kowsari, Kiana Jafari Meimandi, Mojtaba Heidarysafa, Sanjana Mendu, Laura Barnes, and Donald Brown. 2019. Text Classification Algorithms: A Survey. Information, 10, 4 (2019), 150–218.
[33]
Claire Le Goues, Neal Holtschulte, Edward K Smith, Yuriy Brun, Premkumar Devanbu, Stephanie Forrest, and Westley Weimer. 2015. The ManyBugs and IntroClass Benchmarks for Automated Repair of C Programs. IEEE Transactions on Software Engineering (TSE), 41, 12 (2015), 1236–1256.
[34]
Bin Lin, Csaba Nagy, Gabriele Bavota, and Michele Lanza. 2019. On the Impact of Refactoring Operations on Code Naturalness. In 26th IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2019, Hangzhou, China, February 24-27, 2019. IEEE, 594–598.
[35]
P Majumder, M Mitra, and BB Chaudhuri. 2002. N-gram: A Language Independent Approach to IR and NLP. In International Conference on Universal Knowledge and Language.
[36]
Anh Tuan Nguyen, Hoan Anh Nguyen, Tung Thanh Nguyen, and Tien N Nguyen. 2014. Statistical learning approach for mining API usage mappings for code migration. In Proceedings of the 29th ACM/IEEE international conference on Automated software engineering. 457–468.
[37]
Anh Tuan Nguyen, Tung Thanh Nguyen, and Tien N Nguyen. 2013. Lexical Statistical Machine Translation for Language Migration. In Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering (ESEC/FSE’13). 651–654.
[38]
Tien N Nguyen. 2016. Code migration with statistical machine translation. In Proceedings of the 5th International Workshop on Software Mining. 2–2.
[39]
Vu H Nguyen, Hien T Nguyen, Hieu N Duong, and Vaclav Snasel. 2016. n-Gram-based Text Compression. Computational Intelligence and Neuroscience.
[40]
Fuchun Peng and Dale Schuurmans. 2003. Combining Naive Bayes and N-gram Language Models for Text Classification. In European Conference on Information Retrieval (IR). 335–350.
[41]
Musfiqur Rahman, Dharani Palani, and Peter C. Rigby. 2019. Natural Software Revisited. In Proceedings of the 41st International Conference on Software Engineering, ICSE 2019, Montreal, QC, Canada, May 25-31, 2019, Joanne M. Atlee, Tevfik Bultan, and Jon Whittle (Eds.). IEEE / ACM, 37–48. https://doi.org/10.1109/ICSE.2019.00022
[42]
Baishakhi Ray, Vincent Hellendoorn, Saheel Godhane, Zhaopeng Tu, Alberto Bacchelli, and Premkumar Devanbu. 2016. On the " Naturalness" of Buggy Code. In 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE’16). 428–439.
[43]
Ngoc Tran, Hieu Tran, Son Nguyen, Hoan Nguyen, and Tien Nguyen. 2019. Does BLEU score work for code migration? In 2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC). 165–176.
[44]
Zhaopeng Tu, Zhendong Su, and Premkumar Devanbu. 2014. On the Localness of Software. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE’14). 269–280.
[45]
Xinwei Zhang and Bin Wu. 2015. Short Text Classification based on Feature Extension Using the N-gram Model. In 2015 12th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD). 710–716.
[46]
Hao Zhong and Hong Mei. 2020. Learning a graph-based classifier for fault localization. Science China Information Sciences, 63, 6 (2020), 1–22.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ESEC/FSE 2022: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering
November 2022
1822 pages
ISBN:9781450394130
DOI:10.1145/3540250
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 November 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Bug Fixing
  2. Bugs
  3. Code Entropy
  4. Naturalness
  5. Source Code

Qualifiers

  • Research-article

Funding Sources

  • National Natural Science Foundation of China
  • CCF- Huawei Formal Verification Innovation Research Plan

Conference

ESEC/FSE '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 112 of 543 submissions, 21%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)97
  • Downloads (Last 6 weeks)9
Reflects downloads up to 07 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media