research-article

Do bugs lead to unnaturalness of source code?

Authors:

Lu ZhangAuthors Info & Claims

ESEC/FSE 2022: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Pages 1085 - 1096

https://doi.org/10.1145/3540250.3549149

Published: 09 November 2022 Publication History

Abstract

Texts in natural languages are highly repetitive and predictable because of the naturalness of natural languages. Recent research validated that source code in programming languages is also repetitive and predictable, and naturalness is an inherent property of source code. It was also reported that buggy code is significantly less natural than bug-free one, and bug fixing substantially improves the naturalness of the involved source code. In this paper, we revisit the naturalness of buggy code and investigate the effect of bug-fixing on the naturalness of source code. Different from the existing investigation, we leverage two large-scale and high-quality bug repositories where bug-irrelevant changes in bug-fixing commits have been explicitly excluded. Our evaluation results confirm that buggy lines are often less natural than bug-free ones. However, fixing bugs could not significantly improve the naturalness of involved code lines. Fixed lines on average are as unnatural as buggy ones. Consequently, bugs are not the root cause of the unnaturalness of source code, and it could be inaccurate to identify buggy code lines solely by the naturalness of source code. Our evaluation results suggest that the naturalness-based buggy line detection results in extremely low precision (less than one percentage).

References

[1]

2020. ManySStuBs4J Dataset. https://doi.org/10.5281/zenodo.3653444

[2]

2021. Gson-21. https://github.com/rjust/defects4j/blob/master/framework/projects/Gson/patches/17.src.patch

[3]

2021. JFreeChart. https://sourceforge.net/projects/jfreechart/

[4]

2022. https://issues.apache.org/jira/browse/COMPRESS-367

[5]

2022. A commit from Commons-IO. https://github.com/apache/commons-io/commit/ac36a6df

[6]

2022. Eclipse. https://www.eclipse.org/

[7]

2022. GrowingBugs. https://github.com/jiangyanjie/GrowingBugs

[8]

2022. Lang-Bug24. https://github.com/rjust/defects4j/blob/master/framework/projects/Lang/patches/24.src.patch

[9]

2022. Replication Package. https://github.com/jiangyanjie/RevisitingNaturalness

[10]

Miltiadis Allamanis, Earl T. Barr, Christian Bird, and Charles Sutton. 2014. Learning Natural Coding Conventions. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE’14). Association for Computing Machinery, New York, NY, USA. 281–293. isbn:9781450330565

Digital Library

[11]

Miltiadis Allamanis, Earl T Barr, Christian Bird, and Charles Sutton. 2014. Learning Natural Coding Conventions. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE’14). 281–293.

Digital Library

[12]

Miltiadis Allamanis, Earl T Barr, Premkumar Devanbu, and Charles Sutton. 2018. A survey of machine learning for big code and naturalness. ACM Computing Surveys (CSUR), 51, 4 (2018), 1–37.

Digital Library

[13]

Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2019. code2vec: Learning distributed representations of code. Proceedings of the ACM on Programming Languages, 3, POPL (2019), 1–29.

Digital Library

[14]

Lalit R Bahl, Frederick Jelinek, and Robert L Mercer. 1983. A Maximum Likelihood Approach to Continuous Speech Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 179–190.

Digital Library

[15]

William B Cavnar and John M Trenkle. 1994. N-gram-based Text Categorization. In Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval. 161–175.

[16]

Saikat Chakraborty, Yangruibo Ding Toufique Ahmed, and Baishakhi Ray Premkumar Devanbu. 2022. NATGEN:Generative pre-training by "Naturalizing" source code. In Proceedings of the 2022 30th Joint Meeting on Foundations of Software Engineering (ESEC/FSE’22).

[17]

Valentin Dallmeier and Thomas Zimmermann. 2007. Extraction of Bug Localization Benchmarks from History. In Proceedings of the twenty-second IEEE/ACM International Conference on Automated Software Engineering (ASE’07). 433–436.

Digital Library

[18]

Premkumar T. Devanbu. 2015. New Initiative: The Naturalness of Software. In 37th IEEE/ACM International Conference on Software Engineering, ICSE 2015, Florence, Italy, May 16-24, 2015, Volume 2, Antonia Bertolino, Gerardo Canfora, and Sebastian G. Elbaum (Eds.). IEEE Computer Society, 543–546. https://doi.org/10.1109/ICSE.2015.190

[19]

Tom Fawcett. 2006. An introduction to ROC analysis. Pattern recognition letters, 27, 8 (2006), 861–874.

[20]

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, and Daxin Jiang. 2020. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155.

[21]

Mark Gabel and Zhendong Su. 2010. A Study of the Uniqueness of Source Code. In Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering (FSE’10). 147–156.

Digital Library

[22]

Péter Gyimesi, Béla Vancsics, Andrea Stocco, Davood Mazinanian, Arpád Beszédes, Rudolf Ferenc, and Ali Mesbah. 2019. BugsJS: A Benchmark of JavaScript Bugs. In 12th IEEE Conference on Software Testing, Validation and Verification (ICST). 90–101.

[23]

Abram Hindle, Earl T. Barr, Mark Gabel, Zhendong Su, and Premkumar T. Devanbu. 2016. On the Naturalness of Software. Commun. ACM, 59, 5 (2016), 122–131. https://doi.org/10.1145/2902362

Digital Library

[24]

Abram Hindle, Earl T. Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. 2012. On the Naturalness of Software. In Proceedings of the 34th International Conference on Software Engineering (ICSE). IEEE Press, 837–847. isbn:9781467310673

[25]

Yanjie Jiang, Hui Liu, Xiaoqing Luo, Zhihao Zhu, Xiaye Chi, Nan Niu, Yuxia Zhang, Yamin Hu, Pan Bian, and Lu Zhang. 2022. BugBuilder: An Automated Approach to Building Bug Repository. IEEE Transactions on Software Engineering, 1–22. https://doi.org/10.1109/TSE.2022.3177713

Digital Library

[26]

Yanjie Jiang, Hui Liu, Nan Niu, Lu Zhang, and Yamin Hu. 2021. Extracting Concise Bug-Fixing Patches from Human-Written Patches in Version Control Systems. In Proceedings of the 43rd International Conference on Software Engineering (ICSE’21).

Digital Library

[27]

Matthieu Jimenez, Thierry Titcheu Chekam, Maxime Cordy, Mike Papadakis, Marinos Kintis, Yves Le Traon, and Mark Harman. 2018. Are Mutants Really Natural?: A Study on How "Naturalness" Helps Mutant Selection. In Proceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, ESEM 2018, Oulu, Finland, October 11-12, 2018, Markku Oivo, Daniel Méndez Fernández, and Audris Mockus (Eds.). ACM, 3:1–3:10. https://doi.org/10.1145/3239235.3240500

Digital Library

[28]

René Just, Darioush Jalali, and Michael D Ernst. 2014. Defects4J: A Database of Existing Faults to Enable Controlled Testing Studies for Java Programs. In Proceedings of the 2014 International Symposium on Software Testing and Analysis (ISSTA’14). 437–440.

Digital Library

[29]

René Just, Franz Schweiggert, and Gregory M. Kapfhammer. 2011. MAJOR: An Efficient and Extensible Tool for Mutation Analysis in a Java Compiler. In 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011), Lawrence, KS, USA, November 6-10, 2011, Perry Alexander, Corina S. Pasareanu, and John G. Hosking (Eds.). IEEE Computer Society, 612–615. https://doi.org/10.1109/ASE.2011.6100138

Digital Library

[30]

Rafael-Michael Karampatsis and Charles Sutton. 2020. How often do single-statement bugs occur? the manysstubs4j dataset. In Proceedings of the 17th International Conference on Mining Software Repositories. 573–577.

Digital Library

[31]

Peter Kennedy. February 19, 2008. A Guide to Econometrics. 6th edition 6th Edition. Wiley-Blackwell.

[32]

Kamran Kowsari, Kiana Jafari Meimandi, Mojtaba Heidarysafa, Sanjana Mendu, Laura Barnes, and Donald Brown. 2019. Text Classification Algorithms: A Survey. Information, 10, 4 (2019), 150–218.

[33]

Claire Le Goues, Neal Holtschulte, Edward K Smith, Yuriy Brun, Premkumar Devanbu, Stephanie Forrest, and Westley Weimer. 2015. The ManyBugs and IntroClass Benchmarks for Automated Repair of C Programs. IEEE Transactions on Software Engineering (TSE), 41, 12 (2015), 1236–1256.

Digital Library

[34]

Bin Lin, Csaba Nagy, Gabriele Bavota, and Michele Lanza. 2019. On the Impact of Refactoring Operations on Code Naturalness. In 26th IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2019, Hangzhou, China, February 24-27, 2019. IEEE, 594–598.

[35]

P Majumder, M Mitra, and BB Chaudhuri. 2002. N-gram: A Language Independent Approach to IR and NLP. In International Conference on Universal Knowledge and Language.

[36]

Anh Tuan Nguyen, Hoan Anh Nguyen, Tung Thanh Nguyen, and Tien N Nguyen. 2014. Statistical learning approach for mining API usage mappings for code migration. In Proceedings of the 29th ACM/IEEE international conference on Automated software engineering. 457–468.

Digital Library

[37]

Anh Tuan Nguyen, Tung Thanh Nguyen, and Tien N Nguyen. 2013. Lexical Statistical Machine Translation for Language Migration. In Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering (ESEC/FSE’13). 651–654.

Digital Library

[38]

Tien N Nguyen. 2016. Code migration with statistical machine translation. In Proceedings of the 5th International Workshop on Software Mining. 2–2.

Digital Library

[39]

Vu H Nguyen, Hien T Nguyen, Hieu N Duong, and Vaclav Snasel. 2016. n-Gram-based Text Compression. Computational Intelligence and Neuroscience.

[40]

Fuchun Peng and Dale Schuurmans. 2003. Combining Naive Bayes and N-gram Language Models for Text Classification. In European Conference on Information Retrieval (IR). 335–350.

[41]

Musfiqur Rahman, Dharani Palani, and Peter C. Rigby. 2019. Natural Software Revisited. In Proceedings of the 41st International Conference on Software Engineering, ICSE 2019, Montreal, QC, Canada, May 25-31, 2019, Joanne M. Atlee, Tevfik Bultan, and Jon Whittle (Eds.). IEEE / ACM, 37–48. https://doi.org/10.1109/ICSE.2019.00022

Digital Library

[42]

Baishakhi Ray, Vincent Hellendoorn, Saheel Godhane, Zhaopeng Tu, Alberto Bacchelli, and Premkumar Devanbu. 2016. On the " Naturalness" of Buggy Code. In 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE’16). 428–439.

Digital Library

[43]

Ngoc Tran, Hieu Tran, Son Nguyen, Hoan Nguyen, and Tien Nguyen. 2019. Does BLEU score work for code migration? In 2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC). 165–176.

Digital Library

[44]

Zhaopeng Tu, Zhendong Su, and Premkumar Devanbu. 2014. On the Localness of Software. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE’14). 269–280.

Digital Library

[45]

Xinwei Zhang and Bin Wu. 2015. Short Text Classification based on Feature Extension Using the N-gram Model. In 2015 12th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD). 710–716.

[46]

Hao Zhong and Hong Mei. 2020. Learning a graph-based classifier for fault localization. Science China Information Sciences, 63, 6 (2020), 1–22.

Cited By

Yang CChen JJiang JHuang Y(2024)Dependency-Aware Code NaturalnessProceedings of the ACM on Programming Languages10.1145/36897948:OOPSLA2(2355-2377)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3689794

Index Terms

Do bugs lead to unnaturalness of source code?

Index terms have been assigned to the content through auto-classification.

Recommendations

Dependency-Aware Code Naturalness

Code naturalness, which captures repetitiveness and predictability in programming languages, has proven valuable for various code-related tasks in software engineering. However, precisely measuring code naturalness remains a fundamental challenge. ...
Utilizing a multi-developer network-based developer recommendation algorithm to fix bugs effectively
SAC '14: Proceedings of the 29th Annual ACM Symposium on Applied Computing

Recently, bug fixing has become an important part of software maintenance. In large-scale projects, developers rely on bug reports to guide any bug-fixing activities. Due to a great number of bug reports submitted into the bug repository, the workload ...
Detect Related Bugs from Source Code Using Bug Information
COMPSAC '10: Proceedings of the 2010 IEEE 34th Annual Computer Software and Applications Conference

Open source projects often maintain open bug repositories during development and maintenance, and the reporters often point out straightly or implicitly the reasons why bugs occur when they submit them. The comments about a bug are very valuable for ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ESEC/FSE 2022: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering

November 2022

1822 pages

ISBN:9781450394130

DOI:10.1145/3540250

General Chair:
Abhik Roychoudhury
National University of Singapore, Singapore
,
Program Chairs:
Cristian Cadar
Imperial College London, UK
,
Miryung Kim
University of California at Los Angeles, USA

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 November 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
CCF- Huawei Formal Verification Innovation Research Plan

Conference

ESEC/FSE '22

Sponsor:

ESEC/FSE '22: 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering

November 14 - 18, 2022

Singapore, Singapore

Acceptance Rates

Overall Acceptance Rate 112 of 543 submissions, 21%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
384
Total Downloads

Downloads (Last 12 months)97
Downloads (Last 6 weeks)9

Reflects downloads up to 07 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Yang CChen JJiang JHuang Y(2024)Dependency-Aware Code NaturalnessProceedings of the ACM on Programming Languages10.1145/36897948:OOPSLA2(2355-2377)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3689794

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents