skip to main content
research-article

Demystifying regular expression bugs: A comprehensive study on regular expression bug causes, fixes, and testing

Published: 01 January 2022 Publication History

Abstract

Regular expressions cause string-related bugs and open security vulnerabilities for DOS attacks. However, beyond ReDoS (Regular expression Denial of Service), little is known about the extent to which regular expression issues affect software development and how these issues are addressed in practice. We conduct an empirical study of 356 regex-related bugs from merged pull requests in Apache, Mozilla, Facebook, and Google GitHub repositories. We identify and classify the nature of the regular expression problems, the fixes, and the related changes in the test code. The most important findings in this paper are as follows: 1) incorrect regular expression semantics is the dominant root cause of regular expression bugs (165/356, 46.3%). The remaining root causes are incorrect API usage (9.3%) and other code issues that require regular expression changes in the fix (29.5%), 2) fixing regular expression bugs is nontrivial as it takes more time and more lines of code to fix them compared to the general pull requests, 3) most (51%) of the regex-related pull requests do not contain test code changes. Certain regex bug types (e.g., compile error, performance issues, regex representation) are less likely to include test code changes than others, and 4) the dominant type of test code changes in regex-related pull requests is test case addition (75%). The results of this study contribute to a broader understanding of the practical problems faced by developers when using, fixing, and testing regular expressions.

References

[1]
Amann S, Nadi S, Nguyen HA, Nguyen TN, Mezini M (2016) Mubench: A benchmark for api-misuse detectors. In: Proceedings of the 13th international conference on mining software repositories, pp. 464–467
[2]
Amann S, Nguyen HA, Nadi S, Nguyen TN, and Mezini M A systematic evaluation of static api-misuse detectors IEEE Trans Softw Eng 2018 45 12 1170-1188
[4]
Bae S, Cho H, Lim I, Ryu S (2014) Safewapi: Web api misuse detector for web applications. In: Proceedings of the 22nd ACM SIGSOFT international symposium on foundations of software engineering. pp 507–517
[5]
Bai GR, Clee B, Shrestha N, Chapman C, Wright C, Stolee KT (2019) Exploring tools and strategies used during regular expression composition tasks. In: Proceedings of the 27th international conference on program comprehension. IEEE Press, pp 197–208
[6]
Bartoli A, De Lorenzo A, Medvet E, and Tarlao FInference of regular expressions for text extraction from examplesIEEE Trans Knowl Data Eng20162851217-1230https://doi.org/10.1109/TKDE.2016.2515587
[7]
Beller M, Gousios G, Zaidman A (2017) Oops, my tests broke the build: An explorative analysis of travis ci with github. In: 2017 IEEE/ACM 14th international conference on mining software repositories (MSR). IEEE, pp 356–367
[8]
Brown WH, Malveau RC, McCormick HW, Mowbray TJ (1998) AntiPatterns: refactoring software, architectures, and projects in crisis
[9]
Chapman C, Stolee KT (2016) Exploring regular expression usage and context in python. In: Proceedings of the 25th international symposium on software testing and analysis. ACM, pp 282–293
[10]
Chapman C, Wang P, Stolee KT (2017) Exploring regular expression comprehension. In: Proceedings of the 32nd IEEE/ACM international conference on automated software engineering. IEEE Press, pp 405–416
[11]
Chen Q, Wang X, Ye X, Durrett G, Dillig I (2020) Multi-modal synthesis of regular expressions. In: Proceedings of the 41st ACM SIGPLAN conference on programming language design and implementation. pp 487–502
[12]
Cody-Kenny B, Fenton M, Ronayne A, Considine E, McGuire T, O’Neill M (2017) A search for improved performance in regular expressions. In: Proceedings of the genetic and evolutionary computation conference. ACM, pp 1280–1287
[13]
Coelho R, Almeida L, Gousios G, van Deursen A (2015) Unveiling exception handling bug hazards in android based on github and google code issues. In: 2015 IEEE/ACM 12th working conference on mining software repositories. IEEE, pp 134–145
[18]
Davis JC, Coghlan CA, Servant F, Lee D (2018) The impact of regular expression denial of service (ReDoS) in practice: an empirical study at the ecosystem scale. In: The ACM joint European software engineering conference and symposium on the foundations of software engineering (ESEC/FSE)
[19]
Davis JC, Michael IV, Louis G, Coghlan CA, Servant F (2019) Why aren’t regular expressions a lingua franca? an empirical study on the re-use and portability of regular expressions. In: Proceedings of the 2019 27th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering. ACM, pp 443–454
[20]
Davis JC, Moyer D, Kazerouni AM, Lee D (2019) Testing regex generalizability and its implications: A large-scale many-language measurement study. In: 2019 34th IEEE/ACM international conference on automated software engineering (ASE). IEEE, pp 427–439
[21]
Davis JC, Servant F, Lee D (2021) Using selective memoization to defeat regular expression denial of service (redos). In: 2021 IEEE symposium on security and privacy (SP)
[22]
Di Franco A, Guo H, Rubio-González C (2017) A comprehensive study of real-world numerical bug characteristics. In: Proceedings of the 32nd IEEE/ACM international conference on automated software engineering. IEEE Press, pp 509–519
[23]
Dig D and Johnson R How do apis evolve? a story of refactoring J Softw Maint Evol Res Pract 2006 18 2 83-107
[24]
Eghbali A, Pradel M (2020) No strings attached: An empirical study of string-related software bugs. In: 2020 35th IEEE/ACM international conference on automated software engineering (ASE). IEEE, pp 956–967
[26]
Fowler M Refactoring: improving the design of existing code 2018 Boston Addison-Wesley Professional
[27]
Githut - programming languages and github (2014) https://githut.info/
[29]
Gousios G, Pinzger M, Deursen AV (2014) An exploratory study of the pull-based software development model. In: Proceedings of the 36th international conference on software engineering. ACM, pp 345–355
[30]
Gousios G, Zaidman A (2014) A dataset for pull-based development research. In: Proceedings of the 11th working conference on mining software repositories. ACM, pp 368–371
[32]
Gu Z, Wu J, Liu J, Zhou M, Gu M (2019) An empirical study on api-misuse bugs in open-source c programs. In: 2019 IEEE 43rd annual computer software and applications conference (COMPSAC), vol 1. IEEE, pp 11–20
[33]
Gyimesi P, Vancsics B, Stocco A, Mazinanian D, Beszédes A., Ferenc R, Mesbah A (2019) Bugsjs: a benchmark of javascript bugs. In: 2019 12th IEEE conference on software testing, validation and verification (ICST). IEEE, pp 90–101
[34]
Herzig K, Just S, Zeller A (2013) It’s not a bug, it’s a feature: how misclassification impacts bug prediction. In: Proceedings of the 2013 international conference on software engineering. IEEE Press, pp 392–401
[36]
unit testing - what is the convention for javascript test files? - stack overflow (2021) https://stackoverflow.com/questions/49632743/what-is-the-convention-for-javascript-test-files
[37]
Just R, Jalali D, Ernst MD (2014) Defects4j: A database of existing faults to enable controlled testing studies for java programs. In: Proceedings of the 2014 international symposium on software testing and analysis, pp. 437–440
[38]
Kabinna S, Bezemer CP, Shang W, Hassan AE (2016) Logging library migrations: A case study for the apache software foundation projects. In: 2016 IEEE/ACM 13th working conference on mining software repositories (MSR). IEEE, pp 154–164
[39]
Kalliamvakou E, Gousios G, Blincoe K, Singer L, German DM, Damian D (2014) The promises and perils of mining github. In: Proceedings of the 11th working conference on mining software repositories. pp. 92–101
[40]
Kapur P, Cossette B, Walker RJ (2010) Refactoring references for library migration. In: Proceedings of the ACM international conference on Object oriented programming systems languages and applications. pp. 726–738
[41]
Kechagia M, Devroey X, Panichella A, Gousios G, van Deursen A (2019) Effective and efficient api misuse detection via exception propagation and search-based testing. In: Proceedings of the 28th ACM SIGSOFT international symposium on software testing and analysis. pp 192–203
[42]
Khomh F, Di Penta M, Gueheneuc YG (2009) An exploratory study of the impact of code smells on software change-proneness. In: 2009 16th working conference on reverse engineering. IEEE, pp 75–84
[43]
Kim M, Cai D, Kim S (2011) An empirical investigation into the role of api-level refactorings during software evolution. In: Proceedings of the 33rd international conference on software engineering. pp 151–160
[44]
Ko D, Ma K, Park S, Kim S, Kim D, Le Traon Y (2014) Api document quality for resolving deprecated apis. In: 2014 21st Asia-Pacific software engineering conference, vol 2. IEEE, pp 27–30
[45]
Kochhar PS, Bissyandé T. F., Lo D, Jiang L (2013) Adoption of software testing in open source projects–a preliminary study on 50,000 projects. In: 2013 17th european conference on software maintenance and reengineering. IEEE, pp 353–356
[46]
Larson E, Kirk A (2016) Generating evil test strings for regular expressions. In: 2016 IEEE international conference on software testing, verification and validation (ICST). IEEE, pp 309–319
[47]
Locascio N, Narasimhan K, DeLeon E, Kushman N, Barzilay R (2016) Neural generation of regular expressions from natural language with minimal domain knowledge. arXiv:1608.03000
[48]
Lou Y, Chen Z, Cao Y, Hao D, Zhang L (2020) Understanding build issue resolution in practice: Symptoms and fix patterns. In: Proceedings of the 28th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering, ESEC/FSE 2020. Association for Computing Machinery, New York, pp 617–628.
[49]
Lu J, Chen L, Li L, Feng X (2019) Understanding node change bugs for distributed systems. In: 2019 IEEE 26th international conference on software analysis, evolution and reengineering (SANER). IEEE, pp 399–410
[50]
Lu S, Park S, Seo E, Zhou Y (2008) Learning from mistakes: a comprehensive study on real world concurrency bug characteristics. In: Proceedings of the 13th international conference on Architectural support for programming languages and operating systems. pp 329–339
[51]
Ma W, Chen L, Zhang X, Zhou Y, Xu B (2017) How do developers fix cross-project correlated bugs? a case study on the github scientific python ecosystem. In: 2017 IEEE/ACM 39th international conference on software engineering (ICSE). IEEE, pp 381–392
[52]
Maalej W, Nabil H (2015) Bug report, feature request, or simply praise? on automatically classifying app reviews. In: 2015 IEEE 23rd international requirements engineering conference (RE). IEEE, pp 116–125
[53]
Madeiral F, Urli S, Maia M, Monperrus M (2019) Bears: An extensible java bug benchmark for automatic program repair studies. In: 2019 IEEE 26th international conference on software analysis, evolution and reengineering (SANER). IEEE, pp 468–478
[54]
Majumder S, Chakraborty J, Agrawal A, Menzies T (2019) Why software projects need heroes (lessons learned from 1100+ projects). arXiv:1904.09954
[56]
Marsavina C, Romano D, Zaidman A (2014) Studying fine-grained co-evolution patterns of production and test code. In: 2014 IEEE 14th international working conference on source code analysis and manipulation. IEEE, pp 195–204
[57]
Michael LG, Donohue J, Davis JC, Lee D, Servant F (2019) Regexes are hard: Decision-making, difficulties, and risks in programming regular expressions. In: ACM international conference on automated software engineering (ASE). ACM
[58]
Mileva YM, Dallmeier V, Burger M, Zeller A (2009) Mining trends of library usage. In: Proceedings of the joint international and annual ERCIM workshops on Principles of software evolution (IWPSE) and software evolution (Evol) workshops. pp 57–62
[59]
Mileva YM, Dallmeier V, Zeller A (2010) Mining api popularity. In: International academic and industrial conference on practice and research techniques. Springer, pp 173–180
[60]
Moha N, Gueheneuc YG, Duchien L, and Le Meur AF Decor: A method for the specification and detection of code and design smells IEEE Trans Softw Eng 2009 36 1 20-36
[61]
Møller A (2017) dk.brics.automaton – finite-state automata and regular expressions for Java. http://www.brics.dk/automaton/
[63]
Munaiah N, Kroh S, Cabrey C, and Nagappan M Curating github for engineered software projects Empir Softw Eng 2017 22 6 3219-3253
[64]
Ohira M, Yoshiyuki H, Yamatani Y (2016) A case study on the misclassification of software performance issues in an issue tracking system. In: 2016 IEEE/ACIS 15th international conference on computer and information science (ICIS). IEEE, pp 1–6
[65]
Palomba F, Bavota G, Di Penta M, Oliveto R, De Lucia A, Poshyvanyk D (2013) Detecting bad smells in source code using change history information. In: Proceedings of the 28th IEEE/ACM international conference on automated software engineering. IEEE Press, pp 268–278
[66]
Park JU, Ko SK, Cognetta M, Han YS (2019) Softregex: Generating regex from natural language descriptions using softened regex equivalence. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). pp 6426–6432
[67]
Pearson’s moment coefficient of skewness | a blog on probability and statistics (2015) https://probabilityandstats.wordpress.com/tag/pearsons-moment-coefficient-of-skewness/
[68]
Perkins JH (2005) Automatically generating refactorings to support api evolution. In: Proceedings of the 6th ACM SIGPLAN-SIGSOFT workshop on program analysis for software tools and engineering. pp 111–114
[69]
Pham R, Singer L, Liskin O, Figueira Filho F, Schneider K (2013) Creating a shared understanding of testing culture on a social coding site. In: 2013 35th international conference on software engineering (ICSE). IEEE, pp 112–121
[70]
Pingclasai N, Hata H, Matsumoto KI (2013) Classifying bug reports to bugs and other requests using topic modeling. In: 2013 20Th asia-pacific software engineering conference (APSEC), vol 2. IEEE, pp 13–18
[71]
Pinto LS, Sinha S, Orso A (2012) Understanding myths and realities of test-suite evolution. In: Proceedings of the ACM SIGSOFT 20th international symposium on the foundations of software engineering, pp. 1–11
[75]
Stack overflow (2021) c# - ∖ d less efficient than [0-9] - stack overflow. https://stackoverflow.com/questions/16621738/d-less-efficient-than-0-9
[76]
[79]
Rahman MM, Roy CK (2014) An insight into the pull requests of github. In: Proc. MSR, vol 14
[80]
Saha RK, Lyu Y, Lam W, Yoshida H, Prasad MR (2018) Bugs.jar: A large-scale, diverse dataset of real-world java bugs. In: Proceedings of the 15th international conference on mining software repositories. pp 10–13
[81]
Selakovic M, Pradel M (2016) Performance issues and optimizations in javascript: an empirical study. In: Proceedings of the 38th international conference on software engineering. ACM, pp 61–72
[82]
Sharma T, Fragkoulis M, Spinellis D (2016) Does your configuration code smell?. In: 2016 IEEE/ACM 13th working conference on mining software repositories (MSR). IEEE, pp 189–200
[83]
Shen Y, Jiang Y, Xu C, Yu P, Ma X, Lu J (2018) Rescue: Crafting regular expression dos attacks. In: 2018 33rd IEEE/ACM international conference on automated software engineering (ASE). IEEE, pp 225–235
[84]
Shi L, Zhong H, Xie T, Li M (2011) An empirical study on evolution of api documentation. In: International conference on fundamental approaches to software engineering. Springer, pp 416–431
[86]
Spishak E, Dietl W, Ernst MD (2012) A type system for regular expressions. In: Proceedings of the 14th workshop on formal techniques for java-like programs. ACM, pp 20–26
[87]
Staicu CA, Pradel M (2018) Freezing the web: A study of redos vulnerabilities in javascript-based web servers. In: 27th {USENIX} Security Symposium ({USENIX} Security 18). pp 361–376
[88]
Tan L, Liu C, Li Z, Wang X, Zhou Y, and Zhai C Bug characteristics in open source software Empir Softw Eng 2014 19 6 1665-1705
[89]
Teyton C, Falleri JR, Blanc X (2012) Mining library migration graphs. In: 2012 19th working conference on reverse engineering. IEEE, pp 289–298
[90]
Thung F, Haryono SA, Serrano L, Muller G, Lawall J, Lo D, Jiang L (2020) Automated deprecated-api usage update for android apps: How far are we?. In: 2020 IEEE 27th international conference on software analysis, evolution and reengineering (SANER). IEEE, pp 602–611
[91]
Tufano M, Palomba F, Bavota G, Oliveto R, Di Penta M, De Lucia A, Poshyvanyk D (2015) When and why your code starts to smell bad. In: Proceedings of the 37th international conference on software engineering-volume 1. IEEE Press, pp. 403–414
[92]
Vahabzadeh A, Fard AM, Mesbah A (2015) An empirical study of bugs in test code. In: 2015 IEEE international conference on software maintenance and evolution (ICSME). IEEE, pp 101–110
[93]
Veanes M, De Halleux P, Tillmann N (2010) Rex: Symbolic regular expression explorer. In: 2010 Third international conference on software testing, verification and validation. IEEE, pp 498–507
[94]
Wan Z, Lo D, Xia X, Cai L (2017) Bug characteristics in blockchain systems: a large-scale empirical study. In: 2017 IEEE/ACM 14th international conference on mining software repositories (MSR). IEEE, pp 413–424
[95]
Wang P, Brown C, Jennings JA, Stolee KT (2020) An empirical study on regular expression bugs. In: Proceedings of the 17th international conference on mining software repositories. pp 103–113
[96]
Wang P, Gina R, Stolee KT (2019) Exploring regular expression evolution. In: 2019 IEEE international conference on software analysis, evolution and reengineering (SANER). IEEE, pp 502–513
[97]
Wang P, Stolee KT (2018) How well are regular expressions tested in the wild?. In: Proceedings of the 2018 26th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering. ACM, pp 668–678
[98]
Wang X, Hong Y, Chang H, Park K, Langdale G, Hu J, Zhu H (2019) Hyperscan: a fast multi-pattern regex matcher for modern cpus. In: 16th {USENIX} symposium on networked systems design and implementation ({NSDI} 19). pp 631–648
[100]
Widyasari R, Sim SQ, Lok C, Qi H, Phan J, Tay Q, Tan C, Wee F, Tan JE, Yieh Y et al (2020) Bugsinpy: a database of existing bugs in python programs to enable controlled testing and debugging studies. In: Proceedings of the 28th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering. pp 1556–1560
[101]
Wu L, Wu Q, Liang G, Wang Q, Jin Z (2015) Transforming code with compositional mappings for api-library switching. In: 2015 IEEE 39th annual computer software and applications conference, vol. 2. IEEE, pp 316–325
[102]
Wüstholz V., Olivo O, Heule MJ, Dillig I (2017) Static detection of dos vulnerabilities in programs that use regular expressions. In: International conference on tools and algorithms for the construction and analysis of systems. Springer, pp 3–20
[103]
Ye X, Chen Q, Wang X, Dillig I, Durrett G (2019) Sketch-driven regular expression generation from natural language and examples. arXiv:1908.05848
[104]
Yin Z, Yuan D, Zhou Y, Pasupathy S, Bairavasundaram L (2011) How do fixes become bugs?. In: Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on foundations of software engineering. ACM, pp 26–36
[105]
Yu S, Xu L, Zhang Y, Wu J, Liao Z, Li Y (2018) Nbsl: A supervised classification model of pull request in github. In: 2018 IEEE international conference on communications (ICC). IEEE, pp 1–6
[106]
Zaidman A, Van Rompaey B, Demeyer S, Van Deursen A (2008) Mining software repositories to study co-evolution of production & test code. In: 2008 1st international conference on software testing, verification, and validation. IEEE, pp 220–229
[107]
Zhang Y, Chen Y, Cheung SC, Xiong Y, Zhang L (2018) An empirical study on tensorflow program bugs. In: Proceedings of the 27th ACM SIGSOFT international symposium on software testing and analysis. ACM, pp 129–140
[108]
Zhang Z, Yang Y, Xia X, Lo D, Ren X, Grundy J (2021) Unveiling the mystery of api evolution in deep learning frameworks a case study of tensorflow 2. In: 2021 IEEE/ACM 43rd international conference on software engineering: software engineering in practice (ICSE-SEIP). IEEE, pp. 238–247
[109]
Zhong H, Su Z (2015) An empirical study on real bug fixes. In: Proceedings of the 37th international conference on software engineering-volume 1. IEEE Press, pp 913–923
[110]
Zhong H, Xie T, Zhang L, Pei J, Mei H (2009) Mapo: Mining and recommending api usage patterns. In: European conference on object-oriented programming. Springer, pp 318–343
[111]
Zhong Z, Guo J, Yang W, Peng J, Xie T, Lou JG, Liu T, Zhang D (2018) Semregex: A semantics-based approach for generating regular expressions from natural language specifications. In: Proceedings of the 2018 conference on empirical methods in natural language processing

Cited By

View all

Index Terms

  1. Demystifying regular expression bugs: A comprehensive study on regular expression bug causes, fixes, and testing
            Index terms have been assigned to the content through auto-classification.

            Recommendations

            Comments

            Information & Contributors

            Information

            Published In

            cover image Empirical Software Engineering
            Empirical Software Engineering  Volume 27, Issue 1
            Jan 2022
            985 pages

            Publisher

            Kluwer Academic Publishers

            United States

            Publication History

            Published: 01 January 2022
            Accepted: 01 July 2021

            Author Tags

            1. Regular expression bug characteristics
            2. Pull requests
            3. Bug fixes
            4. Test code

            Qualifiers

            • Research-article

            Funding Sources

            Contributors

            Other Metrics

            Bibliometrics & Citations

            Bibliometrics

            Article Metrics

            • Downloads (Last 12 months)0
            • Downloads (Last 6 weeks)0
            Reflects downloads up to 28 Jan 2025

            Other Metrics

            Citations

            Cited By

            View all

            View Options

            View options

            Figures

            Tables

            Media

            Share

            Share

            Share this Publication link

            Share on social media