research-article

A study on the lifecycle of flaky tests

Authors:

Kıvanç Muşlu,

Hitesh Sajnani,

Suresh ThummalapentaAuthors Info & Claims

ICSE '20: Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering

Pages 1471 - 1482

https://doi.org/10.1145/3377811.3381749

Published: 01 October 2020 Publication History

Abstract

During regression testing, developers rely on the pass or fail outcomes of tests to check whether changes broke existing functionality. Thus, flaky tests, which nondeterministically pass or fail on the same code, are problematic because they provide misleading signals during regression testing. Although flaky tests are the focus of several existing studies, none of them study (1) the reoccurrence, runtimes, and time-before-fix of flaky tests, and (2) flaky tests in-depth on proprietary projects.

This paper fills this knowledge gap about flaky tests and investigates whether prior categorization work on flaky tests also apply to proprietary projects. Specifically, we study the lifecycle of flaky tests in six large-scale proprietary projects at Microsoft. We find, as in prior work, that asynchronous calls are the leading cause of flaky tests in these Microsoft projects. Therefore, we propose the first automated solution, called Flakiness and Time Balancer (FaTB), to reduce the frequency of flaky-test failures caused by asynchronous calls. Our evaluation of five such flaky tests shows that FaTB can reduce the running times of these tests by up to 78% without empirically affecting the frequency of their flaky-test failures. Lastly, our study finds several cases where developers claim they "fixed" a flaky test but our empirical experiments show that their changes do not fix or reduce these tests' frequency of flaky-test failures. Future studies should be more cautious when basing their results on changes that developers claim to be "fixes".

References

[1]

2020. Azure Data Explorer. https://docs.microsoft.com/en-us/azure/data-explorer.

[2]

2020. Bazel. https://bazel.build.

[3]

2020. Buck. https://buckbuild.com.

[4]

2020. Data used by "A Study on the Lifecycle of Flaky Tests". https://github.com/winglam/flaky-test-lifecycle-data.

[5]

Jonathan Bell and Gail Kaiser. 2014. Unit test virtualization with VMVM. In ICSE. Hyderabad, India, 550--561.

[6]

Jonathan Bell, Owolabi Legunsen, Michael Hilton, Lamyaa Eloussi, Tifany Yung, and Darko Marinov. 2018. DEFLAKER: Automatically detecting flaky tests. In ICSE. Gothenburg, Sweden, 433--444.

Digital Library

[7]

Moritz Eck, Fabio Palomba, Marco Castelluccio, and Alberto Bacchelli. 2019. Understanding flaky tests: The developer's perspective. In ESEC/FSE. Tallinn, Estonia, 830--840.

[8]

Tayfun Elmas, Jacob Burnim, George Necula, and Koushik Sen. 2013. CONCUR-RIT: A domain specific language for reproducing concurrency bugs. In PLDI. Seattle, WA, USA, 153--164.

[9]

Hamed Esfahani, Jonas Fietz, Qi Ke, Alexei Kolomiets, Erica Lan, Erik Mavrinac, Wolfram Schulte, Newton Sanches, and Srikanth Kandula. 2016. CloudBuild: Microsoft's distributed and caching build service. In ICSE. Austin, TX, USA, 11--20.

[10]

Martin Fowler. 2020. Eradicating non-determinism in tests. https://martinfowler.com/articles/nonDeterminism.html.

[11]

Alessio Gambi, Jonathan Bell, and Andreas Zeller. 2018. Practical test dependency detection. In ICST. Västerås, Sweden, 1--11.

[12]

Zebao Gao, Yalan Liang, Myra B. Cohen, Atif M. Memon, and Zhen Wang. 2015. Making system user interactive tests repeatable: When and what should we control?. In ICSE. Florence, Italy, 55--65.

[13]

Alex Gyori, August Shi, Farah Hariri, and Darko Marinov. 2015. Reliable testing: Detecting state-polluting tests to prevent test dependency. In ISSTA. Baltimore, MD, USA, 223--233.

[14]

Mark Harman and Peter O'Hearn. 2018. From start-ups to scale-ups: Opportunities and open problems for static and dynamic program analysis. In SCAM. Madrid, Spain, 1--23.

[15]

Kim Herzig, Michaela Greiler, Jacek Czerwonka, and Brendan Murphy. 2015. The art of testing less without sacrificing quality. In ICSE. Florence, Italy, 483--493.

[16]

Vilas Jagannath, Milos Gligoric, Dongyun Jin, Qingzhou Luo, Grigore Roşu, and Darko Marinov. 2011. Improved multithreaded unit testing. In ESEC/FSE. Szeged, Hungary, 223--233.

[17]

Adriaan Labuschagne, Laura Inozemtseva, and Reid Holmes. 2017. Measuring the cost of regression testing in practice: A study of Java projects using continuous integration. In ESEC/FSE. Paderborn, Germany, 821--830.

[18]

Wing Lam, Patrice Godefroid, Suman Nath, Anirudh Santhiar, and Suresh Thummalapenta. 2019. Root causing flaky tests in a large-scale industrial setting. In ISSTA. Beijing, China, 101--111.

[19]

Wing Lam, Reed Oei, August Shi, Darko Marinov, and Tao Xie. 2019. iDFlakies: A framework for detecting and partially classifying flaky tests. In ICST. Xi'an, China, 312--322.

[20]

Qingzhou Luo, Farah Hariri, Lamyaa Eloussi, and Darko Marinov. 2014. An empirical analysis of flaky tests. In FSE. Hong Kong, 643--653.

[21]

Qingzhou Luo and Grigore Roşu. 2013. EnforceMOP: A runtime property enforcement system for multithreaded programs. In ISSTA. Lugano, Switzerland, 156--166.

[22]

Eduardo R. B. Marques, Francisco Martins, and Miguel Simões. 2014. Cooperari: A tool for cooperative testing of multithreaded Java programs. In PPPJ. Cracow, Poland, 200--206.

Digital Library

[23]

John Micco. 2017. The state of continuous integration testing at Google. In ICST. Tokyo, Japan. https://bit.ly/2OohAip

[24]

Kivanç Muşlu, Bilge Soran, and Jochen Wuttke. 2011. Finding bugs by isolating unit tests. In ESEC/FSE. Szeged, Hungary, 496--499.

[25]

Fabio Palomba and Andy Zaidman. 2017. Does refactoring of test smells induce fixing flaky tests?. In ICSME. Shanghai, China, 1--12.

[26]

Md Tajmilur Rahman and Peter C. Rigby. 2018. The impact of failing, flaky, and high failure tests on the number of crash reports associated with Firefox builds. In ESEC/FSE. Lake Buena Vista, FL, USA, 857--862.

[27]

August Shi, Alex Gyori, Owolabi Legunsen, and Darko Marinov. 2016. Detecting assumptions on deterministic implementations of non-deterministic specifications. In ICST. Chicago, IL, USA, 80--90.

[28]

August Shi, Wing Lam, Reed Oei, Tao Xie, and Darko Marinov. 2019. iFixFlakies: A framework for automatically fixing order-dependent flaky tests. In ESEC/FSE. Tallinn, Estonia, 545--555.

[29]

Swapna Thorve, Chandani Sreshtha, and Na Meng. 2018. An empirical study of flaky tests in Android apps. In ICSME, NIER Track. Madrid, Spain, 534--538.

[30]

Kaiyuan Wang, Sarfraz Khurshid, and Milos Gligoric. 2018. JPR: Replaying JPF traces using standard JVM. SIGSOFT Software Engineering Notes 42 (2018), 1--5.

Digital Library

[31]

Sai Zhang, Darioush Jalali, Jochen Wuttke, Kıvanç Muşlu, Wing Lam, Michael D. Ernst, and David Notkin. 2014. Empirically revisiting the test independence assumption. In ISSTA. San Jose, CA, USA, 385--396.

Cited By

Wang JWang KNie PFilkov VRay BZhou M(2024)Efficient Incremental Code Coverage Analysis for Regression Test SuitesProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695551(1882-1894)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695551
Berndt ABach TBaltes S(2024)Do Test and Environmental Complexity Increase Flakiness? An Empirical Study of SAP HANAProceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement10.1145/3674805.3695407(572-581)Online publication date: 24-Oct-2024
https://dl.acm.org/doi/10.1145/3674805.3695407
Cai XDong ZWang YTiwari APeng XChristakis MPradel M(2024)Reproducing Timing-Dependent GUI Flaky Tests in Android Apps via a Single Event DelayProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680377(1504-1515)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1145/3650212.3680377
Show More Cited By

Index Terms

A study on the lifecycle of flaky tests
1. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Software defect analysis
        Software testing and debugging

Recommendations

A Survey of Flaky Tests
Tests that fail inconsistently, without changes to the code under test, are described as flaky. Flaky tests do not give a clear indication of the presence of software bugs and thus limit the reliability of the test suites that contain them. A recent ...
A large-scale longitudinal study of flaky tests

Flaky tests are tests that can non-deterministically pass or fail for the same code version. These tests undermine regression testing efficiency, because developers cannot easily identify whether a test fails due to their recent changes or due to ...
An empirical analysis of flaky tests
FSE 2014: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering

Regression testing is a crucial part of software development. It checks that software changes do not break existing functionality. An important assumption of regression testing is that test outcomes are deterministic: an unmodified test is expected to ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICSE '20: Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering

June 2020

1640 pages

ISBN:9781450371216

DOI:10.1145/3377811

General Chairs:
Gregg Rothermel
North Carolina State University
,
Doo-Hwan Bae
KAIST, South Korea

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGSOFT: ACM Special Interest Group on Software Engineering

In-Cooperation

KIISE: Korean Institute of Information Scientists and Engineers
IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 October 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICSE '20

Sponsor:

SIGSOFT

ICSE '20: 42nd International Conference on Software Engineering

June 27 - July 19, 2020

Seoul, South Korea

Acceptance Rates

Overall Acceptance Rate 276 of 1,856 submissions, 15%

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

82
Total Citations
View Citations
591
Total Downloads

Downloads (Last 12 months)112
Downloads (Last 6 weeks)12

Reflects downloads up to 25 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wang JWang KNie PFilkov VRay BZhou M(2024)Efficient Incremental Code Coverage Analysis for Regression Test SuitesProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695551(1882-1894)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695551
Berndt ABach TBaltes S(2024)Do Test and Environmental Complexity Increase Flakiness? An Empirical Study of SAP HANAProceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement10.1145/3674805.3695407(572-581)Online publication date: 24-Oct-2024
https://dl.acm.org/doi/10.1145/3674805.3695407
Cai XDong ZWang YTiwari APeng XChristakis MPradel M(2024)Reproducing Timing-Dependent GUI Flaky Tests in Android Apps via a Single Event DelayProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680377(1504-1515)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1145/3650212.3680377
Chen YJabbarvand RChristakis MPradel M(2024)Neurosymbolic Repair of Test FlakinessProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680369(1402-1414)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1145/3650212.3680369
Chen YJabbarvand R(2024)Can ChatGPT Repair Non-Order-Dependent Flaky Tests?Proceedings of the 1st International Workshop on Flaky Tests10.1145/3643656.3643900(22-29)Online publication date: 14-Apr-2024
https://dl.acm.org/doi/10.1145/3643656.3643900
Leinen FPerathoner APretschner A(2024)On the Impact of Hitting System Resource Limits on Test FlakinessProceedings of the 1st International Workshop on Flaky Tests10.1145/3643656.3643898(14-19)Online publication date: 14-Apr-2024
https://dl.acm.org/doi/10.1145/3643656.3643898
Chen YRoychoudhury APaiva AAbreu RStorey M(2024)Flakiness Repair in the Era of Large Language ModelsProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings10.1145/3639478.3641227(441-443)Online publication date: 14-Apr-2024
https://dl.acm.org/doi/10.1145/3639478.3641227
Berndt ABaltes SBach TRoychoudhury APaiva AAbreu RStorey MAniche MNagappan N(2024)Taming Timeout Flakiness: An Empirical Study of SAP HANAProceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice10.1145/3639477.3639741(69-80)Online publication date: 14-Apr-2024
https://dl.acm.org/doi/10.1145/3639477.3639741
Gruber MRoslan MParry OScharnböck FMcMinn PFraser GRoychoudhury APaiva AAbreu RStorey M(2024)Do Automatic Test Generation Tools Generate Flaky Tests?Proceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3608138(1-12)Online publication date: 20-May-2024
https://dl.acm.org/doi/10.1145/3597503.3608138
Liu XSong ZFang WYang WWang WChua TNgo CKa-Wei Lee RKumar RLauw H(2024)WEFix: Intelligent Automatic Generation of Explicit Waits for Efficient Web End-to-End Flaky TestsProceedings of the ACM Web Conference 202410.1145/3589334.3645628(3043-3052)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589334.3645628
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents