research-article

An empirical analysis of flaky tests

Authors:

Lamyaa Eloussi,

Darko MarinovAuthors Info & Claims

FSE 2014: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering

Pages 643 - 653

https://doi.org/10.1145/2635868.2635920

Published: 11 November 2014 Publication History

Abstract

Regression testing is a crucial part of software development. It checks that software changes do not break existing functionality. An important assumption of regression testing is that test outcomes are deterministic: an unmodified test is expected to either always pass or always fail for the same code under test. Unfortunately, in practice, some tests often called flaky tests—have non-deterministic outcomes. Such tests undermine the regression testing as they make it difficult to rely on test results. We present the first extensive study of flaky tests. We study in detail a total of 201 commits that likely fix flaky tests in 51 open-source projects. We classify the most common root causes of flaky tests, identify approaches that could manifest flaky behavior, and describe common strategies that developers use to fix flaky tests. We believe that our insights and implications can help guide future research on the important topic of (avoiding) flaky tests.

References

[1]

API design wiki - OrderOfElements. http://wiki.apidesign.org/wiki/OrderOfElements.

[2]

Android FlakyTest annotation. http://goo.gl/e8PILv.

[3]

Apache Software Foundation SVN Repository. http://svn.apache.org/repos/asf/.

[4]

Apache Software Foundation. HBASE-2684. https://issues.apache.org/jira/browse/HBASE-2684.

[5]

A. Bachmann, C. Bird, F. Rahman, P. T. Devanbu, and A. Bernstein. The missing links: bugs and bug-fix commits. In FSE, 2010.

Digital Library

[6]

E. T. Barr, T. Vo, V. Le, and Z. Su. Automatic detection of floating-point exceptions. In POPL, 2013.

Digital Library

[7]

J. Bell and G. Kaiser. Unit test virtualization with VMVM. In ICSE, 2014.

Digital Library

[8]

N. Bettenburg, W. Shang, W. M. Ibrahim, B. Adams, Y. Zou, and A. E. Hassan. An empirical study on inconsistent changes to code clones at the release level. SCP, 2012.

Digital Library

[9]

B. Daniel, V. Jagannath, D. Dig, and D. Marinov. ReAssert: Suggesting repairs for broken unit tests. In ASE, 2009.

Digital Library

[10]

O. Edelstein, E. Farchi, E. Goldin, Y. Nir, G. Ratsaby, and S. Ur. Framework for Testing Multi-Threaded Java Programs. CCPE, 2003.

[11]

E. Farchi, Y. Nir, and S. Ur. Concurrent bug patterns and how to test them. In IPDPS, 2003.

Digital Library

[12]

Flakiness dashboard HOWTO. http://goo.gl/JRZ1J8.

[13]

M. Fowler. Eradicating non-determinism in tests. http://goo.gl/cDDGmm.

[14]

P. Guo, T. Zimmermann, N. Nagappan, and B. Murphy. Characterizing and predicting which bugs get fixed: an empirical study of Microsoft Windows. In ICSE, 2010.

Digital Library

[15]

P. Gupta, M. Ivey, and J. Penix. Testing at the speed and scale of Google, 2011. http://goo.gl/2B5cyl.

[16]

V. Jagannath, M. Gligoric, D. Jin, Q. Luo, G. Rosu, and D. Marinov. Improved multithreaded unit testing. In FSE, 2011.

Digital Library

[17]

Jenkins RandomFail annotation. http://goo.gl/tzyC0W.

[18]

G. Jin, L. Song, W. Zhang, S. Lu, and B. Liblit. Automated atomicity-violation fixing. In PLDI, 2011.

Digital Library

[19]

JUnit and Java7. http://goo.gl/g4crZL.

[20]

F. Lacoste. Killing the gatekeeper: Introducing a continuous integration system. In Agile, 2009.

Digital Library

[21]

T. Lavers and L. Peters. Swing Extreme Testing. 2008.

[22]

Z. Li, L. Tan, X. Wang, S. Lu, Y. Zhou, and C. Zhai. Have things changed now?: An empirical study of bug characteristics in modern open source software. In ASID, 2006.

Digital Library

[23]

Y. Lin and D. Dig. CHECK-THEN-ACT misuse of Java concurrent collections. In ICST, 2013.

Digital Library

[24]

S. Lu, S. Park, E. Seo, and Y. Zhou. Learning from mistakes: A comprehensive study on real world concurrency bug characteristics. In ASPLOS, 2008.

Digital Library

[25]

P. Marinescu, P. Hosek, and C. Cadar. Covrig: A framework for the analysis of code, test, and coverage evolution in real software. In ISSTA, 2014.

Digital Library

[26]

A. M. Memon and M. B. Cohen. Automated testing of GUI applications: models, tools, and controlling flakiness. In ICSE, 2013.

Digital Library

[27]

J. Micco. Continuous integration at Google scale, 2013. http://goo.gl/0qnzGj.

[28]

B. P. Miller, L. Fredriksen, and B. So. An empirical study of the reliability of Unix utilities. CACM, 1990.

Digital Library

[29]

K. Mu¸slu, B. Soran, and J. Wuttke. Finding bugs by isolating unit tests. ESEC/FSE, 2011.

[30]

E. R. Murphy-Hill, T. Zimmermann, C. Bird, and N. Nagappan. The design of bug fixes. In ICSE, 2013.

Digital Library

[31]

Spring Repeat Annotation. http://goo.gl/vnfU3Y.

[32]

P. Sudarshan. No more flaky tests on the Go team. http://goo.gl/BiWaE1.

[33]

6 tips for writing robust, maintainable unit tests. http://blog.melski.net/tag/unit-tests.

[34]

TotT: Avoiding flakey tests. http://goo.gl/vHE47r.

[35]

R. Tzoref, S. Ur, and E. Yom-Tov. Instrumenting where it hurts: An automatic concurrent debugging technique. In ISSTA, 2007.

Digital Library

[36]

G. Yang, S. Khurshid, and M. Kim. Specification-based test repair using a lightweight formal method. In FM, 2012.

[37]

S. Zhang, D. Jalali, J. Wuttke, K. Muslu, M. Ernst, and D. Notkin. Empirically revisiting the test independence assumption. In ISSTA, 2014. Introduction Methodology Causes of Flakiness Categories of Flakiness Root Causes Async Wait Concurrency Test Order Dependency Other Root Causes Flaky Test Introduction Manifestation Platform (In)dependency Flakiness Manifestation Strategies Async Wait Concurrency Test Order Dependency Fixing Strategies Common Fixes and Effectiveness Async Wait Concurrency Test Order Dependency Others Changes to the Code under Test Threats to Validity Related Work Conclusions Acknowledgments References

Digital Library

Cited By

Olianas DLeotta MRicca FBiagiola MTonella P(2025)STILE: A tool for optimizing E2E web test scripts parallelizationJournal of Systems and Software10.1016/j.jss.2024.112304222(112304)Online publication date: Apr-2025
https://doi.org/10.1016/j.jss.2024.112304
Lambiase SCatolino GPalomba FFerrucci F(2024)Motivations, Challenges, Best Practices, and Benefits for Bots and Conversational Agents in Software Engineering: A Multivocal Literature ReviewACM Computing Surveys10.1145/370480657:4(1-37)Online publication date: 10-Dec-2024
https://dl.acm.org/doi/10.1145/3704806
Pei YSohn JHabchi SPapadakis M(2024)Non-Flaky and Nearly Optimal Time-Based Treatment of Asynchronous Wait Web TestsACM Transactions on Software Engineering and Methodology10.1145/369598934:2(1-29)Online publication date: 13-Sep-2024
https://dl.acm.org/doi/10.1145/3695989
Show More Cited By

Index Terms

An empirical analysis of flaky tests
1. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Software defect analysis
        Software testing and debugging

Recommendations

Mitigating the effects of flaky tests on mutation testing
ISSTA 2019: Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis

Mutation testing is widely used in research as a metric for evaluating the quality of test suites. Mutation testing runs the test suite on generated mutants (variants of the code under test), where a test suite kills a mutant if any of the tests fail ...
Do Automatic Test Generation Tools Generate Flaky Tests?
ICSE '24: Proceedings of the IEEE/ACM 46th International Conference on Software Engineering

Non-deterministic test behavior, or flakiness, is common and dreaded among developers. Researchers have studied the issue and proposed approaches to mitigate it. However, the vast majority of previous work has only considered developer-written tests. The ...
Characterizing Flaky Tests in Node.js Applications
ASE '23: Proceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering

Regression testing is an important means of assessing the quality of Node.js applications. However, non-deterministic executions inside Node.js framework could make test cases intermittently pass or fail on the same version of code, which are called ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

FSE 2014: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering

November 2014

856 pages

ISBN:9781450330565

DOI:10.1145/2635868

General Chair:
Shing-Chi Cheung
Hong Kong University of Science and Technology, China
,
Program Chairs:
Alessandro Orso
Georgia Institute of Technology, USA
,
Margaret-Anne Storey
University of Victoria, Canada

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGSOFT: ACM Special Interest Group on Software Engineering

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 November 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGSOFT/FSE'14

Sponsor:

SIGSOFT

SIGSOFT/FSE'14: 22nd ACM SIGSOFT Symposium on the Foundations of Software Engineering

November 16 - 21, 2014

Hong Kong, China

Acceptance Rates

Overall Acceptance Rate 17 of 128 submissions, 13%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

299
Total Citations
View Citations
2,003
Total Downloads

Downloads (Last 12 months)299
Downloads (Last 6 weeks)34

Reflects downloads up to 02 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Olianas DLeotta MRicca FBiagiola MTonella P(2025)STILE: A tool for optimizing E2E web test scripts parallelizationJournal of Systems and Software10.1016/j.jss.2024.112304222(112304)Online publication date: Apr-2025
https://doi.org/10.1016/j.jss.2024.112304
Lambiase SCatolino GPalomba FFerrucci F(2024)Motivations, Challenges, Best Practices, and Benefits for Bots and Conversational Agents in Software Engineering: A Multivocal Literature ReviewACM Computing Surveys10.1145/370480657:4(1-37)Online publication date: 10-Dec-2024
https://dl.acm.org/doi/10.1145/3704806
Pei YSohn JHabchi SPapadakis M(2024)Non-Flaky and Nearly Optimal Time-Based Treatment of Asynchronous Wait Web TestsACM Transactions on Software Engineering and Methodology10.1145/369598934:2(1-29)Online publication date: 13-Sep-2024
https://dl.acm.org/doi/10.1145/3695989
Wang JWang KNie PFilkov VRay BZhou M(2024)Efficient Incremental Code Coverage Analysis for Regression Test SuitesProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695551(1882-1894)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695551
Zhang HLiao LDing ZShang WNarula NSporea CToma ASajedi SFilkov VRay BZhou M(2024)Towards a Robust Waiting Strategy for Web GUI Testing for an Industrial Software SystemProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695269(2065-2076)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695269
Haben GHabchi SMicco JHarman MPapadakis MCordy MLe Traon YFilkov VRay BZhou M(2024)The Importance of Accounting for Execution Failures when Predicting Test FlakinessProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695261(1979-1989)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695261
Schesch BFeatherman RYang KRoberts BErnst MFilkov VRay BZhou M(2024)Evaluation of Version Control Merge ToolsProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695075(831-83)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695075
Eder FWinter SFilkov VRay BZhou M(2024)Efficient Detection of Test Interference in C ProjectsProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3694995(166-178)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3694995
Dietrich JWhite TAbdollahpour MWen EHassanshahi BTorres-Arias SDe Carli LZhang Y(2024)BinEq - A Benchmark of Compiled Java Programs to Assess Alternative BuildsProceedings of the 2024 Workshop on Software Supply Chain Offensive Research and Ecosystem Defenses10.1145/3689944.3696162(15-25)Online publication date: 19-Nov-2024
https://dl.acm.org/doi/10.1145/3689944.3696162
Parry OGruber MHenderson TMcMinn PFraser G(2024)Summary of the 1st International Flaky Test Workshop (FTW 2024)ACM SIGSOFT Software Engineering Notes10.1145/3672089.367210049:3(35-36)Online publication date: 18-Jul-2024
https://dl.acm.org/doi/10.1145/3672089.3672100
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten