skip to main content
research-article
Open access

A large-scale longitudinal study of flaky tests

Published: 13 November 2020 Publication History

Abstract

Flaky tests are tests that can non-deterministically pass or fail for the same code version. These tests undermine regression testing efficiency, because developers cannot easily identify whether a test fails due to their recent changes or due to flakiness. Ideally, one would detect flaky tests right when flakiness is introduced, so that developers can then immediately remove the flakiness. Some software organizations, e.g., Mozilla and Netflix, run some tools—detectors—to detect flaky tests as soon as possible. However, detecting flaky tests is costly due to their inherent non-determinism, so even state-of-the-art detectors are often impractical to be used on all tests for each project change. To combat the high cost of applying detectors, these organizations typically run a detector solely on newly added or directly modified tests, i.e., not on unmodified tests or when other changes occur (including changes to the test suite, the code under test, and library dependencies). However, it is unclear how many flaky tests can be detected or missed by applying detectors in only these limited circumstances.
To better understand this problem, we conduct a large-scale longitudinal study of flaky tests to determine when flaky tests become flaky and what changes cause them to become flaky. We apply two state-of-theart detectors to 55 Java projects, identifying a total of 245 flaky tests that can be compiled and run in the code version where each test was added. We find that 75% of flaky tests (184 out of 245) are flaky when added, indicating substantial potential value for developers to run detectors specifically on newly added tests. However, running detectors solely on newly added tests would still miss detecting 25% of flaky tests. The percentage of flaky tests that can be detected does increase to 85% when detectors are run on newly added or directly modified tests. The remaining 15% of flaky tests become flaky due to other changes and can be detected only when detectors are always applied to all tests. Our study is the first to empirically evaluate when tests become flaky and to recommend guidelines for applying detectors in the future.

Supplementary Material

Auxiliary Presentation Video (oopsla20main-p314-p-video.mp4)
This is a presentation video of Wing's talk at OOPSLA 2020 on the paper, "A Large-Scale Longitudinal Study of Flaky Tests", accepted in the research track. Flaky tests are tests that can non-deterministically pass or fail for the same code version. Ideally, one would detect flaky tests right when flakiness is introduced, so that developers can then immediately remove the flakiness. However, detecting flaky tests is costly due to their inherent non-determinism, so even state-of-the-art flaky-test detection tools (detectors) are often impractical to be used on all tests for each project change. To better understand when detectors should be run, we conduct a study to determine when flaky tests become flaky and what changes cause them to become flaky. We find that 85% of flaky tests are flaky when they are newly added or directly modified. Our study is the first to empirically evaluate when tests become flaky and to recommend guidelines for applying detectors in the future.

References

[1]
AvoidingFlakeyTests 2019. TotT: Avoiding flakey tests. http://goo.gl/vHE47r.
[2]
Jonathan Bell. 2014. Detecting, isolating, and enforcing dependencies among and within test cases. In FSE.
[3]
Jonathan Bell and Gail Kaiser. 2014. Unit test virtualization with VMVM. In ICSE.
[4]
Jonathan Bell, Owolabi Legunsen, Michael Hilton, Lamyaa Eloussi, Tifany Yung, and Darko Marinov. 2018. DeFlaker: Automatically detecting flaky tests. In ICSE.
[5]
Sebastian Burckhardt, Pravesh Kothari, Madanlal Musuvathi, and Santosh Nagarakatte. 2010. A randomized scheduler with probabilistic guarantees of finding bugs. In ASPLOS.
[6]
Coverity 2014. Static analysis in industry. http://popl.mpi-sws.org/2014/andy.pdf
[7]
Christophe Croux and Catherine Dehon. 2010. Influence functions of the Spearman and Kendall correlation measures. Statistical methods & applications ( 2010 ).
[8]
Moritz Eck, Fabio Palomba, Marco Castelluccio, and Alberto Bacchelli. 2019. Understanding flaky tests: The developer's perspective. In ESEC/FSE.
[9]
FacebookFlakyTestCall 2019. Facebook testing and verification request for proposals. https://research.fb.com/programs/ research-awards/proposals/facebook-testing-and-verification-request-for-proposals-2019
[10]
fastjsonGitIssue 2020. fastjson-Git issue. https://github.com/alibaba/fastjson/issues/2584
[11]
FlakinessDashboardHOWTO 2020. Flakiness dashboard HOWTO. http://www.chromium.org/developers/testing/flakinessdashboard
[12]
FlakyTestFICWebsite 2020. A Large-Scale Longitudinal Study of Flaky Tests-Tools and Dataset. https://sites.google.com/ view/first-commit-flaky-test
[13]
Alessio Gambi, Jonathan Bell, and Andreas Zeller. 2018. Practical test dependency detection. In ICST.
[14]
Zebao Gao, Yalan Liang, Myra B. Cohen, Atif M. Memon, and Zhen Wang. 2015. Making system user interactive tests repeatable: When and what should we control?. In ICSE.
[15]
GitBisect 2020. Git bisect. https://git-scm.com/docs/git-bisect
[16]
GitHub 2020. GitHub. https://github.com
[17]
Alex Gyori, August Shi, Farah Hariri, and Darko Marinov. 2015. Reliable testing: Detecting state-polluting tests to prevent test dependency. In ISSTA.
[18]
Mark Harman and Peter O'Hearn. 2018. From start-ups to scale-ups: Opportunities and open problems for static and dynamic program analysis. In SCAM.
[19]
Brian Harry. 2019. How we approach testing VSTS to enable continuous delivery. https://blogs.msdn.microsoft.com/ bharry/2017/06/28/testing-in-a-cloud-delivery-cadence
[20]
Kim Herzig, Michaela Greiler, Jacek Czerwonka, and Brendan Murphy. 2015. The art of testing less without sacrificing quality. In ICSE.
[21]
Chen Huo and James Clause. 2014. Improving oracle quality by detecting brittle assertions and unused inputs in tests. In FSE.
[22]
Infer 2020. Infer static analyzer. https://fbinfer.com
[23]
JavaModules 2020. Java Platform Module System. https://www.oracle.com/corporate/features/understanding-java-9-modules.html
[24]
He Jiang, Xiaochen Li, Zijiang Yang, and Jifeng Xuan. 2017. What causes my test alarm? Automatic cause analysis for test alarms in system and integration testing. In ICSE.
[25]
Emily Kowalczyk, Karan Nair, Zebao Gao, Leopold Silberstein, Teng Long, and Atif Memon. 2020. Modeling and ranking lfaky tests at Apple. In ICSE.
[26]
Wing Lam, Patrice Godefroid, Suman Nath, Anirudh Santhiar, and Suresh Thummalapenta. 2019a. Root causing flaky tests in a large-scale industrial setting. In ISSTA.
[27]
Wing Lam, Kivanç Muşlu, Hitesh Sajnani, and Suresh Thummalapenta. 2020a. A study on the lifecycle of flaky tests. In ICSE.
[28]
Wing Lam, Reed Oei, August Shi, Darko Marinov, and Tao Xie. 2019b. iDFlakies: A framework for detecting and partially classifying flaky tests. In ICST.
[29]
Wing Lam, Stefan Winter, Angello Astorga, Victoria Stodden, and Darko Marinov. 2020b. Understanding Reproducibility and Characteristics of Flaky Tests Through Test Reruns in Java Projects. In ISSRE.
[30]
Jim Larus, Tom Ball, Manuvir Das, Rob DeLine, Manuel Fahndrich, Jon Pincus, Sriram Rajamani, and Ramanathan Venkatapathy. 2004. Righting software. IEEE Software ( 2004 ).
[31]
Barbara Liskov and John Guttag. 2000. Program development in Java: Abstraction, specification, and object-oriented design.
[32]
Qingzhou Luo, Farah Hariri, Lamyaa Eloussi, and Darko Marinov. 2014. An empirical analysis of flaky tests. In FSE.
[33]
Maven 2020. Maven. https://maven.apache.org
[34]
Atif Memon, Zebao Gao, Bao Nguyen, Sanjeev Dhanda, Eric Nickell, Rob Siemborski, and John Micco. 2017. Taming Google-scale continuous testing. In ICSE SEIP.
[35]
John Micco. 2020. Continuous integration at Google scale. https://eclipsecon.org/2013/sites/eclipsecon.org. 2013/files/2013-03-24%20Continuous%20Integration%20at%20Google%20Scale.pdf
[36]
MozillaChaosMode 2019. Test verification. https://developer.mozilla.org/en-US/docs/Mozilla/QA/Test_Verification
[37]
Kivanç Muşlu, Bilge Soran, and Jochen Wuttke. 2011. Finding bugs by isolating unit tests. In ESEC/FSE.
[38]
Kıvanç Muşlu, Yuriy Brun, and Alexandra Meliou. 2015. Preventing data errors with continuous testing. In ISSTA.
[39]
NetflixAutomationTalk 2017. Netflix automation talks-Test automation at scale. https://youtu.be/FrBN94gUn_I?t= 764
[40]
Pengyu Nie, Ahmet Celik, Matthew Coley, Aleksandar Milicevic, Jonathan Bell, and Milos Gligoric. 2020. Debugging the performance of Maven's test isolation: Experience report. In ISSTA.
[41]
Gustavo Pinto, Breno Miranda, Supun Dissanayake, Marcelo d'Amorim, Christoph Treude, and Antonia Bertolino. 2020. What is the vocabulary of flaky tests?. In MSR.
[42]
Md Tajmilur Rahman and Peter C. Rigby. 2018. The impact of failing, flaky, and high failure tests on the number of crash reports associated with Firefox builds. In ESEC/FSE.
[43]
RSpecIssue635 2020. RSpec-core issue 635. https://github.com/rspec/rspec-core/issues/635
[44]
Caitlin Sadowski, Jefrey van Gogh, Ciera Jaspan, Emma Söderberg, and Collin Winter. 2015. Tricorder: Building a program analysis ecosystem. In ICSE.
[45]
David Saf and Michael D. Ernst. 2003. Reducing wasted development time via continuous testing. In ISSRE.
[46]
SalesforceFlakyTests 2016. Flaky tests (and how to avoid them). https://engineering.salesforce. com/flaky-tests-and-howto-avoid-them-25b84b756f60.
[47]
August Shi, Alex Gyori, Owolabi Legunsen, and Darko Marinov. 2016. Detecting assumptions on deterministic implementations of non-deterministic specifications. In ICST.
[48]
August Shi, Wing Lam, Reed Oei, Tao Xie, and Darko Marinov. 2019. iFixFlakies: A framework for automatically fixing order-dependent flaky tests. In ESEC/FSE.
[49]
Pavan Sudarshan. 2012. No more flaky tests on the Go team. http://www.thoughtworks.com/insights/blog/no-more-flakytests-go-team.
[50]
Surefire 2020. Maven Surefire plugin. https://maven.apache.org/surefire/maven-surefire-plugin
[51]
Valerio Terragni, Pasquale Salza, and Filomena Ferrucci. 2020. A Container-Based Infrastructure for Fuzzy-Driven Root Causing of Flaky Tests. In ICSE NIER.
[52]
testrunner 2020. TestingResearchIllinois/testrunner. https://github.com/TestingResearchIllinois/testrunner
[53]
Claes Wohlin, Per Runeson, Martin Höst, Magnus Ohlsson, Björn Regnell, and Anders Wesslén. 2012. Experimentation in software engineering.
[54]
Shin Yoo and Mark Harman. 2012. Regression testing minimization, selection and prioritization: A survey. Software Testing, Verification & Reliability ( 2012 ).
[55]
Andreas Zeller. 1999. Yesterday, my program worked. Today, it does not. Why?. In ESEC/FSE.
[56]
Sai Zhang, Darioush Jalali, Jochen Wuttke, Kıvanç Muşlu, Wing Lam, Michael D. Ernst, and David Notkin. 2014. Empirically revisiting the test independence assumption. In ISSTA.
[57]
Celal Ziftci and Jim Reardon. 2017. Who broke the build?: Automatically identifying changes that induce test failures in continuous integration at Google scale. In ICSE.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Programming Languages
Proceedings of the ACM on Programming Languages  Volume 4, Issue OOPSLA
November 2020
3108 pages
EISSN:2475-1421
DOI:10.1145/3436718
Issue’s Table of Contents
This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 November 2020
Published in PACMPL Volume 4, Issue OOPSLA

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. flaky test
  2. regression testing

Qualifiers

  • Research-article

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)619
  • Downloads (Last 6 weeks)165
Reflects downloads up to 31 Dec 2024

Other Metrics

Citations

Cited By

View all

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media