research-article

Assessing Effectiveness of Test Suites: What Do We Know and What Should We Do?

Authors:

Yuming ZhouAuthors Info & Claims

ACM Transactions on Software Engineering and Methodology, Volume 33, Issue 4

Article No.: 86, Pages 1 - 32

https://doi.org/10.1145/3635713

Published: 17 April 2024 Publication History

Abstract

Background. Software testing is a critical activity for ensuring the quality and reliability of software systems. To evaluate the effectiveness of different test suites, researchers have developed a variety of metrics. Problem. However, comparing these metrics is challenging due to the lack of a standardized evaluation framework including comprehensive factors. As a result, researchers often focus on single factors (e.g., size), which finally leads to different or even contradictory conclusions. After comparing dozens of pieces of work in detail, we have found two main problems most troubling to our community: (1) researchers tend to oversimplify the description of the ground truth they use, and (2) data involving real defects is not suitable for analysis using traditional statistical indicators. Objective. We aim at scrutinizing the whole process of comparing test suites for our community. Method. To hit this aim, we propose a framework ASSENT (evAluating teSt Suite EffectiveNess meTrics) to guide the follow-up research for evaluating a test suite effectiveness metric. ASSENT consists of three fundamental components: ground truth, benchmark test suites, and agreement indicator. Its functioning is as follows: first, users clarify the ground truth for determining the real order in effectiveness among test suites. Second, users generate a set of benchmark test suites and derive their ground truth order in effectiveness. Third, users use the metric to derive the order in effectiveness for the same test suites. Finally, users calculate the agreement indicator between the two orders derived by two metrics. Result. With ASSENT, we are able to compare the accuracy of different test suite effectiveness metrics. We apply ASSENT to evaluate representative test suite effectiveness metrics, including mutation score and code coverage metrics. Our results show that, based on the real faults, mutation score, and subsuming mutation score are the best metrics to quantify test suite effectiveness. Meanwhile, by using mutants instead of real faults, test effectiveness will be overestimated by more than 20% in values. Conclusion. We recommend that the standardized evaluation framework ASSENT should be used for evaluating and comparing test effectiveness metrics in the future work.

References

[1]

[n. d.]. https://github.com/zhangpengNJU/ASSENT/README.md. Accessed on: 2023-09-25.

[2]

[n. d.]. https://github.com/cobertura. Accessed on: 2023-09-25.

[3]

Iftekhar Ahmed, Rahul Gopinath, Caius Brindescu, Alex Groce, and Carlos Jensen. 2016. Can testedness be effectively measured?. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. 547–558.

Digital Library

[4]

Paul Ammann, Marcio Eduardo Delamaro, and Jeff Offutt. 2014. Establishing theoretical minimal sets of mutants. In Proceedings of the 2014 IEEE 7th International Conference on Software Testing, Verification, and Validation. IEEE, 21–30.

Digital Library

[5]

Thomas Ball. 2004. A theory of predicate-complete test coverage and generation. In Proceedings of the International Symposium on Formal Methods for Components and Objects. Springer, 1–22.

[6]

Benoit Baudry, Franck Fleurey, and Yves Le Traon. 2006. Improving test suites for efficient fault localization. In Proceedings of the 28th International Conference on Software Engineering. 82–91.

Digital Library

[7]

Xia Cai. 2006. Coverage-based Testing Strategies and Reliability Modeling for Fault-tolerant Software Systems. The Chinese University of Hong Kong (People’s Republic of China).

[8]

Xia Cai and Michael R. Lyu. 2005. The effect of code coverage on fault detection under different testing profiles. In Proceedings of the 1st International Workshop on Advances in Model-based Testing. 1–7.

Digital Library

[9]

Thierry Titcheu Chekam, Mike Papadakis, Yves Le Traon, and Mark Harman. 2017. An empirical study on mutation, statement and branch coverage fault revelation that avoids the unreliable clean program assumption. In Proceedings of the 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE ’17). IEEE, 597–608.

Digital Library

[10]

Yiqun T. Chen, Rahul Gopinath, Anita Tadakamalla, Michael D. Ernst, Reid Holmes, Gordon Fraser, Paul Ammann, and René Just. 2020. Revisiting the relationship between fault detection, test adequacy criteria, and test set size. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering. 237–249.

Digital Library

[11]

Trishul M. Chilimbi, Ben Liblit, Krishna Mehra, Aditya V. Nori, and Kapil Vaswani. 2009. Holmes: Effective statistical debugging via efficient path profiling. In Proceedings of the 2009 IEEE 31st International Conference on Software Engineering. IEEE, 34–44.

Digital Library

[12]

Jacob Cohen. 1992. Quantitative methods in psychology: A power primer. Psychol. Bull. 112, 1 (1992), 1155–1159.

[13]

Marcio Eduardo Delamaro, Lin Deng, Vinicius Humberto Serapilha Durelli, Nan Li, and Jeff Offutt. 2014. Experimental evaluation of SDL and one-op mutation for C. In Proceedings of the 2014 IEEE 7th International Conference on Software Testing, Verification, and Validation. IEEE, 203–212.

Digital Library

[14]

Marcio Eduardo Delamaro, Jeff Offutt, and Paul Ammann. 2014. Designing deletion mutation operators. In Proceedings of the 2014 IEEE 7th International Conference on Software Testing, Verification, and Validation. IEEE, 11–20.

Digital Library

[15]

Pedro Delgado-Pérez, Sergio Segura, and Inmaculada Medina-Bulo. 2017. Assessment of C++ object-oriented mutation operators: A selective mutation approach. Software Testing, Verification and Reliability 27, 4-5 (2017), e1630.

[16]

Richard A. DeMillo, Richard J. Lipton, and Frederick G. Sayward. 1978. Hints on test data selection: Help for the practicing programmer. Computer 11, 4 (1978), 34–41.

Digital Library

[17]

Milos Gligoric, Alex Groce, Chaoqiang Zhang, Rohan Sharma, Mohammad Amin Alipour, and Darko Marinov. 2015. Guidelines for coverage-based comparisons of non-adequate test suites. ACM Transactions on Software Engineering and Methodology 24, 4 (2015), 1–33.

Digital Library

[18]

Dunwei Gong, Gongjie Zhang, Xiangjuan Yao, and Fanlin Meng. 2017. Mutant reduction based on dominance relation for weak mutation testing. Information and Software Technology 81, C (2017), 82–96.

[19]

Rahul Gopinath, Carlos Jensen, and Alex Groce. 2014. Code coverage for suite evaluation by developers. In Proceedings of the 36th International Conference on Software Engineering. 72–82.

Digital Library

[20]

Giovanni Grano, Fabio Palomba, and Harald C. Gall. 2019. Lightweight assessment of test-case effectiveness using source-code-quality indicators. IEEE Transactions on Software Engineering 47, 4 (2019), 758–774.

[21]

Richard G. Hamlet. 1977. Testing programs with the aid of a compiler. IEEE Transactions on Software Engineering4 (1977), 279–290.

Digital Library

[22]

Farah Hariri, August Shi, Vimuth Fernando, Suleman Mahmood, and Darko Marinov. 2019. Comparing mutation testing at the levels of source code and compiler intermediate representation. In Proceedings of the 2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST ’19). IEEE, 114–124.

[23]

Shamaila Hussain. 2008. Mutation clustering. Ms. Th., Kings College London, Strand, London (2008), 9.

[24]

Laura Inozemtseva and Reid Holmes. 2014. Coverage is not strongly correlated with test suite effectiveness. In Proceedings of the 36th International Conference on Software Engineering. 435–445.

Digital Library

[25]

Kevin Jalbert and Jeremy S. Bradbury. 2012. Predicting mutation score using source code and test suite metrics. In 2012 1st International Workshop on Realizing AI Synergies in Software Engineering (RAISE ’12). IEEE, 42–46.

[26]

Yue Jia and Mark Harman. 2010. An analysis and survey of the development of mutation testing. IEEE Transactions on Software Engineering 37, 5 (2010), 649–678.

Digital Library

[27]

René Just, Darioush Jalali, Laura Inozemtseva, Michael D. Ernst, Reid Holmes, and Gordon Fraser. 2014. Are mutants a valid substitute for real faults in software testing?. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. 654–665.

Digital Library

[28]

René Just, Franz Schweiggert, and Gregory M. Kapfhammer. 2011. MAJOR: An efficient and extensible tool for mutation analysis in a Java compiler. In Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE ’11). IEEE, 612–615.

Digital Library

[29]

Pavneet Singh Kochhar, Ferdian Thung, and David Lo. 2015. Code coverage and test suite effectiveness: Empirical study with real bugs in large systems. In Proceedings of the 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER ’15). IEEE, 560–564.

[30]

Bob Kurtz, Paul Ammann, Marcio E. Delamaro, Jeff Offutt, and Lin Deng. 2014. Mutant subsumption graphs. In Proceedings of the 2014 IEEE 7th International Conference on Software Testing, Verification, and Validation Workshops. IEEE, 176–185.

Digital Library

[31]

Bob Kurtz, Paul Ammann, and Jeff Offutt. 2015. Static analysis of mutant subsumption. In Proceedings of the 2015 IEEE 8th International Conference on Software Testing, Verification, and Validation Workshops (ICSTW ’15). IEEE, 1–10.

[32]

Bob Kurtz, Paul Ammann, Jeff Offutt, Márcio E. Delamaro, Mariet Kurtz, and Nida Gökçe. 2016. Analyzing the validity of selective mutation with dominator mutants. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. 571–582.

Digital Library

[33]

James R. Larus. 1999. Whole program paths. ACM SIGPLAN Notices 34, 5 (1999), 259–269.

Digital Library

[34]

Aditya P. Mathur. 1991. Performance, effectiveness, and reliability issues in software testing. In 1991 Proceedings of the 15th Annual International Computer Software and Applications Conference. IEEE Computer Society, 604–605.

[35]

Akbar Siami Namin and James H. Andrews. 2009. The influence of size and coverage on test suite effectiveness. In Proceedings of the 18th International Symposium on Software Testing and Analysis. 57–68.

Digital Library

[36]

A. Jefferson Offutt, Ammei Lee, Gregg Rothermel, Roland H. Untch, and Christian Zapf. 1996. An experimental determination of sufficient mutant operators. ACM Transactions on Software Engineering and Methodology 5, 2 (1996), 99–118.

Digital Library

[37]

Mike Papadakis, Marinos Kintis, Jie Zhang, Yue Jia, Yves Le Traon, and Mark Harman. 2019. Mutation testing advances: An analysis and survey. In Proceedings of the Advances in Computers. Elsevier, 275–378.

[38]

Mike Papadakis and Nicos Malevris. 2010. An empirical evaluation of the first and second order mutation testing strategies. In Proceedings of the 2010 3rd International Conference on Software Testing, Verification, and Validation Workshops. IEEE, 90–99.

Digital Library

[39]

Mike Papadakis, Donghwan Shin, Shin Yoo, and Doo-Hwan Bae. 2018. Are mutation scores correlated with real fault detection? A large scale empirical study on the relationship between mutants and real faults. In Proceedings of the 40th International Conference on Software Engineering. 537–548.

Digital Library

[40]

Domenico Serra, Giovanni Grano, Fabio Palomba, Filomena Ferrucci, Harald C. Gall, and Alberto Bacchelli. 2019. On the effectiveness of manual and automatic unit test generation: Ten years later. In Proceedings of the 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR ’19). IEEE, 121–125.

Digital Library

[41]

Donghwan Shin, Shin Yoo, and Doo-Hwan Bae. 2017. A theoretical and empirical study of diversity-aware mutation adequacy criterion. IEEE Transactions on Software Engineering 44, 10 (2017), 914–931.

Digital Library

[42]

Akbar Siami Namin, James H. Andrews, and Duncan J. Murdoch. 2008. Sufficient mutation operators for measuring test effectiveness. In Proceedings of the 30th International Conference on Software Engineering. 351–360.

[43]

Zhao Tian, Junjie Chen, Qihao Zhu, Junjie Yang, and Lingming Zhang. 2022. Learning to construct better mutation faults. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1–13.

Digital Library

[44]

Elaine J. Weyuker. 1993. Can we measure software testing effectiveness?. In [1993] Proceedings of the 1st International Software Metrics Symposium. IEEE, 100–107.

[45]

W. Eric Wong and Aditya P. Mathur. 1995. Reducing the cost of mutation testing: An empirical study. Journal of Systems and Software 31, 3 (1995), 185–196.

Digital Library

[46]

Martin R. Woodward and Michael A. Hennell. 2006. On the relationship between two control-flow coverage criteria: All JJ-paths and MCDC. Information and Software Technology 48, 7 (2006), 433–440.

[47]

Peng Zhang, Yanhui Li, Wanwangying Ma, Yibiao Yang, Lin Chen, Hongmin Lu, Yuming Zhou, and Baowen Xu. 2020. Cbua: A probabilistic, predictive, and practical approach for evaluating test suite effectiveness. IEEE Transactions on Software Engineering 48, 3 (2020), 1067–1096.

[48]

Peng Zhang, Yang Wang, Xutong Liu, Yanhui Li, Yibiao Yang, Ziyuan Wang, Xiaoyu Zhou, Lin Chen, and Yuming Zhou. 2022. Mutant reduction evaluation: What is there and what is missing? ACM Transactions on Software Engineering and Methodology 31, 4 (2022), 1–46.

Digital Library

Index Terms

Assessing Effectiveness of Test Suites: What Do We Know and What Should We Do?
1. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Software defect analysis
        Software testing and debugging

Recommendations

Coverage is not strongly correlated with test suite effectiveness
ICSE 2014: Proceedings of the 36th International Conference on Software Engineering

The coverage of a test suite is often used as a proxy for its ability to detect faults. However, previous studies that investigated the correlation between code coverage and test suite effectiveness have failed to reach a consensus about the nature and ...
Will my tests tell me if I break this code?
CSED '16: Proceedings of the International Workshop on Continuous Software Evolution and Delivery

Automated tests play an important role in software evolution because they can rapidly detect faults introduced during changes. In practice, code-coverage metrics are often used as criteria to evaluate the effectiveness of test suites with focus on ...
Effectiveness Assessment of an Early Testing Technique using Model-Level Mutants
EASE '17: Proceedings of the 21st International Conference on Evaluation and Assessment in Software Engineering

While modern software development technologies enhance the capabilities of model-based/driven development, they introduce challenges for testers such as how to perform early testing at model level to ensure the quality of the model. In this context, we ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Software Engineering and Methodology

ACM Transactions on Software Engineering and Methodology Volume 33, Issue 4

May 2024

940 pages

EISSN:1557-7392

DOI:10.1145/3613665

Editor:
Mauro Pezzè
USI Università della Svizzera italiana and SIT Schaffhausen Institute of Technology, Switzerland

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 April 2024

Online AM: 05 December 2023

Accepted: 22 November 2023

Revised: 08 November 2023

Received: 30 April 2023

Published in TOSEM Volume 33, Issue 4

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
Natural Science Foundation of Jiangsu Province
NJU-Huawei Software New Technology Joint Laboratory Fund

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
532
Total Downloads

Downloads (Last 12 months)532
Downloads (Last 6 weeks)71

Reflects downloads up to 26 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents