skip to main content
research-article

Assessing Effectiveness of Test Suites: What Do We Know and What Should We Do?

Published: 17 April 2024 Publication History

Abstract

Background. Software testing is a critical activity for ensuring the quality and reliability of software systems. To evaluate the effectiveness of different test suites, researchers have developed a variety of metrics. Problem. However, comparing these metrics is challenging due to the lack of a standardized evaluation framework including comprehensive factors. As a result, researchers often focus on single factors (e.g., size), which finally leads to different or even contradictory conclusions. After comparing dozens of pieces of work in detail, we have found two main problems most troubling to our community: (1) researchers tend to oversimplify the description of the ground truth they use, and (2) data involving real defects is not suitable for analysis using traditional statistical indicators. Objective. We aim at scrutinizing the whole process of comparing test suites for our community. Method. To hit this aim, we propose a framework ASSENT (evAluating teSt Suite EffectiveNess meTrics) to guide the follow-up research for evaluating a test suite effectiveness metric. ASSENT consists of three fundamental components: ground truth, benchmark test suites, and agreement indicator. Its functioning is as follows: first, users clarify the ground truth for determining the real order in effectiveness among test suites. Second, users generate a set of benchmark test suites and derive their ground truth order in effectiveness. Third, users use the metric to derive the order in effectiveness for the same test suites. Finally, users calculate the agreement indicator between the two orders derived by two metrics. Result. With ASSENT, we are able to compare the accuracy of different test suite effectiveness metrics. We apply ASSENT to evaluate representative test suite effectiveness metrics, including mutation score and code coverage metrics. Our results show that, based on the real faults, mutation score, and subsuming mutation score are the best metrics to quantify test suite effectiveness. Meanwhile, by using mutants instead of real faults, test effectiveness will be overestimated by more than 20% in values. Conclusion. We recommend that the standardized evaluation framework ASSENT should be used for evaluating and comparing test effectiveness metrics in the future work.

References

[3]
Iftekhar Ahmed, Rahul Gopinath, Caius Brindescu, Alex Groce, and Carlos Jensen. 2016. Can testedness be effectively measured?. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. 547–558.
[4]
Paul Ammann, Marcio Eduardo Delamaro, and Jeff Offutt. 2014. Establishing theoretical minimal sets of mutants. In Proceedings of the 2014 IEEE 7th International Conference on Software Testing, Verification, and Validation. IEEE, 21–30.
[5]
Thomas Ball. 2004. A theory of predicate-complete test coverage and generation. In Proceedings of the International Symposium on Formal Methods for Components and Objects. Springer, 1–22.
[6]
Benoit Baudry, Franck Fleurey, and Yves Le Traon. 2006. Improving test suites for efficient fault localization. In Proceedings of the 28th International Conference on Software Engineering. 82–91.
[7]
Xia Cai. 2006. Coverage-based Testing Strategies and Reliability Modeling for Fault-tolerant Software Systems. The Chinese University of Hong Kong (People’s Republic of China).
[8]
Xia Cai and Michael R. Lyu. 2005. The effect of code coverage on fault detection under different testing profiles. In Proceedings of the 1st International Workshop on Advances in Model-based Testing. 1–7.
[9]
Thierry Titcheu Chekam, Mike Papadakis, Yves Le Traon, and Mark Harman. 2017. An empirical study on mutation, statement and branch coverage fault revelation that avoids the unreliable clean program assumption. In Proceedings of the 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE ’17). IEEE, 597–608.
[10]
Yiqun T. Chen, Rahul Gopinath, Anita Tadakamalla, Michael D. Ernst, Reid Holmes, Gordon Fraser, Paul Ammann, and René Just. 2020. Revisiting the relationship between fault detection, test adequacy criteria, and test set size. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering. 237–249.
[11]
Trishul M. Chilimbi, Ben Liblit, Krishna Mehra, Aditya V. Nori, and Kapil Vaswani. 2009. Holmes: Effective statistical debugging via efficient path profiling. In Proceedings of the 2009 IEEE 31st International Conference on Software Engineering. IEEE, 34–44.
[12]
Jacob Cohen. 1992. Quantitative methods in psychology: A power primer. Psychol. Bull. 112, 1 (1992), 1155–1159.
[13]
Marcio Eduardo Delamaro, Lin Deng, Vinicius Humberto Serapilha Durelli, Nan Li, and Jeff Offutt. 2014. Experimental evaluation of SDL and one-op mutation for C. In Proceedings of the 2014 IEEE 7th International Conference on Software Testing, Verification, and Validation. IEEE, 203–212.
[14]
Marcio Eduardo Delamaro, Jeff Offutt, and Paul Ammann. 2014. Designing deletion mutation operators. In Proceedings of the 2014 IEEE 7th International Conference on Software Testing, Verification, and Validation. IEEE, 11–20.
[15]
Pedro Delgado-Pérez, Sergio Segura, and Inmaculada Medina-Bulo. 2017. Assessment of C++ object-oriented mutation operators: A selective mutation approach. Software Testing, Verification and Reliability 27, 4-5 (2017), e1630.
[16]
Richard A. DeMillo, Richard J. Lipton, and Frederick G. Sayward. 1978. Hints on test data selection: Help for the practicing programmer. Computer 11, 4 (1978), 34–41.
[17]
Milos Gligoric, Alex Groce, Chaoqiang Zhang, Rohan Sharma, Mohammad Amin Alipour, and Darko Marinov. 2015. Guidelines for coverage-based comparisons of non-adequate test suites. ACM Transactions on Software Engineering and Methodology 24, 4 (2015), 1–33.
[18]
Dunwei Gong, Gongjie Zhang, Xiangjuan Yao, and Fanlin Meng. 2017. Mutant reduction based on dominance relation for weak mutation testing. Information and Software Technology 81, C (2017), 82–96.
[19]
Rahul Gopinath, Carlos Jensen, and Alex Groce. 2014. Code coverage for suite evaluation by developers. In Proceedings of the 36th International Conference on Software Engineering. 72–82.
[20]
Giovanni Grano, Fabio Palomba, and Harald C. Gall. 2019. Lightweight assessment of test-case effectiveness using source-code-quality indicators. IEEE Transactions on Software Engineering 47, 4 (2019), 758–774.
[21]
Richard G. Hamlet. 1977. Testing programs with the aid of a compiler. IEEE Transactions on Software Engineering4 (1977), 279–290.
[22]
Farah Hariri, August Shi, Vimuth Fernando, Suleman Mahmood, and Darko Marinov. 2019. Comparing mutation testing at the levels of source code and compiler intermediate representation. In Proceedings of the 2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST ’19). IEEE, 114–124.
[23]
Shamaila Hussain. 2008. Mutation clustering. Ms. Th., Kings College London, Strand, London (2008), 9.
[24]
Laura Inozemtseva and Reid Holmes. 2014. Coverage is not strongly correlated with test suite effectiveness. In Proceedings of the 36th International Conference on Software Engineering. 435–445.
[25]
Kevin Jalbert and Jeremy S. Bradbury. 2012. Predicting mutation score using source code and test suite metrics. In 2012 1st International Workshop on Realizing AI Synergies in Software Engineering (RAISE ’12). IEEE, 42–46.
[26]
Yue Jia and Mark Harman. 2010. An analysis and survey of the development of mutation testing. IEEE Transactions on Software Engineering 37, 5 (2010), 649–678.
[27]
René Just, Darioush Jalali, Laura Inozemtseva, Michael D. Ernst, Reid Holmes, and Gordon Fraser. 2014. Are mutants a valid substitute for real faults in software testing?. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. 654–665.
[28]
René Just, Franz Schweiggert, and Gregory M. Kapfhammer. 2011. MAJOR: An efficient and extensible tool for mutation analysis in a Java compiler. In Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE ’11). IEEE, 612–615.
[29]
Pavneet Singh Kochhar, Ferdian Thung, and David Lo. 2015. Code coverage and test suite effectiveness: Empirical study with real bugs in large systems. In Proceedings of the 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER ’15). IEEE, 560–564.
[30]
Bob Kurtz, Paul Ammann, Marcio E. Delamaro, Jeff Offutt, and Lin Deng. 2014. Mutant subsumption graphs. In Proceedings of the 2014 IEEE 7th International Conference on Software Testing, Verification, and Validation Workshops. IEEE, 176–185.
[31]
Bob Kurtz, Paul Ammann, and Jeff Offutt. 2015. Static analysis of mutant subsumption. In Proceedings of the 2015 IEEE 8th International Conference on Software Testing, Verification, and Validation Workshops (ICSTW ’15). IEEE, 1–10.
[32]
Bob Kurtz, Paul Ammann, Jeff Offutt, Márcio E. Delamaro, Mariet Kurtz, and Nida Gökçe. 2016. Analyzing the validity of selective mutation with dominator mutants. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. 571–582.
[33]
James R. Larus. 1999. Whole program paths. ACM SIGPLAN Notices 34, 5 (1999), 259–269.
[34]
Aditya P. Mathur. 1991. Performance, effectiveness, and reliability issues in software testing. In 1991 Proceedings of the 15th Annual International Computer Software and Applications Conference. IEEE Computer Society, 604–605.
[35]
Akbar Siami Namin and James H. Andrews. 2009. The influence of size and coverage on test suite effectiveness. In Proceedings of the 18th International Symposium on Software Testing and Analysis. 57–68.
[36]
A. Jefferson Offutt, Ammei Lee, Gregg Rothermel, Roland H. Untch, and Christian Zapf. 1996. An experimental determination of sufficient mutant operators. ACM Transactions on Software Engineering and Methodology 5, 2 (1996), 99–118.
[37]
Mike Papadakis, Marinos Kintis, Jie Zhang, Yue Jia, Yves Le Traon, and Mark Harman. 2019. Mutation testing advances: An analysis and survey. In Proceedings of the Advances in Computers. Elsevier, 275–378.
[38]
Mike Papadakis and Nicos Malevris. 2010. An empirical evaluation of the first and second order mutation testing strategies. In Proceedings of the 2010 3rd International Conference on Software Testing, Verification, and Validation Workshops. IEEE, 90–99.
[39]
Mike Papadakis, Donghwan Shin, Shin Yoo, and Doo-Hwan Bae. 2018. Are mutation scores correlated with real fault detection? A large scale empirical study on the relationship between mutants and real faults. In Proceedings of the 40th International Conference on Software Engineering. 537–548.
[40]
Domenico Serra, Giovanni Grano, Fabio Palomba, Filomena Ferrucci, Harald C. Gall, and Alberto Bacchelli. 2019. On the effectiveness of manual and automatic unit test generation: Ten years later. In Proceedings of the 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR ’19). IEEE, 121–125.
[41]
Donghwan Shin, Shin Yoo, and Doo-Hwan Bae. 2017. A theoretical and empirical study of diversity-aware mutation adequacy criterion. IEEE Transactions on Software Engineering 44, 10 (2017), 914–931.
[42]
Akbar Siami Namin, James H. Andrews, and Duncan J. Murdoch. 2008. Sufficient mutation operators for measuring test effectiveness. In Proceedings of the 30th International Conference on Software Engineering. 351–360.
[43]
Zhao Tian, Junjie Chen, Qihao Zhu, Junjie Yang, and Lingming Zhang. 2022. Learning to construct better mutation faults. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1–13.
[44]
Elaine J. Weyuker. 1993. Can we measure software testing effectiveness?. In [1993] Proceedings of the 1st International Software Metrics Symposium. IEEE, 100–107.
[45]
W. Eric Wong and Aditya P. Mathur. 1995. Reducing the cost of mutation testing: An empirical study. Journal of Systems and Software 31, 3 (1995), 185–196.
[46]
Martin R. Woodward and Michael A. Hennell. 2006. On the relationship between two control-flow coverage criteria: All JJ-paths and MCDC. Information and Software Technology 48, 7 (2006), 433–440.
[47]
Peng Zhang, Yanhui Li, Wanwangying Ma, Yibiao Yang, Lin Chen, Hongmin Lu, Yuming Zhou, and Baowen Xu. 2020. Cbua: A probabilistic, predictive, and practical approach for evaluating test suite effectiveness. IEEE Transactions on Software Engineering 48, 3 (2020), 1067–1096.
[48]
Peng Zhang, Yang Wang, Xutong Liu, Yanhui Li, Yibiao Yang, Ziyuan Wang, Xiaoyu Zhou, Lin Chen, and Yuming Zhou. 2022. Mutant reduction evaluation: What is there and what is missing? ACM Transactions on Software Engineering and Methodology 31, 4 (2022), 1–46.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Software Engineering and Methodology
ACM Transactions on Software Engineering and Methodology  Volume 33, Issue 4
May 2024
940 pages
EISSN:1557-7392
DOI:10.1145/3613665
  • Editor:
  • Mauro Pezzè
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 April 2024
Online AM: 05 December 2023
Accepted: 22 November 2023
Revised: 08 November 2023
Received: 30 April 2023
Published in TOSEM Volume 33, Issue 4

Check for updates

Author Tags

  1. Test suite effectiveness
  2. coverage testing
  3. mutation testing
  4. order preservation
  5. statistical indicators

Qualifiers

  • Research-article

Funding Sources

  • National Natural Science Foundation of China
  • Natural Science Foundation of Jiangsu Province
  • NJU-Huawei Software New Technology Joint Laboratory Fund

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 532
    Total Downloads
  • Downloads (Last 12 months)532
  • Downloads (Last 6 weeks)71
Reflects downloads up to 26 Dec 2024

Other Metrics

Citations

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media