skip to main content
10.1145/2950290.2950324acmconferencesArticle/Chapter ViewAbstractPublication PagesfseConference Proceedingsconference-collections
research-article

Can testedness be effectively measured?

Published: 01 November 2016 Publication History

Abstract

Among the major questions that a practicing tester faces are deciding where to focus additional testing effort, and deciding when to stop testing. Test the least-tested code, and stop when all code is well-tested, is a reasonable answer. Many measures of "testedness" have been proposed; unfortunately, we do not know whether these are truly effective. In this paper we propose a novel evaluation of two of the most important and widely-used measures of test suite quality. The first measure is statement coverage, the simplest and best-known code coverage measure. The second measure is mutation score, a supposedly more powerful, though expensive, measure.
We evaluate these measures using the actual criteria of interest: if a program element is (by these measures) well tested at a given point in time, it should require fewer future bug-fixes than a "poorly tested" element. If not, then it seems likely that we are not effectively measuring testedness. Using a large number of open source Java programs from Github and Apache, we show that both statement coverage and mutation score have only a weak negative correlation with bug-fixes. Despite the lack of strong correlation, there are statistically and practically significant differences between program elements for various binary criteria. Program elements (other than classes) covered by any test case see about half as many bug-fixes as those not covered, and a similar line can be drawn for mutation score thresholds. Our results have important implications for both software engineering practice and research evaluation.

References

[1]
I. Ahmed, U. A. Mannan, R. Gopinath, and C. Jensen. An empirical study of design degradation: How software projects get worse over time. In ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, pages 31–40, 2015.
[2]
J. H. Andrews, L. C. Briand, and Y. Labiche. Is mutation an appropriate tool for testing experiments? In International Conference on Software Engineering, pages 402–411. IEEE, 2005.
[3]
J. H. Andrews, L. C. Briand, Y. Labiche, and A. S. Namin. Using mutation analysis for assessing and comparing testing coverage criteria. IEEE Transactions on Software Engineering, 32(8), 2006.
[4]
Apache Software Foundation. Apache commons. http://commons.apache.org/.
[5]
Apache Software Foundation. Apache maven project. http://maven.apache.org.
[6]
A. Arcuri and L. C. Briand. A hitchhiker’s guide to statistical tests for assessing randomized algorithms in software engineering. Softw. Test., Verif. Reliab., 24(3):219–250, 2014.
[7]
C. Bird, A. Bachmann, E. Aune, J. Duffy, A. Bernstein, V. Filkov, and P. Devanbu. Fair and balanced?: bias in bug-fix datasets. In Proceedings of the the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering, pages 121–130. ACM, 2009.
[8]
T. A. Budd. Mutation Analysis of Program Test Data. PhD thesis, Yale University, New Haven, CT, USA, 1980.
[9]
X. Cai and M. R. Lyu. The effect of code coverage on fault detection under different testing profiles. In ACM SIGSOFT Software Engineering Notes, volume 30, pages 1–7. ACM, 2005.
[10]
H. Coles. Pit mutation testing. http://pitest.org/.
[11]
M. Daran and P. Thévenod-Fosse. Software error analysis: A real case study involving real faults and mutations. In ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 158–171. ACM, 1996.
[12]
M. Delahaye and L. Bousquet. Selecting a software engineering tool: lessons learnt from mutation analysis. Software: Practice and Experience, 2015.
[13]
R. A. DeMillo and A. P. Mathur. On the use of software artifacts to evaluate the effectiveness of mutation analysis for detecting errors in production software. Technical Report SERC-TR92-P, Software Engineering Research Center, Purdue University, West Lafayette, IN ” 1991.
[14]
J.-R. Falleri, F. Morandat, X. Blanc, M. Martinez, and M. Monperrus. Fine-grained and accurate source code differencing. In Proceedings of the 29th ACM/IEEE International Conference on Automated Software Engineering, ASE ’14, pages 313–324, New York, NY, USA, 2014. ACM.
[15]
P. G. Frankl and O. Iakounenko. Further empirical studies of test effectiveness. In ACM SIGSOFT Software Engineering Notes, volume 23, pages 153–162. ACM, 1998.
[16]
P. G. Frankl and S. N. Weiss. An experimental comparison of the effectiveness of branch testing and data flow testing. IEEE Transactions on Software Engineering, 19:774–787, 1993.
[17]
P. G. Frankl, S. N. Weiss, and C. Hu. All-uses vs mutation testing: an experimental comparison of effectiveness. Journal of Systems and Software, 38(3):235–253, 1997.
[18]
GitHub Inc. Software repository. http://www.github.com.
[19]
M. Gligoric, A. Groce, C. Zhang, R. Sharma, A. Alipour, and D. Marinov. Guidelines for coverage-based comparisons of non-adequate test suites. ACM Transactions on Software Engineering and Methodology, 24(4):4–37, 2014.
[20]
M. Gligoric, A. Groce, C. Zhang, R. Sharma, M. A. Alipour, and D. Marinov. Comparing non-adequate test suites using coverage criteria. In ACM SIGSOFT International Symposium on Software Testing and Analysis. ACM, 2013.
[21]
R. Gopinath, C. Jensen, and A. Groce. Code coverage for suite evaluation by developers. In International Conference on Software Engineering. IEEE, 2014.
[22]
A. Groce, M. A. Alipour, and R. Gopinath. Coverage and its discontents. In Proceedings of the 2014 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming & Software, Onward! 2014, pages 255–268, New York, NY, USA, 2014. ACM.
[23]
A. Gupta and P. Jalote. An approach for experimentally evaluating effectiveness and efficiency of coverage criteria for software testing. International Journal on Software Tools for Technology Transfer, 10(2):145–160, 2008.
[24]
M. Hutchins, H. Foster, T. Goradia, and T. Ostrand. Experiments of the effectiveness of dataflow-and controlflow-based test adequacy criteria. In International Conference on Software Engineering, pages 191–200. IEEE Computer Society Press, 1994.
[25]
L. Inozemtseva and R. Holmes. Coverage Is Not Strongly Correlated With Test Suite Effectiveness. In International Conference on Software Engineering, 2014.
[26]
L. M. M. Inozemtseva. Predicting test suite effectiveness for java programs. Master’s thesis, University of Waterloo, 2012.
[27]
R. Just, D. Jalali, L. Inozemtseva, M. D. Ernst, R. Holmes, and G. Fraser. Are mutants a valid substitute for real faults in software testing? In ACM SIGSOFT Symposium on The Foundations of Software Engineering, pages 654–665, Hong Kong, China, 2014. ACM.
[28]
S. Kakarla. An analysis of parameters influencing test suite effectiveness. Master’s thesis, Texas Tech University, 2010.
[29]
N. Li, U. Praphamontripong, and J. Offutt. An experimental comparison of four unit test criteria: Mutation, edge-pair, all-uses and prime path coverage. In International Conference on Software Testing, Verification and Validation Workshops, pages 220–229. IEEE, 2009.
[30]
A. P. Mathur and W. E. Wong. An empirical comparison of data flow and mutation-based test adequacy criteria. Software Testing, Verification and Reliability, 4(1):9–31, 1994.
[31]
T. J. McCabe. A complexity measure. IEEE Transactions on Software Engineering, (4):308–320, 1976.
[32]
A. Mockus, N. Nagappan, and T. T. Dinh-Trong. Test coverage and post-verification defects: A multiple case study. In Empirical Software Engineering and Measurement, 2009. ESEM 2009. 3rd International Symposium on, pages 291–301. IEEE, 2009.
[33]
A. S. Namin and J. H. Andrews. The influence of size and coverage on test suite effectiveness. In ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 57–68. ACM, 2009.
[34]
A. S. Namin and S. Kakarla. The use of mutation in testing experiments and its sensitivity to external threats. In ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 342–352, New York, NY, USA, 2011. ACM.
[35]
A. J. Offutt and J. M. Voas. Subsumption of condition coverage techniques by mutation testing. Technical report, Technical Report ISSE-TR-96-01, Information and Software Systems Engineering, George Mason University, 1996.
[36]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. Scikit-learn: Machine learning in python. The Journal of Machine Learning Research, 12:2825–2830, 2011.
[37]
RTCA Special Committee 167. Software considerations in airborne systems and equipment certification. Technical Report DO-1789B, RTCA, Inc., 1992.
[38]
S. Shamshiri, R. Just, J. M. Rojas, G. Fraser, P. McMinn, and A. Arcuri. Do automatically generated unit tests find real faults? an empirical study of effectiveness and challenges. In IEEE/ACM International Conference on Automated Software Engineering, pages 201–211, 2015.
[39]
M. Shepperd. A critique of cyclomatic complexity as a software metric. Software Engineering Journal, 3(2):30–36, 1988.
[40]
A. Shi, A. Gyori, M. Gligoric, A. Zaytsev, and D. Marinov. Balancing trade-offs in test-suite reduction. In Proceedings of the 22Nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2014, pages 246–256, New York, NY, USA, 2014. ACM.
[41]
SIR: Software-artifact Infrastructure Repository. Sir usage information, accessed at mar 8, 2016. http://sir.unl.edu/portal/usage.php.
[42]
D. Tengeri, L. Vidacs, A. Beszedes, J. Jasz, G. Balogh, B. Vancsics, and T. Gyimóthy. Relating code coverage, mutation score and test suite reducibility to defect density,accepted paper. In mutationworkshop, 2016.
[43]
Y. Tian, J. Lawall, and D. Lo. Identifying linux bug fixing patches. In Software Engineering (ICSE), 2012 34th International Conference on, pages 386–396. IEEE, 2012.
[44]
Y. Wei, B. Meyer, and M. Oriol. Empirical Software Engineering and Verification, chapter Is branch coverage a good measure of testing effectiveness?, pages 194–212. Springer-Verlag, Berlin, Heidelberg, 2012.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
FSE 2016: Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering
November 2016
1156 pages
ISBN:9781450342186
DOI:10.1145/2950290
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 November 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. coverage criteria
  2. mutation testing
  3. statistical analysis
  4. test suite evaluation

Qualifiers

  • Research-article

Conference

FSE'16
Sponsor:

Acceptance Rates

Overall Acceptance Rate 17 of 128 submissions, 13%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)19
  • Downloads (Last 6 weeks)3
Reflects downloads up to 26 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media