skip to main content
10.1109/ICSE43902.2021.00138acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

Automatic Unit Test Generation for Machine Learning Libraries: How Far Are We?

Published: 05 November 2021 Publication History

Abstract

Automatic unit test generation that explores the input space and produces effective test cases for given programs have been studied for decades. Many unit test generation tools that can help generate unit test cases with high structural coverage over a program have been examined. However, the fact that existing test generation tools are mainly evaluated on general software programs calls into question about its practical effectiveness and usefulness for machine learning libraries, which are statistically-orientated and have fundamentally different nature and construction from general software projects.
In this paper, we set out to investigate the effectiveness of existing unit test generation techniques on machine learning libraries. To investigate this issue, we conducted an empirical study on five widely-used machine learning libraries with two popular unit test case generation tools, i.e., EVOSUITE and Randoop. We find that (1) most of the machine learning libraries do not maintain a high-quality unit test suite regarding commonly applied quality metrics such as code coverage (on average is 34.1%) and mutation score (on average is 21.3%), (2) unit test case generation tools, i.e., EVOSUITE and Randoop, lead to clear improvements in code coverage and mutation score, however, the improvement is limited, and (3) there exist common patterns in the uncovered code across the five machine learning libraries that can be used to improve unit test case generation tasks.

References

[1]
P. Runeson, "A survey of unit testing practices," IEEE software, vol. 23, no. 4, pp. 22--29, 2006.
[2]
G. Fraser and A. Arcuri, "Evosuite: automatic test suite generation for object-oriented software," in FSE'11, 2011, pp. 416--419.
[3]
C. Pacheco and M. D. Ernst, "Randoop: feedback-directed random testing for java," in Companion to the 22nd ACM SIGPLAN conference on Object-oriented programming systems and applications companion, 2007, pp. 815--816.
[4]
C. Pacheco, S. K. Lahiri, M. D. Ernst, and T. Ball, "Feedback-directed random test generation," in 29th International Conference on Software Engineering (ICSE'07), 2007, pp. 75--84.
[5]
G. Fraser, M. Staats, P. McMinn, A. Arcuri, and F. Padberg, "Does automated unit test generation really help software testers? a controlled empirical study," TOSEM'15, vol. 24, no. 4, pp. 1--49, 2015.
[6]
G. Fraser and A. Arcuri, "A large-scale evaluation of automated unit test generation using evosuite," TOSEM'14, vol. 24, no. 2, pp. 1--42, 2014.
[7]
D. Serra, G. Grano, F. Palomba, F. Ferrucci, H. C. Gall, and A. Bacchelli, "On the effectiveness of manual and automatic unit test generation: ten years later," in MSR'19, 2019, pp. 121--125.
[8]
G. Fraser, M. Staats, P. McMinn, A. Arcuri, and F. Padberg, "Does automated white-box test generation really help software testers?" in ISSTA' 13, 2013, pp. 291--301.
[9]
K. Pei, Y. Cao, J. Yang, and S. Jana, "Deepxplore: Automated whitebox testing of deep learning systems," in SOSP' 17, pp. 1--18.
[10]
H. B. Braiek and F. Khomh, "On testing machine learning programs," JSS'20, vol. 164, p. 110542, 2020.
[11]
C. Chen, A. Seff, A. Kornhauser, and J. Xiao, "Deep-driving: Learning affordance for direct perception in autonomous driving," in ICCV' 15, pp. 2722--2730.
[12]
J. M. Zhang, M. Harman, L. Ma, and Y. Liu, "Machine learning testing: Survey, landscapes and horizons," TSE'20, 2020.
[13]
M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, "The weka data mining software: an update," ACM SIGKDD explorations newsletter, vol. 11, no. 1, pp. 10--18, 2009.
[14]
C. D. Manning, M. Surdeanu, J. Bauer, J. R. Finkel, S. Bethard, and D. McClosky, "The stanford corenlp natural language processing toolkit," in Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations, 2014, pp. 55--60.
[15]
A. K. McCallum, "Mallet: A machine learning for language toolkit," http://mallet.cs.umass.edu, 2002.
[16]
J. Baldridge, "The opennlp project," URL: http://opennlp.apache.org/index.html,(accessed 2 February 2012), p. 1, 2005.
[17]
D. Lyubimov and A. Palumbo, Apache Mahout: Beyond MapReduce. CreateSpace Independent Publishing Platform, 2016.
[18]
A. Bacchelli, P. Ciancarini, and D. Rossi, "On the effectiveness of manual and automatic unit test generation," in ICSEA'08, 2008, pp. 252--257.
[19]
A. Arcuri, M. Z. Iqbal, and L. Briand, "Random testing: Theoretical results and practical implications," TSE'11, vol. 38, no. 2, pp. 258--277, 2011.
[20]
P. McMinn, "Search-based software test data generation: a survey," Software testing, Verification and reliability, vol. 14, no. 2, pp. 105--156, 2004.
[21]
M. Harman and B. F. Jones, "Search-based software engineering," IST'01, vol. 43, no. 14, pp. 833--839, 2001.
[22]
G. Fraser and A. Arcuri, "Whole test suite generation," IEEE Transactions on Software Engineering, vol. 39, no. 2, pp. 276--291, 2012.
[23]
M. Mohri, A. Rostamizadeh, and A. Talwalkar, Foundations of machine learning, 2018.
[24]
Y. LeCun, Y. Bengio, and G. Hinton, "Deep learning," nature, vol. 521, no. 7553, pp. 436--444, 2015.
[25]
Y. Kim, "Convolutional neural networks for sentence classification," arXiv preprint arXiv:1408.5882, 2014.
[26]
A. Graves, A.-r. Mohamed, and G. Hinton, "Speech recognition with deep recurrent neural networks," in 2013 IEEE international conference on acoustics, speech and signal processing, 2013, pp. 6645--6649.
[27]
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard et al., "Tensorflow: A system for large-scale machine learning," in 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16), 2016, pp. 265--283.
[28]
F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. Good-fellow, A. Bergeron, N. Bouchard, D. Warde-Farley, and Y. Bengio, "Theano: new features and speed improvements," arXiv preprint arXiv:1211.5590, 2012.
[29]
A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., "Pytorch: An imperative style, high-performance deep learning library," in Advances in Neural Information Processing Systems, 2019, pp. 8024--8035.
[30]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg et al., "Scikit-learn: Machine learning in python," Journal of machine learning research, vol. 12, no. Oct, pp. 2825--2830, 2011.
[31]
M. M. Almasi, H. Hemmati, G. Fraser, A. Arcuri, and J. Benefelds, "An industrial evaluation of unit test generation: Finding real faults in a financial application," in ICSE-SEIP'17, 2017, pp. 263--272.
[32]
S. Shamshiri, R. Just, J. M. Rojas, G. Fraser, P. McMinn, and A. Arcuri, "Do automatically generated unit tests find real faults? an empirical study of effectiveness and challenges (t)," in ASE'15, 2015, pp. 201--211.
[33]
M. L. McHugh, "Interrater reliability: the kappa statistic," Biochemia medica: Biochemia medica, vol. 22, no. 3, pp. 276--282, 2012.
[34]
J. M. Rojas, G. Fraser, and A. Arcuri, "Automated unit test generation during software development: A controlled experiment and think-aloud observations," in ISSTA'15, pp. 338--349.
[35]
M. M. Tikir and J. K. Hollingsworth, "Efficient instrumentation for code coverage testing," ACM SIGSOFT Software Engineering Notes, vol. 27, no. 4, pp. 86--96, 2002.
[36]
Y. Jia and M. Harman, "An analysis and survey of the development of mutation testing," TSE'10, vol. 37, no. 5, pp. 649--678, 2010.
[37]
R. Just, D. Jalali, L. Inozemtseva, M. D. Ernst, R. Holmes, and G. Fraser, "Are mutants a valid substitute for real faults in software testing?" in FSE'14, pp. 654--665.
[38]
H. Do and G. Rothermel, "A controlled experiment assessing test case prioritization techniques via mutation faults," in ICSM'05, pp. 411--420.
[39]
J. H. Andrews, L. C. Briand, and Y. Labiche, "Is mutation an appropriate tool for testing experiments?" in ICSE'05, pp. 402--411.
[40]
G. Fraser and A. Zeller, "Mutation-driven generation of unit tests and oracles," TSE'12, vol. 38, no. 2, pp. 278--292.
[41]
M. Harman, Y. Jia, and W. B. Langdon, "Strong higher order mutation-based test data generation," in FSE'11, pp. 212--222.
[42]
H. Coles, T. Laurent, C. Henard, M. Papadakis, and A. Ventresque, "Pit: a practical mutation testing tool for java," in ISSTA'16, 2016, pp. 449--452.
[43]
M. Hilton, J. Bell, and D. Marinov, "A large-scale study of test coverage evolution," in ICSE'18, 2018, pp. 53--63.
[44]
J. Zhang, L. Zhang, M. Harman, D. Hao, Y. Jia, and L. Zhang, "Predictive mutation testing," TSE'08, vol. 45, no. 9, pp. 898--918, 2018.
[45]
L. Baresi, P. L. Lanzi, and M. Miraz, "Testful: an evolutionary test approach for java," in ICST'10, 2010, pp. 185--194.
[46]
W. Visser, C. S. Păsăreanu, and S. Khurshid, "Test input generation with java pathfinder," pp. 97--107, 2004.
[47]
N. Tillmann and J. De Halleux, "Pex-white box test generation for. net," in International conference on tests and proofs, 2008, pp. 134--153.
[48]
C. Csallner and Y. Smaragdakis, "Jcrasher: an automatic robustness tester for java," Software: Practice and Experience, vol. 34, no. 11, pp. 1025--1050, 2004.
[49]
G. Gay, M. Staats, M. Whalen, and M. P. Heimdahl, "The risks of coverage-directed test case generation," IEEE Transactions on Software Engineering, vol. 41, no. 8, pp. 803--819, 2015.
[50]
J. H. Andrews, S. Haldar, Y. Lei, and F. C. H. Li, "Tool support for randomized unit testing," in Proceedings of the 1st international workshop on Random testing, 2006, pp. 36--45.
[51]
C. Pacheco and M. D. Ernst, "Eclat: Automatic generation and classification of test inputs," in ECOOP'05, 2005, pp. 504--527.
[52]
"jtest," https://www.parasoft.com/jtest, 2013.
[53]
J. H. Andrews, T. Menzies, and F. C. Li, "Genetic algorithms for randomized unit testing," TSE'11, vol. 37, no. 1, pp. 80--94, 2011.
[54]
A. Leitner, I. Ciupa, M. Oriol, B. Meyer, and A. Fiva, "Contract driven development= test driven development-writing test cases," in FSE'07, 2007, pp. 425--434.
[55]
P. Tonella, "Evolutionary testing of classes," ACM SIGSOFT Software Engineering Notes, vol. 29, no. 4, pp. 119--128, 2004.
[56]
J. C. King, "Symbolic execution and program testing," Communications of the ACM, vol. 19, no. 7, pp. 385--394, 1976.
[57]
C. S. Păsăreanu and N. Rungta, "Symbolic pathfinder: symbolic execution of java bytecode," in ASE' 10, 2010, pp. 179--180.
[58]
C. S. Păsăreanu, N. Rungta, and W. Visser, "Symbolic execution with mixed concrete-symbolic solving," in ISSTA' 11, 2011, pp. 34--44.
[59]
X. Xie, J. W. Ho, C. Murphy, G. Kaiser, B. Xu, and T. Y. Chen, "Testing and validating machine learning classifiers by metamorphic testing," JSS'11, vol. 84, no. 4, pp. 544--558, 2011.
[60]
T. Y. Chen, S. C. Cheung, and S. M. Yiu, "Metamorphic testing: a new approach for generating next test cases," arXiv preprint arXiv:2002.12543, 2002.
[61]
T. Y. Chen, J. W. Ho, H. Liu, and X. Xie, "An innovative approach for testing bioinformatics programs using metamorphic testing," BMC bioinformatics, vol. 10, no. 1, p. 24, 2009.
[62]
S. Segura, G. Fraser, A. B. Sanchez, and A. Ruiz-Cortés, "A survey on metamorphic testing," TSE'16, vol. 42, no. 9, pp. 805--824, 2016.
[63]
C. Murphy, G. E. Kaiser, and L. Hu, "Properties of machine learning applications for use in metamorphic testing," 2008.
[64]
C. Murphy, K. Shen, and G. Kaiser, "Using jml runtime assertion checking to automate metamorphic testing in applications without test oracles," in ICST'09, 2009, pp. 436--445.
[65]
C. Murphy, G. Kaiser, and M. Arias, "Parameterizing random test data according to equivalence classes," in ASE'07, 2007, pp. 38--41.
[66]
S. Huang, E.-H. Liu, Z.-W. Hui, S.-Q. Tang, and S.-J. Zhang, "Challenges of testing machine learning applications." International Journal of Performability Engineering, vol. 14, no. 6, 2018.
[67]
S. Masuda, K. Ono, T. Yasue, and N. Hosokawa, "A survey of software quality for machine learning applications," in ICSTW'18, 2018, pp. 279--284.
[68]
M. Nejadgholi and J. Yang, "A study of oracle approximations in testing deep learning libraries," in ASE). IEEE, 2019, pp. 785--796.
[69]
F. Ishikawa, "Concepts in quality assessment for machine learning-from test data to arguments," in ICCM'18, 2018, pp. 536--544.
[70]
Q. Guo, S. Chen, X. Xie, L. Ma, Q. Hu, H. Liu, Y. Liu, J. Zhao, and X. Li, "An empirical study towards characterizing deep learning development and deployment across different frameworks and platforms," in ASE'19, 2019, pp. 810--822.
[71]
L. Ma, F. Juefei-Xu, M. Xue, Q. Hu, S. Chen, B. Li, Y. Liu, J. Zhao, J. Yin, and S. See, "Secure deep learning engineering: A software quality assurance perspective," arXiv preprint arXiv:1810.04538, 2018.
[72]
X. Huang, D. Kroening, M. Kwiatkowska, W. Ruan, Y. Sun, E. Thamo, M. Wu, and X. Yi, "Safety and trustworthiness of deep neural networks: A survey," arXiv preprint arXiv:1812.08342, 2018.
[73]
Y. Tian, K. Pei, S. Jana, and B. Ray, "Deeptest: Automated testing of deep-neural-network-driven autonomous cars," in ICSE' 18, 2018, pp. 303--314.
[74]
L. Ma, F. Zhang, J. Sun, M. Xue, B. Li, F. Juefei-Xu, C. Xie, L. Li, Y. Liu, J. Zhao et al., "Deepmutation: Mutation testing of deep learning systems," in ISSRE'18, 2018, pp. 100--111.

Cited By

View all

Index Terms

  1. Automatic Unit Test Generation for Machine Learning Libraries: How Far Are We?
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ICSE '21: Proceedings of the 43rd International Conference on Software Engineering
      May 2021
      1768 pages
      ISBN:9781450390859

      Sponsors

      Publisher

      IEEE Press

      Publication History

      Published: 05 November 2021

      Check for updates

      Author Tags

      1. Empirical software engineering
      2. test case generation
      3. testing machine learning libraries

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Conference

      ICSE '21
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 276 of 1,856 submissions, 15%

      Upcoming Conference

      ICSE 2025

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)22
      • Downloads (Last 6 weeks)2
      Reflects downloads up to 28 Dec 2024

      Other Metrics

      Citations

      Cited By

      View all

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media