skip to main content
research-article

Neuron importance-aware coverage analysis for deep neural network testing

Published: 25 July 2024 Publication History

Abstract

Deep Neural Network (DNN) models are widely used in many cutting-edge domains, such as medical diagnostics and autonomous driving. However, an urgent need to test DNN models thoroughly has increasingly risen. Recent research proposes various structural and non-structural coverage criteria to measure test adequacy. Structural coverage criteria quantify the degree to which the internal elements of DNN models are covered by a test suite. However, they convey little information about individual inputs and exhibit limited correlation with defect detection. Additionally, existing non-structural coverage criteria are unaware of neurons’ importance to decision-making. This paper addresses these limitations by proposing novel non-structural coverage criteria. By tracing neurons’ cumulative contribution to the final decision on the training set, this paper identifies important neurons of DNN models. A novel metric is proposed to quantify the difference in important neuron behavior between a test input and the training set, which provides a measured way at individual test input granularity. Additionally, two non-structural coverage criteria are introduced that allow for the quantification of test adequacy by examining differences in important neuron behavior between the testing and the training set. The empirical evaluation of image datasets demonstrates that the proposed metric outperforms the existing non-structural adequacy metrics by up to 14.7% accuracy improvement in capturing error-revealing test inputs. Compared with state-of-the-art coverage criteria, the proposed coverage criteria are more sensitive to errors, including natural errors and adversarial examples.

References

[1]
Uda (2016) The udacity open source self-driving car project. https://github.com/ udacity/self-driving-car
[2]
Bach S, Binder A, Montavon G, Klauschen F, Müller KR, and Samek W On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation PLOS ONE 2015 10 7 1-46
[3]
Carlini N, Wagner DA (2017) Towards evaluating the robustness of neural networks. In: 2017 IEEE Symposium on security and privacy, SP 2017, San Jose, CA, USA, May 22–26, 2017, IEEE Computer Society, pp 39–57.
[4]
Chen J, Wu Z, Wang Z, You H, Zhang L, Yan M (2020a) Practical accuracy estimation for efficient deep neural network testing. ACM Trans Softw Eng Methodol 29(4):30:1–30:35.
[5]
Chen J, Yan M, Wang Z, Kang Y, Wu Z (2020b) Deep neural network test coverage: How far are we? CoRR arXiv:2010.04946
[6]
Chen Z, Huang X (2017) End-to-end learning for lane keeping of self-driving cars. In: IEEE Intelligent vehicles symposium, IV 2017, Los Angeles, CA, USA, June 11-14, 2017, IEEE, pp 1856–1860.
[7]
Du X, Xie X, Li Y, Ma L, Liu Y, Zhao J (2019) Deepstellar: model-based quantitative analysis of stateful deep learning systems. In: Dumas M, Pfahl D, Apel S, Russo A (eds) Proceedings of the ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering, ESEC/SIGSOFT FSE 2019, Tallinn, Estonia, August 26–30, 2019, ACM, pp 477–487.
[8]
Esteva A, Kuprel B, Novoa RA, Ko J, Swetter SM, Blau HM, and Thrun S Dermatologist-level classification of skin cancer with deep neural networks Nat 2017 542 7639 115-118
[9]
Feng Y, Shi Q, Gao X, Wan J, Fang C, Chen Z (2020) Deepgini: prioritizing massive tests to enhance the robustness of deep neural networks. In: Khurshid S, Pasareanu CS (eds) ISSTA ’20: 29th ACM SIGSOFT International symposium on software testing and analysis, virtual event, USA, July 18–22, 2020, ACM, pp 177–188.
[10]
Gal Y, Ghahramani Z (2016) Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In: Balcan M, Weinberger KQ (eds) Proceedings of the 33nd international conference on machine learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, JMLR.org, JMLR Workshop and Conference Proceedings, vol 48, pp 1050–1059. http://proceedings.mlr.press/v48/gal16.html
[11]
Gerasimou S, Eniser HF, Sen A, Cakan A (2020) Importance-driven deep learning system testing. In: Rothermel G, Bae D (eds) ICSE ’20: 42nd International conference on software engineering, Seoul, South Korea, 27 June - 19 July, 2020, ACM, pp 702–713.
[12]
Goodfellow IJ, Shlens J, Szegedy C (2015) Explaining and harnessing adversarial examples. In: Bengio Y, LeCun Y (eds) 3rd International conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. arXiv:1412.6572
[13]
Goodfellow IJ, Papernot N, McDaniel PD (2016) cleverhans v0.1: an adversarial machine learning library. CoRR arXiv:1610.00768
[14]
Guo J, Jiang Y, Zhao Y, Chen Q, Sun J (2018) Dlfuzz: differential fuzzing testing of deep learning systems. In: Leavens GT, Garcia A, Pasareanu CS (eds) Proceedings of the 2018 ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering, ESEC/SIGSOFT FSE 2018, Lake Buena Vista, FL, USA, November 04-09, 2018, ACM, pp 739–743.
[15]
Haq FU, Shin D, Nejati S, and Briand LC Correction to: Can offline testing of deep neural networks replace their online testing? Empir Softw Eng 2022 27 6 141
[16]
Harel-Canada F, Wang L, Gulzar MA, Gu Q, Kim M (2020) Is neuron coverage a meaningful measure for testing deep neural networks? In: Devanbu P, Cohen MB, Zimmermann T (eds) ESEC/FSE ’20: 28th ACM joint european software engineering conference and symposium on the foundations of software engineering, virtual event, USA, November 8–13, 2020, ACM, pp 851–862.
[17]
Houle ME (2017) Local intrinsic dimensionality II: multivariate analysis and distributional support. In: Beecks C, Borutta F, Kröger P, Seidl T (eds) Similarity Search and Applications - 10th International Conference, SISAP 2017, Munich, Germany, October 4-6, 2017, Proceedings, Springer, Lecture Notes in Computer Science, vol 10609, pp 80–95.
[18]
Hu Q, Ma L, Xie X, Yu B, Liu Y, Zhao J (2019) Deepmutation++: A mutation testing framework for deep learning systems. In: 34th IEEE/ACM International conference on automated software engineering, ASE 2019, San Diego, CA, USA, November 11–15, 2019, IEEE, pp 1158–1161.
[19]
Huang W, Sun Y, Zhao XE, Sharp J, Ruan W, Meng J, Huang X (2021) Coverage-guided testing for recurrent neural networks. IEEE Transactions on Reliability
[20]
Huang X, Kroening D, Ruan W, Sharp J, Sun Y, Thamo E, Wu M, and Yi X A survey of safety and trustworthiness of deep neural networks: Verification, testing, adversarial attack and defence, and interpretability Comput Sci Rev 2020 37
[21]
Humbatova N, Jahangirova G, Tonella P (2021) Deepcrime: mutation testing of deep learning systems based on real faults. In: Cadar C, Zhang X (eds) ISSTA ’21: 30th ACM SIGSOFT International symposium on software testing and analysis, virtual event, Denmark, July 11–17, 2021, ACM, pp 67–78.
[22]
Julian KD, Kochenderfer MJ, and Owen MP Deep neural network compression for aircraft collision avoidance systems J Guid Control Dyn 2019 42 3 598-608
[23]
Karger DR, Ruhl M (2002) Finding nearest neighbors in growth-restricted metrics. In: Reif JH (ed) Proceedings on 34th Annual ACM Symposium on Theory of Computing, May 19–21, 2002, Montréal, Québec, Canada, ACM, pp 741–750.
[24]
Kim J, Feldt R, Yoo S (2019) Guiding deep learning system testing using surprise adequacy. In: Atlee JM, Bultan T, Whittle J (eds) Proceedings of the 41st international conference on software engineering, ICSE 2019, Montreal, QC, Canada, May 25–31, 2019, IEEE / ACM, pp 1039–1049.
[25]
Kim J, Ju J, Feldt R, Yoo S (2020) Reducing DNN labelling cost using surprise adequacy: an industrial case study for autonomous driving. In: Devanbu P, Cohen MB, Zimmermann T (eds) ESEC/FSE ’20: 28th ACM Joint european software engineering conference and symposium on the foundations of software engineering, virtual event, USA, November 8–13, 2020, ACM, pp 1466–1476.
[26]
Kim J, Feldt R, Yoo S (2023) Evaluating surprise adequacy for deep learning system testing. ACM Trans Softw Eng Methodol 32(2):42:1–42:29.
[27]
Kim S, Yoo S (2020) Evaluating surprise adequacy for question answering. In: ICSE ’20: 42nd International conference on software engineering, Workshops, Seoul, Republic of Korea, 27 June - 19 July, 2020, ACM, pp 197–202.
[28]
Kim S, Yoo S (2021) Multimodal surprise adequacy analysis of inputs for natural language processing DNN models. In: 2nd IEEE/ACM International conference on automation of software test, AST@ICSE 2021, Madrid, Spain, May 20-21, 2021, IEEE, pp 80–89.
[29]
Krizhevsky N, Vinod H, Geoffrey C, Papadakis M, Ventresque A (2014a) The cifar-10 dataset. http://www.cs.toronto.edu/~kriz/cifar.html
[30]
Krizhevsky N, Vinod H, Geoffrey C, Papadakis M, Ventresque A (2014b) The cifar-100 dataset. http://www.cs.toronto.edu/~kriz/cifar.html
[31]
Kurakin A, Goodfellow IJ, Bengio S (2017) Adversarial examples in the physical world. In: 5th International conference on learning representations, ICLR 2017, Toulon, France, April 24-26, 2017, Workshop Track Proceedings, OpenReview.net. https://openreview.net/forum?id=HJGU3Rodl
[32]
Lakshminarayanan B, Pritzel A, Blundell C (2017) Simple and scalable predictive uncertainty estimation using deep ensembles. In: Guyon I, von Luxburg U, Bengio S, Wallach HM, Fergus R, Vishwanathan SVN, Garnett R (eds) Advances in neural information processing systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp 6402–6413. https://proceedings.neurips.cc/paper/2017/hash/9ef2ed4b7fd2c810847ffa5fa85bce38-Abstract.html
[33]
LeCun Y, Cortes C (1998) The mnist database of handwritten digits. http://yann.lecun.com/exdb/mnist/
[34]
LeCun Y, Bottou L, Bengio Y, and Haffner P Gradient-based learning applied to document recognition Proc IEEE 1998 86 11 2278-2324
[35]
Lee S, Cha S, Lee D, Oh H (2020) Effective white-box testing of deep neural networks with adaptive neuron-selection strategy. In: Khurshid S, Pasareanu CS (eds) ISSTA ’20: 29th ACM SIGSOFT International symposium on software testing and analysis, virtual event, USA, July 18-22, 2020, ACM, pp 165–176.
[36]
Li Z, Ma X, Xu C, Cao C (2019a) Structural coverage criteria for neural networks could be misleading. In: Sarma A, Murta L (eds) Proceedings of the 41st International Conference on Software Engineering: New Ideas and Emerging Results, ICSE (NIER) 2019, Montreal, QC, Canada, May 29-31, 2019, IEEE / ACM, pp 89–92.
[37]
Li Z, Ma X, Xu C, Cao C, Xu J, Lü J (2019b) Boosting operational DNN testing efficiency through conditioning. In: Dumas M, Pfahl D, Apel S, Russo A (eds) Proceedings of the ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering, ESEC/SIGSOFT FSE 2019, Tallinn, Estonia, August 26–30, 2019, ACM, pp 499–509.
[38]
Liu T Learning to Rank for Information Retrieval Springer 2011
[39]
Ma L, Juefei-Xu F, Zhang F, Sun J, Xue M, Li B, Chen C, Su T, Li L, Liu Y, Zhao J, Wang Y (2018a) Deepgauge: multi-granularity testing criteria for deep learning systems. In: Huchard M, Kästner C, Fraser G (eds) Proceedings of the 33rd ACM/IEEE international conference on automated software engineering, ASE 2018, Montpellier, France, September 3-7, 2018, ACM, pp 120–131.
[40]
Ma L, Zhang F, Sun J, Xue M, Li B, Juefei-Xu F, Xie C, Li L, Liu Y, Zhao J, Wang Y (2018b) Deepmutation: Mutation testing of deep learning systems. In: Ghosh S, Natella R, Cukic B, Poston RS, Laranjeiro N (eds) 29th IEEE International symposium on software reliability engineering, ISSRE 2018, Memphis, TN, USA, October 15-18, 2018, IEEE Computer Society, pp 100–111.
[41]
Ma L, Juefei-Xu F, Xue M, Li B, Li L, Liu Y, Zhao J (2019) Deepct: Tomographic combinatorial testing for deep learning systems. In: Wang X, Lo D, Shihab E (eds) 26th IEEE International conference on software analysis, evolution and reengineering, SANER 2019, Hangzhou, China, February 24-27, 2019, IEEE, pp 614–618.
[42]
Ma W, Papadakis M, Tsakmalis A, Cordy M, Traon YL (2021) Test selection for deep learning systems. ACM Trans Softw Eng Methodol 30(2):13:1–13:22.
[43]
Ma X, Li B, Wang Y, Erfani SM, Wijewickrema SNR, Schoenebeck G, Song D, Houle ME, Bailey J (2018c) Characterizing adversarial subspaces using local intrinsic dimensionality. In: 6th International conference on learning representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, OpenReview.net. https://openreview.net/forum?id=B1gJ1L2aW
[44]
Netzer Y, Wang T, Coates A, Bissacco A, Wu B, Ng A (2011) Reading digits in natural images with unsupervised feature learning
[45]
Odena A, Olsson C, Andersen DG, Goodfellow IJ (2019) Tensorfuzz: Debugging neural networks with coverage-guided fuzzing. In: Chaudhuri K, Salakhutdinov R (eds) Proceedings of the 36th international conference on machine learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, PMLR, Proceedings of Machine Learning Research, vol 97, pp 4901–4911. http://proceedings.mlr.press/v97/odena19a.html
[46]
Ouyang T, Isobe Y, Marco VS, Ogata J, Seo Y, Oiwa Y (2021) AI robustness analysis with consideration of corner cases. In: 2021 IEEE International Conference on Artificial Intelligence Testing, AITest 2021, Oxford, United Kingdom, August 23-26, 2021, IEEE, pp 29–36.
[47]
Papernot N, McDaniel PD, Jha S, Fredrikson M, Celik ZB, Swami A (2016) The limitations of deep learning in adversarial settings. In: IEEE European symposium on security and privacy, EuroS &P 2016, Saarbrücken, Germany, March 21-24, 2016, IEEE, pp 372–387.
[48]
Pei K, Cao Y, Yang J, Jana S (2017) Deepxplore: Automated whitebox testing of deep learning systems. In: Proceedings of the 26th Symposium on Operating Systems Principles, Shanghai, China, October 28-31, 2017, ACM, pp 1–18.
[49]
Riccio V, Jahangirova G, Stocco A, Humbatova N, Weiss M, and Tonella P Testing machine learning based systems: a systematic mapping Empir Softw Eng 2020 25 6 5193-5254
[50]
Romano J, Kromrey JD, Coraggio J, Skowronek J, Devine L (2006) Exploring methods for evaluating group differences on the nsse and other surveys: Are the t-test and cohen’sd indices the most appropriate choices. In: Annual meeting of the southern association for institutional research, Citeseer, pp 1–51
[51]
Sekhon J, Fleming C (2019) Towards improved testing for deep learning. In: Sarma A, Murta L (eds) Proceedings of the 41st international conference on software engineering: new ideas and emerging results, ICSE (NIER) 2019, Montreal, QC, Canada, May 29-31, 2019, IEEE / ACM, pp 85–88.
[52]
Shen W, Li Y, Chen L, Han Y, Zhou Y, Xu B (2020) Multiple-boundary clustering and prioritization to promote neural network retraining. In: 35th IEEE/ACM International conference on automated software engineering, ASE 2020, Melbourne, Australia, September 21-25, 2020, IEEE, pp 410–422.
[53]
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
[54]
Sun Y, Wu M, Ruan W, Huang X, Kwiatkowska M, Kroening D (2018) Concolic testing for deep neural networks. In: Huchard M, Kästner C, Fraser G (eds) Proceedings of the 33rd ACM/IEEE international conference on automated software engineering, ASE 2018, Montpellier, France, September 3-7, 2018, ACM, pp 109–119.
[55]
Sun Y, Huang X, Kroening D, Sharp J, Hill M, Ashmore R (2019) Structural test coverage criteria for deep neural networks. In: Atlee JM, Bultan T, Whittle J (eds) Proceedings of the 41st international conference on software engineering: companion proceedings, ICSE 2019, Montreal, QC, Canada, May 25-31, 2019, IEEE / ACM, pp 320–321.
[56]
Tian Y, Pei K, Jana S, Ray B (2018) Deeptest: automated testing of deep-neural-network-driven autonomous cars. In: Chaudron M, Crnkovic I, Chechik M, Harman M (eds) Proceedings of the 40th international conference on software engineering, ICSE 2018, Gothenburg, Sweden, May 27 - June 03, 2018, ACM, pp 303–314.
[57]
Wang Z, You H, Chen J, Zhang Y, Dong X, Zhang W (2021) Prioritizing test inputs for deep neural networks via mutation analysis. In: 43rd IEEE/ACM International conference on software engineering, ICSE 2021, Madrid, Spain, 22-30 May 2021, IEEE, pp 397–409.
[58]
Weiss M, Tonella P (2021) Fail-safe execution of deep learning based systems through uncertainty monitoring. In: 14th IEEE Conference on software testing, verification and validation, ICST 2021, Porto de Galinhas, Brazil, April 12-16, 2021, IEEE, pp 24–35.
[59]
Weiss M, Chakraborty R, Tonella P (2021) A review and refinement of surprise adequacy. In: 3rd IEEE/ACM International workshop on deep learning for testing and testing for deep learning, DeepTest@ICSE 2021, Madrid, Spain, June 1, 2021, IEEE, pp 17–24.
[60]
Wilcoxon F (1992) Individual comparisons by ranking methods. In: Breakthroughs in statistics, Springer, pp 196–202
[61]
Xiao H, Rasul K, Vollgraf R (2019) Fashion-mnist is a dataset of zalando’s article images. https://github.com/zalandoresearch/fashion-mnist
[62]
Xie X, Ma L, Juefei-Xu F, Xue M, Chen H, Liu Y, Zhao J, Li B, Yin J, See S (2019) Deephunter: a coverage-guided fuzz testing framework for deep neural networks. In: Zhang D, Møller A (eds) Proceedings of the 28th ACM SIGSOFT International symposium on software testing and analysis, ISSTA 2019, Beijing, China, July 15-19, 2019, ACM, pp 146–157.
[63]
Xie X, Li T, Wang J, Ma L, Guo Q, Juefei-Xu F, Liu Y (2022) NPC: neuron path coverage via characterizing decision logic of deep neural networks. ACM Trans Softw Eng Methodol 31(3):47:1–47:27.
[64]
Yan S, Tao G, Liu X, Zhai J, Ma S, Xu L, Zhang X (2020) Correlations between deep neural network model coverage criteria and model quality. In: Devanbu P, Cohen MB, Zimmermann T (eds) ESEC/FSE ’20: 28th ACM Joint european software engineering conference and symposium on the foundations of software engineering, virtual event, USA, November 8-13, 2020, ACM, pp 775–787.
[65]
Yuan Y, Pang Q, Wang S (2023) Revisiting neuron coverage for DNN testing: A layer-wise and distribution-aware criterion. In: 45th IEEE/ACM International conference on software engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023, IEEE, pp 1200–1212.
[66]
Zhang JM, Harman M, Ma L, and Liu Y Machine learning testing: Survey, landscapes and horizons IEEE Trans Software Eng 2022 48 2 1-36

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Empirical Software Engineering
Empirical Software Engineering  Volume 29, Issue 5
Sep 2024
1352 pages

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 25 July 2024
Accepted: 02 July 2024

Author Tags

  1. Deep neural network testing
  2. Test adequacy
  3. Testing coverage criteria
  4. Non-structural coverage

Qualifiers

  • Research-article

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 28 Jan 2025

Other Metrics

Citations

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media