research-article

Neuron importance-aware coverage analysis for deep neural network testing

Authors:

Zhiqiu HuangAuthors Info & Claims

Empirical Software Engineering, Volume 29, Issue 5

https://doi.org/10.1007/s10664-024-10524-x

Published: 25 July 2024 Publication History

Abstract

Deep Neural Network (DNN) models are widely used in many cutting-edge domains, such as medical diagnostics and autonomous driving. However, an urgent need to test DNN models thoroughly has increasingly risen. Recent research proposes various structural and non-structural coverage criteria to measure test adequacy. Structural coverage criteria quantify the degree to which the internal elements of DNN models are covered by a test suite. However, they convey little information about individual inputs and exhibit limited correlation with defect detection. Additionally, existing non-structural coverage criteria are unaware of neurons’ importance to decision-making. This paper addresses these limitations by proposing novel non-structural coverage criteria. By tracing neurons’ cumulative contribution to the final decision on the training set, this paper identifies important neurons of DNN models. A novel metric is proposed to quantify the difference in important neuron behavior between a test input and the training set, which provides a measured way at individual test input granularity. Additionally, two non-structural coverage criteria are introduced that allow for the quantification of test adequacy by examining differences in important neuron behavior between the testing and the training set. The empirical evaluation of image datasets demonstrates that the proposed metric outperforms the existing non-structural adequacy metrics by up to 14.7% accuracy improvement in capturing error-revealing test inputs. Compared with state-of-the-art coverage criteria, the proposed coverage criteria are more sensitive to errors, including natural errors and adversarial examples.

References

[1]

Uda (2016) The udacity open source self-driving car project. https://github.com/ udacity/self-driving-car

[2]

Bach S, Binder A, Montavon G, Klauschen F, Müller KR, and Samek W On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation PLOS ONE 2015 10 7 1-46

[3]

Carlini N, Wagner DA (2017) Towards evaluating the robustness of neural networks. In: 2017 IEEE Symposium on security and privacy, SP 2017, San Jose, CA, USA, May 22–26, 2017, IEEE Computer Society, pp 39–57.

[4]

Chen J, Wu Z, Wang Z, You H, Zhang L, Yan M (2020a) Practical accuracy estimation for efficient deep neural network testing. ACM Trans Softw Eng Methodol 29(4):30:1–30:35.

Digital Library

[5]

Chen J, Yan M, Wang Z, Kang Y, Wu Z (2020b) Deep neural network test coverage: How far are we? CoRR arXiv:2010.04946

[6]

Chen Z, Huang X (2017) End-to-end learning for lane keeping of self-driving cars. In: IEEE Intelligent vehicles symposium, IV 2017, Los Angeles, CA, USA, June 11-14, 2017, IEEE, pp 1856–1860.

Digital Library

[7]

Du X, Xie X, Li Y, Ma L, Liu Y, Zhao J (2019) Deepstellar: model-based quantitative analysis of stateful deep learning systems. In: Dumas M, Pfahl D, Apel S, Russo A (eds) Proceedings of the ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering, ESEC/SIGSOFT FSE 2019, Tallinn, Estonia, August 26–30, 2019, ACM, pp 477–487.

Digital Library

[8]

Esteva A, Kuprel B, Novoa RA, Ko J, Swetter SM, Blau HM, and Thrun S Dermatologist-level classification of skin cancer with deep neural networks Nat 2017 542 7639 115-118

[9]

Feng Y, Shi Q, Gao X, Wan J, Fang C, Chen Z (2020) Deepgini: prioritizing massive tests to enhance the robustness of deep neural networks. In: Khurshid S, Pasareanu CS (eds) ISSTA ’20: 29th ACM SIGSOFT International symposium on software testing and analysis, virtual event, USA, July 18–22, 2020, ACM, pp 177–188.

Digital Library

[10]

Gal Y, Ghahramani Z (2016) Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In: Balcan M, Weinberger KQ (eds) Proceedings of the 33nd international conference on machine learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, JMLR.org, JMLR Workshop and Conference Proceedings, vol 48, pp 1050–1059. http://proceedings.mlr.press/v48/gal16.html

[11]

Gerasimou S, Eniser HF, Sen A, Cakan A (2020) Importance-driven deep learning system testing. In: Rothermel G, Bae D (eds) ICSE ’20: 42nd International conference on software engineering, Seoul, South Korea, 27 June - 19 July, 2020, ACM, pp 702–713.

Digital Library

[12]

Goodfellow IJ, Shlens J, Szegedy C (2015) Explaining and harnessing adversarial examples. In: Bengio Y, LeCun Y (eds) 3rd International conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. arXiv:1412.6572

[13]

Goodfellow IJ, Papernot N, McDaniel PD (2016) cleverhans v0.1: an adversarial machine learning library. CoRR arXiv:1610.00768

[14]

Guo J, Jiang Y, Zhao Y, Chen Q, Sun J (2018) Dlfuzz: differential fuzzing testing of deep learning systems. In: Leavens GT, Garcia A, Pasareanu CS (eds) Proceedings of the 2018 ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering, ESEC/SIGSOFT FSE 2018, Lake Buena Vista, FL, USA, November 04-09, 2018, ACM, pp 739–743.

Digital Library

[15]

Haq FU, Shin D, Nejati S, and Briand LC Correction to: Can offline testing of deep neural networks replace their online testing? Empir Softw Eng 2022 27 6 141

Digital Library

[16]

Harel-Canada F, Wang L, Gulzar MA, Gu Q, Kim M (2020) Is neuron coverage a meaningful measure for testing deep neural networks? In: Devanbu P, Cohen MB, Zimmermann T (eds) ESEC/FSE ’20: 28th ACM joint european software engineering conference and symposium on the foundations of software engineering, virtual event, USA, November 8–13, 2020, ACM, pp 851–862.

Digital Library

[17]

Houle ME (2017) Local intrinsic dimensionality II: multivariate analysis and distributional support. In: Beecks C, Borutta F, Kröger P, Seidl T (eds) Similarity Search and Applications - 10th International Conference, SISAP 2017, Munich, Germany, October 4-6, 2017, Proceedings, Springer, Lecture Notes in Computer Science, vol 10609, pp 80–95.

[18]

Hu Q, Ma L, Xie X, Yu B, Liu Y, Zhao J (2019) Deepmutation++: A mutation testing framework for deep learning systems. In: 34th IEEE/ACM International conference on automated software engineering, ASE 2019, San Diego, CA, USA, November 11–15, 2019, IEEE, pp 1158–1161.

Digital Library

[19]

Huang W, Sun Y, Zhao XE, Sharp J, Ruan W, Meng J, Huang X (2021) Coverage-guided testing for recurrent neural networks. IEEE Transactions on Reliability

[20]

Huang X, Kroening D, Ruan W, Sharp J, Sun Y, Thamo E, Wu M, and Yi X A survey of safety and trustworthiness of deep neural networks: Verification, testing, adversarial attack and defence, and interpretability Comput Sci Rev 2020 37

[21]

Humbatova N, Jahangirova G, Tonella P (2021) Deepcrime: mutation testing of deep learning systems based on real faults. In: Cadar C, Zhang X (eds) ISSTA ’21: 30th ACM SIGSOFT International symposium on software testing and analysis, virtual event, Denmark, July 11–17, 2021, ACM, pp 67–78.

Digital Library

[22]

Julian KD, Kochenderfer MJ, and Owen MP Deep neural network compression for aircraft collision avoidance systems J Guid Control Dyn 2019 42 3 598-608

[23]

Karger DR, Ruhl M (2002) Finding nearest neighbors in growth-restricted metrics. In: Reif JH (ed) Proceedings on 34th Annual ACM Symposium on Theory of Computing, May 19–21, 2002, Montréal, Québec, Canada, ACM, pp 741–750.

Digital Library

[24]

Kim J, Feldt R, Yoo S (2019) Guiding deep learning system testing using surprise adequacy. In: Atlee JM, Bultan T, Whittle J (eds) Proceedings of the 41st international conference on software engineering, ICSE 2019, Montreal, QC, Canada, May 25–31, 2019, IEEE / ACM, pp 1039–1049.

Digital Library

[25]

Kim J, Ju J, Feldt R, Yoo S (2020) Reducing DNN labelling cost using surprise adequacy: an industrial case study for autonomous driving. In: Devanbu P, Cohen MB, Zimmermann T (eds) ESEC/FSE ’20: 28th ACM Joint european software engineering conference and symposium on the foundations of software engineering, virtual event, USA, November 8–13, 2020, ACM, pp 1466–1476.

Digital Library

[26]

Kim J, Feldt R, Yoo S (2023) Evaluating surprise adequacy for deep learning system testing. ACM Trans Softw Eng Methodol 32(2):42:1–42:29.

Digital Library

[27]

Kim S, Yoo S (2020) Evaluating surprise adequacy for question answering. In: ICSE ’20: 42nd International conference on software engineering, Workshops, Seoul, Republic of Korea, 27 June - 19 July, 2020, ACM, pp 197–202.

Digital Library

[28]

Kim S, Yoo S (2021) Multimodal surprise adequacy analysis of inputs for natural language processing DNN models. In: 2nd IEEE/ACM International conference on automation of software test, AST@ICSE 2021, Madrid, Spain, May 20-21, 2021, IEEE, pp 80–89.

[29]

Krizhevsky N, Vinod H, Geoffrey C, Papadakis M, Ventresque A (2014a) The cifar-10 dataset. http://www.cs.toronto.edu/~kriz/cifar.html

[30]

Krizhevsky N, Vinod H, Geoffrey C, Papadakis M, Ventresque A (2014b) The cifar-100 dataset. http://www.cs.toronto.edu/~kriz/cifar.html

[31]

Kurakin A, Goodfellow IJ, Bengio S (2017) Adversarial examples in the physical world. In: 5th International conference on learning representations, ICLR 2017, Toulon, France, April 24-26, 2017, Workshop Track Proceedings, OpenReview.net. https://openreview.net/forum?id=HJGU3Rodl

[32]

Lakshminarayanan B, Pritzel A, Blundell C (2017) Simple and scalable predictive uncertainty estimation using deep ensembles. In: Guyon I, von Luxburg U, Bengio S, Wallach HM, Fergus R, Vishwanathan SVN, Garnett R (eds) Advances in neural information processing systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp 6402–6413. https://proceedings.neurips.cc/paper/2017/hash/9ef2ed4b7fd2c810847ffa5fa85bce38-Abstract.html

[33]

LeCun Y, Cortes C (1998) The mnist database of handwritten digits. http://yann.lecun.com/exdb/mnist/

[34]

LeCun Y, Bottou L, Bengio Y, and Haffner P Gradient-based learning applied to document recognition Proc IEEE 1998 86 11 2278-2324

[35]

Lee S, Cha S, Lee D, Oh H (2020) Effective white-box testing of deep neural networks with adaptive neuron-selection strategy. In: Khurshid S, Pasareanu CS (eds) ISSTA ’20: 29th ACM SIGSOFT International symposium on software testing and analysis, virtual event, USA, July 18-22, 2020, ACM, pp 165–176.

Digital Library

[36]

Li Z, Ma X, Xu C, Cao C (2019a) Structural coverage criteria for neural networks could be misleading. In: Sarma A, Murta L (eds) Proceedings of the 41st International Conference on Software Engineering: New Ideas and Emerging Results, ICSE (NIER) 2019, Montreal, QC, Canada, May 29-31, 2019, IEEE / ACM, pp 89–92.

Digital Library

[37]

Li Z, Ma X, Xu C, Cao C, Xu J, Lü J (2019b) Boosting operational DNN testing efficiency through conditioning. In: Dumas M, Pfahl D, Apel S, Russo A (eds) Proceedings of the ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering, ESEC/SIGSOFT FSE 2019, Tallinn, Estonia, August 26–30, 2019, ACM, pp 499–509.

Digital Library

[38]

Liu T Learning to Rank for Information Retrieval Springer 2011

[39]

Ma L, Juefei-Xu F, Zhang F, Sun J, Xue M, Li B, Chen C, Su T, Li L, Liu Y, Zhao J, Wang Y (2018a) Deepgauge: multi-granularity testing criteria for deep learning systems. In: Huchard M, Kästner C, Fraser G (eds) Proceedings of the 33rd ACM/IEEE international conference on automated software engineering, ASE 2018, Montpellier, France, September 3-7, 2018, ACM, pp 120–131.

Digital Library

[40]

Ma L, Zhang F, Sun J, Xue M, Li B, Juefei-Xu F, Xie C, Li L, Liu Y, Zhao J, Wang Y (2018b) Deepmutation: Mutation testing of deep learning systems. In: Ghosh S, Natella R, Cukic B, Poston RS, Laranjeiro N (eds) 29th IEEE International symposium on software reliability engineering, ISSRE 2018, Memphis, TN, USA, October 15-18, 2018, IEEE Computer Society, pp 100–111.

[41]

Ma L, Juefei-Xu F, Xue M, Li B, Li L, Liu Y, Zhao J (2019) Deepct: Tomographic combinatorial testing for deep learning systems. In: Wang X, Lo D, Shihab E (eds) 26th IEEE International conference on software analysis, evolution and reengineering, SANER 2019, Hangzhou, China, February 24-27, 2019, IEEE, pp 614–618.

[42]

Ma W, Papadakis M, Tsakmalis A, Cordy M, Traon YL (2021) Test selection for deep learning systems. ACM Trans Softw Eng Methodol 30(2):13:1–13:22.

Digital Library

[43]

Ma X, Li B, Wang Y, Erfani SM, Wijewickrema SNR, Schoenebeck G, Song D, Houle ME, Bailey J (2018c) Characterizing adversarial subspaces using local intrinsic dimensionality. In: 6th International conference on learning representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, OpenReview.net. https://openreview.net/forum?id=B1gJ1L2aW

[44]

Netzer Y, Wang T, Coates A, Bissacco A, Wu B, Ng A (2011) Reading digits in natural images with unsupervised feature learning

[45]

Odena A, Olsson C, Andersen DG, Goodfellow IJ (2019) Tensorfuzz: Debugging neural networks with coverage-guided fuzzing. In: Chaudhuri K, Salakhutdinov R (eds) Proceedings of the 36th international conference on machine learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, PMLR, Proceedings of Machine Learning Research, vol 97, pp 4901–4911. http://proceedings.mlr.press/v97/odena19a.html

[46]

Ouyang T, Isobe Y, Marco VS, Ogata J, Seo Y, Oiwa Y (2021) AI robustness analysis with consideration of corner cases. In: 2021 IEEE International Conference on Artificial Intelligence Testing, AITest 2021, Oxford, United Kingdom, August 23-26, 2021, IEEE, pp 29–36.

[47]

Papernot N, McDaniel PD, Jha S, Fredrikson M, Celik ZB, Swami A (2016) The limitations of deep learning in adversarial settings. In: IEEE European symposium on security and privacy, EuroS &P 2016, Saarbrücken, Germany, March 21-24, 2016, IEEE, pp 372–387.

[48]

Pei K, Cao Y, Yang J, Jana S (2017) Deepxplore: Automated whitebox testing of deep learning systems. In: Proceedings of the 26th Symposium on Operating Systems Principles, Shanghai, China, October 28-31, 2017, ACM, pp 1–18.

Digital Library

[49]

Riccio V, Jahangirova G, Stocco A, Humbatova N, Weiss M, and Tonella P Testing machine learning based systems: a systematic mapping Empir Softw Eng 2020 25 6 5193-5254

Digital Library

[50]

Romano J, Kromrey JD, Coraggio J, Skowronek J, Devine L (2006) Exploring methods for evaluating group differences on the nsse and other surveys: Are the t-test and cohen’sd indices the most appropriate choices. In: Annual meeting of the southern association for institutional research, Citeseer, pp 1–51

[51]

Sekhon J, Fleming C (2019) Towards improved testing for deep learning. In: Sarma A, Murta L (eds) Proceedings of the 41st international conference on software engineering: new ideas and emerging results, ICSE (NIER) 2019, Montreal, QC, Canada, May 29-31, 2019, IEEE / ACM, pp 85–88.

Digital Library

[52]

Shen W, Li Y, Chen L, Han Y, Zhou Y, Xu B (2020) Multiple-boundary clustering and prioritization to promote neural network retraining. In: 35th IEEE/ACM International conference on automated software engineering, ASE 2020, Melbourne, Australia, September 21-25, 2020, IEEE, pp 410–422.

Digital Library

[53]

Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556

[54]

Sun Y, Wu M, Ruan W, Huang X, Kwiatkowska M, Kroening D (2018) Concolic testing for deep neural networks. In: Huchard M, Kästner C, Fraser G (eds) Proceedings of the 33rd ACM/IEEE international conference on automated software engineering, ASE 2018, Montpellier, France, September 3-7, 2018, ACM, pp 109–119.

Digital Library

[55]

Sun Y, Huang X, Kroening D, Sharp J, Hill M, Ashmore R (2019) Structural test coverage criteria for deep neural networks. In: Atlee JM, Bultan T, Whittle J (eds) Proceedings of the 41st international conference on software engineering: companion proceedings, ICSE 2019, Montreal, QC, Canada, May 25-31, 2019, IEEE / ACM, pp 320–321.

Digital Library

[56]

Tian Y, Pei K, Jana S, Ray B (2018) Deeptest: automated testing of deep-neural-network-driven autonomous cars. In: Chaudron M, Crnkovic I, Chechik M, Harman M (eds) Proceedings of the 40th international conference on software engineering, ICSE 2018, Gothenburg, Sweden, May 27 - June 03, 2018, ACM, pp 303–314.

Digital Library

[57]

Wang Z, You H, Chen J, Zhang Y, Dong X, Zhang W (2021) Prioritizing test inputs for deep neural networks via mutation analysis. In: 43rd IEEE/ACM International conference on software engineering, ICSE 2021, Madrid, Spain, 22-30 May 2021, IEEE, pp 397–409.

Digital Library

[58]

Weiss M, Tonella P (2021) Fail-safe execution of deep learning based systems through uncertainty monitoring. In: 14th IEEE Conference on software testing, verification and validation, ICST 2021, Porto de Galinhas, Brazil, April 12-16, 2021, IEEE, pp 24–35.

[59]

Weiss M, Chakraborty R, Tonella P (2021) A review and refinement of surprise adequacy. In: 3rd IEEE/ACM International workshop on deep learning for testing and testing for deep learning, DeepTest@ICSE 2021, Madrid, Spain, June 1, 2021, IEEE, pp 17–24.

[60]

Wilcoxon F (1992) Individual comparisons by ranking methods. In: Breakthroughs in statistics, Springer, pp 196–202

[61]

Xiao H, Rasul K, Vollgraf R (2019) Fashion-mnist is a dataset of zalando’s article images. https://github.com/zalandoresearch/fashion-mnist

[62]

Xie X, Ma L, Juefei-Xu F, Xue M, Chen H, Liu Y, Zhao J, Li B, Yin J, See S (2019) Deephunter: a coverage-guided fuzz testing framework for deep neural networks. In: Zhang D, Møller A (eds) Proceedings of the 28th ACM SIGSOFT International symposium on software testing and analysis, ISSTA 2019, Beijing, China, July 15-19, 2019, ACM, pp 146–157.

Digital Library

[63]

Xie X, Li T, Wang J, Ma L, Guo Q, Juefei-Xu F, Liu Y (2022) NPC: neuron path coverage via characterizing decision logic of deep neural networks. ACM Trans Softw Eng Methodol 31(3):47:1–47:27.

Digital Library

[64]

Yan S, Tao G, Liu X, Zhai J, Ma S, Xu L, Zhang X (2020) Correlations between deep neural network model coverage criteria and model quality. In: Devanbu P, Cohen MB, Zimmermann T (eds) ESEC/FSE ’20: 28th ACM Joint european software engineering conference and symposium on the foundations of software engineering, virtual event, USA, November 8-13, 2020, ACM, pp 775–787.

Digital Library

[65]

Yuan Y, Pang Q, Wang S (2023) Revisiting neuron coverage for DNN testing: A layer-wise and distribution-aware criterion. In: 45th IEEE/ACM International conference on software engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023, IEEE, pp 1200–1212.

Digital Library

[66]

Zhang JM, Harman M, Ma L, and Liu Y Machine learning testing: Survey, landscapes and horizons IEEE Trans Software Eng 2022 48 2 1-36

Digital Library

Index Terms

Index terms have been assigned to the content through auto-classification.

Recommendations

NPC: Neuron Path Coverage via Characterizing Decision Logic of Deep Neural Networks
Deep learning has recently been widely applied to many applications across different domains, e.g., image classification and audio recognition. However, the quality of Deep Neural Networks (DNNs) still raises concerns in the practical operational ...
Practical Accuracy Estimation for Efficient Deep Neural Network Testing
Continuous Special Section: AI and SE

Deep neural network (DNN) has become increasingly popular and DNN testing is very critical to guarantee the correctness of DNN, i.e., the accuracy of DNN in this work. However, DNN testing suffers from a serious efficiency problem, i.e., it is costly to ...
An Empirical Study on Correlations Between Deep Neural Network Fairness and Neuron Coverage Criteria
Recently, with the widespread use of deep neural networks (DNNs) in high-stakes decision-making systems (such as fraud detection and prison sentencing), concerns have arisen about the fairness of DNNs in terms of the potential negative impact they may ...

Comments

Information & Contributors

Information

Published In

cover image Empirical Software Engineering

Empirical Software Engineering Volume 29, Issue 5

Sep 2024

1352 pages

Issue’s Table of Contents

© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 25 July 2024

Accepted: 02 July 2024

Author Tags

Qualifiers

Research-article

Funding Sources

Joint Funds of the National Natural Science Foundation of China
Fundamental Research Funds for the Central Universities
The Open Foundation of State Key Laboratory for Novel Software Technology of China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 28 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents