research-article

Open access

Contract-based Validation of Conceptual Design Bugs for Engineering Complex Machine Learning Software

Author:

Willem MeijerAuthors Info & Claims

MODELS Companion '24: Proceedings of the ACM/IEEE 27th International Conference on Model Driven Engineering Languages and Systems

Pages 155 - 161

https://doi.org/10.1145/3652620.3688201

Published: 31 October 2024 Publication History

Abstract

Context. Modern software systems increasingly commonly contain one or multiple machine learning (ML) components. Current development practices are generally on a trial-and-error basis, posing a significant risk of introducing bugs. One type of bug is the "conceptual design bug," referring to a misunderstanding between the properties of input data and prerequisites imposed by ML algorithms (e.g., using unscaled data in a scale-sensitive algorithm). These bugs are challenging to test at design time, causing problems at runtime through crashes, noticeably poor model performance, or not at all, threatening the system's robustness and transparency. Objective. In this work, I propose the line of research I intend to pursue during my PhD, addressing conceptual design bugs in complex ML software from a prevention-oriented perspective. I intend to build open-source tooling for ML engineers that can be used to detect conceptual design bugs, enabling them to make quality assurances about their system design's robustness. Approach. We need to understand conceptual bugs beyond the status quo, identifying their types, prevalence, impacts, and structural elements in the code. We operationalize this knowledge into a tool that detects them at design time, allowing ML engineers to resolve them before running their code and wasting resources. We anticipate this tool will leverage contract-based validation applied to partial ML software models. Evaluation. We plan to evaluate the built tool two-fold using professional (industrial) ML software. First, we will study its effectiveness regarding bug detection at design time, identifying whether it fulfills its functional objective. Second, we will study its usability, identifying whether ML engineers benefit when tools like this are introduced into their ML engineering workflow.

References

[1]

Shibbir Ahmed, Sayem Mohammad Imtiaz, Samantha Syeda Khairunnesa, Breno Dantas Cruz, and Hridesh Rajan. 2023. Design by Contract for Deep Learning APIs. In ESEC/FSE. ACM, 94--106.

Digital Library

[2]

Shibbir Ahmed, Mohammad Wardat, Hamid Bagheri, Breno Dantas Cruz, and Hridesh Rajan. 2023. Characterizing Bugs in Python and R Data Analytics Programs. arXiv.

[3]

Md Abdullah Al Alamin and Gias Uddin. 2024. How Far Are We With Automated Machine Learning? Characterization and Challenges of AutoML Toolkits. Empir Softw Eng 29, 4 (2024), 91.

Digital Library

[4]

Moayad Alshangiti, Hitesh Sapkota, Pradeep K. Murukannaiah, Xumin Liu, and Qi Yu. 2019. Why Is Developing Machine Learning Applications Challenging? a Study on Stack Overflow Posts. In ESEM. ACM/IEEE, 1--11.

[5]

Nicolli S.R. Alves, Leilane F. Ribeiro, Vivyane Caires, Thiago S. Mendes, and Rodrigo O. Spínola. 2014. Towards an Ontology of Terms on Technical Debt. In MTD. IEEE, 1--7.

Digital Library

[6]

Aaditya Bhatia, Foutse Khomh, Bram Adams, and Ahmed E. Hassan. 2024. An Empirical Study of Self-admitted Technical Debt in Machine Learning Software. arXiv.

[7]

Justus Bogner, Roberto Verdecchia, and Ilias Gerostathopoulos. 2021. Characterizing Technical Debt and Antipatterns in AI-based Systems: A Systematic Mapping Study. In TechDebt. IEEE/ACM, 64--73.

[8]

Houssem Ben Braiek and Foutse Khomh. 2020. On Testing Machine Learning Programs. J Syst Softw 164 (2020), 110542.

[9]

Lukas Budach, Moritz Feuerpfeil, Nina Ihde, Andrea Nathansen, Nele Noack, Hendrik Patzlaff, Felix Naumann, and Hazar Harmouch. 2022. The Effects of Data Quality on Machine Learning Performance. arXiv.

[10]

Raphael Cabral, Marcos Kalinowski, Maria Teresa Baldassarre, Hugo Villamizar, Tatiana Escovedo, and Hélio Lopes. 2024. Investigating the Impact of SOLID Design Principles on Machine Learning Code Understanding. In CAIN. ACM, 7--17.

Digital Library

[11]

Nicolás Cardozo, Ivana Dusparic, and Christian Cabrera. 2023. Prevalence of Code Smells in Reinforcement Learning Projects. In CAIN. IEEE, 37--42.

[12]

Bruno Cartaxo, Gustavo Pinto, and Sergio Soares. 2020. Rapid Reviews in Software Engineering. Springer Link, 357--384.

[13]

Zhenpeng Chen, Huihan Yao, Yiling Lou, Yanbin Cao, Yuanqiang Liu, Haoyu Wang, and Xuanzhe Liu. 2021. An Empirical Study on Deployment Faults of Deep Learning Based Mobile Applications. In ICSE. IEEE, 674--685.

Digital Library

[14]

Dolors Costal, Cristina Gómez, and Silverio Martínez-Fernández. 2024. Metrics for Code Smells of ML Pipelines. In Product-Focused Software Process Improvement, Vol. 14484. Springer Link, 3--9.

Digital Library

[15]

Jesse Davis and Mark Goadrich. 2006. The Relationship Between Precision-recall and ROC Curves. In ICML. ACM, 233--240.

Digital Library

[16]

Malinda Dilhara, Ameya Ketkar, and Danny Dig. 2021. Understanding Software-2.0: A Study of Machine Learning Library Usage and Evolution. Trans Softw Eng Methodol 30, 4 (2021), 1--42.

Digital Library

[17]

Lukas Fischer, Lisa Ehrlinger, Verena Geist, Rudolf Ramler, Florian Sobiezky, Werner Zellinger, David Brunner, Mohit Kumar, and Bernhard Moser. 2020. AI System Engineering---key Challenges and Lessons Learned. Mach Learn Knowl Extr 3, 1 (2020), 56--83.

[18]

Harald Foidl, Michael Felderer, and Rudolf Ramler. 2022. Data Smells: Categories, Causes and Consequences, and Detection of Suspicious Data in AI-based Systems. In CAIN. ACM, 229--239.

Digital Library

[19]

Harald Foidl, Valentina Golendukhina, Rudolf Ramler, and Michael Felderer. 2024. Data Pipeline Quality: Influencing Factors, Root Causes of Data-related Issues, and Processing Problem Areas for Developers. J Syst Softw 207 (2024), 111855.

Digital Library

[20]

Yanjie Gao, Yichen He, Xinze Li, Bo Zhao, Haoxiang Lin, Yoyo Liang, Jing Zhong, Hongyu Zhang, Jingzhou Wang, Yonghua Zeng, Keli Gui, Jie Tong, and Mao Yang. 2024. An Empirical Study on Low GPU Utilization of Deep Learning Jobs. In ICSE. ACM, 1--13.

Digital Library

[21]

Yanjie Gao, Zhengxian Li, Haoxiang Lin, Hongyu Zhang, Ming Wu, and Mao Yang. 2022. Refty: Refinement Types for Valid Deep Learning Models. In ICSE. ACM, 1843--1855.

Digital Library

[22]

Yanjie Gao, Xiaoxiang Shi, Haoxiang Lin, Hongyu Zhang, Hao Wu, Rui Li, and Mao Yang. 2023. An Empirical Study on Quality Issues of Deep Learning Platform. In ICSE-SEIP. IEEE, 455--466.

Digital Library

[23]

Jiri Gesi, Siqi Liu, Jiawei Li, Iftekhar Ahmed, Nachiappan Nagappan, David Lo, Eduardo Santana de Almeida, Pavneet Singh Kochhar, and Lingfeng Bao. 2022. Code Smells in Machine Learning Systems. arXiv.

[24]

Joan Giner-Miguelez, Abel Gómez, and Jordi Cabot. 2023. A Domain-specific Language for Describing Machine Learning Datasets. J Comput Lang 76 (2023), 101209.

[25]

Akshay Goel, Almog Gueta, Omry Gilon, Chang Liu, Sofia Erell, Lan Huong Nguyen, Xiaohong Hao, Bolous Jaber, Shashir Reddy, Rupesh Kartha, Jean Steiner, Itay Laish, and Amir Feder. 2023. LLMs Accelerate Annotation for Medical Information Extraction. In ML4H. PMLR, 82--100. https://proceedings.mlr.press/v225/goel23a.html

[26]

Konstantin Grotov, Sergey Titov, Vladimir Sotnikov, Yaroslav Golubev, and Timofey Bryksin. 2022. A Large-scale Comparison of Python Code in Jupyter Notebooks and Scripts. In MSR. ACM, 353--364.

Digital Library

[27]

Zara Hassan, Christoph Treude, Michael Norrish, Graham Williams, and Alex Potanin. 2024. Characterising Reproducibility Debt in Scientific Software: A Systematic Literature Review. SSRN 4801433 (2024).

[28]

Xin He, Kaiyong Zhao, and Xiaowen Chu. 2021. AutoML: A Survey of the State-of-the-art. Knowl Based Syst 212 (2021), 106622.

[29]

Shuo Hong, Hailong Sun, Xiang Gao, and Shin Hwei Tan. 2024. Investigating and Detecting Silent Bugs in PyTorch Programs. In SANER. IEEE, 272--283.

[30]

Nargiz Humbatova, Gunel Jahangirova, Gabriele Bavota, Vincenzo Riccio, Andrea Stocco, and Paolo Tonella. 2020. Taxonomy of Real Faults in Deep Learning Systems. In ICSE. ACM, 1110--1121.

Digital Library

[31]

Md Johirul Islam, Giang Nguyen, Rangeet Pan, and Hridesh Rajan. 2019. A Comprehensive Study on Deep Learning Bug Characteristics. In ESEC/FSE. ACM, 510--520.

Digital Library

[32]

Hadhemi Jebnoun, Houssem Ben Braiek, Mohammad Masudur Rahman, and Foutse Khomh. 2020. The Scent of Deep Learning Code: An Empirical Study. In MSR. ACM, 420--430.

Digital Library

[33]

Wenxin Jiang, Vishnu Banna, Naveen Vivek, Abhinav Goel, Nicholas Synovic, George K. Thiruvathukal, and James C. Davis. 2023. Challenges and Practices of Deep Learning Model Reengineering: A Case Study on Computer Vision. arXiv.

[34]

Jai Kannan, Scott Barnett, Luís Cruz, Anj Simmons, and Akash Agarwal. 2022. MLSmellHound: A Context-aware Code Analysis Tool. In ICSE-NIER. ACM, 66--70.

Digital Library

[35]

Tuan Dung Lai, Anj Simmons, Scott Barnett, Jean-Guy Schneider, and Rajesh Vasa. 2024. Comparative Analysis of Real Issues in Open-source Machine Learning Projects. Empir Softw Eng 29, 3 (2024), 60.

Digital Library

[36]

Zengyang Li, Paris Avgeriou, and Peng Liang. 2015. A Systematic Mapping Study on Technical Debt and Its Management. J Syst Softw 101 (2015), 193--220.

Digital Library

[37]

Stephen Macke, Hongpu Gong, Doris Jung-Lin Lee, Andrew Head, Doris Xin, and Aditya Parameswaran. 2021. Fine-grained Lineage for Safer Notebook Interactions. arXiv.

[38]

Dusica Marijan and Arnaud Gotlieb. 2020. Software Testing for Machine Learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. AAAI Press, 13576--13582.

[39]

Kristóf Marussy, Oszkár Semeráth, Aren A. Babikian, and Dániel Varró. 2020. A Specification Language for Consistent Model Generation Based on Partial Models. J Object Technol 19, 3 (2020), 3:1.

[40]

Ran Mo, Yao Zhang, Yushuo Wang, Siyuan Zhang, Pu Xiong, Zengyang Li, and Yang Zhao. 2023. Exploring the Impact of Code Clones on Deep Learning Software. Trans Softw Eng Methodol 32, 6 (2023), 1--34.

Digital Library

[41]

Mohammad Mehdi Morovati, Amin Nikanjam, Florian Tambon, Foutse Khomh, and Zhen Ming Jiang. 2024. Bug Characterization in Machine Learning-based Systems. Empir Softw Eng 29, 1 (2024), 14.

Digital Library

[42]

Gunter Mussbacher, Daniel Amyot, Ruth Breu, Jean-Michel Bruel, Betty H. C. Cheng, Philippe Collet, Benoit Combemale, Robert B. France, Rogardt Heldal, James Hill, Jörg Kienzle, Matthias Schöttle, Friedrich Steimann, Dave Stikkolorum, and Jon Whittle. 2014. The Relevance of Model-driven Engineering Thirty Years From Now. In MODELS. Springer Link, 183--200.

[43]

Nadia Nahar, Haoran Zhang, Grace Lewis, Shurui Zhou, and Christian Kästner. 2023. A Dataset and Analysis of Open-source Machine Learning Products. arXiv.

[44]

Nadia Nahar, Haoran Zhang, Grace Lewis, Shurui Zhou, and Christian Kästner. 2023. A Meta-summary of Challenges in Building Products With Ml Components-collecting Experiences From 4758+ Practitioners. In CAIN. IEEE, 171--183.

[45]

Amin Nikanjam and Foutse Khomh. 2021. Design Smells in Deep Learning Programs: An Empirical Study. In ICSME. IEEE, 332--342.

[46]

Amin Nikanjam, Mohammad Mehdi Morovati, Foutse Khomh, and Houssem Ben Braiek. 2022. Faults in Deep Reinforcement Learning Programs: A Taxonomy and a Detection Approach. Autom Softw Eng 29, 1 (2022), 8.

Digital Library

[47]

David OBrien, Sumon Biswas, Sayem Imtiaz, Rabe Abdalkareem, Emad Shihab, and Hridesh Rajan. 2022. 23 Shades of Self-admitted Technical Debt: An Empirical Study on Machine Learning Software. In ESEC/FSE. ACM, 734--746.

Digital Library

[48]

Suneel Kumar Rath, Madhusmita Sahu, Shom Prasad Das, and Jitesh Pradhan. 2022. Survey on Machine Learning Techniques for Software Reliability Accuracy Prediction. In Meta Heuristic Techniques in Software Engineering and Its Applications. Vol. 1. Springer Link, 43--55.

[49]

Gilberto Recupito, Raimondo Rapacciuolo, Dario Di Nucci, and Fabio Palomba. 2024. Unmasking Data Secrets: An Empirical Investigation Into Data Smells and Their Impact on Data Quality. In CAIN. ACM, 53--63.

Digital Library

[50]

Lars Reimann and Günter Kniesel-Wünsche. 2023. Safe-dS: A Domain Specific Language to Make Data Science Safe. In ICSE-NIER. IEEE, 72--77.

Digital Library

[51]

Vincenzo Riccio, Gunel Jahangirova, Andrea Stocco, Nargiz Humbatova, Michael Weiss, and Paolo Tonella. 2020. Testing Machine Learning Based Systems: A Systematic Mapping. Empir Softw Eng 25, 6 (2020), 5193--5254.

Digital Library

[52]

Sebastian Schelter, Tammo Rukat, and Felix Biessmann. 2021. JENGA: A Framework to Study the Impact of Data Errors on the Predictions of Machine Learning Models. EDBT (2021).

[53]

Richard Schumi and Jun Sun. 2023. Semantic-based Neural Network Repair. In ISSTA. ACM, 150--162.

Digital Library

[54]

David Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and Dan Dennison. 2015. Hidden Technical Debt in Machine Learning Systems. Advances in neural information processing systems 28 (2015). https://tinyurl.com/mvh498nt

[55]

Karthik Shivashankar and Antonio Martini. 2022. Maintainability Challenges in ML: A Systematic Literature Review. In Euromicro SEAA. IEEE, 60--67.

[56]

Md Saeed Siddik and Cor-Paul Bezemer. 2023. Do Code Quality and Style Issues Differ Across (Non-) Machine Learning Notebooks? Yes!. In SCAM. IEEE, 72--83.

[57]

Dionysios Sklavenitis and Dimitris Kalles. 2024. Measuring Technical Debt in AI-based Competition Platforms. arXiv.

[58]

James Skripchuk, Yang Shi, and Thomas Price. 2022. Identifying Common Errors in Open-ended Machine Learning Projects. In SIGCSE. ACM, 216--222.

Digital Library

[59]

Xiaobing Sun, Tianchi Zhou, Gengjie Li, Jiajun Hu, Hui Yang, and Bin Li. 2017. An Empirical Study on Real Bugs for Machine Learning Programs. In APSEC. IEEE, 348--357.

[60]

Florian Tambon, Amin Nikanjam, Le An, Foutse Khomh, and Giuliano Antoniol. 2024. Silent Bugs in Deep Learning Frameworks: An Empirical Study of Keras and TensorFlow. Empir Softw Eng 29, 1 (2024), 10.

Digital Library

[61]

Yiming Tang, Raffi Khatchadourian, Mehdi Bagherzadeh, Rhia Singh, Ajani Stewart, and Anita Raja. 2021. An Empirical Study of Refactorings and Technical Debt in Machine Learning Systems. In ICSE. IEEE, 238--250.

Digital Library

[62]

Bart Van Oort, Luís Cruz, Maurício Aniche, and Arie Van Deursen. 2021. The Prevalence of Code Smells in Machine Learning Projects. In CAIN. IEEE, 1--8.

Digital Library

[63]

Bart Van Oort, Luís Cruz, Babak Loni, and Arie Van Deursen. 2022. "Project Smells": Experiences in Analysing the Software Quality of ML Projects With Mllint. In ICSE-SEIP. ACM, 211--220.

Digital Library

[64]

Chengcheng Wan, Yuhan Liu, Kuntai Du, Henry Hoffmann, Junchen Jiang, Michael Maire, and Shan Lu. 2023. Run-time Prevention of Software Integration Failures of Machine Learning APIs. Proc ACM Program Lang 7, OOPSLA2 (2023), 264--291.

Digital Library

[65]

Gan Wang, Zan Wang, Junjie Chen, Xiang Chen, and Ming Yan. 2022. An Empirical Study on Numerical Bugs in Deep Learning Programs. In ASE. ACM, 1--5.

Digital Library

[66]

Xiaofei Wang, Herbert Schuster, Reuben Borrison, and Benjamin Klöpper. 2023. Technical Debt Management in Industrial ML-state of Practice and Management Model Proposal. In INDIN. IEEE, 1--9.

[67]

Yiran Wang, José Antonio Hernández López, Ulf Nilsson, and Daniel Varro. 2024. Using Run-time Information to Enhance Static Analysis of Machine Learning Code in Notebooks. In FSE Companion. ACM.

Digital Library

[68]

Steven Euijong Whang, Yuji Roh, Hwanjun Song, and Jae-Gil Lee. 2023. Data Collection and Quality Challenges in Deep Learning: A Data-centric AI Perspective. The VLDB Journal 32, 4 (2023), 791--813.

Digital Library

[69]

Dangwei Wu, Beijun Shen, and Yuting Chen. 2021. An Empirical Study on Tensor Shape Faults in Deep Learning Systems. arXiv.

[70]

Dangwei Wu, Beijun Shen, Yuting Chen, He Jiang, and Lei Qiao. 2021. Tensfa: Detecting and Repairing Tensor Shape Faults in Deep Learning Systems. In ISSRE. IEEE, 11--21.

[71]

Dangwei Wu, Beijun Shen, Yuting Chen, He Jiang, and Lei Qiao. 2022. Automatically Repairing Tensor Shape Faults in Deep Learning Programs. Inf Softw Technol 151 (2022), 107027.

Digital Library

[72]

Danning Xie, Yitong Li, Mijung Kim, Hung Viet Pham, Lin Tan, Xiangyu Zhang, and Michael W. Godfrey. 2022. DocTer: Documentation-guided Fuzzing for Testing Deep Learning API Functions. In ISSTA. ACM, 176--188.

Digital Library

[73]

Caiming Zhang and Yang Lu. 2021. Study on Artificial Intelligence: The State of the Art and Future Prospects. J Ind Inf Integr 23 (2021), 100224.

[74]

Haiyin Zhang, Luís Cruz, and Arie van Deursen. 2022. Code Smells for Machine Learning Applications. In CAIN. ACM, 217--228.

Digital Library

[75]

Ru Zhang, Wencong Xiao, Hongyu Zhang, Yu Liu, Haoxiang Lin, and Mao Yang. 2020. An Empirical Study on Program Failures of Deep Learning Jobs. In ICSE. ACM, 1159--1170.

Digital Library

[76]

Tianyi Zhang, Cuiyun Gao, Lei Ma, Michael Lyu, and Miryung Kim. 2019. An Empirical Study of Common Challenges in Developing Deep Learning Applications. In ISSRE. IEEE, 104--115.

[77]

Xiaoyu Zhang, Weipeng Jiang, Chao Shen, Qi Li, Qian Wang, Chenhao Lin, and Xiaohong Guan. 2024. A Survey of Deep Learning Library Testing Methods. arXiv.

[78]

Yuhao Zhang, Yifan Chen, Shing-Chi Cheung, Yingfei Xiong, and Lu Zhang. 2018. An Empirical Study on TensorFlow Program Bugs. In ISSTA. ACM, 129--140.

Digital Library

[79]

Jiongli Zhu and Babak Salimi. 2024. Overcoming Data Biases: Towards Enhanced Accuracy and Reliability in Machine Learning. IEEE Data Eng Bull 47, 1 (2024), 18--35. http://sites.computer.org/debull/A24mar/A24MAR-CD.pdf#page=20

[80]

Renato Magela Zimmermann, Sonya Allin, and Lisa Zhang. 2023. Common Errors in Machine Learning Projects: A Second Look. In Proceedings of the 23rd Koli Calling International Conference on Computing Education Research. ACM, 1--12.

Digital Library

[81]

Haochen Zou. 2022. AIADA: Accuracy Impact Assessment of Deprecated Python API Usages on Deep Learning Models. J Softw 17, 6 (2022), 269--281.

Index Terms

Contract-based Validation of Conceptual Design Bugs for Engineering Complex Machine Learning Software

Recommendations

Machine Learning and Software Engineering
ICTAI '02: Proceedings of the 14th IEEE International Conference on Tools with Artificial Intelligence

Machine learning deals with the issue of how to build programs that improve their performance at some task through experience. Machine learning algorithms have proven to be of great practical value in a variety of application domains. Not surprisingly, ...
An empirical study of dormant bugs
MSR 2014: Proceedings of the 11th Working Conference on Mining Software Repositories

Over the past decade, several research efforts have studied the quality of software systems by looking at post-release bugs. However, these studies do not account for bugs that remain dormant (i.e., introduced in a version of the software system, but ...
Automating Software Engineering with Machine Learning
ISEC '22: Proceedings of the 15th Innovations in Software Engineering Conference

Software plays a crucial role in our everyday lives. The scarcity of skilled software engineers has become a bottleneck in delivering better software at scale. Can we automate software engineering to help improve developer productivity and software ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MODELS Companion '24: Proceedings of the ACM/IEEE 27th International Conference on Model Driven Engineering Languages and Systems

September 2024

1261 pages

ISBN:9798400706226

DOI:10.1145/3652620

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike International 4.0 License.

Sponsors

SIGSOFT: ACM Special Interest Group on Software Engineering
Johannes Kepler University Linz
IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 October 2024

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MODELS Companion '24

Sponsor:

SIGSOFT

MODELS Companion '24: ACM/IEEE 27th International Conference on Model Driven Engineering Languages and Systems

September 22 - 27, 2024

Linz, Austria

Acceptance Rates

Overall Acceptance Rate 144 of 506 submissions, 28%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
81
Total Downloads

Downloads (Last 12 months)81
Downloads (Last 6 weeks)50

Reflects downloads up to 24 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents