skip to main content
10.1145/3652620.3688201acmconferencesArticle/Chapter ViewAbstractPublication PagesmodelsConference Proceedingsconference-collections
research-article
Open access

Contract-based Validation of Conceptual Design Bugs for Engineering Complex Machine Learning Software

Published: 31 October 2024 Publication History

Abstract

Context. Modern software systems increasingly commonly contain one or multiple machine learning (ML) components. Current development practices are generally on a trial-and-error basis, posing a significant risk of introducing bugs. One type of bug is the "conceptual design bug," referring to a misunderstanding between the properties of input data and prerequisites imposed by ML algorithms (e.g., using unscaled data in a scale-sensitive algorithm). These bugs are challenging to test at design time, causing problems at runtime through crashes, noticeably poor model performance, or not at all, threatening the system's robustness and transparency. Objective. In this work, I propose the line of research I intend to pursue during my PhD, addressing conceptual design bugs in complex ML software from a prevention-oriented perspective. I intend to build open-source tooling for ML engineers that can be used to detect conceptual design bugs, enabling them to make quality assurances about their system design's robustness. Approach. We need to understand conceptual bugs beyond the status quo, identifying their types, prevalence, impacts, and structural elements in the code. We operationalize this knowledge into a tool that detects them at design time, allowing ML engineers to resolve them before running their code and wasting resources. We anticipate this tool will leverage contract-based validation applied to partial ML software models. Evaluation. We plan to evaluate the built tool two-fold using professional (industrial) ML software. First, we will study its effectiveness regarding bug detection at design time, identifying whether it fulfills its functional objective. Second, we will study its usability, identifying whether ML engineers benefit when tools like this are introduced into their ML engineering workflow.

References

[1]
Shibbir Ahmed, Sayem Mohammad Imtiaz, Samantha Syeda Khairunnesa, Breno Dantas Cruz, and Hridesh Rajan. 2023. Design by Contract for Deep Learning APIs. In ESEC/FSE. ACM, 94--106.
[2]
Shibbir Ahmed, Mohammad Wardat, Hamid Bagheri, Breno Dantas Cruz, and Hridesh Rajan. 2023. Characterizing Bugs in Python and R Data Analytics Programs. arXiv.
[3]
Md Abdullah Al Alamin and Gias Uddin. 2024. How Far Are We With Automated Machine Learning? Characterization and Challenges of AutoML Toolkits. Empir Softw Eng 29, 4 (2024), 91.
[4]
Moayad Alshangiti, Hitesh Sapkota, Pradeep K. Murukannaiah, Xumin Liu, and Qi Yu. 2019. Why Is Developing Machine Learning Applications Challenging? a Study on Stack Overflow Posts. In ESEM. ACM/IEEE, 1--11.
[5]
Nicolli S.R. Alves, Leilane F. Ribeiro, Vivyane Caires, Thiago S. Mendes, and Rodrigo O. Spínola. 2014. Towards an Ontology of Terms on Technical Debt. In MTD. IEEE, 1--7.
[6]
Aaditya Bhatia, Foutse Khomh, Bram Adams, and Ahmed E. Hassan. 2024. An Empirical Study of Self-admitted Technical Debt in Machine Learning Software. arXiv.
[7]
Justus Bogner, Roberto Verdecchia, and Ilias Gerostathopoulos. 2021. Characterizing Technical Debt and Antipatterns in AI-based Systems: A Systematic Mapping Study. In TechDebt. IEEE/ACM, 64--73.
[8]
Houssem Ben Braiek and Foutse Khomh. 2020. On Testing Machine Learning Programs. J Syst Softw 164 (2020), 110542.
[9]
Lukas Budach, Moritz Feuerpfeil, Nina Ihde, Andrea Nathansen, Nele Noack, Hendrik Patzlaff, Felix Naumann, and Hazar Harmouch. 2022. The Effects of Data Quality on Machine Learning Performance. arXiv.
[10]
Raphael Cabral, Marcos Kalinowski, Maria Teresa Baldassarre, Hugo Villamizar, Tatiana Escovedo, and Hélio Lopes. 2024. Investigating the Impact of SOLID Design Principles on Machine Learning Code Understanding. In CAIN. ACM, 7--17.
[11]
Nicolás Cardozo, Ivana Dusparic, and Christian Cabrera. 2023. Prevalence of Code Smells in Reinforcement Learning Projects. In CAIN. IEEE, 37--42.
[12]
Bruno Cartaxo, Gustavo Pinto, and Sergio Soares. 2020. Rapid Reviews in Software Engineering. Springer Link, 357--384.
[13]
Zhenpeng Chen, Huihan Yao, Yiling Lou, Yanbin Cao, Yuanqiang Liu, Haoyu Wang, and Xuanzhe Liu. 2021. An Empirical Study on Deployment Faults of Deep Learning Based Mobile Applications. In ICSE. IEEE, 674--685.
[14]
Dolors Costal, Cristina Gómez, and Silverio Martínez-Fernández. 2024. Metrics for Code Smells of ML Pipelines. In Product-Focused Software Process Improvement, Vol. 14484. Springer Link, 3--9.
[15]
Jesse Davis and Mark Goadrich. 2006. The Relationship Between Precision-recall and ROC Curves. In ICML. ACM, 233--240.
[16]
Malinda Dilhara, Ameya Ketkar, and Danny Dig. 2021. Understanding Software-2.0: A Study of Machine Learning Library Usage and Evolution. Trans Softw Eng Methodol 30, 4 (2021), 1--42.
[17]
Lukas Fischer, Lisa Ehrlinger, Verena Geist, Rudolf Ramler, Florian Sobiezky, Werner Zellinger, David Brunner, Mohit Kumar, and Bernhard Moser. 2020. AI System Engineering---key Challenges and Lessons Learned. Mach Learn Knowl Extr 3, 1 (2020), 56--83.
[18]
Harald Foidl, Michael Felderer, and Rudolf Ramler. 2022. Data Smells: Categories, Causes and Consequences, and Detection of Suspicious Data in AI-based Systems. In CAIN. ACM, 229--239.
[19]
Harald Foidl, Valentina Golendukhina, Rudolf Ramler, and Michael Felderer. 2024. Data Pipeline Quality: Influencing Factors, Root Causes of Data-related Issues, and Processing Problem Areas for Developers. J Syst Softw 207 (2024), 111855.
[20]
Yanjie Gao, Yichen He, Xinze Li, Bo Zhao, Haoxiang Lin, Yoyo Liang, Jing Zhong, Hongyu Zhang, Jingzhou Wang, Yonghua Zeng, Keli Gui, Jie Tong, and Mao Yang. 2024. An Empirical Study on Low GPU Utilization of Deep Learning Jobs. In ICSE. ACM, 1--13.
[21]
Yanjie Gao, Zhengxian Li, Haoxiang Lin, Hongyu Zhang, Ming Wu, and Mao Yang. 2022. Refty: Refinement Types for Valid Deep Learning Models. In ICSE. ACM, 1843--1855.
[22]
Yanjie Gao, Xiaoxiang Shi, Haoxiang Lin, Hongyu Zhang, Hao Wu, Rui Li, and Mao Yang. 2023. An Empirical Study on Quality Issues of Deep Learning Platform. In ICSE-SEIP. IEEE, 455--466.
[23]
Jiri Gesi, Siqi Liu, Jiawei Li, Iftekhar Ahmed, Nachiappan Nagappan, David Lo, Eduardo Santana de Almeida, Pavneet Singh Kochhar, and Lingfeng Bao. 2022. Code Smells in Machine Learning Systems. arXiv.
[24]
Joan Giner-Miguelez, Abel Gómez, and Jordi Cabot. 2023. A Domain-specific Language for Describing Machine Learning Datasets. J Comput Lang 76 (2023), 101209.
[25]
Akshay Goel, Almog Gueta, Omry Gilon, Chang Liu, Sofia Erell, Lan Huong Nguyen, Xiaohong Hao, Bolous Jaber, Shashir Reddy, Rupesh Kartha, Jean Steiner, Itay Laish, and Amir Feder. 2023. LLMs Accelerate Annotation for Medical Information Extraction. In ML4H. PMLR, 82--100. https://proceedings.mlr.press/v225/goel23a.html
[26]
Konstantin Grotov, Sergey Titov, Vladimir Sotnikov, Yaroslav Golubev, and Timofey Bryksin. 2022. A Large-scale Comparison of Python Code in Jupyter Notebooks and Scripts. In MSR. ACM, 353--364.
[27]
Zara Hassan, Christoph Treude, Michael Norrish, Graham Williams, and Alex Potanin. 2024. Characterising Reproducibility Debt in Scientific Software: A Systematic Literature Review. SSRN 4801433 (2024).
[28]
Xin He, Kaiyong Zhao, and Xiaowen Chu. 2021. AutoML: A Survey of the State-of-the-art. Knowl Based Syst 212 (2021), 106622.
[29]
Shuo Hong, Hailong Sun, Xiang Gao, and Shin Hwei Tan. 2024. Investigating and Detecting Silent Bugs in PyTorch Programs. In SANER. IEEE, 272--283.
[30]
Nargiz Humbatova, Gunel Jahangirova, Gabriele Bavota, Vincenzo Riccio, Andrea Stocco, and Paolo Tonella. 2020. Taxonomy of Real Faults in Deep Learning Systems. In ICSE. ACM, 1110--1121.
[31]
Md Johirul Islam, Giang Nguyen, Rangeet Pan, and Hridesh Rajan. 2019. A Comprehensive Study on Deep Learning Bug Characteristics. In ESEC/FSE. ACM, 510--520.
[32]
Hadhemi Jebnoun, Houssem Ben Braiek, Mohammad Masudur Rahman, and Foutse Khomh. 2020. The Scent of Deep Learning Code: An Empirical Study. In MSR. ACM, 420--430.
[33]
Wenxin Jiang, Vishnu Banna, Naveen Vivek, Abhinav Goel, Nicholas Synovic, George K. Thiruvathukal, and James C. Davis. 2023. Challenges and Practices of Deep Learning Model Reengineering: A Case Study on Computer Vision. arXiv.
[34]
Jai Kannan, Scott Barnett, Luís Cruz, Anj Simmons, and Akash Agarwal. 2022. MLSmellHound: A Context-aware Code Analysis Tool. In ICSE-NIER. ACM, 66--70.
[35]
Tuan Dung Lai, Anj Simmons, Scott Barnett, Jean-Guy Schneider, and Rajesh Vasa. 2024. Comparative Analysis of Real Issues in Open-source Machine Learning Projects. Empir Softw Eng 29, 3 (2024), 60.
[36]
Zengyang Li, Paris Avgeriou, and Peng Liang. 2015. A Systematic Mapping Study on Technical Debt and Its Management. J Syst Softw 101 (2015), 193--220.
[37]
Stephen Macke, Hongpu Gong, Doris Jung-Lin Lee, Andrew Head, Doris Xin, and Aditya Parameswaran. 2021. Fine-grained Lineage for Safer Notebook Interactions. arXiv.
[38]
Dusica Marijan and Arnaud Gotlieb. 2020. Software Testing for Machine Learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. AAAI Press, 13576--13582.
[39]
Kristóf Marussy, Oszkár Semeráth, Aren A. Babikian, and Dániel Varró. 2020. A Specification Language for Consistent Model Generation Based on Partial Models. J Object Technol 19, 3 (2020), 3:1.
[40]
Ran Mo, Yao Zhang, Yushuo Wang, Siyuan Zhang, Pu Xiong, Zengyang Li, and Yang Zhao. 2023. Exploring the Impact of Code Clones on Deep Learning Software. Trans Softw Eng Methodol 32, 6 (2023), 1--34.
[41]
Mohammad Mehdi Morovati, Amin Nikanjam, Florian Tambon, Foutse Khomh, and Zhen Ming Jiang. 2024. Bug Characterization in Machine Learning-based Systems. Empir Softw Eng 29, 1 (2024), 14.
[42]
Gunter Mussbacher, Daniel Amyot, Ruth Breu, Jean-Michel Bruel, Betty H. C. Cheng, Philippe Collet, Benoit Combemale, Robert B. France, Rogardt Heldal, James Hill, Jörg Kienzle, Matthias Schöttle, Friedrich Steimann, Dave Stikkolorum, and Jon Whittle. 2014. The Relevance of Model-driven Engineering Thirty Years From Now. In MODELS. Springer Link, 183--200.
[43]
Nadia Nahar, Haoran Zhang, Grace Lewis, Shurui Zhou, and Christian Kästner. 2023. A Dataset and Analysis of Open-source Machine Learning Products. arXiv.
[44]
Nadia Nahar, Haoran Zhang, Grace Lewis, Shurui Zhou, and Christian Kästner. 2023. A Meta-summary of Challenges in Building Products With Ml Components-collecting Experiences From 4758+ Practitioners. In CAIN. IEEE, 171--183.
[45]
Amin Nikanjam and Foutse Khomh. 2021. Design Smells in Deep Learning Programs: An Empirical Study. In ICSME. IEEE, 332--342.
[46]
Amin Nikanjam, Mohammad Mehdi Morovati, Foutse Khomh, and Houssem Ben Braiek. 2022. Faults in Deep Reinforcement Learning Programs: A Taxonomy and a Detection Approach. Autom Softw Eng 29, 1 (2022), 8.
[47]
David OBrien, Sumon Biswas, Sayem Imtiaz, Rabe Abdalkareem, Emad Shihab, and Hridesh Rajan. 2022. 23 Shades of Self-admitted Technical Debt: An Empirical Study on Machine Learning Software. In ESEC/FSE. ACM, 734--746.
[48]
Suneel Kumar Rath, Madhusmita Sahu, Shom Prasad Das, and Jitesh Pradhan. 2022. Survey on Machine Learning Techniques for Software Reliability Accuracy Prediction. In Meta Heuristic Techniques in Software Engineering and Its Applications. Vol. 1. Springer Link, 43--55.
[49]
Gilberto Recupito, Raimondo Rapacciuolo, Dario Di Nucci, and Fabio Palomba. 2024. Unmasking Data Secrets: An Empirical Investigation Into Data Smells and Their Impact on Data Quality. In CAIN. ACM, 53--63.
[50]
Lars Reimann and Günter Kniesel-Wünsche. 2023. Safe-dS: A Domain Specific Language to Make Data Science Safe. In ICSE-NIER. IEEE, 72--77.
[51]
Vincenzo Riccio, Gunel Jahangirova, Andrea Stocco, Nargiz Humbatova, Michael Weiss, and Paolo Tonella. 2020. Testing Machine Learning Based Systems: A Systematic Mapping. Empir Softw Eng 25, 6 (2020), 5193--5254.
[52]
Sebastian Schelter, Tammo Rukat, and Felix Biessmann. 2021. JENGA: A Framework to Study the Impact of Data Errors on the Predictions of Machine Learning Models. EDBT (2021).
[53]
Richard Schumi and Jun Sun. 2023. Semantic-based Neural Network Repair. In ISSTA. ACM, 150--162.
[54]
David Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and Dan Dennison. 2015. Hidden Technical Debt in Machine Learning Systems. Advances in neural information processing systems 28 (2015). https://tinyurl.com/mvh498nt
[55]
Karthik Shivashankar and Antonio Martini. 2022. Maintainability Challenges in ML: A Systematic Literature Review. In Euromicro SEAA. IEEE, 60--67.
[56]
Md Saeed Siddik and Cor-Paul Bezemer. 2023. Do Code Quality and Style Issues Differ Across (Non-) Machine Learning Notebooks? Yes!. In SCAM. IEEE, 72--83.
[57]
Dionysios Sklavenitis and Dimitris Kalles. 2024. Measuring Technical Debt in AI-based Competition Platforms. arXiv.
[58]
James Skripchuk, Yang Shi, and Thomas Price. 2022. Identifying Common Errors in Open-ended Machine Learning Projects. In SIGCSE. ACM, 216--222.
[59]
Xiaobing Sun, Tianchi Zhou, Gengjie Li, Jiajun Hu, Hui Yang, and Bin Li. 2017. An Empirical Study on Real Bugs for Machine Learning Programs. In APSEC. IEEE, 348--357.
[60]
Florian Tambon, Amin Nikanjam, Le An, Foutse Khomh, and Giuliano Antoniol. 2024. Silent Bugs in Deep Learning Frameworks: An Empirical Study of Keras and TensorFlow. Empir Softw Eng 29, 1 (2024), 10.
[61]
Yiming Tang, Raffi Khatchadourian, Mehdi Bagherzadeh, Rhia Singh, Ajani Stewart, and Anita Raja. 2021. An Empirical Study of Refactorings and Technical Debt in Machine Learning Systems. In ICSE. IEEE, 238--250.
[62]
Bart Van Oort, Luís Cruz, Maurício Aniche, and Arie Van Deursen. 2021. The Prevalence of Code Smells in Machine Learning Projects. In CAIN. IEEE, 1--8.
[63]
Bart Van Oort, Luís Cruz, Babak Loni, and Arie Van Deursen. 2022. "Project Smells": Experiences in Analysing the Software Quality of ML Projects With Mllint. In ICSE-SEIP. ACM, 211--220.
[64]
Chengcheng Wan, Yuhan Liu, Kuntai Du, Henry Hoffmann, Junchen Jiang, Michael Maire, and Shan Lu. 2023. Run-time Prevention of Software Integration Failures of Machine Learning APIs. Proc ACM Program Lang 7, OOPSLA2 (2023), 264--291.
[65]
Gan Wang, Zan Wang, Junjie Chen, Xiang Chen, and Ming Yan. 2022. An Empirical Study on Numerical Bugs in Deep Learning Programs. In ASE. ACM, 1--5.
[66]
Xiaofei Wang, Herbert Schuster, Reuben Borrison, and Benjamin Klöpper. 2023. Technical Debt Management in Industrial ML-state of Practice and Management Model Proposal. In INDIN. IEEE, 1--9.
[67]
Yiran Wang, José Antonio Hernández López, Ulf Nilsson, and Daniel Varro. 2024. Using Run-time Information to Enhance Static Analysis of Machine Learning Code in Notebooks. In FSE Companion. ACM.
[68]
Steven Euijong Whang, Yuji Roh, Hwanjun Song, and Jae-Gil Lee. 2023. Data Collection and Quality Challenges in Deep Learning: A Data-centric AI Perspective. The VLDB Journal 32, 4 (2023), 791--813.
[69]
Dangwei Wu, Beijun Shen, and Yuting Chen. 2021. An Empirical Study on Tensor Shape Faults in Deep Learning Systems. arXiv.
[70]
Dangwei Wu, Beijun Shen, Yuting Chen, He Jiang, and Lei Qiao. 2021. Tensfa: Detecting and Repairing Tensor Shape Faults in Deep Learning Systems. In ISSRE. IEEE, 11--21.
[71]
Dangwei Wu, Beijun Shen, Yuting Chen, He Jiang, and Lei Qiao. 2022. Automatically Repairing Tensor Shape Faults in Deep Learning Programs. Inf Softw Technol 151 (2022), 107027.
[72]
Danning Xie, Yitong Li, Mijung Kim, Hung Viet Pham, Lin Tan, Xiangyu Zhang, and Michael W. Godfrey. 2022. DocTer: Documentation-guided Fuzzing for Testing Deep Learning API Functions. In ISSTA. ACM, 176--188.
[73]
Caiming Zhang and Yang Lu. 2021. Study on Artificial Intelligence: The State of the Art and Future Prospects. J Ind Inf Integr 23 (2021), 100224.
[74]
Haiyin Zhang, Luís Cruz, and Arie van Deursen. 2022. Code Smells for Machine Learning Applications. In CAIN. ACM, 217--228.
[75]
Ru Zhang, Wencong Xiao, Hongyu Zhang, Yu Liu, Haoxiang Lin, and Mao Yang. 2020. An Empirical Study on Program Failures of Deep Learning Jobs. In ICSE. ACM, 1159--1170.
[76]
Tianyi Zhang, Cuiyun Gao, Lei Ma, Michael Lyu, and Miryung Kim. 2019. An Empirical Study of Common Challenges in Developing Deep Learning Applications. In ISSRE. IEEE, 104--115.
[77]
Xiaoyu Zhang, Weipeng Jiang, Chao Shen, Qi Li, Qian Wang, Chenhao Lin, and Xiaohong Guan. 2024. A Survey of Deep Learning Library Testing Methods. arXiv.
[78]
Yuhao Zhang, Yifan Chen, Shing-Chi Cheung, Yingfei Xiong, and Lu Zhang. 2018. An Empirical Study on TensorFlow Program Bugs. In ISSTA. ACM, 129--140.
[79]
Jiongli Zhu and Babak Salimi. 2024. Overcoming Data Biases: Towards Enhanced Accuracy and Reliability in Machine Learning. IEEE Data Eng Bull 47, 1 (2024), 18--35. http://sites.computer.org/debull/A24mar/A24MAR-CD.pdf#page=20
[80]
Renato Magela Zimmermann, Sonya Allin, and Lisa Zhang. 2023. Common Errors in Machine Learning Projects: A Second Look. In Proceedings of the 23rd Koli Calling International Conference on Computing Education Research. ACM, 1--12.
[81]
Haochen Zou. 2022. AIADA: Accuracy Impact Assessment of Deprecated Python API Usages on Deep Learning Models. J Softw 17, 6 (2022), 269--281.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MODELS Companion '24: Proceedings of the ACM/IEEE 27th International Conference on Model Driven Engineering Languages and Systems
September 2024
1261 pages
ISBN:9798400706226
DOI:10.1145/3652620
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 October 2024

Check for updates

Author Tags

  1. machine learning
  2. software bugs
  3. software design
  4. knowledge mining
  5. software contracts
  6. empirical software engineering

Qualifiers

  • Research-article

Conference

MODELS Companion '24
Sponsor:

Acceptance Rates

Overall Acceptance Rate 144 of 506 submissions, 28%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 81
    Total Downloads
  • Downloads (Last 12 months)81
  • Downloads (Last 6 weeks)50
Reflects downloads up to 24 Dec 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media