skip to main content
10.1145/3597503.3639196acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

Mining Pull Requests to Detect Process Anomalies in Open Source Software Development

Published: 12 April 2024 Publication History

Abstract

Trustworthy Open Source Software (OSS) development processes are the basis that secures the long-term trustworthiness of software projects and products. With the aim to investigate the trustworthiness of the Pull Request (PR) process, the common model of collaborative development in OSS community, we exploit process mining to identify and analyze the normal and anomalous patterns of PR processes, and propose our approach to identifying anomalies from both control-flow and semantic aspects, and then to analyze and synthesize the root causes of the identified anomalies. We analyze 17531 PRs of 18 OSS projects on GitHub, extracting 26 root causes of control-flow anomalies and 19 root causes of semantic anomalies. We find that most PRs can hardly contain both semantic anomalies and control-flow anomalies, and the internal custom rules in projects may be the key causes for the identified anomalous PRs. We further discover and analyze the patterns of normal PR processes. We find that PRs in the non-fork model (42%) are far more likely than the fork model (5%) to bypass the review process, indicating a higher potential risk. Besides, we analyzed nine poisoned projects whose PR practices were indeed worse. Given the complex and diverse PR processes in OSS community, the proposed approach can help identify and understand not only anomalous PRs but also normal PRs, which offers early risk indications of suspicious incidents (such as poisoning) to OSS supply chain.

References

[1]
Arya Adriansyah, Jorge Munoz-Gama, Josep Carmona, Boudewijn F Van Dongen, and Wil MP Van Der Aalst. 2015. Measuring precision of modeled behavior. Information systems and e-Business Management 13 (2015), 37--67.
[2]
Muhammad Ilyas Azeem, Sebastiano Panichella, Andrea Di Sorbo, Alexander Serebrenik, and Qing Wang. 2020. Action-based recommendation in pull-request development. In Proceedings of the International Conference on Software and System Processes. 115--124.
[3]
Saimir Bala and Jan Mendling. 2018. Monitoring the software development process with process mining. In Business Modeling and Software Design: 8th International Symposium, BMSD 2018, Vienna, Austria, July 2--4, 2018, Proceedings 8. Springer, 432--442.
[4]
Xu Ben, Shen Beijun, and Yang Weicheng. 2013. Mining developer contribution in open source software using visualization techniques. In Proceedings of the 3rd International Conference on Intelligent System Design and Engineering Applications. IEEE, 934--937.
[5]
Fábio Bezerra and Jacques Wainer. 2013. Algorithms for anomaly detection of traces in logs of process aware information systems. Information Systems 38, 1 (2013), 33--44.
[6]
Fábio Bezerra, Jacques Wainer, and Wil MP van der Aalst. 2009. Anomaly detection using process mining. In Enterprise, Business-Process and Information Systems Modeling: 10th International Workshop, BPMDS 2009, and 14th International Conference, EMMSAD 2009, held at CAiSE 2009, Amsterdam, The Netherlands, June 8--9, 2009. Proceedings. Springer, 149--161.
[7]
Joos CAM Buijs, Boudewijn F van Dongen, and Wil MP van der Aalst. 2014. Quality dimensions in process discovery: The importance of fitness, precision, generalization and simplicity. International Journal of Cooperative Information Systems 23, 01 (2014), 1440001.
[8]
Moataz Chouchen, Ali Ouni, Raula Gaikovina Kula, Dong Wang, Patanamon Thongtanunam, Mohamed Wiem Mkaouer, and Kenichi Matsumoto. 2021. Anti-patterns in modern code review: Symptoms and prevalence. In Proceedings of the 28th International Conference on Software Analysis, Evolution and Re engineering. IEEE, 531--535.
[9]
Alexandre Decan, Tom Mens, and Hassan Onsori Delicheh. 2023. On the outdatedness of workflows in the GitHub Actions ecosystem. Journal of Systems and Software 206 (2023), 111827.
[10]
Tapajit Dey and Audris Mockus. 2020. Effect of technical and social factors on pull request quality for the npm ecosystem. In Proceedings of the 14th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). 1--11.
[11]
Emre Doğan and Eray Tüzün. 2022. Towards a taxonomy of code review smells. Information and Software Technology 142 (2022), 106737.
[12]
Omar Elazhary, Margaret-Anne Storey, Neil Ernst, and Andy Zaidman. 2019. Do as i do, not as i say: Do contribution guidelines match the github contribution process?. In 2019 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 286--290.
[13]
William Enck and Laurie Williams. 2022. Top five challenges in software supply chain security: Observations from 30 industry and government organizations. IEEE Security & Privacy 20, 2 (2022), 96--100.
[14]
Mehdi Golzadeh, Damien Legay, Alexandre Decan, and Tom Mens. 2020. Bot or not? Detecting bots in GitHub pull request activity based on comment similarity. In Proceedings of the IEEE/ACM 42nd international conference on software engineering workshops. 31--35.
[15]
Georgios Gousios, Martin Pinzger, and Arie van Deursen. 2014. An exploratory study of the pull-based software development model. In Proceedings of the 36th international conference on software engineering. 345--355.
[16]
Georgios Gousios, Andy Zaidman, Margaret-Anne Storey, and Arie Van Deursen. 2015. Work practices and challenges in pull-based development: The integrator's perspective. In Proceedings of the 37th International Conference on Software Engineering, Vol. 1. IEEE, 358--368.
[17]
Jing Jiang, David Lo, Jiateng Zheng, Xin Xia, Yun Yang, and Li Zhang. 2019. Who should make decision on this pull request? Analyzing time-decaying relationships and file similarities for integrator prediction. Journal of Systems and Software 154 (2019), 196--210.
[18]
Oleksii Kononenko, Tresa Rose, Olga Baysal, Michael Godfrey, Dennis Theisen, and Bart De Water. 2018. Studying pull request merges: A case study of shopify's active merchant. In Proceedings of the 40th International Conference on Software Engineering: Software Engineering in Practice. 124--133.
[19]
Piergiorgio Ladisa, Henrik Plate, Matias Martinez, and Olivier Barais. 2022. Taxonomy of attacks on open-source software supply chains. arXiv preprint arXiv.2204.04008 (2022).
[20]
Amanda Lee, Jeffrey C. Carver, and Amiangshu Bosu. 2017. Understanding the impressions, motivations, and barriers of one time code contributors to FLOSS projects: a survey. In Proceedings of the 39th International Conference on Software Engineering. ACM, 187--197.
[21]
Sander JJ Leemans, Dirk Fahland, and Wil MP van Der Aalst. 2013. Discovering block-structured process models from event logs - A constructive approach. In Proceedings of the 34th International Conference on Application and Theory of Petri Nets and Concurrency. Springer, 311--329.
[22]
Sander JJ Leemans, Dirk Fahland, and Wil MP van Der Aalst. 2014. Discovering block-structured process models from event logs containing infrequent behaviour. In Proceedings of the 11th International Workshop on Business Process Management. Springer, 66--78.
[23]
Zhixing Li, Yue Yu, Tao Wang, Shanshan Li, and Huaimin Wang. 2022. Opportunities and Challenges in Repeated Revisions to Pull-Requests: An Empirical Study. Proceedings of the ACM on Human-Computer Interaction 6, CSCW2 (2022). 1--35.
[24]
Zhixing Li, Yue Yu, Tao Wang, Gang Yin, Shanshan Li, and Huaimin Wang. 2021. Are you still working on this? An empirical study on pull request abandonment. IEEE Transactions on Software Engineering 48, 6 (2021), 2173--2188.
[25]
Marcus Lucero. 2022. The story behind colors.js and faker.js. https://www.revenera.com/blog/software-composition-analysis/the-story-behind-colors-js-and-faker-js/.
[26]
Laura MacLeod, Michaela Greiler, Margaret-Anne Storey, Christian Bird, and Jacek Czerwonka. 2017. Code reviewing in the trenches: Challenges and best practices. IEEE Software 35, 4 (2017), 34--42.
[27]
Ivan Mistrík, John Grundy, Andre Van der Hoek, and Jim Whitehead. 2010. Collaborative software engineering: challenges and prospects. Springer, Berlin.
[28]
Abdillah Mohamed, Li Zhang, Jing Jiang, and Ahmed Ktob. 2018. Predicting which pull requests will get reopened in github. In Proceedings of the 25th International Conference on Asia-Pacific Software Engineering Conference. IEEE, 375--385.
[29]
David Myers, Suriadi Suriadi, Kenneth Radke, and Ernest Foo. 2018. Anomaly detection for industrial control systems using process mining. Computers & Security 78 (2018), 103--125.
[30]
Timo Nolle, Stefan Luettgen, Alexander Seeliger, and Max Mühlhäuser. 2022. BINet: Multi-perspective business process anomaly classification. Information Systems 103 (2022), 101458.
[31]
Timo Nolle, Alexander Seeliger, and Max Mühlhäuser. 2018. BINet: Multivariate business process anomaly detection using deep learning. In Proceedings of the 16th International Conference on Business Process Management. Springer, 271--287.
[32]
Saya Onoue, Hideaki Hata, and Ken-ichi Matsumoto. 2013. A study of the characteristics of developers' activities in GitHub. In Proceedings of the 20th International Conference on Asia-Pacific Software Engineering Conference. IEEE, 7--12.
[33]
Luca Pascarella, Davide Spadini, Fabio Palomba, Magiel Bruntink, and Alberto Bacchelli. 2018. Information needs in contemporary code review. Proceedings of the ACM on Human-Computer Interaction 2, CSCW (2018), 1--27.
[34]
Sachar Paulus, Nazila Gol Mohammadi, and Thorsten Weyer. 2013. Trustworthy Software Development. In Communications and Multimedia Security. Springer Berlin Heidelberg, Berlin, Heidelberg, 233--247.
[35]
Wouter Poncin, Alexander Serebrenik, and Mark Van Den Brand. 2011. Process mining software repositories. In 2011 15th European conference on software maintenance and re engineering. IEEE, 5--14.
[36]
Vladimir A. Rubin, Christian W. Günther, Wil M. P. van der Aalst, Ekkart Kindler, Boudewijn F. van Dongen, and Wilhelm Schäfer. 2007. Process Mining Framework for Software Processes. In Software Process Dynamics and Agility. International Conference on Software Process, ICSP 2007, Minneapolis, MN, USA. May 19--20, 2007, Proceedings, Vol. 4470. Springer, 169--181.
[37]
Vladimir A. Rubin, Irina A. Lomazova, and Wil M. P. van der Aalst. 2014. Agile development with software process mining. In International Conference on Software and Systems Process 2014, ICSSP '14, Nanjing, China - May 26 -- 28, 2014. ACM, 70--74.
[38]
Vladimir A. Rubin, Alexey A. Mitsyuk, Irina A. Lomazova, and Wil M. P. van der Aalst. 2014. Process mining can be applied to software too!. In 2014 ACM-IEEE International Symposium on Empirical Software Engineering and Measurement. ESEM '14, Torino, Italy, September 18--19, 2014. ACM, 57:1--57:8.
[39]
Mahdi Sahlabadi, Ravie Chandren Muniyandi, and Zarina Shukur. 2014. Detecting abnormal behavior in social network websites by using a process mining technique. Journal of Computer Science 10, 3 (2014), 393.
[40]
Nishrith Saini and Ricardo Britto. 2021. Using machine intelligence to prioritise code review requests. In Proceedings of the 43rd International Conference on Software Engineering: Software Engineering in Practice. IEEE, 11--20.
[41]
Jason Tsay, Laura Dabbish, and James Herbsleb. 2014. Influence of social and technical factors for evaluating contribution in GitHub. In Proceedings of the 36th international conference on Software engineering. 356--366.
[42]
Pablo Valenzuela-Toledo and Alexandre Bergel. 2022. Evolution of github action workflows. In 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 123--127.
[43]
Wil Van Der Aalst. 2016. Process mining: data science in action. Vol. 2. Springer Berlin.
[44]
Wil Van der Aalst, Ton Weijters, and Laura Maruster. 2004. Workflow mining: Discovering process models from event logs. IEEE transactions on knowledge and data engineering 16, 9 (2004), 1128--1142.
[45]
Erik Van Der Veen, Georgios Gousios, and Andy Zaidman. 2015. Automatically prioritizing pull requests. In 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories. IEEE, 357--361.
[46]
Borja Vázquez-Barreiros, Manuel Mucientes, and Manuel Lama. 2015. ProDiGen: Mining complete, precise and minimal structure process models with a genetic algorithm. Information Sciences 294 (2015), 315--333.
[47]
AJMM Weijters, Wil MP van Der Aalst, and AK Alves De Medeiros. 2006. Process mining with the heuristics miner-algorithm. Technische Universiteit Eindhoven. Tech. Rep. WP 166, July 2017 (2006), 1--34.
[48]
Peter Weißgerber, Daniel Neu, and Stephan Diehl. 2008. Small patches get in!. In Proceedings of the 5th International Working Conference on Mining Software Repositories. ACM, 67--76.
[49]
Mairieli Wessel, Joseph Vargovich, Marco A Gerosa, and Christoph Treude. 2023. GitHub Actions: the impact on the pull request process. Empirical Software Engineering 28, 6 (2023), 1--35.
[50]
Yue Yu, Gang Yin, Huaimin Wang, and Tao Wang. 2014. Exploring the patterns of social behavior in GitHub. In Proceedings of the 1st international workshop on crowd-based software development methods and technologies. 31--36.
[51]
Nico Zazworka, Victor R Basili, and Forrest Shull. 2009. Tool supported detection and judgment of nonconformance in process execution. In 2009 3rd International Symposium on Empirical Software Engineering and Measurement. IEEE, 312--323.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICSE '24: Proceedings of the IEEE/ACM 46th International Conference on Software Engineering
May 2024
2942 pages
ISBN:9798400702174
DOI:10.1145/3597503
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

In-Cooperation

  • Faculty of Engineering of University of Porto

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 April 2024

Check for updates

Author Tags

  1. open source software development
  2. process mining
  3. pull request

Qualifiers

  • Research-article

Funding Sources

Conference

ICSE '24
Sponsor:

Acceptance Rates

Overall Acceptance Rate 276 of 1,856 submissions, 15%

Upcoming Conference

ICSE 2025

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 140
    Total Downloads
  • Downloads (Last 12 months)140
  • Downloads (Last 6 weeks)10
Reflects downloads up to 25 Dec 2024

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media