research-article

Towards Causal Analysis of Empirical Software Engineering Data: The Impact of Programming Languages on Coding Competitions

Authors:

Carlo A. Furia,

Richard Torkar,

Robert FeldtAuthors Info & Claims

ACM Transactions on Software Engineering and Methodology, Volume 33, Issue 1

Article No.: 13, Pages 1 - 35

https://doi.org/10.1145/3611667

Published: 24 November 2023 Publication History

Abstract

There is abundant observational data in the software engineering domain, whereas running large-scale controlled experiments is often practically impossible. Thus, most empirical studies can only report statistical correlations—instead of potentially more insightful and robust causal relations.

To support analyzing purely observational data for causal relations and to assess any differences between purely predictive and causal models of the same data, this article discusses some novel techniques based on structural causal models (such as directed acyclic graphs of causal Bayesian networks). Using these techniques, one can rigorously express, and partially validate, causal hypotheses and then use the causal information to guide the construction of a statistical model that captures genuine causal relations—such that correlation does imply causation.

We apply these ideas to analyzing public data about programmer performance in Code Jam, a large world-wide coding contest organized by Google every year. Specifically, we look at the impact of different programming languages on a participant’s performance in the contest. While the overall effect associated with programming languages is weak compared to other variables—regardless of whether we consider correlational or causal links—we found considerable differences between a purely associational and a causal analysis of the very same data.

The takeaway message is that even an imperfect causal analysis of observational data can help answer the salient research questions more precisely and more robustly than with just purely predictive techniques—where genuine causal effects may be confounded.

References

[1]

Alain Abran, James W. Moore, Pierre Bourque, Robert Dupuis, and L. Tripp. 2004. Software Engineering Body of Knowledge. IEEE Computer Society, Angela Burgess, 25.

[2]

Valentin Amrehin, Sander Greenland, and Blake McShane. 2019. Scientists rise up against statistical significance. Nature 567 (2019), 305–307.

[3]

George K. Baah, Andy Podgurski, and Mary Jean Harrold. 2010. Causal inference for statistical fault localization. In Proceedings of the 19th International Symposium on Software Testing and Analysis. 73–84.

Digital Library

[4]

Alexandra Back and Emma Westman. 2017. Comparing Programming Languages in Google Code Jam. Master’s thesis. Chalmers University of Technology. https://publications.lib.chalmers.se/records/fulltext/250672/250672.pdf

[5]

Daniel J. Benjamin, James O. Berger, Magnus Johannesson, Brian A. Nosek, E.-J. Wagenmakers, Richard Berk, Kenneth A. Bollen, Björn Brembs, Lawrence Brown, Colin Camerer, David Cesarini, Christopher D. Chambers, Merlise Clyde, Thomas D. Cook, Paul De Boeck, Zoltan Dienes, Anna Dreber, Kenny Easwaran, Charles Efferson, Ernst Fehr, Fiona Fidler, Andy P. Field, Malcolm Forster, Edward I. George, Richard Gonzalez, Steven Goodman, Edwin Green, Donald P. Green, Anthony G. Greenwald, Jarrod D. Hadfield, Larry V. Hedges, Leonhard Held, Teck Hua Ho, Herbert Hoijtink, Daniel J. Hruschka, Kosuke Imai, Guido Imbens, John P. A. Ioannidis, Minjeong Jeon, James Holland Jones, Michael Kirchler, David Laibson, John List, Roderick Little, Arthur Lupia, Edouard Machery, Scott E. Maxwell, Michael McCarthy, Don A. Moore, Stephen L. Morgan, Marcus Munafó, Shinichi Nakagawa, Brendan Nyhan, Timothy H. Parker, Luis Pericchi, Marco Perugini, Jeff Rouder, Judith Rousseau, Victoria Savalei, Felix D. Schönbrodt, Thomas Sellke, Betsy Sinclair, Dustin Tingley, Trisha Van Zandt, Simine Vazire, Duncan J. Watts, Christopher Winship, Robert L. Wolpert, Yu Xie, Cristobal Young, Jonathan Zinman, and Valen E. Johnson. 2018. Redefine statistical significance. Nature Human Behaviour 2, 6-10 (2018).

[6]

Emery D. Berger, Celeste Hollenbeck, Petr Maj, Olga Vitek, and Jan Vitek. 2019. On the impact of programming languages on code quality: A reproduction study. ACM Transactions on Programming Languages and Systems 41, 4 (2019), 21:1–21:24.

Digital Library

[7]

Pascal Caillet, Sarah Klemm, Michel Ducher, Alexandre Aussem, and Anne-Marie Schott. 2015. Hip fracture in the elderly: A re-analysis of the EPIDOS study with causal Bayesian networks. PLoS One 10, 3 (2015), e0120125.

[8]

Carlos Cinelli, Andrew Forney, and Judea Pearl. 2022. A crash course in good and bad controls. Sociological Methods & Research (2022). DOI:

[9]

Andrew G. Clark, Michael Foster, Benedikt Prifling, Neil Walkinshaw, Robert M. Hierons, Volker Schmidt, and Robert D. Turner. 2022. Testing causality in scientific modelling software. arXiv preprint arXiv:2209.00357 (2022).

[10]

Jacob Cohen. 1994. The earth is round (\(p \lt .05\)). American Psychologist 49, 12 (1994), 997–1003.

[11]

Clemens Dubslaff, Kallistos Weis, Christel Baier, and Sven Apel. 2022. Causality in configurable software systems. In Proceedings of the 44th International Conference on Software Engineering (ICSE’22), ACM.

[12]

Hongbo Fang, Hemank Lamba, James Herbsleb, and Bogdan Vasilescu. 2022. “This is damn slick!” Estimating the impact of tweets on open source project popularity and new contributors. In Proceedings of the 44th International Conference on Software Engineering (ICSE’22), ACM.

[13]

Carlo A. Furia, Robert Feldt, and Richard Torkar. 2021. Bayesian data analysis in empirical software engineering research. IEEE Transactions on Software Engineering 47, 9 (September 2021), 1786–1810.

[14]

Carlo A. Furia, Richard Torkar, and Robert Feldt. 2022. Applying Bayesian analysis guidelines to empirical software engineering data: The case of programming languages and code quality. ACM Transactions on Software Engineering and Methodology 31, 3 (2022), 40:1–40:38.

Digital Library

[15]

Carlo A. Furia, Richard Torkar, and Robert Feldt. 2023. Replication Package.

[16]

Andrew Gelman. 2016. The problems with P-values are not just with P-values. American Statistician 70 (2016). Online discussion: http://www.stat.columbia.edu/gelman/research/published/asa_pvalues.pdf

[17]

Andrew Gelman. 2016. Why I Prefer 50% Rather Than 95% Intervals. https://statmodeling.stat.columbia.edu/2016/11/05/why-i-prefer-50-to-95-intervals/ From the blog Statistical Modeling, Causal Inference, and Social Science.

[18]

Andrew Gelman, Jennifer Hill, and Aki Vehtari. 2020. Regression and Other Stories. Cambridge University Press, Cambridge, UK. https://books.google.se/books?id=SZFKzQEACAAJ

[19]

Andrew Gelman and David Weakliem. 2009. Of beauty, sex and power. American Scientist 97 (2009), 310–316.

[20]

Steven N. Goodman. 1999. Toward evidence-based medical statistics. 1: The P value fallacy. Annals of Internal Medicine 130, 12 (1999), 995–1004.

[21]

Lasse Hakulinen. 2011. Survey on informatics competitions: Developing tasks. In Olympiads in Informatics, Vol. 5. IOI, 12–25.

[22]

Joseph Halpern. 2015. A modification of the Halpern-Pearl definition of causality. In 24th International Joint Conference on Artificial Intelligence.

Digital Library

[23]

Stefan Hanenberg. 2010. An experiment about static and dynamic type systems: Doubts about the positive impact of static type systems on development time. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA’10). ACM, New York, NY, 22–35.

Digital Library

[24]

Leonard Henckel, Emilija Perković, and Marloes H. Maathuis. 2022. Graphical criteria for efficient total effect estimation via adjustment in causal linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 84, 2 (2022), 579–599. Also https://arxiv.org/abs/1907.02435

[25]

Miguel A. Hernán and Susana Monge. 2023. Selection bias due to conditioning on a collider. BMJ 381 (2023), 1135. arXiv:https://www.bmj.com/content/381/bmj.p1135.full.pdf

[26]

Hans-Martin Heyn and Eric Knauss. 2022. Structural causal models as boundary objects in AI system development. In 1st International Conference on AI Engineering-Software Engineering for AI.

Digital Library

[27]

Guido W. Imbens. 2020. Potential outcome and directed acyclic graph approaches to causality: Relevance for empirical practice in economics. Journal of Economic Literature 58, 4 (2020), 1129–1179.

[28]

Edwin T. Jaynes. 2003. Probability Theory: The Logic of Science. Cambridge University Press, Cambridge, UK.

[29]

Hieke Keuning, Johan Jeuring, and Bastiaan Heeren. 2019. A systematic literature review of automated feedback generation for programming exercises. ACM Transactions on Computing Education 19, 1 (2019), 3:1–3:43.

Digital Library

[30]

Seongmin Lee, Dave Binkley, Robert Feldt, Nicolas Gold, and Shin Yoo. 2021. Causal program dependence analysis. arXiv preprint arXiv:2104.09107 (2021).

[31]

Xing Li, Yinbo Yu, Kai Bu, Yan Chen, Jianfeng Yang, and Ruijie Quan. 2019. Thinking inside the box: Differential fault localization for SDN control plane. In 2019 IFIP/IEEE Symposium on Integrated Network and Service Management (IM’19). IEEE, 353–359.

[32]

Yuchu Liu, David Issa Mattos, Jan Bosch, Helena Holmström Olsson, and Jonn Lantz. 2022. Bayesian causal inference in automotive software engineering and online evaluation. arXiv preprint arXiv:2207.00222 (2022).

[33]

Richard McElreath. 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan (2nd Ed.). CRC Press.

[34]

Blakeley B. McShane, David Gal, Andrew Gelman, Christian Robert, and Jennifer L. Tackett. 2019. Abandon statistical significance. American Statistician 73, S1 (2019), 235–245.

[35]

Tim Menzies and Martin Shepperd. 2019. “Bad smells” in software analytics papers. Information and Software Technology 112 (2019), 35–47. https://arxiv.org/abs/1803.05518

Digital Library

[36]

Leo A. Meyerovich and Ariel S. Rabkin. 2013. Empirical analysis of programming language adoption. In Proceedings of the 2013 ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages & Applications (OOPSLA’13). ACM, New York, NY, 1–18.

Digital Library

[37]

Sebastian Nanz and Carlo A. Furia. 2014. A Comparative Study of Programming Languages in Rosetta Code. http://arxiv.org/abs/1409.0252

[38]

Sebastian Nanz and Carlo A. Furia. 2015. A comparative study of programming languages in Rosetta code. In Proceedings of the 37th International Conference on Software Engineering (ICSE’15), Antonia Bertolino, Gerardo Canfora, and Sebastian Elbaum (Eds.). ACM, 778–788.

[39]

Sebastian Nanz, Faraz Torshizi, Michela Pedroni, and Bertrand Meyer. 2011. Design of an empirical study for comparing the usability of concurrent programming languages. In Proceedings of the 2011 International Symposium on Empirical Software Engineering and Measurement (ESEM’11). IEEE Computer Society, Washington, DC, 325–334.

Digital Library

[40]

Judea Pearl. 2009. Causality. Cambridge University Press.

[41]

Judea Pearl. 2009. Causality: Models, Reasoning and Inference (2nd Ed.). Cambridge University Press.

[42]

Judea Pearl. 2011. The mathematics of causal inference. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’11). Association for Computing Machinery, New York, NY, 5.

Digital Library

[43]

Judea Pearl. 2019. The seven tools of causal inference, with reflections on machine learning. Communications of the ACM 62, 3 (2019), 54–60.

Digital Library

[44]

Judea Pearl. 2009. Causal inference in statistics: An overview. Statistics Surveys 3 (2009), 96–146.

[45]

Judea Pearl and Dana Mackenzie. 2018. The Book of Why. Penguin Random House.

Digital Library

[46]

Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. 2017. Elements of Causal Inference: Foundations and Learning Algorithms. MIT Press.

Digital Library

[47]

Lutz Prechelt. 2000. An empirical comparison of seven programming languages. IEEE Computer 33, 10 (Oct. 2000), 23–29.

Digital Library

[48]

Baishakhi Ray, Daryl Posnett, Vladimir Filkov, and Premkumar Devanbu. 2014. A large scale study of programming languages and code quality in Github. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE’14). Association for Computing Machinery, New York, NY, 155–165.

Digital Library

[49]

Jonathan G. Richens, Ciarán M. Lee, and Saurabh Johri. 2020. Improving the accuracy of medical diagnosis with causal machine learning. Nature Communications 11, 1 (2020), 1–9.

[50]

Christopher J. Rossbach, Owen S. Hofmann, and Emmett Witchel. 2010. Is transactional programming actually easier? In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’10). ACM, New York, NY, 47–56.

Digital Library

[51]

Donald B. Rubin. 1974. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology 66, 5 (1974), 688.

[52]

Maximilian Scholz and Paul-Christian Bürkner. 2022. Prediction Can be Safely Used as a Proxy for Explanation in Causally Consistent Bayesian Generalized Linear Models. https://arxiv.org/abs/2210.06927

[53]

Maximilian Scholz and Richard Torkar. 2021. An empirical study of linespots: A novel past-fault algorithm. Software Testing, Verification and Reliability 31, 8 (2021), e1787.

[54]

Hyunmin Seo, Caitlin Sadowski, Sebastian G. Elbaum, Edward Aftandilian, and Robert W. Bowdidge. 2014. Programmers’ build errors: A case study (at Google). In 36th International Conference on Software Engineering (ICSE’14). ACM, 724–734.

Digital Library

[55]

Julian Siebert. 2022. Applications of statistical causal inference in software engineering. arXiv preprint arXiv:2211.11482 (2022).

[56]

Joseph P. Simmons, Leif D. Nelson, and Uri Simonsohn. 2011. False-positive psychology. Psychological Science 22, 11 (2011), 1359–1366.

[57]

Peter Spirtes and Kun Zhang. 2016. Causal discovery and inference: Concepts and recent methodological advances. In Applied informatics, Vol. 3. SpringerOpen, 1–28.

[58]

Jonathan Sterne. 2020. We Should Be Cautious about Associations of Patient Characteristics with COVID-19 Outcomes That Are Identified in Hospitalised Patients. Health Data Research UK – https://www.hdruk.ac.uk/news/we-should-be-cautious-about-associations-of-patient-characteristics-with-covid-19-outcomes-that-are-identified-in-hospitalised-patients/

[59]

Richard Torkar, Carlo A. Furia, Robert Feldt, Francisco Gomes de Oliveira Neto, Lucas Gren, Per Lenberg, and Neil A. Ernst. 2022. A method to assess and argue for practical significance in software engineering. IEEE Transactions on Software Engineering 48, 6 (June 2022), 2053–2065.

Digital Library

[60]

Aki Vehtari, Andrew Gelman, and Jonah Gabry. 2017. Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing 27, 5 (2017), 1413–1432.

Digital Library

[61]

Tom Verhoeff. 1997. The role of competitions in education. In Future World: Educating for the 21st Century. IOI.

[62]

Ronald L. Wasserstein and Nicole A. Lazar. 2016. The ASA statement on p-values: Context, process, and purpose. American Statistician 70, 2 (2016), 129–133. https://www.amstat.org/asa/files/pdfs/P-ValueStatement.pdf

Cited By

Giamattei LGuerriero APietrantuono RRusso S(2025)Causal reasoning in Software Quality Assurance: A systematic reviewInformation and Software Technology10.1016/j.infsof.2024.107599178(107599)Online publication date: Feb-2025
https://doi.org/10.1016/j.infsof.2024.107599
Sundelin AGonzalez-Huerta JTorkar RWnuk K(2025)Governing the commons: code ownership and code-clones in large-scale software developmentEmpirical Software Engineering10.1007/s10664-024-10598-730:2Online publication date: 1-Mar-2025
https://dl.acm.org/doi/10.1007/s10664-024-10598-7
Abid SCai XJiang L(2024)Measuring model alignment for code clone detection using causal interpretationEmpirical Software Engineering10.1007/s10664-024-10583-030:2Online publication date: 19-Dec-2024
https://doi.org/10.1007/s10664-024-10583-0
Show More Cited By

Index Terms

Towards Causal Analysis of Empirical Software Engineering Data: The Impact of Programming Languages on Coding Competitions

Recommendations

Applying Bayesian Analysis Guidelines to Empirical Software Engineering Data: The Case of Programming Languages and Code Quality
Statistical analysis is the tool of choice to turn data into information and then information into empirical knowledge. However, the process that goes from data to knowledge is long, uncertain, and riddled with pitfalls. To be valid, it should be ...
Empirical software engineering for agent programming
AGERE! 2012: Proceedings of the 2nd edition on Programming systems, languages and applications based on actors, agents, and decentralized control abstractions

Empirical software engineering is a branch of software engineering in which empirical methods are used to evaluate and develop tools, languages and techniques. In this position paper we argue for the use of empirical methods to advance the area of agent ...
Data quality in empirical software engineering: a targeted review
EASE '13: Proceedings of the 17th International Conference on Evaluation and Assessment in Software Engineering

Context: The utility of prediction models in empirical software engineering (ESE) is heavily reliant on the quality of the data used in building those models. Several data quality challenges such as noise, incompleteness, outliers and duplicate data ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Software Engineering and Methodology

ACM Transactions on Software Engineering and Methodology Volume 33, Issue 1

January 2024

933 pages

EISSN:1557-7392

DOI:10.1145/3613536

Editor:
Mauro Pezzè
USI Universitá della Svizzera italiana and SIT Schaffhausen Institute of Technology, Switzerland

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 November 2023

Online AM: 19 August 2023

Accepted: 07 July 2023

Revised: 27 June 2023

Received: 22 January 2023

Published in TOSEM Volume 33, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
460
Total Downloads

Downloads (Last 12 months)315
Downloads (Last 6 weeks)26

Reflects downloads up to 01 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Giamattei LGuerriero APietrantuono RRusso S(2025)Causal reasoning in Software Quality Assurance: A systematic reviewInformation and Software Technology10.1016/j.infsof.2024.107599178(107599)Online publication date: Feb-2025
https://doi.org/10.1016/j.infsof.2024.107599
Sundelin AGonzalez-Huerta JTorkar RWnuk K(2025)Governing the commons: code ownership and code-clones in large-scale software developmentEmpirical Software Engineering10.1007/s10664-024-10598-730:2Online publication date: 1-Mar-2025
https://dl.acm.org/doi/10.1007/s10664-024-10598-7
Abid SCai XJiang L(2024)Measuring model alignment for code clone detection using causal interpretationEmpirical Software Engineering10.1007/s10664-024-10583-030:2Online publication date: 19-Dec-2024
https://doi.org/10.1007/s10664-024-10583-0
Frattini JFucci DTorkar RMontgomery LUnterkalmsteiner MFischbach JMendez D(2024)Applying bayesian data analysis for causal inference about requirements quality: a controlled experimentEmpirical Software Engineering10.1007/s10664-024-10582-130:1Online publication date: 22-Nov-2024
https://dl.acm.org/doi/10.1007/s10664-024-10582-1

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Figures

Tables

Media

View full text|Download PDF

View Issue’s Table of Contents