skip to main content
research-article

Towards Causal Analysis of Empirical Software Engineering Data: The Impact of Programming Languages on Coding Competitions

Published: 24 November 2023 Publication History

Abstract

There is abundant observational data in the software engineering domain, whereas running large-scale controlled experiments is often practically impossible. Thus, most empirical studies can only report statistical correlations—instead of potentially more insightful and robust causal relations.
To support analyzing purely observational data for causal relations and to assess any differences between purely predictive and causal models of the same data, this article discusses some novel techniques based on structural causal models (such as directed acyclic graphs of causal Bayesian networks). Using these techniques, one can rigorously express, and partially validate, causal hypotheses and then use the causal information to guide the construction of a statistical model that captures genuine causal relations—such that correlation does imply causation.
We apply these ideas to analyzing public data about programmer performance in Code Jam, a large world-wide coding contest organized by Google every year. Specifically, we look at the impact of different programming languages on a participant’s performance in the contest. While the overall effect associated with programming languages is weak compared to other variables—regardless of whether we consider correlational or causal links—we found considerable differences between a purely associational and a causal analysis of the very same data.
The takeaway message is that even an imperfect causal analysis of observational data can help answer the salient research questions more precisely and more robustly than with just purely predictive techniques—where genuine causal effects may be confounded.

References

[1]
Alain Abran, James W. Moore, Pierre Bourque, Robert Dupuis, and L. Tripp. 2004. Software Engineering Body of Knowledge. IEEE Computer Society, Angela Burgess, 25.
[2]
Valentin Amrehin, Sander Greenland, and Blake McShane. 2019. Scientists rise up against statistical significance. Nature 567 (2019), 305–307.
[3]
George K. Baah, Andy Podgurski, and Mary Jean Harrold. 2010. Causal inference for statistical fault localization. In Proceedings of the 19th International Symposium on Software Testing and Analysis. 73–84.
[4]
Alexandra Back and Emma Westman. 2017. Comparing Programming Languages in Google Code Jam. Master’s thesis. Chalmers University of Technology. https://publications.lib.chalmers.se/records/fulltext/250672/250672.pdf
[5]
Daniel J. Benjamin, James O. Berger, Magnus Johannesson, Brian A. Nosek, E.-J. Wagenmakers, Richard Berk, Kenneth A. Bollen, Björn Brembs, Lawrence Brown, Colin Camerer, David Cesarini, Christopher D. Chambers, Merlise Clyde, Thomas D. Cook, Paul De Boeck, Zoltan Dienes, Anna Dreber, Kenny Easwaran, Charles Efferson, Ernst Fehr, Fiona Fidler, Andy P. Field, Malcolm Forster, Edward I. George, Richard Gonzalez, Steven Goodman, Edwin Green, Donald P. Green, Anthony G. Greenwald, Jarrod D. Hadfield, Larry V. Hedges, Leonhard Held, Teck Hua Ho, Herbert Hoijtink, Daniel J. Hruschka, Kosuke Imai, Guido Imbens, John P. A. Ioannidis, Minjeong Jeon, James Holland Jones, Michael Kirchler, David Laibson, John List, Roderick Little, Arthur Lupia, Edouard Machery, Scott E. Maxwell, Michael McCarthy, Don A. Moore, Stephen L. Morgan, Marcus Munafó, Shinichi Nakagawa, Brendan Nyhan, Timothy H. Parker, Luis Pericchi, Marco Perugini, Jeff Rouder, Judith Rousseau, Victoria Savalei, Felix D. Schönbrodt, Thomas Sellke, Betsy Sinclair, Dustin Tingley, Trisha Van Zandt, Simine Vazire, Duncan J. Watts, Christopher Winship, Robert L. Wolpert, Yu Xie, Cristobal Young, Jonathan Zinman, and Valen E. Johnson. 2018. Redefine statistical significance. Nature Human Behaviour 2, 6-10 (2018).
[6]
Emery D. Berger, Celeste Hollenbeck, Petr Maj, Olga Vitek, and Jan Vitek. 2019. On the impact of programming languages on code quality: A reproduction study. ACM Transactions on Programming Languages and Systems 41, 4 (2019), 21:1–21:24.
[7]
Pascal Caillet, Sarah Klemm, Michel Ducher, Alexandre Aussem, and Anne-Marie Schott. 2015. Hip fracture in the elderly: A re-analysis of the EPIDOS study with causal Bayesian networks. PLoS One 10, 3 (2015), e0120125.
[8]
Carlos Cinelli, Andrew Forney, and Judea Pearl. 2022. A crash course in good and bad controls. Sociological Methods & Research (2022). DOI:
[9]
Andrew G. Clark, Michael Foster, Benedikt Prifling, Neil Walkinshaw, Robert M. Hierons, Volker Schmidt, and Robert D. Turner. 2022. Testing causality in scientific modelling software. arXiv preprint arXiv:2209.00357 (2022).
[10]
Jacob Cohen. 1994. The earth is round (\(p \lt .05\)). American Psychologist 49, 12 (1994), 997–1003.
[11]
Clemens Dubslaff, Kallistos Weis, Christel Baier, and Sven Apel. 2022. Causality in configurable software systems. In Proceedings of the 44th International Conference on Software Engineering (ICSE’22), ACM.
[12]
Hongbo Fang, Hemank Lamba, James Herbsleb, and Bogdan Vasilescu. 2022. “This is damn slick!” Estimating the impact of tweets on open source project popularity and new contributors. In Proceedings of the 44th International Conference on Software Engineering (ICSE’22), ACM.
[13]
Carlo A. Furia, Robert Feldt, and Richard Torkar. 2021. Bayesian data analysis in empirical software engineering research. IEEE Transactions on Software Engineering 47, 9 (September 2021), 1786–1810.
[14]
Carlo A. Furia, Richard Torkar, and Robert Feldt. 2022. Applying Bayesian analysis guidelines to empirical software engineering data: The case of programming languages and code quality. ACM Transactions on Software Engineering and Methodology 31, 3 (2022), 40:1–40:38.
[15]
Carlo A. Furia, Richard Torkar, and Robert Feldt. 2023. Replication Package.
[16]
Andrew Gelman. 2016. The problems with P-values are not just with P-values. American Statistician 70 (2016). Online discussion: http://www.stat.columbia.edu/gelman/research/published/asa_pvalues.pdf
[17]
Andrew Gelman. 2016. Why I Prefer 50% Rather Than 95% Intervals. https://statmodeling.stat.columbia.edu/2016/11/05/why-i-prefer-50-to-95-intervals/ From the blog Statistical Modeling, Causal Inference, and Social Science.
[18]
Andrew Gelman, Jennifer Hill, and Aki Vehtari. 2020. Regression and Other Stories. Cambridge University Press, Cambridge, UK. https://books.google.se/books?id=SZFKzQEACAAJ
[19]
Andrew Gelman and David Weakliem. 2009. Of beauty, sex and power. American Scientist 97 (2009), 310–316.
[20]
Steven N. Goodman. 1999. Toward evidence-based medical statistics. 1: The P value fallacy. Annals of Internal Medicine 130, 12 (1999), 995–1004.
[21]
Lasse Hakulinen. 2011. Survey on informatics competitions: Developing tasks. In Olympiads in Informatics, Vol. 5. IOI, 12–25.
[22]
Joseph Halpern. 2015. A modification of the Halpern-Pearl definition of causality. In 24th International Joint Conference on Artificial Intelligence.
[23]
Stefan Hanenberg. 2010. An experiment about static and dynamic type systems: Doubts about the positive impact of static type systems on development time. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA’10). ACM, New York, NY, 22–35.
[24]
Leonard Henckel, Emilija Perković, and Marloes H. Maathuis. 2022. Graphical criteria for efficient total effect estimation via adjustment in causal linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 84, 2 (2022), 579–599. Also https://arxiv.org/abs/1907.02435
[25]
Miguel A. Hernán and Susana Monge. 2023. Selection bias due to conditioning on a collider. BMJ 381 (2023), 1135. arXiv:https://www.bmj.com/content/381/bmj.p1135.full.pdf
[26]
Hans-Martin Heyn and Eric Knauss. 2022. Structural causal models as boundary objects in AI system development. In 1st International Conference on AI Engineering-Software Engineering for AI.
[27]
Guido W. Imbens. 2020. Potential outcome and directed acyclic graph approaches to causality: Relevance for empirical practice in economics. Journal of Economic Literature 58, 4 (2020), 1129–1179.
[28]
Edwin T. Jaynes. 2003. Probability Theory: The Logic of Science. Cambridge University Press, Cambridge, UK.
[29]
Hieke Keuning, Johan Jeuring, and Bastiaan Heeren. 2019. A systematic literature review of automated feedback generation for programming exercises. ACM Transactions on Computing Education 19, 1 (2019), 3:1–3:43.
[30]
Seongmin Lee, Dave Binkley, Robert Feldt, Nicolas Gold, and Shin Yoo. 2021. Causal program dependence analysis. arXiv preprint arXiv:2104.09107 (2021).
[31]
Xing Li, Yinbo Yu, Kai Bu, Yan Chen, Jianfeng Yang, and Ruijie Quan. 2019. Thinking inside the box: Differential fault localization for SDN control plane. In 2019 IFIP/IEEE Symposium on Integrated Network and Service Management (IM’19). IEEE, 353–359.
[32]
Yuchu Liu, David Issa Mattos, Jan Bosch, Helena Holmström Olsson, and Jonn Lantz. 2022. Bayesian causal inference in automotive software engineering and online evaluation. arXiv preprint arXiv:2207.00222 (2022).
[33]
Richard McElreath. 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan (2nd Ed.). CRC Press.
[34]
Blakeley B. McShane, David Gal, Andrew Gelman, Christian Robert, and Jennifer L. Tackett. 2019. Abandon statistical significance. American Statistician 73, S1 (2019), 235–245.
[35]
Tim Menzies and Martin Shepperd. 2019. “Bad smells” in software analytics papers. Information and Software Technology 112 (2019), 35–47. https://arxiv.org/abs/1803.05518
[36]
Leo A. Meyerovich and Ariel S. Rabkin. 2013. Empirical analysis of programming language adoption. In Proceedings of the 2013 ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages & Applications (OOPSLA’13). ACM, New York, NY, 1–18.
[37]
Sebastian Nanz and Carlo A. Furia. 2014. A Comparative Study of Programming Languages in Rosetta Code. http://arxiv.org/abs/1409.0252
[38]
Sebastian Nanz and Carlo A. Furia. 2015. A comparative study of programming languages in Rosetta code. In Proceedings of the 37th International Conference on Software Engineering (ICSE’15), Antonia Bertolino, Gerardo Canfora, and Sebastian Elbaum (Eds.). ACM, 778–788.
[39]
Sebastian Nanz, Faraz Torshizi, Michela Pedroni, and Bertrand Meyer. 2011. Design of an empirical study for comparing the usability of concurrent programming languages. In Proceedings of the 2011 International Symposium on Empirical Software Engineering and Measurement (ESEM’11). IEEE Computer Society, Washington, DC, 325–334.
[40]
Judea Pearl. 2009. Causality. Cambridge University Press.
[41]
Judea Pearl. 2009. Causality: Models, Reasoning and Inference (2nd Ed.). Cambridge University Press.
[42]
Judea Pearl. 2011. The mathematics of causal inference. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’11). Association for Computing Machinery, New York, NY, 5.
[43]
Judea Pearl. 2019. The seven tools of causal inference, with reflections on machine learning. Communications of the ACM 62, 3 (2019), 54–60.
[44]
Judea Pearl. 2009. Causal inference in statistics: An overview. Statistics Surveys 3 (2009), 96–146.
[45]
Judea Pearl and Dana Mackenzie. 2018. The Book of Why. Penguin Random House.
[46]
Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. 2017. Elements of Causal Inference: Foundations and Learning Algorithms. MIT Press.
[47]
Lutz Prechelt. 2000. An empirical comparison of seven programming languages. IEEE Computer 33, 10 (Oct. 2000), 23–29.
[48]
Baishakhi Ray, Daryl Posnett, Vladimir Filkov, and Premkumar Devanbu. 2014. A large scale study of programming languages and code quality in Github. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE’14). Association for Computing Machinery, New York, NY, 155–165.
[49]
Jonathan G. Richens, Ciarán M. Lee, and Saurabh Johri. 2020. Improving the accuracy of medical diagnosis with causal machine learning. Nature Communications 11, 1 (2020), 1–9.
[50]
Christopher J. Rossbach, Owen S. Hofmann, and Emmett Witchel. 2010. Is transactional programming actually easier? In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’10). ACM, New York, NY, 47–56.
[51]
Donald B. Rubin. 1974. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology 66, 5 (1974), 688.
[52]
Maximilian Scholz and Paul-Christian Bürkner. 2022. Prediction Can be Safely Used as a Proxy for Explanation in Causally Consistent Bayesian Generalized Linear Models. https://arxiv.org/abs/2210.06927
[53]
Maximilian Scholz and Richard Torkar. 2021. An empirical study of linespots: A novel past-fault algorithm. Software Testing, Verification and Reliability 31, 8 (2021), e1787.
[54]
Hyunmin Seo, Caitlin Sadowski, Sebastian G. Elbaum, Edward Aftandilian, and Robert W. Bowdidge. 2014. Programmers’ build errors: A case study (at Google). In 36th International Conference on Software Engineering (ICSE’14). ACM, 724–734.
[55]
Julian Siebert. 2022. Applications of statistical causal inference in software engineering. arXiv preprint arXiv:2211.11482 (2022).
[56]
Joseph P. Simmons, Leif D. Nelson, and Uri Simonsohn. 2011. False-positive psychology. Psychological Science 22, 11 (2011), 1359–1366.
[57]
Peter Spirtes and Kun Zhang. 2016. Causal discovery and inference: Concepts and recent methodological advances. In Applied informatics, Vol. 3. SpringerOpen, 1–28.
[58]
Jonathan Sterne. 2020. We Should Be Cautious about Associations of Patient Characteristics with COVID-19 Outcomes That Are Identified in Hospitalised Patients. Health Data Research UKhttps://www.hdruk.ac.uk/news/we-should-be-cautious-about-associations-of-patient-characteristics-with-covid-19-outcomes-that-are-identified-in-hospitalised-patients/
[59]
Richard Torkar, Carlo A. Furia, Robert Feldt, Francisco Gomes de Oliveira Neto, Lucas Gren, Per Lenberg, and Neil A. Ernst. 2022. A method to assess and argue for practical significance in software engineering. IEEE Transactions on Software Engineering 48, 6 (June 2022), 2053–2065.
[60]
Aki Vehtari, Andrew Gelman, and Jonah Gabry. 2017. Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing 27, 5 (2017), 1413–1432.
[61]
Tom Verhoeff. 1997. The role of competitions in education. In Future World: Educating for the 21st Century. IOI.
[62]
Ronald L. Wasserstein and Nicole A. Lazar. 2016. The ASA statement on p-values: Context, process, and purpose. American Statistician 70, 2 (2016), 129–133. https://www.amstat.org/asa/files/pdfs/P-ValueStatement.pdf

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Software Engineering and Methodology
ACM Transactions on Software Engineering and Methodology  Volume 33, Issue 1
January 2024
933 pages
EISSN:1557-7392
DOI:10.1145/3613536
  • Editor:
  • Mauro Pezzè
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 November 2023
Online AM: 19 August 2023
Accepted: 07 July 2023
Revised: 27 June 2023
Received: 22 January 2023
Published in TOSEM Volume 33, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Causality analysis
  2. statistical analysis
  3. empirical software engineering
  4. programming contests

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)315
  • Downloads (Last 6 weeks)26
Reflects downloads up to 01 Feb 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media