research-article

Open access

Data Analysts and Their Software Practices: A Profile of the Sabermetrics Community and Beyond

Authors:

Justin Middleton,

Emerson Murphy-Hill,

Kathryn T. StoleeAuthors Info & Claims

Proceedings of the ACM on Human-Computer Interaction, Volume 4, Issue CSCW1

Article No.: 52, Pages 1 - 27

https://doi.org/10.1145/3392859

Published: 29 May 2020 Publication History

Abstract

For modern data analytics, practices from software development are increasingly necessary to manage data, but they must be incorporated alongside other statistical and scientific skills. Therefore, we ask: how does a community recontextualize software development through the unique pressures of their work? To answer this, we explore the analytic community around baseball, or sabermetrics. To discover software development's place in the search for robust statistical insight in sports, we interview 10 participants in the sabermetric community and survey over 120 more data analysts, both in baseball and not. We explore how their work lives at the intersection of science and entertainment, and as a consequence, baseball data serves as an accessible yet deep subject to practice analytic skills. Software development exists within an iterative research process that cycles between defining rigorous statistical methods and preserving the flexibility to chase interesting problems. In this question-driven process, members of the community inhabit several overlapping roles of intentional work, in which software development can become the priority to support research and statistical infrastructure, and we discuss the way that the community can foster the balance of these skills.

Supplementary Material

ZIP File (v4cscw052aux.zip)

Download
1.00 MB

References

[1]

IEEE Standards Association et al. 2010. Systems and software engineering - Vocabulary ISO/IEC/IEEE 24765: 2010. Iso/Iec/Ieee 24765 (2010), 1--418.

[2]

Ana Isabel Rojão Lourenço Azevedo and Manuel Filipe Santos. 2008. KDD, SEMMA and CRISP-DM: a parallel overview, In Proceedings of the IADIS European Conference on Data Mining. IADS-DM, 182--185.

[3]

Andrew Begel and Thomas Zimmermann. 2014. Analyze this! 145 questions for data scientists in software engineering. In Proceedings of the 36th International Conference on Software Engineering. ACM, 12--23.

Digital Library

[4]

Christine L Borgman, Jillian C Wallis, and Matthew S Mayernik. 2012. Who's got the data? Interdependencies in science and technology collaborations. Computer Supported Cooperative Work (CSCW) 21, 6 (2012), 485--523.

[5]

Pierre Bourque, Richard E Fairley, et al. 2014. Guide to the software engineering body of knowledge (SWEBOK (R)): Version 3.0. IEEE Computer Society Press.

[6]

danah boyd and Kate Crawford. 2011. Six provocations for big data. In A decade in internet time: Symposium on the dynamics of the internet and society, Vol. 21. Oxford Internet Institute Oxford, UK.

[7]

Benjamin Burroughs. 2018. Statistics and baseball fandom: Sabermetric infrastructure of expertise. Games and Culture (2018), 1555412018783319.

[8]

Longbing Cao. 2017. Data science: a comprehensive overview. ACM Computing Surveys (CSUR) 50, 3 (2017), 43.

Digital Library

[9]

Parmit K Chilana, Carole L Palmer, and Amy Ko. 2009. Comparing bioinformatics software development by computer scientists and biologists: An exploratory study. In Proceedings of the 2009 ICSE Workshop on Software Engineering for Computational Science and Engineering. IEEE Computer Society, 72--79.

Digital Library

[10]

George Chin Jr, Olga A Kuchar, and Katherine E Wolf. 2009. Exploring the analytical processes of intelligence analysts. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 11--20.

[11]

Joohee Choi and Yla Tausczik. 2017. Characteristics of collaboration in the emerging practice of open Data analysis. In Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing. ACM, 835--846.

Digital Library

[12]

Thomas H Davenport and DJ Patil. 2012. Data scientist. Harvard business review 90, 5 (2012), 70--76.

[13]

Towards a general framework for data mining. In International Workshop on Knowledge Discovery in Inductive Databases. Springer, 259--300.

[14]

Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. 1996. The KDD process for extracting useful knowledge from volumes of data. Commun. ACM 39, 11 (1996), 27--34.

Digital Library

[15]

Usama M Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, et al. 1996. Knowledge Discovery and Data Mining: Towards a Unifying Framework. In KDD, Vol. 96. 82--88.

Digital Library

[16]

Danyel Fisher, Rob DeLine, Mary Czerwinski, and Steven Drucker. 2012. Interactions with big data analytics. interactions 19, 3 (2012), 50--59.

[17]

Marc Fisher II, Gregg Rothermel, Darren Brown, Mingming Cao, Curtis Cook, and Margaret Burnett. 2006. Integrating automated test generation into the WYSIWYT spreadsheet testing methodology. ACM Transactions on Software Engineering and Methodology (TOSEM) 15, 2 (2006), 150--194.

Digital Library

[18]

R Stuart Geiger, Charlotte Mazel-Cabasse, Chihoko Y Cullens, Laura Noren, Brittany Fiore-Gartland, Diya Das, and Henry Brady. 2018. Career Paths and Prospects in Academic Data Science: Report of the Moore-Sloan Data Science Environments Survey. (2018).

[19]

Erica Rosenfeld Halverson and Richard Halverson. 2008. Fantasy baseball: The case for competitive fandom. Games and Culture 3, 3--4 (2008), 286--308.

[20]

Harlan Harris, Sean Murphy, and Marck Vaisman. 2013. Analyzing the analyzers: an introspective survey of data scientists and their work. " O'Reilly Media, Inc.".

[21]

France Henri and Béatrice Pudelko. 2003. Understanding and analysing activity and learning in virtual communities. Journal of Computer Assisted Learning 19, 4 (2003), 474--487.

[22]

Felienne Hermans. 2013. Improving spreadsheet test practices. In Proceedings of the 2013 Conference of the Center for Advanced Studies on Collaborative Research. IBM Corp., 56--69.

Digital Library

[23]

Brett Hutchins. 2016. Tales of the digital sublime: Tracing the relationship between big data and professional sport. Convergence 22, 5 (2016), 494--509.

[24]

Bill James. 1984. The Bill James Baseball Abstract, 1984. Ballantine Books New York.

[25]

Sean Kandel, Andreas Paepcke, Joseph M Hellerstein, and Jeffrey Heer. 2012. Enterprise data analysis and visualization: An interview study. IEEE Transactions on Visualization and Computer Graphics 18, 12 (2012), 2917--2926.

Digital Library

[26]

Youn-ah Kang and John Stasko. 2011. Characterizing the intelligence analysis process: Informing visual analytics design through a longitudinal field study. In 2011 IEEE conference on visual analytics science and technology (VAST). IEEE, 21--30.

[27]

Helena Karasti and Karen S Baker. 2004. Infrastructuring for the long-term: Ecological information management. In 37th Annual Hawaii International Conference on System Sciences, 2004. Proceedings of the. IEEE, 10--pp.

[28]

Mary Beth Kery and Brad A Myers. 2017. Exploring exploratory programming. In Visual Languages and Human-Centric Computing (VL/HCC), 2017 IEEE Symposium on. IEEE, 25--29.

[29]

Miryung Kim, Thomas Zimmermann, Robert DeLine, and Andrew Begel. 2016. The emerging role of data scientists on software development teams. In Proceedings of the 38th International Conference on Software Engineering. ACM, 96--107.

Digital Library

[30]

Miryung Kim, Thomas Zimmermann, Robert DeLine, and Andrew Begel. 2018. Data scientists in software teams: State of the art and challenges. IEEE Transactions on Software Engineering 44, 11 (2018), 1024--1038.

[31]

Gary Klein, Jennifer K Phillips, Erica L Rall, and Deborah A Peluso. 2007. A data--frame theory of sensemaking. In Expertise out of context. Psychology Press, 118--160.

[32]

Amy Ko, Robin Abraham, Laura Beckwith, Alan Blackwell, Margaret Burnett, Martin Erwig, Chris Scaffidi, Joseph Lawrance, Henry Lieberman, Brad Myers, et al. 2011. The state of the art in end-user software engineering. ACM Computing Surveys (CSUR) 43, 3 (2011), 21.

Digital Library

[33]

Andhy Koesnandar, Sebastian Elbaum, Gregg Rothermel, Lorin Hochstein, Christopher Scaffidi, and Kathryn T. Stolee. 2008. Using assertions to help end-user programmers create dependable web macros. In Symposium on Foundations of Software Engineering.

[34]

Lukasz A Kurgan and Petr Musilek. 2006. A survey of Knowledge Discovery and Data Mining process models. The Knowledge Engineering Review 21, 1 (2006), 1--24.

Digital Library

[35]

Sandeep K Kuttal, Anita Sarma, and Gregg Rothermel. 2014. On the benefits of providing versioning support for end users: an empirical study. ACM Transactions on Computer-Human Interaction (TOCHI) 21, 2 (2014), 9.

Digital Library

[36]

Paul Luo Li, Amy J Ko, and Andrew Begel. 2017. Cross-disciplinary perspectives on collaborations with software engineers. In Proceedings of the 10th International Workshop on Cooperative and Human Aspects of Software Engineering. IEEE Press, 2--8.

Digital Library

[37]

Oscar Marbán, Javier Segovia, Ernestina Menasalvas, and Covadonga Fernández-Baizán. 2009. Toward data mining engineering: A software engineering approach. Information systems 34, 1 (2009), 87--107.

[38]

Max Marchi and Jim Albert. 2013. Analyzing baseball data with R. CRC Press.

[39]

Gonzalo Mariscal, Oscar Marban, and Covadonga Fernandez. 2010. A survey of data mining and knowledge discovery process models and methodologies. The Knowledge Engineering Review 25, 2 (2010), 137--166.

Digital Library

[40]

Tim Menzies and Thomas Zimmermann. 2013. Software analytics: so what? IEEE Software 30, 4 (2013), 31--37.

Digital Library

[41]

Brad Millington and Rob Millington. 2015. "The datafication of everything": Toward a sociology of sport and big data. Sociology of Sport Journal 32, 2 (2015), 140--160.

[42]

Michael Muller, Ingrid Lange, Dakuo Wang, David Piorkowski, Jason Tsay, Q Vera Liao, Casey Dugan, and Thomas Erickson. 2019. How Data Science Workers Work with Data: Discovery, Capture, Curation, Design, Creation. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. ACM, 126.

Digital Library

[43]

Bonnie A Nardi. 1993. A small matter of programming: perspectives on end user computing.

[44]

Gina Neff, Anissa Tanweer, Brittany Fiore-Gartland, and Laura Osburn. 2017. Critique and contribute: A practice-based framework for improving critical data studies and data science. Big data 5, 2 (2017), 85--97.

[45]

Drew Paine and Charlotte P Lee. 2017. Who has plots?: Contextualizing scientific software, practice, and visualizations. Proceedings of the ACM on Human-Computer Interaction 1, CSCW (2017), 85.

Digital Library

[46]

Samir Passi and Steven J Jackson. 2018. Trust in Data Science: Collaboration, Translation, and Accountability in Corporate Data Science Projects. Proceedings of the ACM on Human-Computer Interaction 2, CSCW (2018), 136.

Digital Library

[47]

Emily S Patterson, Emilie M Roth, and David D Woods. 2001. Predicting vulnerabilities in computer-supported inferential analysis under data overload. Cognition, Technology & Work 3, 4 (2001), 224--237.

[48]

Gregory Piatetsky. 2014. CRISP-DM, still the top methodology for analytics, data mining, or data science projects. KDD News (2014).

[49]

Peter Pirolli and Stuart Card. 2005. The sensemaking process and leverage points for analyst technology as identified through cognitive task analysis. In Proceedings of international conference on intelligence analysis, Vol. 5. McLean, VA, USA, 2--4.

[50]

Gearan Paul Rexer, Karl and Heather Allen. 2015. 2015 Rexer Analytics Data Science Survey. (2015).

[51]

David Ribes and Thomas A Finholt. 2007. Tensions across the scales: planning infrastructure for the long-term. In Proceedings of the 2007 international ACM conference on Supporting group work. ACM, 229--238.

Digital Library

[52]

Leah Riungu-Kalliosaari, Marjo Kauppinen, and Tomi Männistö. 2017. What Can Be Learnt from Experienced Data Scientists? A Case Study. In International Conference on Product-Focused Software Process Improvement. Springer, 55--70.

[53]

Martin P Robillard and Robert Deline. 2011. A field study of API learning obstacles. Empirical Software Engineering 16, 6 (2011), 703--732.

Digital Library

[54]

Betsy Rolland and Charlotte P Lee. 2013. Beyond trust and reliability: reusing data in collaborative cancer epidemiology research. In Proceedings of the 2013 conference on Computer supported cooperative work. ACM, 435--444.

Digital Library

[55]

Gregg Rothermel, Margaret Burnett, Lixin Li, Christopher Dupuis, and Andrei Sheretov. 2001. A methodology for testing spreadsheets. ACM Transactions on Software Engineering and Methodology (TOSEM) 10, 1 (2001), 110--147.

Digital Library

[56]

Judith Segal. 2007. Some problems of professional end user developers. In Visual Languages and Human-Centric Computing. IEEE, 111--118.

[57]

Colin Shearer. 2000. The CRISP-DM model: the new blueprint for data mining. Journal of data warehousing 5, 4 (2000), 13--22.

[58]

Susan Leigh Star and Karen Ruhleder. 1996. Steps toward an ecology of infrastructure: Design and access for large information spaces. Information systems research 7, 1 (1996), 111--134.

[59]

Kathryn T. Stolee and Sebastian Elbaum. 2011. Refactoring pipe-like mashups for end-user programmers. In International Conference on Software Engineering (Waikiki, Honolulu, HI, USA). 10.

[60]

Kathryn T. Stolee and Sebastian Elbaum. 2013. Identification, Impact, and Refactoring of Smells in Pipe-Like Web Mashups. IEEE Trans. Softw. Eng. 39, 12 (Dec. 2013), 1654--1679. https://doi.org/10.1109/TSE.2013.42

Digital Library

[61]

Anselm Strauss and Juliet Corbin. 1994. Grounded theory methodology. Handbook of qualitative research 17 (1994), 273--85.

[62]

Erik H Trainer, Chalalai Chaihirunkarn, Arun Kalyanasundaram, and James D Herbsleb. 2015. From personal tool to community resource: What's the extra work and who will do it?. In Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing. ACM, 417--430.

Digital Library

[63]

Medha Umarji, Mark Pohl, Carolyn Seaman, A Güne_ Koru, and Hongfang Liu. 2008. Teaching software engineering to end-users. In Proceedings of the 4th international workshop on End-user software engineering. ACM, 40--42.

Digital Library

[64]

Ricardo Valerdi. 2017. Why software is like baseball. IEEE Software 34, 5 (2017), 7--9.

Digital Library

[65]

Dhawal Verma, Jon Gesell, Harvey Siy, and Mansour Zand. 2013. Lack of software engineering practices in the development of bioinformatics software. ICCGI 2013 (2013), 57--62.

Cited By

Smith MCito JLu KVeeramachaneni K(2021)Enabling Collaborative Data Science Development with the Ballet FrameworkProceedings of the ACM on Human-Computer Interaction10.1145/34795755:CSCW2(1-39)Online publication date: 18-Oct-2021
https://dl.acm.org/doi/10.1145/3479575
Shrestha NBarik TParnin C(2021)Remote, but ConnectedProceedings of the ACM on Human-Computer Interaction10.1145/34491265:CSCW1(1-31)Online publication date: 22-Apr-2021
https://dl.acm.org/doi/10.1145/3449126

Index Terms

Data Analysts and Their Software Practices: A Profile of the Sabermetrics Community and Beyond
1. Human-centered computing
  1. Collaborative and social computing
    1. Collaborative and social computing systems and tools
    2. Empirical studies in collaborative and social computing
2. Software and its engineering
  1. Software creation and management
    1. Collaboration in software development

Recommendations

Understanding the gap between software process practices and actual practice in very small companies

This paper reports on a grounded theory to study into software developers' use of software development processes in actual practice in the specific context of very small companies. This study was conducted in three very small software product companies ...
Satisfaction, Practices, and Influences in Agile Software Development
EASE '18: Proceedings of the 22nd International Conference on Evaluation and Assessment in Software Engineering 2018

The principles behind the Agile Manifesto begin with "Our highest priority is to satisfy the customer...". It also states that Agile projects should be build around motivated and self-organized teams, which might also lead to more satisfied developers. ...
Could removal of project-level knowledge flow obstacles contribute to software process improvement? A study of software engineer perceptions

ContextSoftware process improvement (SPI) is one type of innovation often formulated to address problems such as uncontrollable costs, schedule overruns, and poor end product quality. This study investigates SPI through the application of knowledge ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Human-Computer Interaction

Proceedings of the ACM on Human-Computer Interaction Volume 4, Issue CSCW1

CSCW

May 2020

1285 pages

EISSN:2573-0142

DOI:10.1145/3403424

Editors:
Cliff Lampe
University of Michigan
,
Jeff Nichols
Google
,
Karrie Karahalios
University of Illinois Urbana-Champaign
,
Geraldine Fitzpatrick
Vienna University of Technology
,
Uichin Lee
KAIST
,
Andres Monroy-Hernandez
Microsoft Research
,
Wolfgang Stuerzlinger
Simon Fraser University, Canada

Issue’s Table of Contents

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 May 2020

Published in PACMHCI Volume 4, Issue CSCW1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundation

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
980
Total Downloads

Downloads (Last 12 months)176
Downloads (Last 6 weeks)26

Reflects downloads up to 22 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Smith MCito JLu KVeeramachaneni K(2021)Enabling Collaborative Data Science Development with the Ballet FrameworkProceedings of the ACM on Human-Computer Interaction10.1145/34795755:CSCW2(1-39)Online publication date: 18-Oct-2021
https://dl.acm.org/doi/10.1145/3479575
Shrestha NBarik TParnin C(2021)Remote, but ConnectedProceedings of the ACM on Human-Computer Interaction10.1145/34491265:CSCW1(1-31)Online publication date: 22-Apr-2021
https://dl.acm.org/doi/10.1145/3449126

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents