skip to main content
research-article
Open access

Data Analysts and Their Software Practices: A Profile of the Sabermetrics Community and Beyond

Published: 29 May 2020 Publication History

Abstract

For modern data analytics, practices from software development are increasingly necessary to manage data, but they must be incorporated alongside other statistical and scientific skills. Therefore, we ask: how does a community recontextualize software development through the unique pressures of their work? To answer this, we explore the analytic community around baseball, or sabermetrics. To discover software development's place in the search for robust statistical insight in sports, we interview 10 participants in the sabermetric community and survey over 120 more data analysts, both in baseball and not. We explore how their work lives at the intersection of science and entertainment, and as a consequence, baseball data serves as an accessible yet deep subject to practice analytic skills. Software development exists within an iterative research process that cycles between defining rigorous statistical methods and preserving the flexibility to chase interesting problems. In this question-driven process, members of the community inhabit several overlapping roles of intentional work, in which software development can become the priority to support research and statistical infrastructure, and we discuss the way that the community can foster the balance of these skills.

Supplementary Material

ZIP File (v4cscw052aux.zip)

References

[1]
IEEE Standards Association et al. 2010. Systems and software engineering - Vocabulary ISO/IEC/IEEE 24765: 2010. Iso/Iec/Ieee 24765 (2010), 1--418.
[2]
Ana Isabel Rojão Lourenço Azevedo and Manuel Filipe Santos. 2008. KDD, SEMMA and CRISP-DM: a parallel overview, In Proceedings of the IADIS European Conference on Data Mining. IADS-DM, 182--185.
[3]
Andrew Begel and Thomas Zimmermann. 2014. Analyze this! 145 questions for data scientists in software engineering. In Proceedings of the 36th International Conference on Software Engineering. ACM, 12--23.
[4]
Christine L Borgman, Jillian C Wallis, and Matthew S Mayernik. 2012. Who's got the data? Interdependencies in science and technology collaborations. Computer Supported Cooperative Work (CSCW) 21, 6 (2012), 485--523.
[5]
Pierre Bourque, Richard E Fairley, et al. 2014. Guide to the software engineering body of knowledge (SWEBOK (R)): Version 3.0. IEEE Computer Society Press.
[6]
danah boyd and Kate Crawford. 2011. Six provocations for big data. In A decade in internet time: Symposium on the dynamics of the internet and society, Vol. 21. Oxford Internet Institute Oxford, UK.
[7]
Benjamin Burroughs. 2018. Statistics and baseball fandom: Sabermetric infrastructure of expertise. Games and Culture (2018), 1555412018783319.
[8]
Longbing Cao. 2017. Data science: a comprehensive overview. ACM Computing Surveys (CSUR) 50, 3 (2017), 43.
[9]
Parmit K Chilana, Carole L Palmer, and Amy Ko. 2009. Comparing bioinformatics software development by computer scientists and biologists: An exploratory study. In Proceedings of the 2009 ICSE Workshop on Software Engineering for Computational Science and Engineering. IEEE Computer Society, 72--79.
[10]
George Chin Jr, Olga A Kuchar, and Katherine E Wolf. 2009. Exploring the analytical processes of intelligence analysts. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 11--20.
[11]
Joohee Choi and Yla Tausczik. 2017. Characteristics of collaboration in the emerging practice of open Data analysis. In Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing. ACM, 835--846.
[12]
Thomas H Davenport and DJ Patil. 2012. Data scientist. Harvard business review 90, 5 (2012), 70--76.
[13]
Towards a general framework for data mining. In International Workshop on Knowledge Discovery in Inductive Databases. Springer, 259--300.
[14]
Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. 1996. The KDD process for extracting useful knowledge from volumes of data. Commun. ACM 39, 11 (1996), 27--34.
[15]
Usama M Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, et al. 1996. Knowledge Discovery and Data Mining: Towards a Unifying Framework. In KDD, Vol. 96. 82--88.
[16]
Danyel Fisher, Rob DeLine, Mary Czerwinski, and Steven Drucker. 2012. Interactions with big data analytics. interactions 19, 3 (2012), 50--59.
[17]
Marc Fisher II, Gregg Rothermel, Darren Brown, Mingming Cao, Curtis Cook, and Margaret Burnett. 2006. Integrating automated test generation into the WYSIWYT spreadsheet testing methodology. ACM Transactions on Software Engineering and Methodology (TOSEM) 15, 2 (2006), 150--194.
[18]
R Stuart Geiger, Charlotte Mazel-Cabasse, Chihoko Y Cullens, Laura Noren, Brittany Fiore-Gartland, Diya Das, and Henry Brady. 2018. Career Paths and Prospects in Academic Data Science: Report of the Moore-Sloan Data Science Environments Survey. (2018).
[19]
Erica Rosenfeld Halverson and Richard Halverson. 2008. Fantasy baseball: The case for competitive fandom. Games and Culture 3, 3--4 (2008), 286--308.
[20]
Harlan Harris, Sean Murphy, and Marck Vaisman. 2013. Analyzing the analyzers: an introspective survey of data scientists and their work. " O'Reilly Media, Inc.".
[21]
France Henri and Béatrice Pudelko. 2003. Understanding and analysing activity and learning in virtual communities. Journal of Computer Assisted Learning 19, 4 (2003), 474--487.
[22]
Felienne Hermans. 2013. Improving spreadsheet test practices. In Proceedings of the 2013 Conference of the Center for Advanced Studies on Collaborative Research. IBM Corp., 56--69.
[23]
Brett Hutchins. 2016. Tales of the digital sublime: Tracing the relationship between big data and professional sport. Convergence 22, 5 (2016), 494--509.
[24]
Bill James. 1984. The Bill James Baseball Abstract, 1984. Ballantine Books New York.
[25]
Sean Kandel, Andreas Paepcke, Joseph M Hellerstein, and Jeffrey Heer. 2012. Enterprise data analysis and visualization: An interview study. IEEE Transactions on Visualization and Computer Graphics 18, 12 (2012), 2917--2926.
[26]
Youn-ah Kang and John Stasko. 2011. Characterizing the intelligence analysis process: Informing visual analytics design through a longitudinal field study. In 2011 IEEE conference on visual analytics science and technology (VAST). IEEE, 21--30.
[27]
Helena Karasti and Karen S Baker. 2004. Infrastructuring for the long-term: Ecological information management. In 37th Annual Hawaii International Conference on System Sciences, 2004. Proceedings of the. IEEE, 10--pp.
[28]
Mary Beth Kery and Brad A Myers. 2017. Exploring exploratory programming. In Visual Languages and Human-Centric Computing (VL/HCC), 2017 IEEE Symposium on. IEEE, 25--29.
[29]
Miryung Kim, Thomas Zimmermann, Robert DeLine, and Andrew Begel. 2016. The emerging role of data scientists on software development teams. In Proceedings of the 38th International Conference on Software Engineering. ACM, 96--107.
[30]
Miryung Kim, Thomas Zimmermann, Robert DeLine, and Andrew Begel. 2018. Data scientists in software teams: State of the art and challenges. IEEE Transactions on Software Engineering 44, 11 (2018), 1024--1038.
[31]
Gary Klein, Jennifer K Phillips, Erica L Rall, and Deborah A Peluso. 2007. A data--frame theory of sensemaking. In Expertise out of context. Psychology Press, 118--160.
[32]
Amy Ko, Robin Abraham, Laura Beckwith, Alan Blackwell, Margaret Burnett, Martin Erwig, Chris Scaffidi, Joseph Lawrance, Henry Lieberman, Brad Myers, et al. 2011. The state of the art in end-user software engineering. ACM Computing Surveys (CSUR) 43, 3 (2011), 21.
[33]
Andhy Koesnandar, Sebastian Elbaum, Gregg Rothermel, Lorin Hochstein, Christopher Scaffidi, and Kathryn T. Stolee. 2008. Using assertions to help end-user programmers create dependable web macros. In Symposium on Foundations of Software Engineering.
[34]
Lukasz A Kurgan and Petr Musilek. 2006. A survey of Knowledge Discovery and Data Mining process models. The Knowledge Engineering Review 21, 1 (2006), 1--24.
[35]
Sandeep K Kuttal, Anita Sarma, and Gregg Rothermel. 2014. On the benefits of providing versioning support for end users: an empirical study. ACM Transactions on Computer-Human Interaction (TOCHI) 21, 2 (2014), 9.
[36]
Paul Luo Li, Amy J Ko, and Andrew Begel. 2017. Cross-disciplinary perspectives on collaborations with software engineers. In Proceedings of the 10th International Workshop on Cooperative and Human Aspects of Software Engineering. IEEE Press, 2--8.
[37]
Oscar Marbán, Javier Segovia, Ernestina Menasalvas, and Covadonga Fernández-Baizán. 2009. Toward data mining engineering: A software engineering approach. Information systems 34, 1 (2009), 87--107.
[38]
Max Marchi and Jim Albert. 2013. Analyzing baseball data with R. CRC Press.
[39]
Gonzalo Mariscal, Oscar Marban, and Covadonga Fernandez. 2010. A survey of data mining and knowledge discovery process models and methodologies. The Knowledge Engineering Review 25, 2 (2010), 137--166.
[40]
Tim Menzies and Thomas Zimmermann. 2013. Software analytics: so what? IEEE Software 30, 4 (2013), 31--37.
[41]
Brad Millington and Rob Millington. 2015. "The datafication of everything": Toward a sociology of sport and big data. Sociology of Sport Journal 32, 2 (2015), 140--160.
[42]
Michael Muller, Ingrid Lange, Dakuo Wang, David Piorkowski, Jason Tsay, Q Vera Liao, Casey Dugan, and Thomas Erickson. 2019. How Data Science Workers Work with Data: Discovery, Capture, Curation, Design, Creation. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. ACM, 126.
[43]
Bonnie A Nardi. 1993. A small matter of programming: perspectives on end user computing.
[44]
Gina Neff, Anissa Tanweer, Brittany Fiore-Gartland, and Laura Osburn. 2017. Critique and contribute: A practice-based framework for improving critical data studies and data science. Big data 5, 2 (2017), 85--97.
[45]
Drew Paine and Charlotte P Lee. 2017. Who has plots?: Contextualizing scientific software, practice, and visualizations. Proceedings of the ACM on Human-Computer Interaction 1, CSCW (2017), 85.
[46]
Samir Passi and Steven J Jackson. 2018. Trust in Data Science: Collaboration, Translation, and Accountability in Corporate Data Science Projects. Proceedings of the ACM on Human-Computer Interaction 2, CSCW (2018), 136.
[47]
Emily S Patterson, Emilie M Roth, and David D Woods. 2001. Predicting vulnerabilities in computer-supported inferential analysis under data overload. Cognition, Technology & Work 3, 4 (2001), 224--237.
[48]
Gregory Piatetsky. 2014. CRISP-DM, still the top methodology for analytics, data mining, or data science projects. KDD News (2014).
[49]
Peter Pirolli and Stuart Card. 2005. The sensemaking process and leverage points for analyst technology as identified through cognitive task analysis. In Proceedings of international conference on intelligence analysis, Vol. 5. McLean, VA, USA, 2--4.
[50]
Gearan Paul Rexer, Karl and Heather Allen. 2015. 2015 Rexer Analytics Data Science Survey. (2015).
[51]
David Ribes and Thomas A Finholt. 2007. Tensions across the scales: planning infrastructure for the long-term. In Proceedings of the 2007 international ACM conference on Supporting group work. ACM, 229--238.
[52]
Leah Riungu-Kalliosaari, Marjo Kauppinen, and Tomi Männistö. 2017. What Can Be Learnt from Experienced Data Scientists? A Case Study. In International Conference on Product-Focused Software Process Improvement. Springer, 55--70.
[53]
Martin P Robillard and Robert Deline. 2011. A field study of API learning obstacles. Empirical Software Engineering 16, 6 (2011), 703--732.
[54]
Betsy Rolland and Charlotte P Lee. 2013. Beyond trust and reliability: reusing data in collaborative cancer epidemiology research. In Proceedings of the 2013 conference on Computer supported cooperative work. ACM, 435--444.
[55]
Gregg Rothermel, Margaret Burnett, Lixin Li, Christopher Dupuis, and Andrei Sheretov. 2001. A methodology for testing spreadsheets. ACM Transactions on Software Engineering and Methodology (TOSEM) 10, 1 (2001), 110--147.
[56]
Judith Segal. 2007. Some problems of professional end user developers. In Visual Languages and Human-Centric Computing. IEEE, 111--118.
[57]
Colin Shearer. 2000. The CRISP-DM model: the new blueprint for data mining. Journal of data warehousing 5, 4 (2000), 13--22.
[58]
Susan Leigh Star and Karen Ruhleder. 1996. Steps toward an ecology of infrastructure: Design and access for large information spaces. Information systems research 7, 1 (1996), 111--134.
[59]
Kathryn T. Stolee and Sebastian Elbaum. 2011. Refactoring pipe-like mashups for end-user programmers. In International Conference on Software Engineering (Waikiki, Honolulu, HI, USA). 10.
[60]
Kathryn T. Stolee and Sebastian Elbaum. 2013. Identification, Impact, and Refactoring of Smells in Pipe-Like Web Mashups. IEEE Trans. Softw. Eng. 39, 12 (Dec. 2013), 1654--1679. https://doi.org/10.1109/TSE.2013.42
[61]
Anselm Strauss and Juliet Corbin. 1994. Grounded theory methodology. Handbook of qualitative research 17 (1994), 273--85.
[62]
Erik H Trainer, Chalalai Chaihirunkarn, Arun Kalyanasundaram, and James D Herbsleb. 2015. From personal tool to community resource: What's the extra work and who will do it?. In Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing. ACM, 417--430.
[63]
Medha Umarji, Mark Pohl, Carolyn Seaman, A Güne_ Koru, and Hongfang Liu. 2008. Teaching software engineering to end-users. In Proceedings of the 4th international workshop on End-user software engineering. ACM, 40--42.
[64]
Ricardo Valerdi. 2017. Why software is like baseball. IEEE Software 34, 5 (2017), 7--9.
[65]
Dhawal Verma, Jon Gesell, Harvey Siy, and Mansour Zand. 2013. Lack of software engineering practices in the development of bioinformatics software. ICCGI 2013 (2013), 57--62.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Human-Computer Interaction
Proceedings of the ACM on Human-Computer Interaction  Volume 4, Issue CSCW1
CSCW
May 2020
1285 pages
EISSN:2573-0142
DOI:10.1145/3403424
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 May 2020
Published in PACMHCI Volume 4, Issue CSCW1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data analysts
  2. end-user communities
  3. end-user software engineering
  4. software process

Qualifiers

  • Research-article

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)176
  • Downloads (Last 6 weeks)26
Reflects downloads up to 22 Dec 2024

Other Metrics

Citations

Cited By

View all

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media