research-article

Understanding Machine Learning Practitioners' Data Documentation Perceptions, Needs, Challenges, and Desiderata

Authors:

Jennifer Wortman VaughanAuthors Info & Claims

Proceedings of the ACM on Human-Computer Interaction, Volume 6, Issue CSCW2

Article No.: 340, Pages 1 - 29

https://doi.org/10.1145/3555760

Published: 11 November 2022 Publication History

Get Access

Abstract

Data is central to the development and evaluation of machine learning (ML) models. However, the use of problematic or inappropriate datasets can result in harms when the resulting models are deployed. To encourage responsible AI practice through more deliberate reflection on datasets and transparency around the processes by which they are created, researchers and practitioners have begun to advocate for increased data documentation and have proposed several data documentation frameworks. However, there is little research on whether these data documentation frameworks meet the needs of ML practitioners, who both create and consume datasets. To address this gap, we set out to understand ML practitioners' data documentation perceptions, needs, challenges, and desiderata, with the ultimate goal of deriving design requirements that can inform future data documentation frameworks. We conducted a series of semi-structured interviews with 14 ML practitioners at a single large, international technology company. We had them answer a list of questions taken from datasheets for datasets~\citegebru2018datasheets. Our findings show that current approaches to data documentation are largely ad hoc and myopic in nature. Participants expressed needs for data documentation frameworks to be adaptable to their contexts, integrated into their existing tools and workflows, and automated wherever possible. Despite the fact that data documentation frameworks are often motivated from the perspective of responsible AI, participants did not make the connection between the questions that they were asked to answer and their responsible AI implications. In addition, participants often had difficulties prioritizing the needs of dataset consumers and providing information that someone unfamiliar with their datasets might need to know. Based on these findings, we derive seven design requirements for future data documentation frameworks such as more actionable guidance on how the characteristics of datasets might result in harms and how these harms might be mitigated, more explicit prompts for reflection, automated adaptation to different contexts, and integration into ML practitioners' existing tools and workflows.

Supplementary Material

PDF File (v6cscw2340aux.pdf)

This is the form provided to participants in the study presented in the CSCW 2022 paper "Understanding ML Practitioners' Data Documentation Perceptions, Needs, Challenges, and Desiderata" by Amy K. Heger et al.

Download
294.22 KB

References

[1]

Saleema Amershi, Andrew Begel, Christian Bird, Robert DeLine, Harald Gall, Ece Kamar, Nachiappan Nagappan, Besmira Nushi, and Thomas Zimmermann. 2019. Software engineering for machine learning: A case study. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 291--300.

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

Machine learning data practices through a data curation lens: An evaluation framework

Understanding Implementation Challenges in Machine Learning Documentation

Understanding documentation value in software maintenance

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations