skip to main content
10.1145/3650203.3663326acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Open access

Croissant: A Metadata Format for ML-Ready Datasets

Published: 09 June 2024 Publication History

Abstract

Data is a critical resource for Machine Learning (ML), yet working with data remains a key friction point. This paper introduces Croissant, a metadata format for datasets that simplifies how data is used by ML tools and frameworks. Croissant makes datasets more discoverable, portable and interoperable, thereby addressing significant challenges in ML data management and responsible AI. Croissant is already supported by several popular dataset repositories, spanning hundreds of thousands of datasets, ready to be loaded into the most popular ML frameworks.

References

[1]
Michael Kuchnik, Ana Klimovic, Jiri Simsa, Virginia Smith, and George Amvrosiadis. Plumber: Diagnosing and removing performance bottlenecks in machine learning data pipelines. Proceedings of Machine Learning and Systems, 4: 33--51, 2022.
[2]
Luis Oala, Manil Maskey, Lilith Bat-Leah, Alicia Parrish, Nezihe Merve Gürel, Tzu-Sheng Kuo, Yang Liu, Rotem Dror, Danilo Brajovic, Xiaozhe Yao, Max Bartolo, William A Gaviria Rojas, Ryan Hileman, Rainier Aliment, Michael W. Mahoney, Meg Risdal, Matthew Lease, Wojciech Samek, Debojyoti Dutta, Curtis G Northcutt, Cody Coleman, Braden Hancock, Bernard Koch, Girmaw Abebe Tadesse, Bojan Karlaš, Ahmed Alaa, Adji Bousso Dieng, Natasha Noy, Vijay Janapa Reddi, James Zou, Praveen Paritosh, Mihaela van der Schaar, Kurt Bollacker, Lora Aroyo, Ce Zhang, Joaquin Vanschoren, Isabelle Guyon, and Peter Mattson. DMLR: Data-centric machine learning research - past, present and future. Journal of Data-centric Machine Learning Research, 2024. URL https://openreview.net/forum?id=2kpu78QdeE. Featured Certification, Survey Certification.
[3]
Nithya Sambasivan, Shivani Kapania, Hannah Highfill, Diana Akrong, Praveen Paritosh, and Lora M Aroyo. "Everyone wants to do the model work, not the data work": Data Cascades in High-Stakes AI. In proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pages 1--15, 2021.
[4]
Omar Benjelloun, Elena Simperl, Pierre Marcenac, Pierre Ruyssen, Costanza Conforti, Michael Kuchnik, Jos van der Velde, Luis Oala, Steffen Vogler, Mubashara Akthar, Nitisha Jain, and Slava Tykhonov. Croissant format specification. Technical report, 2024. URL https://mlcommons.org/croissant/1.0.
[5]
Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. 2017.
[6]
Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. Datasheets for datasets, 2021.
[7]
Riccardo Albertoni, David Browning, Simon JD Cox, Alejandra Gonzalez Beltran, Andrea Perego, and Peter Winstanley. Data catalog vocabulary (DCAT) - version 3. https://www.w3.org/TR/vocab-dcat-3/, 01 2024. (Accessed on 03/18/2024).
[8]
schema.org. Schema.org v26.0. https://github.com/schemaorg/schemaorg/tree/main/data/releases/26.0/, 02 2024. (Accessed on 03/18/2024).
[9]
Data packages. https://specs.frictionlessdata.io/. (Accessed on 03/21/2024).
[10]
Csv on the web: A primer. https://www.w3.org/TR/tabular-data-primer/. (Accessed on 03/21/2024).
[11]
Stian Soiland-Reyes, Mercè Crosas Peter Sefton, Leyla Jael Castro, Frederik Coppens, José M. Fernández, Daniel Garijo, Marco La Rosa Björn Grüning, Simone Leo, Eoghan Ó Carragáin, Marc Portier, Ana Trisovic, RO-Crate Community, Paul Groth, and Carole Goble. Packaging research artefacts with ro-crate. Data Science, 5(2), 2022.
[12]
Open archives initiative object exchange and reuse. https://www.openarchives.org/ore/. (Accessed on 03/21/2024).
[13]
Gary King. An introduction to the dataverse network as an infrastructure for data sharing, 2007.
[14]
Ckan. https://ckan.org/. (Accessed on 03/21/2024).
[15]
Apache Software Foundation. Arrow columnar format --- apache arrow v15.0.1. https://arrow.apache.org/docs/format/Columnar.html, 01 2024. (Accessed on 03/16/2024).
[16]
Apache Software Foundation. Apache parquet. https://parquet.apache.org/docs/file-format/, 11 2023. (Accessed on 03/16/2024).
[17]
Huggingface. huggingface/safetensors: Simple, safe way to store and distribute tensors v0.4.2. https://github.com/huggingface/safetensors, 01 2024. (Accessed on 03/18/2024).
[18]
Michael Armbrust, Tathagata Das, Liwen Sun, Burak Yavuz, Shixiong Zhu, Mukul Murthy, Joseph Torres, Herman van Hovell, Adrian Ionescu, Alicja Łuszczak, et al. Delta lake: high-performance acid table storage over cloud object stores. Proceedings of the VLDB Endowment, 13(12):3411--3424, 2020.
[19]
Chang She. Benchmarking random access in lance. https://blog.lancedb.com/announcing-lancedb-5cb0deaa46ee-2/, 03 2023. (Accessed on 03/18/2024).
[20]
Ibis project. https://ibis-project.org/. (Accessed on 03/21/2024).
[21]
Emily M. Bender and Batya Friedman. Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics, 6:587--604, 2018. URL https://aclanthology.org/Q18-1041.
[22]
Mahima Pushkarna, Andrew Zaldivar, and Oddur Kjartansson. Data cards: Purposeful and transparent dataset documentation for responsible ai, 2022.
[23]
Mubashara Akhtar, Nitisha Jain, Joan Giner-Miguelez, Omar Benjelloun, Elena Simperl, Lora Aroyo, Rajat Shinde, Luis Oala, and Michael Kuchnik. Croissant rai specification. Technical report, 2024. URL https://mlcommons.org/croissant/RAI/1.0.
[24]
Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft coco: Common objects in context, 2015.
[25]
TFDS. TensorFlow Datasets, a collection of ready-to-use datasets. https://www.tensorflow.org/datasets, 03 2024.
[26]
Dan Brickley, Matthew Burgess, and Natasha Noy. Google dataset search: Building a search engine for datasets in an open web ecosystem. In The world wide web conference, pages 1365--1375, 2019.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
DEEM '24: Proceedings of the Eighth Workshop on Data Management for End-to-End Machine Learning
June 2024
89 pages
ISBN:9798400706110
DOI:10.1145/3650203
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 June 2024

Check for updates

Author Tags

  1. ML datasets
  2. discoverability
  3. reproducibility
  4. responsible AI

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

  • AI4EUROPE
  • ECSEL Joint Undertaking (JU)
  • HORIZON EUROPE

Conference

SIGMOD/PODS '24
Sponsor:

Acceptance Rates

DEEM '24 Paper Acceptance Rate 12 of 17 submissions, 71%;
Overall Acceptance Rate 44 of 67 submissions, 66%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1,297
  • Downloads (Last 6 weeks)174
Reflects downloads up to 29 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media