Sign In

Communications of the ACM

Review articles

Datasheets for Datasets

individuals moving word blocks, illustration

Credit: GoodStudio

Data plays a critical role in machine learning. Every machine learning model is trained and evaluated using data, quite often in the form of static datasets. The characteristics of these datasets fundamentally influence a model's behavior: a model is unlikely to perform well in the wild if its deployment context does not match its training or evaluation datasets, or if these datasets reflect unwanted societal biases. Mismatches like this can have especially severe consequences when machine learning models are used in high-stakes domains, such as criminal justice,1,13,24 hiring,19 critical infrastructure,11,21 and finance.18 Even in other domains, mismatches may lead to loss of revenue or public relations setbacks. Of particular concern are recent examples showing that machine learning models can reproduce or amplify unwanted societal biases reflected in training datasets.4,5,12 For these and other reasons, the World Economic Forum suggests all entities should document the provenance, creation, and use of machine learning datasets to avoid discriminatory outcomes.25

Back to Top

Key Insights


Although data provenance has been studied extensively in the databases community,3,8 it is rarely discussed in the machine learning community. Documenting the creation and use of datasets has received even less attention. Despite the importance of data to machine learning, there is currently no standardized process for documenting machine learning datasets.


David Lippert

I'm curious if/how the FAIR guiding principles for scientific data management and stewardship align with datasheets for datasets.

Also, there seems to be an adversarial relationship between DAIR and industry. Are there opportunities for industry to work with DAIR to help eliminate unwanted societal biases?

Gary Marchionini

This is interesting and timely work. It is crucial that rich context (metadata++) is available for datasets. The 50+ questions are daunting for an author/creator but the more time and effort given to adding this value will accrue on multiple levels: findability; credit for concept development, understandability, and computability/inferencing/linking to advance progress. Libraries have long developed metadata standards for individual objects to serve these value adds, and archives create finding aids that richly contextualize collections. This work illustrates how to move forward with curating datasets. Gary Marchionini

Stephen Gilbert

This is a great idea and I wonder if it could be integrated with the Open Science Framework so that people who post their datasets to would follow one of your templates. Also, I wondered if this effort could learn from the research reporting guidelines established for medical research (see If you want to submit a medical research article, the journal will demand that you also submit a completed research report like CONSORT (or any of hundreds of others). Maybe data journals could do the same with your template (but not the hundreds part).

Displaying all 3 comments

Log in to Read the Full Article

Sign In

Sign in using your ACM Web Account username and password to access premium content if you are an ACM member, Communications subscriber or Digital Library subscriber.

Need Access?

Please select one of the options below for access to premium content and features.

Create a Web Account

If you are already an ACM member, Communications subscriber, or Digital Library subscriber, please set up a web account to access premium content on this site.

Join the ACM

Become a member to take full advantage of ACM's outstanding computing information resources, networking opportunities, and other benefits.

Subscribe to Communications of the ACM Magazine

Get full access to 50+ years of CACM content and receive the print version of the magazine monthly.

Purchase the Article

Non-members can purchase this article or a copy of the magazine in which it appears.