Artificial Intelligence and Machine Learning

Resolving the Human-Subjects Status of ML’s Crowdworkers

At what point do ML crowdworkers constitute human subjects for ethical and regulatory purposes?

By Divyansh Kaushik, Zachary C. Lipton, and Alex John London

Posted Mar 26 2024

Current Regulatory Framework
Common Rule and ML Research
SCENARIO 1
SCENARIO 2
Discussion
Acknowledgments
References

As the focus of machine learning (ML) has shifted toward settings characterized by massive datasets, researchers have become reliant on crowdsourcing platforms.¹³^,²⁵ Just for the natural language processing (NLP) task of passage-based question answering (QA), more than 15 new datasets containing at least 50k annotations have been introduced since 2016. Prior to that, available QA datasets contained orders of magnitude fewer examples.

The ability to construct such enormous resources derives mostly from the liquid market for temporary labor on crowdsourcing platforms such as Amazon Mechanical Turk. These practices, however, have raised ethical concerns, including low wages;⁵^,²⁶ disparate access, benefits, and harms of developed applications;¹^,²⁰ reproducibility of proposed methods;⁴^,²¹ and potential for unfairness and discrimination in the resulting technologies.⁹^,¹⁴

This article looks at what ethical framework should govern the interaction of ML researchers and crowdworkers, and the unique challenges in regulating ML research. Researchers typically lack expertise in human-subjects research and require guidance on how to classify the role crowdworkers play to comply with relevant ethical and regulatory requirements. Unfortunately, clear guidance is lacking: Some institutions and a 2021 paper by Shmueli et al. suggest all ML crowdworkers constitute human subjects;²³ others suggest that ML crowdworkers rarely constitute human subjects.¹⁰ Confusion surrounding ML crowdworkers is grounded in the following factors:

Novel relationships. The U.S. Common Rule was developed in the wake of abuses in biomedical and behavioral research and reflects the need to distinguish clinical research from medical practice.¹⁵ Because the distinction between employees on a research team and study participants is less ambiguous in medical contexts, little attention has been paid to criteria for distinguishing research staff from study participants.
Novel methods. In biomedical or social sciences, data is collected to answer questions that have been specified in advance, while ML often involves a dynamic workflow in which data is collected in an open-ended fashion and research questions are articulated considering its analysis. Additionally, ML researchers often release rich data resources, where much of the data is not analyzed.
Ambiguity under the Common Rule. Whether an individual is a human subject hinges on whether the data collected, and later analyzed, is about that individual. As Shmueli et al. have noted, crowdworkers can fill such diverse roles in ML research that it becomes difficult to draw a line between collected data about the crowdworkers versus merely from them (but about something else).²³
Scale. NLP research produces hundreds of crowdsourcing papers per year, with 703 appearing at the top venues alone from 2015–2020.²³
Inexperience. Crowdsourcing-intensive ML/NLP papers seldom discuss ethical considerations that would otherwise be central to human-subjects research, and they rarely discuss whether institutional review board (IRB) approval or exemption was sought—only 14 (about 2%) of the aforementioned 703 papers described IRB review or exemption.²³

Current Regulatory Framework

In the U.S., the regulations governing the treatment of humans in scientific research, detailed in the Code of Federal Regulations (CFR), are known as the Common Rule. Falling under the auspices of Office of Human Research Protections (OHRP) of the U.S. Department of Health and Human Services, these regulations apply only to institutions that accept federal funds or have agreed to abide by these rules. Two important criteria determine whether a person constitutes a research participant: those that define research and those that define a human subject.

Research is defined, in part, as “a systematic investigation, including research development, testing, and evaluation, designed to develop or contribute to generalizable knowledge.”

A human subject is defined as “a living individual about whom an investigator (whether professional or student) conducting research: (i) obtains information or biospecimens through intervention or interaction with the individual, and uses, studies, or analyzes the information or biospecimens; or (ii) obtains, uses, studies, analyzes, or generates identifiable private information or identifiable biospecimens.”

For simplicity, this discussion is limited to the production of information, rather than a discussion of specimens.

Two points of clarification: First, to satisfy the definition of a human subject in the CFR, researchers must retrieve data about an individual. This doesn’t imply the study focuses on the individual but aims to generate generalizable knowledge. For example, in biomedicine, individual measurements are used to produce knowledge about a wider population. Defining what information is about an individual can be challenging for ML researchers dealing with crowdworkers.

Second, conditions (i) and (ii) in the CFR lump together a range of cases that vary in substantive ways. Condition (i) is a combination of two conjuncts. The first concerns the way that information is produced: from either intervention or interaction. These terms are defined as:

Intervention includes both physical procedures by which information or biospecimens are gathered (for example, venipuncture) and manipulations of the subject or the subject’s environment that are performed for research purposes.
Interaction includes communication or interpersonal contact between investigator and subject.

Of these, interaction is the weaker condition. Interventions can be understood as the subset of interactions that produce a change in either the individual (for example, administering a drug or drawing blood) or their environment (for example, placing an individual in an imaging device). In contrast, interactions include communication or interpersonal contact that generates information without necessarily bringing about a change to the individual or their environment. For example, a study might divide participants into two groups: one to test an intervention alongside usual care; one to receive just the usual care. The group receiving only usual care is still part of the study’s social interaction that generates data to control for confounding, thus aiding in creating generalizable knowledge.

The second conjunct in condition (i) requires that information arising in one of these two ways—intervention or interaction—is then used, studied, or analyzed. Of these, use is the broadest category, as there may be myriad ways that information from a social interaction is used in research. In contrast, study and analysis constitute a stricter subset of uses in which data are analyzed or evaluated, presumably to generate the generalizable knowledge that defines the study in question.

The accompanying table lists combinations from these categories forming different research paradigms. Among these, the intervention analysis condition is narrowest, which implies a person becomes a study subject through targeted interventions and subsequent analysis. In contrast, the interaction use criteria is broader, holding that a person is a human subject if, during research, researchers interact with them in a way that produces information used to further the goals of the research.

Table. Examples of research interactions with the crowd.

	Studies/Analyzes	Uses
Intervention	Identifying better crowdsourcing strategies via a randomized study	Train an ML model on data collected in a gamification environment
Interaction	Analyzing data collected via surveys	Train an ML model on an annotated dataset.

Condition (ii) of the CFR’s definition of human subject applies when researchers obtain, use, study, analyze, or generate private information about a living individual, even if direct interaction is absent. It covers research involving datasets containing personal information or studies generating such information from noninclusive datasets.

These definitions demarcate which set of ethical and regulatory requirements applies to an activity. Activities not involving human subjects are not governed by regulations for human subject research, making IRB review unnecessary. Research involving human participants, however, necessitates adherence to specific moral and regulatory responsibilities, including mandatory IRB review.

This last claim might come as a surprise to some familiar with the Common Rule, since a significant portion of ML research, and NLP research in particular, is likely to be classified as exempt. Per 46.104.(3)(i) of the Common Rule, research involving benign behavioral interventions in conjunction with the collection of information from an adult subject through verbal or written responses or audiovisual recording can qualify for exempt status if the subject prospectively agrees to the intervention and information collection and at least one of the following criteria is met:

The information obtained is recorded by the investigator in such a manner that the identity of the human subjects cannot readily be ascertained, directly or through identifiers linked to the subjects.
Any disclosure of the human subjects’ responses outside the research would not reasonably place the subjects at risk of criminal or civil liability or be damaging to the subjects’ financial standing, employability, educational advancement, or reputation.

However, a researcher cannot unilaterally declare their research to be exempt from IRB review.

Rather, exempt is a regulatory status that must be determined by an IRB (§46.109(a)). This may seem paradoxical, as for a study to qualify for exempt status, researchers are obligated to offer comprehensive details regarding their project to the IRB. The board assesses this information to ensure all applicable Common Rule standards are met. This is common in administrative rulemaking, as well as judicial review; courts may determine whether something is in their jurisdiction, but a plaintiff must provide information to enable a court to make that determination. Exempt status usually entails less effort and receives faster approval than a full IRB review. A researcher at an institution governed by the Common Rule would violate regulatory obligations by commencing human subject research without prior IRB review, even if the research would have been exempt.

Common Rule and ML Research

Based on the preceding analysis, there is a large subset of ML research in which crowdworkers are clearly human subjects. These cases fit squarely into the paradigm of research, familiar in biomedicine and social science, where researchers interact with crowdworkers to produce data about those individuals, and then analyze that data to produce generalized knowledge about a population from which those individuals are considered representative samples.

In some studies, researchers assign crowdworkers at random to interventions to produce data that can be analyzed to generate generalizable knowledge about best practices for using crowdworkers. Here, crowdworkers are clearly human subjects. They are the target of an intervention designed specifically to capture data about them and their performance.

For example, Khashabi et al. engaged crowdworkers to investigate which workflows result in higher-quality QA datasets.¹² They recruited one set of crowdworkers to write questions given a passage, while another group of crowdworkers were shown a passage along with a suggested question and were tasked with minimally editing this question to generate new questions. In these settings, the data was about the workers themselves, as was the analysis.

Similarly, Kaushik et al. also examined different workflows to create QA datasets.¹¹ They asked one set of crowdworkers to write five questions after reading a passage, and another to write questions that elicit incorrect predictions from a pretrained QA model. Through this study, they derived insights about how each setup influenced crowdworker behavior, and then trained various QA models on these datasets.

Human-subjects research in NLP is not limited to studies aimed at dataset quality. Hayati et al. paired two crowdworkers in a conversational setting and asked one to recommend a movie to the other.⁷ They analyzed the outputs to identify what communication strategies led to successful recommendations and used these insights to train automated dialog systems.

Perez-Rosas et al. asked crowdworkers to each write seven truths and seven plausible lies on topics of their own choosing and collected demographic attributes (such as age and gender) for each crowdworker.²² They analyzed how attributes of deceptive behavior relate to gender and age, and then trained classifiers to predict deception, gender, and age. In these cases, the researchers interacted with crowdworkers to produce data about the crowdworkers that was then analyzed to answer research hypotheses, which created generalizable knowledge.

Cases where the human-subjects designation is problematic. Many ML crowdsourcing studies do not fit neatly in the paradigm of research common elsewhere. For example, crowdworkers are often recruited not as objects of study but to perform tasks that could have been—and sometimes are—performed by the researchers. In these cases, the researchers interact with crowdworkers and produce data that is then used to produce generalizable knowledge. Moreover, some of the collected data is about the worker (for example, to facilitate payment). In these cases, however, data analyzed to produce generalizable knowledge is not about the crowdworkers in any meaningful sense.

In the most common use of crowdsourcing in ML research (for example, Hovy et al.⁸) workers are hired to label datasets used for model training. While such research might seemingly satisfy the interaction and use criteria from the Common Rule, it meets these through information not directly about the worker. Crowdworkers perform tasks that would typically be performed by the research team when dealing with smaller datasets. For example, Kovashka et al. described computer vision papers where researchers provided their own labels.¹³ Addressing the same task, DeYoung et al. recruited crowdworkers to provide annotation,³ while Zaidan et al. did the annotations themselves.³⁰ All these tasks involved interacting with crowdworkers and using the generated data.

On a strict reading of the claim that a human subject is a living individual “about whom” researchers obtain information that is used or analyzed to produce generalizable knowledge, crowdworkers in these cases would not be classified as human subjects. This reading is consistent with the practice of some IRBs.

For example, Whittier College states:

Information-gathering interviews with questions that focus on things, products, or policies rather than people or their thoughts about themselves may not meet the definition of human-subjects research. Example: interviewing students about campus cafeteria menus or managers about travel reimbursement policy.²⁷

In contrast, other IRBs adopt a far more expansive reading of the Common Rule. Loyola University says:

In making a determination about whether an activity constitutes research involving human subjects, ask yourself the following questions:
Will the data collected be publicly presented or published?
AND
Do my research methods involve a) direct and/or indirect interaction with participants via interviews, assessments, surveys, or observations, or b) access to identifiable private information about individuals, for example, information that is not in the public domain?
If the answer to both these questions is “yes,” a project is considered research with human subjects and is subject to federal regulations.¹⁸

Note this interpretation does not distinguish whether the information is about an individual or just obtained via a direct and/or indirect interaction. This view appears to be shared by other IRBs as well.²

How does information about versus merely from impact human-subjects determination? Traditionally, research ethics has not had to worry about who is a member of the research team and who is a research participant. This ambiguity arises in cases of self-experimentation, but such cases are rare and fit into the intervention + analysis category from the Common Rule. The scope of the effort required to produce data that can be used in ML research has engendered new forms of interaction between researchers and the public. Without explicit guidance from federal authorities, individual IRBs must grapple with this issue on their own.

In the problematic cases discussed here, our contention is crowdworkers are best understood as augmenting the labor capacity of researchers rather than participating as human subjects in that research. This argument has two parts:

The first part of the argument is based on symmetry. Within a division of labor, if a task can be performed by more than one person, the categorization of that task should depend on its substantive features, not the identity of the individual performing it. (The potential counterargument citing unionized and nonunionized workers or independent contractors and employees shows that individual identity and related features may influence workplace protections, even for the same type of work. Pre-existing agreements modifying agent entitlements, however, do not change the nature of the activity—be it work or research.)

Therefore, if the same task is performed by a researcher and then by crowdworkers, the categorization should be consistent across both instances. Consequently, symmetry implies that either both the crowdworker and the researcher are part of the research team, or both are human subjects.

The second part of the argument offers additional factors encouraging the categorization of both as part of the research team. First, when conducting tasks pertinent to the study, researchers do not self-experiment; they are not study subjects.

Second, this position reflects the understanding that these interactions generate useful information contributing to the development of universally applicable knowledge. This information, however, should be seen as originating from, not being about, them.

Third, researchers interact as a team to generate tools, materials, and metrics used in research. But this interaction and use creates the means of generating new knowledge; it does not constitute the data whose study or analysis produces new knowledge.

Finally, ignoring the distinction between data about a person versus from them and considering both researchers and crowdworkers as human subjects would excessively broaden the regulatory category. This would categorize every research team member, even in biomedical and social science, as human subjects since they regularly interact with their teams to generate information for general knowledge.

Loopholes in research oversight. Previous analysis has underscored an ethical quandary in ML research. Ethical oversight in studies involving human participants safeguards their interests, which can be at risk because of interactions, interventions, or subsequent data usage. The concept of an oversight loophole—where a researcher can evade oversight requirements without affecting the applied research procedures¹⁷—constitutes an ethical concern. It infringes on the principle of equal treatment: If data is collected from individuals for research to produce generalized knowledge, their interests should receive the same level of oversight and concern regardless of how labor is distributed during the process. Nevertheless, two aspects of ML research render it susceptible to oversight loopholes: the way data collection and analysis workload is partitioned; and how research questions often surface post-data collection.

SCENARIO 1

The Common Rule envisions several divisions of labor in research. In traditional biomedical or social science research, it is common for the same researchers to both collect and analyze data. This approach is affirmed by 45 CFR 46.102 (e)(1)(i), which states that a researcher who “[o]btains information or biospecimens through intervention or interaction with the individual, and uses, studies, or analyzes the information or biospecimens,” is engaging in human-subjects research. Here, the ethical review assesses whether interactions respect participant autonomy and welfare, and information obtained from these interactions is used in ways that respect individuals’ rights and welfare.

Data or biospecimens are often collected during medical care or other health services. Such interactions are governed by medical ethics and professional norms rather than requirements of research. Hence, research ethics review assesses if identifiable private information is contained in the data or specimens and whether its use respects individuals’ rights and welfare.

It is not clear whether the Common Rule accounts for cases where researchers collect data for research goals but don’t analyze it themselves. This differs from secondary use of research data, where initial data collection already considers participants’ welfare and rights, ensuring adequate oversight. Subsequent oversight would thus evaluate additional use of that data.

In contrast, many ML researchers gather large datasets for research purposes, without defined hypotheses, often to support future research in broad fields.²⁸^,³¹ For example, Williams et al. compiled a dataset for textual entailment recognition and released it (along with anonymized crowdworker identifiers) for future research.²⁸ Similarly, Mihaylov et al. and Talmor et al. created and released QA datasets with anonymized identifiers for further research.¹⁹^,²⁴ These studies, involving only interaction with and using or analyzing data from crowdworkers, may not necessitate IRB review.

In a subsequent study, Geva et al. analyzed information about crowdworkers using these anonymized sets.⁶ They assessed how ML models trained on data from one group of crowdworkers generalizes to data from another group, and trained models to predict the authoring crowdworkers for respective documents. Given that they studied only existing anonymous datasets and didn’t directly interact with the workers, it’s questionable whether their work would require IRB oversight. If the researchers who collected the initial data had also conducted this analysis, however, IRB review would have been compulsory to ensure proper protections of participants’ welfare.

While much of ML research poses minimal risk to participants, cases do exist where interventions or interactions are less benign. For example, Xu et al. asked crowdworkers to prompt unsafe responses from a chatbot, using this data to create safer response models.²⁹ These individuals may not inherently be considered human subjects, as their input doesn’t pertain directly to them. In this study, however, the researchers also established an offensive language taxonomy for classifying human utterances, paving the way for its application in future research. Thus, inferences could potentially be drawn about the proclivities or proficiency of some crowdworkers to use offensive language of particular types.

In each of these cases, datasets were collected that contain information from crowdworkers for the purposes of producing generalizable knowledge that can include information about the crowdworkers. A research oversight loophole is created as 45 CFR 46.102 (e)(1)(i) considers individuals as human subjects only if their information is obtained and used in the same study. To be clear, releasing such a dataset with identifiable private information for research purposes would fall under clause (ii) from 45 CFR 46.102(e)(1) (discussed earlier). Subsequent research on this dataset is also subject to this clause, if the identifiable information remains.

The method where researchers collect data from individuals to create generalizable knowledge, anonymize it, and pass it to another team for analysis could be viewed as a loophole. This process, unlike when the researchers themselves analyze the data, would not be subject to oversight aimed at respecting individual autonomy and welfare.¹⁵ Even though anonymization lessens harm from exposure of sensitive details, it doesn’t assure respect for individual autonomy and well-being in the data collection, due to absence of oversight.

One way to address loopholes of this type would be to amend 45 CFR 46.102 (e)(1)(i) to explicitly include the release of data alongside its use, study, or analysis.

SCENARIO 2

Revision of 45 CFR 46.102 (e)(1)(i) to include data release might not prevent loopholes. For example, a research team collecting data directly from crowdworkers and about them—an approach fitting standard research—might divide the process into two protocols to avoid IRB approval requirements. In the first protocol, they collect data but analyze only that from the crowdworkers and not about them. They then anonymize all collected data and in a second protocol analyze data about crowdworkers. This avoids the need for IRB approval as it doesn’t involve personal interaction or identifiable private information use.

In this scenario, a single study that would require IRB approval could avoid research ethics oversight by being decomposed into separate studies. As a result, the determination of whether an ML project constitutes research with human participants might need to be made at a higher level than the individual study protocol.

In the context of drug development, for example, a trial portfolio has been defined as a “series of trials interrelated by a common set of objectives.”¹⁶ It might be beneficial to apply this portfolio-level approach in ML research—that is, considering the data generated and questions investigated across interlinked studies’ relevance to crowdworkers. Successful portfolio-level reviews need researchers to predetermine the kind, scope, and nature of data they are collecting and possible inquiries across various studies. As new research questions often arise after data collection because of the dynamic nature of ML research, researchers may need to consult with IRBs to clarify when a proposed portfolio of studies should be classified as human research.

Discussion

There is considerable confusion about when ML’s crowdworkers constitute human subjects for ethical and regulatory purposes. While some sources suggest treating all crowdworkers as human subjects,²³ our analysis makes a more nuanced proposal, identifying: clear-cut cases of human-subjects research, which require IRB consultation, even if only to confirm they belong to an exempt category; crowdsourcing studies that do not constitute human-subjects research because the analyses do not involve data about the workers; difficult cases, where the distinctive features of ML’s crowdworking studies combine with ambiguities in the Common Rule to create uncertainty about how to apply existing requirements; and loopholes, whereby researchers might elude the human-subjects designation without making substantive changes to the research performed.

The spirit of research oversight is to safeguard the rights and interests of individuals involved in research. Individuals who are not research participants can still be exposed to risks to their well-being and threats to their autonomy. This is particularly true of employment interactions, as employers often have access to sensitive, private, identifiable information (such as Social Security Number and background-check reports) about their employees.

The solution is not necessarily to redefine all crowdworkers as human subjects, but rather to clarify the parameters for their classification as such, ensuring due oversight when applicable. In other instances, their rights should be upheld via ethical and regulatory frameworks guiding labor practices and workplace safety.

Our recommendations:

ML researchers must work proactively with IRBs to determine which, if any, information they will generate is about versus merely from crowdworkers. They must discern whether their intended portfolio of studies involving this data constitutes human-subjects research. They should also recognize that as the questions they investigate change, the status of the research they are conducting may change. Consequently, researchers must consult IRBs to understand when a new submission or a protocol modification is necessary for the ongoing research.
IRBs should not reflexively classify all ML research involving crowdworkers as human-subjects research. Rather, IRBs should establish clear procedures for evaluating portfolios of research to address the possibility of loopholes in research oversight. They should communicate with ML researchers about the conditions under which the classifications might change and the conditions under which a revised protocol would be required.
OHRP should offer precise guidance about what it means for information or analysis to be “about” a set of individuals. We also recommend that OHRP revise the Common Rule so that 45 CFR 46.102(e)(1) condition (i) reads: “Obtains information or biospecimens through intervention or interaction with the individual, and uses, studies, analyzes, or releases the information or biospecimens.” This modification would require that an original investigator who collects data through interaction with humans and plans to release a dataset (even if anonymized) that could be used to ask questions about those individuals must secure IRB approval for the research in which that data is gathered. Subsequent studies using the anonymized data would not be counted as human-subjects research unless they aim to re-identify individuals. This change resolves one loophole identified here. OHRP also has a role to play in offering guidance to ML researchers. This could be achieved by issuing an agency Dear Colleague letter or an FAQ document.

Acknowledgments

We thank Sina Fazelpour, Holly Fernandez Lynch, and I. Glenn Cohen for their constructive feedback. We also thank Mozilla, the Carnegie Mellon University Block Center, the Carnegie Mellon University PwC Center, University of Pittsburgh Medical Center, Abridge, Meta Research, and Amazon Research for the grants and fellowships that made this work possible.

Submit an Article to CACM

CACM welcomes unsolicited submissions on topics of relevance and value to the computing community.

You Just Read

Resolving the Human-Subjects Status of ML’s Crowdworkers

View in the ACM Digital Library

DOI

10.1145/3641858

May 2024 Issue

Vol. 67 No. 5

Pages: 52-59

Table of Contents

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Explore More

BLOG@CACM Aug 30 2024

Everything You Always Wanted to Know About PCs, But Were Afraid to Ask

Saurabh Bagchi

Computing Profession

individuals at a conference table, illustration

News Aug 30 2024

How CrowdStrike Stopped Everything

David Geer

Security and Privacy

BLOG@CACM Aug 29 2024

Leveraging Computational Thinking in the Era of Generative AI

Yael Erez, Koby Mike, and Orit Hazzan

Education

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More

Current Regulatory Framework

Common Rule and ML Research

SCENARIO 1

SCENARIO 2

Discussion

Acknowledgments

Resolving the Human-Subjects Status of ML’s Crowdworkers

DOI

May 2024 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.