Artificial Intelligence and Machine Learning

Prevalence and Prevention of Large Language Model Use in Crowd Work

Crowd workers often use LLMs, but this can have a homogenizing effect on their output. How can we—and should we—prevent LLM use in crowd work?

By Veniamin Veselovsky, Manoel Horta Ribeiro, Philip J. Cozzolino, Andrew Gordon, David Rothschild, and Robert West

Posted Feb 19 2025

Study 1: Prevalence of LLM Use
Study 2: Prevention of LLM Use
Discussion
References

Crowd work platforms, such as Prolific and Amazon Mechanical Turk, play an important part in academia and industry, empowering the creation, annotation, and summarization of data,¹¹ as well as surveys and experiments.²¹ At the same time, large language models (LLMs), such as ChatGPT, Gemini, and Claude, promise similar capabilities. They are remarkable data annotators¹⁰ and can, in some cases, accurately simulate human behavior, enabling in-silico experiments and surveys that yield human-like results.² Yet, if crowd workers were to start using LLMs, this could threaten the validity of data generated using crowd work platforms. Sometimes, researchers seek to observe unaided human responses (even if LLMs could provide a good proxy), and LLMs still often fail to accurately simulate human behavior.²² Further, LLM-generated data may degrade subsequent models trained on it.²³ Here, we investigate the extent to which crowd workers use LLMs in a text-production task and whether targeted mitigation strategies can prevent LLM use.

Study 1: Prevalence of LLM Use

To estimate LLM use on Prolific, a research-oriented crowd work platform, we asked n = 161 workers to summarize scientific abstracts (following Ribeiro et al.;¹⁵ see Appendix in Supplemental Materials). We chose this task because it is laborious for humans but easily done by LLMs¹⁷ and because it allowed us to use pre-LLM summaries from prior work¹⁵ as “human ground truth.” We detected whether a summary had been generated using LLMs with a fine-tuned e5-base classifier²⁸ trained on human, pre-LLM summaries¹⁵ and summaries generated by GPT-4 and ChatGPT. The model was then run on each of the 161 new summaries to estimate its probability of being LLM-generated. In this study, we did not instruct participants to use LLMs in any way; thus, we captured a baseline of LLM use for uninstructed participants doing a task for which LLMs have a considerable advantage over human labor.

Following a study on Mechanical Turk that took place four to six weeks before ours,²⁷ we used three approaches to aggregate the probabilities of LLM use (henceforth “LLM probabilities”), obtaining similar (but slightly lower) estimates:

Classify-and-count, considering as synthetic any summary with an LLM probability above 50% (prevalence estimate: 33.3%; 95% CI [25.9%, 40.1%])
Probabilistic classify-and-count, where we calibrated the model⁶ (see Appendix) and then averaged the LLM probabilities (estimate: 35.2% [29.8%, 40.6%])
Corrected classify-and-count, adjusting for the type I and type II error rates estimated on the training data¹⁸ (estimate: 35.4% [27.8%, 43.0%]).

We validated our results by analyzing crowd workers’ copy-pasting behavior (see Appendix), finding that 55% of the summaries where workers had copy-pasted text were classified as synthetic (that is, LLM probability above 50%) vs. only 9% when workers had not copy-pasted text. As no information about copy-pasting was used in the “LLM-or-not” classifier, this result strengthens our confidence in it. Interestingly, far fewer crowd workers used copy-pasting on Prolific (53%) in Study 1, compared with a previous study²⁷ on Amazon Mechanical Turk (89%).

Study 2: Prevention of LLM Use

Next, we analyzed whether targeted strategies can curb LLM use. Specifically, we studied two different mitigation approaches: 1) explicitly asking crowd workers not to use LLMs (henceforth the “request” strategy) and 2) imposing hurdles that deter LLM use (the “hurdle” strategy). We considered two variations for each: For the request strategy, we asked individuals either directly or indirectly not to use LLMs (see Appendix), and for the hurdle strategy, we either converted the original abstract text to an image or disabled copy-pasting entirely. As the two strategies are independent, we investigated all combinations (alongside a no-restriction condition) in a 3 x 3 factorial design (see Table 1).

Using the same task as in Study 1, Study 2 was conducted by randomly splitting n = 720 users into the nine conditions. Upon completion, they were then redirected to a follow-up survey where they were asked (Q1) how often they used ChatGPT in their daily lives, (Q2) whether they had used ChatGPT for the task, and (Q3) whether they knew of studies tracking ChatGPT use on crowd work platforms (see Appendix for exact phrasing). We measure LLM use with the probabilistic classify-and-count classifier, self-reported LLM use as captured by Q2, and high-precision (and likely low-recall) heuristics indicating LLM use (see Materials and Methods).

Effectiveness of preventive measures. Table 1 shows the estimated LLM use across different mitigation strategies. For example, when workers were directly requested not to use LLMs and shown the text to be summarized as an image (thus preventing copy-pasting), LLM use (as measured by the probabilistic classify-and-count method) almost halved, dropping from 27.6% to 15.9% (as measured by the classifier; see Table1a). Similar results were obtained using self-reported use by crowd workers (Q2) and using high-precision heuristics (Tables 1c and 2c; see Materials and Methods). Comparing high-precision heuristics with self-reports revealed that only 11 of the 31 workers using LLMs according to high-precision heuristics admitted to using LLMs, whereas 31 of the 689 whom the heuristic and classifier both failed to mark as synthetic admitted to LLM use.

We further disentangled the effect of each specific strategy and variation with a linear model (see Appendix), finding three out of the four tested interventions to significantly reduce LLM use (considering the LLM use predicted by the classifier; see the figure). Notably, asking crowd workers indirectly (“Please do your best to summarize the abstract in your own words”) was the least effective strategy across all measures of LLM use and the only non-significant intervention when considering the classifier-based outcome (“Indirect”; 2% decrease; p = 0.38). This hints at the complexity of preventing LLM use, as crowd workers may choose to ignore requests if it is in their best interest financially.

Figure. Estimated effect sizes for interventions to prevent LLM use considering three different measures of LLM use as the outcome variable: 1) probabilistic classify-and-count, 2) self-reported use, 3) high-precision heuristics. Error bars represent 95% confidence intervals; n = 720.

Correlates of LLM use. We studied the relationship between LLM use and 1) the age of crowd workers and 2) how they answered two of the post-survey questions (Q1: LLM use in general; Q3: awareness of studies measuring LLM use) using a simple linear model and considering both self-reports and the classifier’s LLM-probability estimates as outcomes (see Appendix). We found that younger individuals were significantly more likely to use LLMs (−0.18% in estimated LLM probability per year; p = 0.014) and that workers who used LLMs “often” were 18.7% more likely to use it for the task (p < 0.001). Awareness of studies measuring LLM use did not significantly affect use (+1.6%; p = 0.55). Results were similar when considering self-reported use as the outcome variable.

Additionally, we analyzed the relationship between LLM use and time spent on the task, finding that preventive measures (that is, hurdles and requests) seem to mediate the relationship. Users who self-reported LLM use spent 21.9% less time (relative decrease; p = 0.002) to complete the task than those who did not (across experimental conditions). Using a simple linear model with time spent as the log-transformed outcome (see Appendix), we further analyzed this relative change across different proxy metrics for LLM use and experimental conditions (see Table 2). Across proxy metrics, we found that the overall time reduction is never statistically significant when hurdles are employed. However, when only requests are applied, results differed: The relative decreases were not statistically significant considering the classifier but remained statistically significant considering self-reports. We hypothesize this may be because users who use LLMs and lightly edit their output spend more time on the task and are less likely to self-report use.

Table 2. Relative differences in time spent between instances where we detected LLM use and where we did not. We report differences across the nine experimental conditions and determine LLM use using three methods. (Note that time spent is one of our heuristics for detecting ChatGPT use.)

Content-level analysis. Analyzing the text of crowd workers’ summaries, we found that summaries labeled as synthetic by the classifier were significantly more “homogeneous” than those labeled as human, according to a previously proposed homogeneity metric²⁰ and BertScore³⁰ (details in Appendix). We estimated a homogeneity score of 45.6% (43.2%, 48.2%) for synthetic texts, vs. 27.1% (26.8%, 27.4%) for human texts, and a BERTScore of 91.4 (91.0, 91.8) for synthetic texts vs. 87.4 (87.2, 87.3) for human texts.

In the original study whose human summaries we reused,¹⁵ the authors measured the retention of keywords from the original abstract corresponding to essential information, finding it to be highly correlated with human evaluations of quality. Using this metric as a proxy for quality, we found that summaries labeled as synthetic preserved more keywords (40.1% [36.9%, 43.2%]) than summaries labeled as human (31.2% [29.9%, 32.6%]). We found a similar effect when using self-reports and high-precision heuristics instead of the classifier’s labels.

But how do the interventions affect the above content-level metrics? We repeated the analysis shown in the figure but using homogeneity, BERTScore, and keyword retention as outcomes (see Section G.1 in the Appendix for details). We found no significant effect of the interventions on content-level outcomes, with one exception: Directly requesting workers not to use LLMs decreased keyword retention by 5.8% (p = 0.003). We hypothesize that the reduction in keyword retention may be caused by crowd workers’ hesitancy to use extractive summarization when prompted not to use LLMs. (Results were similar when considering only summaries classified by us as being human-made.)

Discussion

The results suggest that LLMs pervade current crowd work on text-production tasks. Although adopting various strict mitigation approaches reduced LLM use by nearly 50%, it could not entirely prevent it. While text-production tasks are particularly suitable for LLM use, we argue that these findings are broadly applicable to crowdsourcing, as crowd workers will likely use LLMs on other kinds of tasks (for example, image segmentation, multiple-choice questions) in the near future, if they are not already doing so. There are several reasons for this. First, the models are increasingly capable of doing other tasks; for instance, while writing this article, ChatGPT was updated to receive images as input,¹⁹ which could allow its use on tasks such as image tagging or classification.²⁹ Second, crowd workers have incentives to use them; even in the absence of LLMs, there are widespread attempts to “game the system” to make money, to the extent that an extensive body of work has been developed around ensuring the quality of responses.⁷ Third, crowd workers, who are are often tech-savvy⁵^,¹² and frequently rely on plug-ins and Web services to boost their performance and earnings,⁹^,¹⁶ are capable of integrating these models into their pipelines. Even without requiring coding, tools to automate ChatGPT use are plentiful (for example, IFTTT, Zapier).

Synthetic data may harm the utility of crowd work platforms, as researchers often care about human behavior or preferences; for example, the authors of the paper whose human summaries we borrowed¹⁵ wanted to know how people summarized, instead of merely obtaining good summaries. While some preliminary studies suggest that synthetic data may capture certain viewpoints,² it still often fails to do so, and research using crowd work may inadvertently capture the behavior and preferences of LLMs, not humans. Even if LLMs can capture average behavior or preferences, the homogeneity of their responses may result in losing the long tail of human behavior and preferences that is vital to researchers²⁴ and, according to recent work, important to training capable LLMs.²³ In that context, our results indicating that LLM-generated summaries are more homogeneous than human-generated summaries suggest that LLM use may be particularly harmful when the goal of crowdsourcing is to capture the diversity of human preferences, behaviors, or opinions.

To foresee the potential harm of LLM use in crowdsourcing, one may consider a topic that has received increased attention in the social sciences in the past few years: climate change.⁸^,²⁶ Social scientists often use crowdsourcing to study attitudes toward climate change.⁴^,²⁵ Yet, recent work has shown that, when prompted to answer multiple-choice questions, LLMs’ opinions are better aligned with liberal, wealthy individuals and exhibit pro-environmental bias.¹³^,²² Therefore, it may be expected that LLM use could harm the validity of social scientists’ studies on behavioral interventions and assessments of global stances toward climate change. We stress that this is not particular to climate change: Social scientists use crowdsourcing for various topics,³ and LLMs are non-representative of the samples of interest in various ways.¹³^,²²

We must be careful not to conflate LLM use with cheating. Depending on the study, it could be beneficial if LLMs assist crowd workers. Further, as LLMs become intertwined with how people write and accomplish everyday tasks, the distinction between “synthetic” and “human” data may blur. For example, is text generated with the help of a spellchecker synthetic? Thus, we expect the thresholds for concern and meaning will shift dramatically over the coming months and years, as LLMs become more ubiquitous in everyday productivity tasks. In that context, a fruitful future direction is to explore the landscape of how crowd workers use LLMs. There are many ways of integrating these models into crowd workers’ workflows, and different approaches may affect downstream research output differently.

We found that stricter mitigation approaches can significantly reduce LLM use. These measures may, however, backfire when detection is critical. Stricter measures may limit the number of participants using LLMs but also make them more reluctant to admit ex-post that they used them, or make them more difficult to detect, as the prevention measure eliminated a key indicator of LLM use. For example, removing copy-pasting makes it harder to use LLMs, limiting use, but then researchers also cannot use copy-pasting as a feature to detect who used LLMs. Further, mitigation approaches can reduce the overall response quality: As we found empirically, workers explicitly told not to use LLMs produced lower-quality summaries.

LLM-based tools and LLM users are co-evolving in ways to ensure the low temporal validity of our specific findings and estimates. In the past few months alone, tools have evolved to interpret images and to call LLMs without the need to copy-paste (for example, by simply selecting text). This does not diminish the value of our work—it makes it even more valuable: It is critical to establish baselines and ongoing measurements as this co-evolution progresses, and our work establishes such baselines. Further, we are confident that our high-level interpretations and guidance will translate across this evolution, and we hope this helps establish a regularly updated new program of study to serve crowd work platforms and researchers.

To conclude, in light of our findings, we propose practical guidelines for researchers to use crowdsourcing in the era of large language models. First, researchers should assess the impact of LLMs on their research by asking themselves: Is the point of crowdsourcing to obtain data representative of human behavior, preferences, and opinions? And if so, is capturing the diversity of these human responses important? We argue that crowdsourcing will be most affected when the answer to both questions is yes, as we found that LLM responses differ from human responses and are more homogeneous. Second, if large language models are likely to harm the utility of crowdsourcing, our findings indicate that researchers can actively diminish LLM use by requesting that workers not use them and creating hurdles that decrease the incentives for using them. Notably, hurdles should be adapted as models become more capable and better integrated into people’s lives.

Materials and Methods

Data. We modified a prior Mechanical Turk task¹⁵ where crowd workers were asked to summarize medical paper abstracts. We re-ran the study twice on Prolific. In Study 1, we estimated prevalence by collecting 168 user summaries (paying £9/hour). In Study 2, we re-ran the study on 720 users, now using several mitigation techniques (paying £10/hour). (See Appendix for full description of data and original study.)
Model training. We fine-tuned a e5-base-v2 language model²⁸ for our classification task and conducted a hyperparameter sweep. The model was trained on the summaries from the original study¹⁵ (written before the adoption of LLMs) and summaries synthetically generated using OpenAI’s API.
Heuristic-based estimates. We defined two high-precision heuristics for measuring LLM use: feasible time for completion and pasting in artifacts from the ChatGPT Web interface (details in Appendix).
Effect of each intervention. We assessed the effectiveness of each of the interventions with a linear probability model. We do not consider interactions between the treatment conditions, as a two-way ANOVA indicated that the interactions between the two strategies are not statistically significant.

To access the Appendix, please visit this article’s page in the ACM Digital Library and click on Supplemental Material.

Submit an Article to CACM

CACM welcomes unsolicited submissions on topics of relevance and value to the computing community.

You Just Read

Prevalence and Prevention of Large Language Model Use in Crowd Work

View in the ACM Digital Library

DOI

10.1145/3685527

March 2025 Issue

Vol. 68 No. 3

Pages: 42-47

Table of Contents

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Explore More

News Mar 20 2025

AI Transforms Medical Diagnostics

Samuel Greengard

Artificial Intelligence and Machine Learning

doctors and medical professionals walking in a hallway

News Mar 20 2025

Predicting the Unpredictable: The AI Outlook

Karen Emslie

Architecture and Hardware

News Mar 19 2025

Cyber Trust Mark to Distinguish Secure Smart Home Devices

David Geer

Architecture and Hardware

woman adjusting a baby monitor near bed of a sleeping child

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More

Study 1: Prevalence of LLM Use

Study 2: Prevention of LLM Use

Discussion

Prevalence and Prevention of Large Language Model Use in Crowd Work

DOI

March 2025 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.