The need for human-generated labels to train machine learning (ML) models is a key driver behind the rise of crowdsourcing5: outsourcing simple data annotation tasks to non-experts, who are available 24/7 and are paid per task, via dedicated platforms. During the last 15 years, much research on crowdsourcing and human computation has been conducted, including methods to control for quality, cost, and speed of data labeling. Large corporations make use of crowdsourcing for tasks like image annotation, relevance judgments, as well as content moderation. After many years of crowdsourcing being used in production, researchers envisioned a future of sustainable crowd work with long-term opportunities and managerial roles within hierarchical structures.7 In hindsight, crowd work has not changed much over the years, and has increasingly received criticism as being an exploitative process, in which the labor behind modern ML-driven AI applications is hidden.1
Crowd work has always been technology-supported, with professional workers making use of scripts and tools to improve productivity.6 Now, with the wide availability of generative artificial intelligence (GenAI) tools such as large language models (LLMs), we ask a crucial question: How will workers use them, and will crowd work be made obsolete by GenAI? Will this disruptive technology eliminate the need for crowd work or will it fundamentally reshape its nature? Considering, in particular, the case of data annotation, we take the stance that the incorporation of GenAI into crowd work is unavoidable, and that platforms, as well as requesters, must take steps to promote its responsible use. We explore data annotation specifically because it is the very process through which GenAI models have been built and continue to learn. As will be discussed, while both requesters and workers can benefit from the use of GenAI for annotation, unintended consequences (for example, labels that are too homogeneous or low quality) must always be anticipated and treated with care.
Can GenAI Replace All Crowdsourced Annotations?
Internet corporations currently employ armies of annotators. One experimental study suggested that a large proportion of workers on Amazon Mechanical Turk (MTurk) are leveraging ChatGPT in text summarization tasks.10 Furthermore, MTurkers self-reported using GenAI in tasks.2 While this is not surprising, it poses questions as to whether crowdsourcing is still needed. Perhaps GenAI will replace humans in tasks such as moderating online content (especially when potentially harmful to people) and creating labeled data to train ML models, resulting in widespread AI-Sourcing.
Current LLMs work well for tasks such as text summarization, but it is not yet clear if for more subjective tasks (for example, sentiment classification, hate speech detection, or image labeling4). LLMs are ready to replace humans. With respect to using ChatGPT broadly in annotation tasks, Kocon and colleagues described it as a “Jack of all trades … but master of none.”8 Evaluating it on 25 tasks, they concluded that it performs decently overall, but can be hit-or-miss on particular tasks relative to bespoke solutions. Beyond the discussion of LLM capabilities, it is important to explore scenarios in which humans work together with GenAI to complete tasks more effectively. But then, requesters may be unaware of annotators using these tools as a co-pilot. Do they need to know? Should there be an option to collect natively human labels rather than hybrid human-machine annotations? Or, is what matters the quality of the resulting labels or of the predictions made by the models trained with such labels? The research strongly suggests a need for humans-in-the-loop; however, the nature of human computation and crowdsourcing has fundamentally changed.
Banning or Enabling the Use of GenAI in Crowdsourcing?
Early stances toward workers using GenAI have not been aligned. We have seen that MTurk workers have started to use ChatGPT.2 On the other hand, Prolific does not allow workers to use LLMs, informing workers: “please don’t use AI assistance tools or large language models (LLMs) when completing a study.”a In the end, the market may decide on the “best approach,” with requesters choosing the platform that suits their needs. However, when we have diverse approaches to the management of GenAI in crowd work, we risk creating a digital divide. Workers who are not allowed to use it may, over time, become disadvantaged. Also, even if a platform bans GenAI, enforcing this is likely to prove futile. Automatic detection of AI-generated work is an “arms race,” as the technology is improving rapidly.
To date, platforms have followed various approaches to managing GenAI tools. However, ultimately, embracing human-AI collaboration in crowd work is the only solution. Rather than a future in which annotation tasks will be taken over by GenAI, we envision a future collaborative environment where workers may be more empowered and efficient; however, this may signify a shift in the nature of the tasks and the effort required. We can consider a spectrum in between the extremes of fully AI-driven and fully human annotation tasks. For instance, text summarization tasks might be completed by GenAI alone, common sense knowledge assessment tasks could be done collaboratively, while the validation of AI annotations should be completed by humans.
If this collaborative approach is widely adopted, it is then critical to provide training to humans on the risks and opportunities of using these tools as a co-pilot. Increasing literacy on the impact of GenAI on human decisions, beyond in-depth knowledge of annotation guidelines, will be essential. It is not obvious what approach should be taken. Effort towards increasing GenAI literacy in the community of human annotators, as well as data requesters, is needed.
Embracing Human-AI Collaboration
We anticipate that humans will move toward the role of auditors of AI-generated annotations, without this signifying the end of human-oriented tasks. On the contrary, we envision a human-AI collaborative spectrum3 that encompasses both AI-supported human annotations (HA-AI) and human-audited AI annotations (AIA-HV). Figure 1 outlines the potential forms of collaboration.
Humans can assume the role of annotators, providing pure human annotations (HA). As GenAI advances, humans will shift to auditors of AI-generated annotations, verifying labels and/or validating their correctness. The final step in this evolution is for humans to become observers, shifting to pure AI annotations (AIA), and even AI validations/verifications (AIV). Currently, workers are far from reaching the role of observers, but do experience different forms of collaboration with GenAI. For example, they can receive support during annotation from AI tools (HA-AI) (for example, a shorter list of possible tags for an image annotation task) or they might act as both annotators and auditors during the same task (HA-HV) (for example, performing sentiment analysis on a paragraph and verifying an AI-generated sentiment analysis for the same paragraph) or, finally, they might move to a role of verifying/validating AI annotations (AIA-HV) as pure auditors (for example, validating the correctness of AI labels). This collaborative spectrum requires two types of effort from workers: producing annotations or auditing AI annotations (see Figure 2). We consider the roles of validator and verifier similar, clearly different from the current role of annotator, and not yet at the stage of observer. For this reason, we discuss these two roles together, under the general role of auditor in the following.
Figure 2 analyzes how the balance between human and AI effort may evolve over time, providing examples of several types of annotation tasks. In particular, the human effort dedicated to either (original) annotation or auditing when facilitated by GenAI on common tasks is considered. First, we picture a transition toward GenAI as a co-pilot for humans and then, a gradual shift of the role of the human towards having less original work to complete, but more responsibility as auditors.
Tasks such as sentiment classification or image annotation traditionally require high human effort. For these tasks, it is possible that humans are required to both create original annotations, as well as to audit samples of AI-generated annotations (HA-HV). Humans may need to audit low-confidence annotations, or intervene on potentially biased AI outcomes, and on higher-risk topics and decisions. Even in cases where human involvement is only required for auditing (AIA-HV), it is expected that the effort/responsibility of the auditors will be substantial, as shown in Figure 2. For tasks such as content creation, the human effort is drastically reduced with the effective use of GenAI content. This however, underscores the necessity for human auditing tasks that require increased responsibility on the side of the auditor. The actors enabling this type of human-AI collaborations must be mindful, ensuring training and worker qualifications on the use or auditing of AI-generated annotations. As GenAI advances, humans will assume more intricate tasks in terms of responsibility. This requires reassessment of the training of workers, as well as of the design of tasks to include those where GenAI alone does not perform well enough.
Crowdsourcing or AI-Sourcing?
There will always be a role for humans in AI pipelines, although GenAI is disrupting the crowdsourcing environment as we know it. Platforms need to set standards, supported by guidelines and templates, for its use. Workers have used tools/scripts for more than 10 years; thus, GenAI can be viewed as a new tool they will add to their stack. This disruption does not mean the end of crowdsourcing but there will adjustments, including reevaluating quality control mechanisms and payment schemes. While auditing AI output might seem like an “easy” task requiring a yes or no answer, it may actually be a time-consuming process. These changes will have implications for stakeholders in the current ecosystem: platforms, crowd workers, infrastructure, and customers. As GenAI becomes deployable locally rather than in the cloud, it will be easier to perform annotation without the use of crowd platforms. While running internal annotation pipelines will be an option, the need for human oversight and hardware will still make platform-based data annotation pipelines viable. Even now, many cloud providers offer fully managed services that include foundational models, MLOps, and human label collection to build, train and deploy custom ML models at scale.
Involving humans helps us avoid the vicious loop of “AI training AI”, which poses significant risks. In fact, commercial search engines have started to use LLMs to generate relevance judgments and to evaluate and train retrieval models.9 After some iterations of training AI with AI-generated labels, the quality of the model’s predictions could quickly drop (for example, due to lack of variability), but also, there is a risk of reinforcing existing bias, if human oversight is no longer required. The adoption of GenAI in annotation tasks will disrupt not only the way these tasks are completed, but also the labeling outcome, by resulting in less diverse labels. The expertise and social knowledge of humans should be harnessed to ensure quality and appropriate diversity, to produce labels that reflect what a broad spectrum of humans “knows.” Training AI on GenAI labels creates risks for the future of AI, where systems may be optimized for accuracy, but carry on invisible issues of fairness and impartiality. Thus, there is a continued need for human oversight.
Platforms should embrace GenAI as co-pilots and thus, they should: incorporate these tools into tasks when possible; take responsibility for training workers and requesters, in their responsible use; provide a qualification system to recognize worker expertise in the use of these tools; and promote transparency in the use of GenAI, so that trusted workers are identifiable. Such an approach will preserve the benefits of GenAI. In summary, we see a future where the data annotation ecosystem is significantly different, becoming mostly automated, but still requiring humans in the loop, to ensure the quality and diversity of AI-generated labels.
Join the Discussion (0)
Become a Member or Sign In to Post a Comment