The presence of bias in data has led to a lot of research being conducted to understand the impact of bias on machine learning (ML) models and data-driven decision-making systems.2 Research has focused on questions related to the fairness in the decisions taken by models trained with biased data, and on designing methods to increase the transparency of automated decision-making processes so that possible bias issues may be easily spotted and "fixed" by removing bias.
Recent approaches taken in the literature to deal with data bias first aim to understand the cause of the problem (for example, a subset of the population being underrepresented in the training dataset) and then propose and evaluate an ad hoc intervention to reduce or remove the bias from the system (for example, by selecting which additional training data items to label in order to rebalance the dataset and increase equality—that is, a balanced representation of classes—rather than equity—that is, overrepresenting the disadvantaged subset of the population.
Example research on bias removal include work looking at how to remove bias from learned word embeddings. Bolukbasi et al.3 defined a methodology for modifying an embedding representation to remove gender stereotypes. This allows researchers to remove certain bias (for example, the association between the words 'receptionist' and 'female') while preserving other gender-relation information (for example, maintaining the associating between the words 'queen' and 'female'). Sutton et al.24 observed how gender bias, which is often present in professions, is also reflected in word embeddings and proposed to remove gender bias from embeddings by using a projection function. Schick et al.22 proposed an algorithm that reduces the probability of a language model generating problematic text. These are examples of methods that enable researchers and practitioners to intervene, subjectively, by making individual choices on how and where bias should be removed from models created from bias-reinforcing human-generated datasets. Additional examples include the removal of bias from prediction datasets where researchers decided how to rebalance datasets to increase fairness across groups when doing data augmentation,17 feature augmentation,5 or adjusting the metrics that measure bias.13 All these approaches include personal choices made by the researchers on how to detect and what to do with bias.
As compared to these examples of the body of work that has looked at bias removal, in this column we propose a different perspective on the problem by introducing the task of bias management rather than removal. The rationale lies in the fact that bias is inevitably present in human annotations. Thus, making a deliberate choice to remove bias deploying certain interventions or post-processing collected labels in a certain way, would introduce a bias itself as there may be multiple ways in which biased labels may be curated. Instead, we propose an alternative approach to manage bias in data-driven pipelines, that increases data and bias transparency thus allowing us to surface it to end users and human decision makers empowering them to take reparative actions and interventions themselves by leveraging common sense and contextual information. This brings away the responsibility from the data workers and system developers that risk to bias the results according to their perspective instead.
Consider the case where, to create an image dataset, a data scientist must collect manual annotations or label a set of images; it is reasonable to assume that such a dataset may then be used to train an automatic system to independently perform a specific object identification task in fresh, unseen images. Suppose a user issues the gender-neutral query "nurse" to an image search engine trained on this dataset. The user may see on the search engine result page the vast majority of images being of female nurses. While this might appear as an indication that the ranking algorithm of the search engine has a gender bias issue, as it would be qualified by the vast majority of existing bias/fairness measures,1 this might also reflect the real gender distribution of people employed in this profession, for example, female nurses are statistically more frequent than male nurses. While a traditional approach might look at resolving this bias by forcing the algorithm to show a balanced result set with male and female nurses in a similar percentage, we argue an alternative, less-invasive algorithmic approach might be more useful to the end users.
The search engine might display on the result page a set of additional metadata, which may be useful to the user to have a complete understanding of the magnitude of bias in the search result set; for example, the search engine might show a label indicating "the search results appear to be highly imbalanced in terms of gender: in the top 1,000 results, 870 of them are of female nurses and 130 of them are of male nurses; this is however similar to the gender distribution in the nursing profession where official data from your country government shows that 89% of the nursing workforce in 2016 was female." This information makes the end user more informed and aware of the statistical distribution of the search results with respect to a specific group (in this case, gender) also increasing gender bias literacy and the understanding of current societal norms. Then, ideally, the user should be asked by the system if they would like to maintain the current search result or whether they would prefer to inspect the results after a fairness policy of their choice is applied to the data (for example, forcing the number of male and female nurses to be approximately the same in the search result list). These, or similar, remarks can be transferred from search systems to recommender systems.
Research has looked at bias in search and at how gender-neutral queries may return gender-imbalanced results.
Research has looked at bias in search and at how gender-neutral queries may return gender-imbalanced results.15 Otterbacher et al.16 looked at how end users of search engines perceive biased search results. They studied how users perceive gender-imbalanced search results as compared to how they score on a scale of sexism. Their results show how different people perceive the results differently, thus confirming that a one-size-fits-all bias removal solution would not be appropriate for all users. A similar approach would apply to content in social media feeds.
We argue that employing an explicit and not transparent bias removal intervention might be potentially harmful to the user. In fact, if the task of the end user in our example scenario was to investigate something related to or influenced by the percentage of male and female nurses, the implicit application of the fairness policy as decided by the system designer might leave the user with an inaccurate perception of the real gender distribution in the nursing profession. Taking this concept to the extreme, the user might even erroneously think, somehow paradoxically, that gender bias is not present in the nursing profession, and that male and female nurses are equally present in this job market. This triggers ethical questions related to how we should manage bias, which we discuss later.
A related study by Silberzahn et al.23 looked at how groups of data analysts reach different conclusions when working on the same dataset and trying to answer a given question (that is, 'Are soccer referees more likely to give red cards to dark-skin-toned players than light-skin-toned players?' in this specific piece of research). They found that different groups made very different observations, which led the authors to conclude that when analyzing complex data it may not be possible to avoid reaching diverse conclusions. This example supports the need for something different from a one-size-fits-all approach to bias.
Next, we look at some examples of bias in human annotations. This serves the purpose to explain how human annotations, which are often used to train and evaluate the performance of ML models, carry bias and stereotypes from the human annotators providing the labels.
Employing an explicit and not transparent bias removal intervention might be potentially harmful to the user.
Crowdsourcing is a popular way to collect input from human annotators and contributors. A common example of a successful crowdsourcing project is Wikipedia and its related projects. Wikipedia is known to have a gender-biased population of editors where the majority are men.a As a research example, Sarasua et al.21 studied participation bias in Wikidata showing not only that participation is very skewed with very few editors contributing most of the content and a long tail of very many editors that contribute little, but also that the way contributions are made by these two different groups (that is, head and tail of the distribution) varies substantially.
Such long-tailed distribution of participation in a crowdsourcing project has shown to be common also in paid micro-task crowdsourcing,6 which is often used to collect labels to train supervised ML models. This means very few human annotators end up contributing the majority of the labels in the dataset. This is a common behavior pattern also known as Nielsen's 90-9-1 participation ruleb stating that in this type of projects 90% of users often are observers, 9% are sporadic contributors, and 1% account for most of the contributions. We will discuss how this may be problematic from a data bias point of view.
Thus, the data stored in crowdsourced knowledge graphs such as Wikidata is influenced by the (imbalanced) population of contributors. This then leads to data imbalance in terms of class representation. Luggen et al.12 has shown how to estimate class completeness in Wikidata and observed how certain classes may be more complete than others.
A similar approach may be used to measure the gender balance of entities in a knowledge graph such as Wikidata (for example, how many female astronauts are there in the dataset?). Once the measurements are done, intervention actions may be taken. For example, Wikidata editors may decide to only contribute new entities for the class Astronauts that are female until a balance is achieved between genders. Alternatively, editors may focus on having equal completeness rates for all genders (for example, 80% of all female astronauts and 80% of male astronauts rather than having a higher completion rate for one and lower for the other gender). While measurements can be done algorithmically, intervention decisions are human-made and who makes the adjustment decisions is another source of bias reflected in the underlying data.
Training supervised ML models with this kind of unbalanced training data is a typical cause of Unknown Unknowns (UUs) errors. These are errors made with high model confidence, thus showing how the model is unaware of the possibility of making classification mistakes. UUs often appear for classes that are poorly represented in the training data and may lead to fairness issues, where certain segments of the population under-represented in the training data systematically get more wrong decisions (for example, a minority population in loan decisions). While humans can be deployed to identify UUs,9 the common fix to the problem is the collection of further labels or the augmentation of existing training data for the underrepresented classes.
Vast amounts of labeled data are necessary to train large supervised ML models. This labeled data typically comes from human annotators. Relevant research questions are related to the way human annotators annotate data, how different annotators annotate differently, and how different-than-usual data gets annotated.
La Barbera et al.11 looked at how non-expert human annotators label misinformation and observed a political bias. That is, unsurprisingly, people perceive misinformation differently based on their political background. As shown in Figure 1, human annotators who vote Republican are more generous when labeling the level of truthfulness of statements by Republican politicians (and the same is true the other way around). We can see that, systematically at all levels of truthfulness, the scores assigned by the Republican crowd (in red) are higher than those given by annotators with a Democrats background (in blue). This example shows not only how different annotators provide different labels, but also the presence of systematic bias in human annotators.
Figure 1. Crowdsourced misinformation labels by nonexperts (y-axis) as compared to expert fact-checkers (x-axis) for statements made by Republican politicians (from La Barbera et al.11).
Another relevant study by Fan et al.4 looked at how 'unusual' data gets manually annotated. They made use of short videos depicting people washing their hands in developing and developed countries setting up a study controlling for the socioeconomic status of the people depicted in the videos. They collected labels from humans annotators based in the U.S. and observed the presence of bias in the labels. They observed, for example, how videos showing people in Africa receive more negative annotations than those from Asia, how videos with higher-income families receive more positive annotations, and how high-income households received more descriptive annotations. This example shows how data may receive different annotations that reflect bias and stereotypes present in human annotators. These examples of annotation bias are often the result of cognitive biases (confirmation bias in particular).14
A way to track the source of bias is by means of collecting data about the annotation process and the involved annotators. Recent research has made use of logs of human annotation behavior data to study the annotation process. For example, Han et al.7 looked at editing behavior in Wikidata, and Han et al.8 at how data scientists curate data. These two studies consistently showed the results are different based on who the individual person providing labels and making decisions is.
These studies confirm the conclusion that different human annotators would provide diverse labels for the same dataset, and that models trained on such labels would provide different decisions.18 Thus, selecting the right mix of human annotators can lead to less-biased labels and, as a result, less biased automated decision-support systems.
Rather than removing biased information or avoiding doing so, we believe it is a better option to keep track of bias and surface it to the end users. This would serve the purpose of increasing transparency over the entire data pipeline, rather than having an algorithmic fix to the bias problem. The example of "bias in search" presented previously shows how interventions may have consequences for end users and decision makers and, the interventions themselves, may introduce bias.
Thus, our proposal to deal with bias and produce resilient data pipelines20 is based on the assumption that the algorithmic results should not be subjectively changed by potentially biased design choices, but rather enriched with metadata that can surface information about bias to the end users and information consumers. In this way, they would be empowered to make their own choices and interventions on the system.
A bias management pipeline. Our proposal, alternative to removal of bias, consists of five steps (see Figure 2).
Recent research has started to look at some of these steps. For example, on the front of measuring bias, Lum et al.13 have shown how current measures that try to quantify model performance differences for different parts of a population are themselves statistically biased estimators and new ways to measure bias are needed.
The ethics of adapting for bias. One important question that must be raised is about the appropriateness of surfacing bias metadata to end users. In certain situations, this may lead to negative behavior and potentially to harm. For example, not all people may be comfortable being exposed to evidence of discrimination existing in the society, or in the dataset they are looking at, and could instead feel safer when presented with culturally aligned data, even if that potentially reinforces stereotypes.
When dealing with bias present in data and results, deciding about the most appropriate way to adapt for it is a culturally dependent, subjective decision and should be the result of each individual's preference. For example, when adapting for gender-biased search results, we might want to consider that women in STEM are the majority in many Iranian and Indian universities, but this may not be the case in other locations. Similar questions should be asked by system designers when applying personalization techniques. For example, Reinecke and Bernstein19 have shown in a controlled study how users were 22% faster when using a culturally adapted user interface.
In the end, the goal would be to empower the end users and to provide them with informative, and potentially perspective-changing adaptive strategies, based on their own preferences on which adjustments should be applied or not to any existing data bias. To deal with these challenges we claim that such adaptations should be an individual user choice made available to them by the data-driven systems, but finally chosen by the end users according to their own preference, believes, and comfort.
Bias is part of human nature, and it should be managed rather than removed as removal would introduce a different type of bias by the system designers and engineers making ad hoc choices.10 The bias management model detailed in this column envisions a different approach from the current bias and fairness research. We believe the ideas detailed here can lead to a more sound, informed, and transparent data-driven decision-making process that will impact future data pipeline design.
Figure. Watch the authors discuss this work in the exclusive Communications video. https://cacm.acm.org/videos/data-bias-management
1. Amigó, E. et al. A unifying and general account of fairness measurement in recommender systems. Information Processing and Management 60, 1 (2023), 103–115. 10.1016/j.ipm.2022.103115
9. Han, L., Dong, X., and Demartini, G. Iterative human-in-the-loop discovery of unknown unknowns in image datasets. In Proceedings of the AAAI Conf. on Human Computation and Crowdsourcing 9, (2021), 72–83.
15. Otterbacher, J., Bates, J., and Clough, P. Competent men and warm women: Gender stereotypes and backlash in image search results. In Proceedings of the 2017 CHI Conf. on Human Factors in Computing Systems. (2017), 6620–6631.
16. Otterbacher, J. et al. Investigating user perception of gender bias in image search: the role of sexism. In Proceedings of the 41st Intern. ACM SIGIR Conf. on Research and Development in Information Retrieval. (2018), 933–936.
17. Pastaltzidis, I. et al. Data augmentation for fairness-aware machine learning: Preventing algorithmic bias in law enforcement systems. In 2022 ACM Conference on Fairness, Accountability, and Transparency. (2022), 2302–2314.
18. Perikleous, P. et al. How does the crowd impact the model? A tool for raising awareness of social bias in crowdsourced training data. In Proceedings of the 31st ACM Intern. Conf. on Information & Knowledge Management. (2022), 4951–4954.
19. Reinecke, K. and Bernstein, A. Improving performance, perceived usability, and aesthetics with culturally adaptive user interfaces. ACM Transactions on Computer-Human Interaction (TOCHI) 18, 2 (2011), 1–29.
21. Sarasua, C. et al. The evolution of power and standard Wikidata editors: Comparing editing behavior over time to predict lifespan and volume of edits. Computer Supported Cooperative Work (CSCW) 28, 5 (2019), 843–882.
22. Schick, T., Udupa, S., and Schütze, H. Self-diagnosis and self-debiasing: A proposal for reducing corpus-based bias in NLP. Transactions of the Association for Computational Linguistics 9 (2021), 1408–1424.
23. Silberzahn, R. et al. Many analysts, one data set: Making transparent how variations in analytic choices affect results. Advances in Methods and Practices in Psychological Science 1, 3 (2018), 337–356.
© 2024 Copyright held by the owner/author(s).
Request permission to (re)publish from the owner/author
The Digital Library is published by the Association for Computing Machinery. Copyright © 2024 ACM, Inc.
No entries found