Opinion
Artificial Intelligence and Machine Learning

Confusing the Map for the Territory

The limits of datasets for culturally inclusive AI.

Posted
bazaar in Turkey

Imagine you are a marketing professional prompting an artificial intelligence (AI) image generator to produce different images of Pakistani urban streetscapes. What if the model, despite the prompting for specificity, produces Orientalist scene after scene of dusty streets, poverty, and chaos—missing important landmarks, social scenes, and the human diversity that makes a Pakistani city unique? This example illustrates a growing concern with the cultural inclusivity of AI systems failing to work for global populations, but instead, reinforce stereotypes that erase swaths of particular populations in AI-generated output.8

To address such issues of cultural inclusion in AI, the field has attempted to incorporate cultural knowledge into models through a common tool in its arsenal: datasets. Datasets of, for instance, global values, offensive terms, and cultural artifacts are all attempts to incorporate cultural awareness into models.

But trying to capture culture in datasets is akin to believing you have captured everything important about the world in a map. A map is an abstracted and simplified two-dimensional representation of a multidimensional world. While a valuable tool, using maps effectively requires understanding the limits of their correspondence with the physical world. One must know, for example, how the Mercator projection map, created in the 1500s and adopted in the 1700s as the global standard for navigation, distorted the relative sizes of the continents. Confusing the abstraction for the reality has led to all sorts of trouble. Colonial powers used the Mercator projection maps of the physical world to demarcate social worlds—drawing lines through simplified representations on a map, separating communities and leading to decades of ethnic strife, all to make navigation supposedly more efficient.

Similarly, today’s datasets involve abstracting dynamic social concepts from their context. Previous experiences with using datasets for sociotechnical decisions show what gets lost in translation as we compress complex social worlds into datasets. For instance, in algorithmic content moderation, datasets of offensive terms are used to train classifiers to detect hate speech. Hate speech, however, is not an inherent property of a word or phrase, but, rather, is produced in the context of social interactions. 

A single word can be a neutral descriptor in one interaction and a harmful stereotype or offensive term in another, based on how a given word is used, by whom, to whom, and for what purpose. Because it is difficult to incorporate social interactions in a dataset, the culturally laden interpretations of words shared online get reified into datasets as objective truth. For instance, classification algorithms have mislabeled as hate speech terms that have been reclaimed by communities (for example, “queer,” “dyke”), further marginalizing their speech.9

Datasets also rely on taxonomizing at scale—squashing the richness of the social world into a categorization schema. But identity labels and categories, such as race, are social constructs that do not translate across cultural contexts. For instance, skin tones, often used as a proxy for race or ethnicity, are not relevant social markers in many parts of the world. As Chimamanda Ngozi Adichie writes, “I had never before thought of myself as ‘Black’; I did not need to … In Nigeria, I was Igbo and Roman Catholic.” But once she moved across the Atlantic, a racial category label was applied to her that did not meaningfully reflect her lived experience. Thus, we cannot address a lack of cultural diversity in AI through datasets that uncritically reinforce universalized, and thus flattening, assumptions of the relationship between visual proxies for identity (for example, skin color), social identities (for example, race), and social experiences.

One might believe that an expanded dataset would solve these problems—perhaps we should just use more data and categories. But as the work of Geoffrey Bowker and Susan Leigh Star reminds us, the act of categorization is always political. For instance, Arabic words like “shaheed” (that is, “martyr”) have been censored from social media platforms by being categorized as harmful. Yet, this word is used to describe a diverse range of deaths, including from illness. The perspective from which human annotators are categorizing a word thus changes how a system behaves in response.

As such, a database is never neutral—there is no “view from nowhere”—since even the most ‘complete’ dataset reflects particular perspectives that will always be a product of the time the data was collected. Like maps, datasets give an illusion of completeness while instantiating particular views of the world. For instance, imagine constructing a more ‘inclusive’ dataset of photographs of non-Western contexts to train a model. What happens if photographs easily scraped from the Internet have been taken by tourists or are from museum collections,7 annotated with labels that amplify the perspective of colonial powers over the communities who produced the art? Failing to recognize these perspectives would replicate the same dominant point of view that datasets had sought to diversify. So what might be done instead? Recognizing that maps can be useful abstractions, we invite you to imagine alternative visions for more culturally responsive AI development. At the center of our argument is the recognition that datasets are brittle artifacts that can only offer brief glimpses of social worlds. As time-bound, socially structured bits of information, data comes from multiple and tenuous ground truths. Thus, we must focus on unpacking the social nature of data and the social scientific work it takes to develop culturally inclusive training datasets (that is, not just seeing it as technical work). What might this mean?

To develop datasets that can offer more contextualized interpretations of the social world, we must recognize the interpretive processes of meaning-making that go into creating and annotating social datasets. We need to shift away from the current annotation practices that rely on mathematical formalisms of individual annotators’ labels. For socially complex data, which may be contested, a single ground-truth label will always work against cultural diversity. Instead, we need new methods for data annotation work that feature interpretation and deliberation of data’s social meanings. For instance, the field might more broadly adopt qualitative methods for annotations such as workshops to enable deliberation among data workers or subject matter experts for socially contested concepts (for example, safety, culture, stereotypes) so that labels and annotations can be collectively constructed or debated.3

Expanding on the aforementioned methods for the interpretation and deliberation of social data is necessary, but not sufficient for grappling with the complexity of the social world. Results of annotation processes (even deliberative ones) rely on who is at–and who pays for—the ‘table’ of AI development. We thus need to recognize that annotating social data requires diverse backgrounds6 and forms of expertise. At a minimum, annotators should have lived experience with the cultural context of the data, as other work has proposed.4 Here, we do not mean to create an equivalence between identity categories and lived experiences, but instead, we call on researchers to consider what specific (for example, cultural) expertise their dataset annotations require, beyond just randomly selecting anonymous annotators from a particular country. In addition, we should recruit annotators with relevant disciplinary training, such as subject matter experts in curating and organizing information, or about the politics of capturing information about the social world—such as experts in museum curation, archives, heritage preservation, media and communication, or climate science.5

It is important to recognize that even with more diverse data and annotation workers that better leverage social knowledge, a perfectly representative system of all cultures is not a sustainable, let alone desirable, goal. Cultures change over time, including in their expression and even in the meanings of individual words. As such, any dataset must be treated as a snapshot of a particular culture at a particular time. It will become stale and outdated, in myriad and subtle ways, as soon as it is collected and labeled. This is akin to the concept of dataset drift, where the distribution of data changes after a model is trained, but for cultural data. Instead, we invite researchers to consider openness to cultural context, dynamic user agency, and change as design paradigms. Rather than, for expediency’s sake, settling for datasets that ignore contextually dependent, contested concepts like hate speech, cultural artifacts, or social values, what if we instead designed for modeling social dynamism? What might design paradigms look like that could support the polyvocality and multiple perspectives inherent in our complex, heterogeneous social world?

As one example, we call for novel approaches to dataset schema that might encode multiple narratives into each data object. The archiving project “100 histories of 100 worlds in one object” shows us what this might look like—where an object like an Akan drum can be labeled as an object of trauma when viewed through histories of forced migration and slave trade, or as a living object for social engagement when understood as a daily part of contemporary Akan culture. For AI, imagine a system explicitly trained on datasets that integrate labels, interpretations, and annotations from multiple cultural perspectives, so that when users query it about a sari, instead of giving a single answer about what a sari is, it invites users to clarify what kinds of saris they might want to learn about. Thus, instead of relying on singular datasets that supposedly capture global cultural values, such interfaces might become a supporting partner in a journey of cultural discovery.

Given the contextual nature of data on social worlds, we want to avoid what Selbst et al.10 refer to as the “portability trap” in sociotechnical systems: when datasets constructed in one context are sold as ‘ground truth’ and inappropriately used in a radically different cultural context. Thus, we call for more robust documentation practices for datasets that build on existing proposals for data documentation, such as The Data Nutrition Project’s Labels, Datasheets, and Data Cards, in ways that might help developers explicitly document the sociocultural provenance of the dataset. While some datasets, such as the Dollar Street Dataset of household items, provide metadata like country and monthly income of the household, more robust social provenance of the data is needed, such as Andrews et al.2 recommend for responsibly curating computer vision datasets. Documenting who collected and annotated that data—such as if non-local photographers are taking pictures claimed to represent a particular geographical region7—how data objects were chosen, the contexts or use cases the dataset should (not) be used for, and more, might enable developers to make more appropriate choices for using datasets to improve how models reflect the social world.

More broadly, we encourage the field of AI not to confuse the map for the territory, or as Agre argued, confuse the representation for the thing.1 Datasets are not the totality of the world. We are reminded of a character from a Lewis Carroll story who joyfully declares the creation of a map on the scale of a “mile to the mile,” where unrolling the map would cover the entire country—it is no longer clear if the map is being made for something useful or just to make a bigger map. Acknowledging the hubris of trying to capture the complexity of human cultural expression is a necessary first step. The next step is to stop the rapacious pursuit of more data and scale, to instead focus on approaches that foster, not flatten, the richness, joy, and wonder of our infinite social worlds.

    References

    • 1. Agre, P. Computation and Human Experience. Cambridge University Press. (1997).
    • 2. Andrews, J. et al. Ethical considerations for responsible data curation. Advances in Neural Information Processing Systems 36 (2023).
    • 3. Bergman, S. et al. STELA: A community-centred approach to norm elicitation for AI alignment. Scientific Reports 14, 1 (2024).
    • 4. Diaz, M. et al. CrowdWorkSheets: Accounting for individual and collective identities underlying crowdsourced dataset annotation. In Proceedings of the 2022 ACM Conf. on Fairness, Accountability, and Transparency (2022).
    • 5. Jo, E.S. and Gebru, T. Lessons from archives: Strategies for collecting sociocultural data in machine learning. In Proceedings of the 2020 Conf. on Fairness, Accountability, and Transparency. (2020).
    • 6. Kapania, S., Taylor, A.S., and Wang, D. A hunt for the snark: Annotator diversity in data practices. In Proceedings of the 2023 CHI Conf. on Human Factors in Computing Systems (2023).
    • 7. Naggita, K., LaChance, J., and Xiang, A. Flickr Africa: Examining geo-diversity in large-scale, human-centric visual data. In Proceedings of the 2023 AAAI/ACM Conf. on AI, Ethics, and Society (2023).
    • 8. Qadri, R. et al. AI’s regimes of representation: A community-centered study of text-to-image models in South Asia. In Proceedings of the 2023 ACM Conf. on Fairness, Accountability, and Transparency (2023).
    • 9. Sap, M. et al. The risk of racial bias in hate speech detection. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (2019).
    • 10. Selbst, A.D. et al. Fairness and abstraction in sociotechnical systems. In Proceedings of the Conf. on Fairness, Accountability, and Transparency (2019).

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More