A Taxonomy of Challenges to Curating Fair Datasets

VIEW PUBLICATION

Dora Zhao*

Morgan Klaus Scheuerman

Pooja Chitre*

Jerone Andrews

Georgia Panagiotidou*

Shawn Walker*

Kathleen H. Pine*

Alice Xiang

* External authors

NeurIPS 2024

2024

Abstract

Despite extensive efforts to create fairer machine learning (ML) datasets, there remains a limited understanding of the practical aspects of dataset curation. Drawing from interviews with 30 ML dataset curators, we present a comprehensive taxonomy of the challenges and trade-offs encountered throughout the dataset curation lifecycle. Our findings underscore overarching issues within the broader fairness landscape that impact data curation. We conclude with recommendations aimed at fostering systemic changes to better facilitate fair dataset curation practices.

Related Publications

Responsibly Training Foundation Models: Actualizing Ethical Principles for Curating Large-Scale Training Datasets in the Era …

ACM SIGCHI, 2025
Morgan Klaus Scheuerman, Dora Zhao*, Jerone T. A. Andrews, Abeba Birhane, Q. Vera Liao*, Georgia Panagiotidou*, Pooja Chitre*, Kathleen Pine, Shawn Walker*, Jieyu Zhao*, Alice Xiang

AI technologies have become ubiquitous, influencing domains from healthcare to finance and permeating our daily lives. Concerns about the values underlying the creation and use of datasets to develop AI technologies are growing. Current dataset practices often disregard crit…

How Data Workers Shape Datasets: The Role of Positionality in Data Collection and Annotation for Computer Vision

ACM SIGCHI, 2025
Morgan Klaus Scheuerman, Allison Woodruff, Jed R. Brubaker

Data workers play a key role in the big data industry. Clients hire data workers to collect and annotate data with human identity concepts, like demographic categories or clothing items. Often, such workers are treated as computational—they are expected to quickly and object…

Transphobia is in the Eye of the Prompter: Trans-Centered Perspectives on Large Language Models

CVPR, 2025
Morgan Klaus Scheuerman, Katy Weathington, Adrian Petterson, Dylan Thomas Doyle, Dipto Das, Michael Ann DeVito, Jed R. Brubaker

Large language models (LLMs) are the new hot trend being rapidly integrated into products and services—often, in chatbots. LLM-powered chatbots are expected to respond to any number of topics, including topics central to gender identity. In light of rising anti-trans discour…

SEE ALL

HOME
Publications
A Taxonomy of Challenges to Curating Fair Datasets

JOIN US

Shape the Future of AI with Sony AI

We want to hear from those of you who have a strong desire
to shape the future of AI.

LEARN MORE