Authors

* External authors

Venue

Date

Share

A Taxonomy of Challenges to Curating Fair Datasets

Dora Zhao*

Morgan Klaus Scheuerman

Pooja Chitre*

Jerone Andrews

Georgia Panagiotidou*

Shawn Walker*

Kathleen H. Pine*

Alice Xiang

* External authors

NeurIPS 2024

2024

Abstract

Despite extensive efforts to create fairer machine learning (ML) datasets, there remains a limited understanding of the practical aspects of dataset curation. Drawing from interviews with 30 ML dataset curators, we present a comprehensive taxonomy of the challenges and trade-offs encountered throughout the dataset curation lifecycle. Our findings underscore overarching issues within the broader fairness landscape that impact data curation. We conclude with recommendations aimed at fostering systemic changes to better facilitate fair dataset curation practices.

Related Publications

Transphobia is in the Eye of the Prompter: Trans-Centered Perspectives on Large Language Models

CVPR, 2025
Morgan Klaus Scheuerman, Katy Weathington, Adrian Petterson, Dylan Thomas Doyle, Dipto Das, Michael Ann DeVito, Jed R. Brubaker

Large language models (LLMs) are the new hot trend being rapidly integrated into products and services—often, in chatbots. LLM-powered chatbots are expected to respond to any number of topics, including topics central to gender identity. In light of rising anti-trans discour…

Images Speak Louder than Words: Understanding and Mitigating Bias in Vision-Language Model from a Causal Mediation Perspectiv…

EMNLP, 2024
Zhaotian Weng*, Zijun Gao*, Jerone Andrews, Jieyu Zhao*

Vision-language models (VLMs) pre-trained on extensive datasets can inadvertently learn biases by correlating gender information with specific objects or scenarios. Current methods, which focus on modifying inputs and monitoring changes in the model's output probability scor…

Resampled Datasets Are Not Enough: Mitigating Societal Bias Beyond Single Attributes

EMNLP, 2024
Yusuke Hirota, Jerone Andrews, Dora Zhao*, Orestis Papakyriakopoulos*, Apostolos Modas, Yuta Nakashima*, Alice Xiang

We tackle societal bias in image-text datasets by removing spurious correlations between protected groups and image attributes. Traditional methods only target labeled attributes, ignoring biases from unlabeled ones. Using text-guided inpainting models, our approach ensures …

JOIN US

Shape the Future of AI with Sony AI

We want to hear from those of you who have a strong desire
to shape the future of AI.