Authors
- Morgan Klaus Scheuerman
- Dora Zhao*
- Jerone T. A. Andrews
- Abeba Birhane
- Q. Vera Liao*
- Georgia Panagiotidou*
- Pooja Chitre*
- Kathleen Pine
- Shawn Walker*
- Jieyu Zhao*
- Alice Xiang
* External authors
Venue
- ACM SIGCHI
Date
- 2025
Responsibly Training Foundation Models: Actualizing Ethical Principles for Curating Large-Scale Training Datasets in the Era of Massive AI Models
Morgan Klaus Scheuerman
Dora Zhao*
Jerone T. A. Andrews
Abeba Birhane
Q. Vera Liao*
Georgia Panagiotidou*
Pooja Chitre*
Kathleen Pine
Shawn Walker*
Jieyu Zhao*
* External authors
ACM SIGCHI
2025
Abstract
AI technologies have become ubiquitous, influencing domains from healthcare to finance and permeating our daily lives. Concerns about the values underlying the creation and use of datasets to develop AI technologies are growing. Current dataset practices often disregard critical ethical issues, despite the fact that data represents and impacts real people. While progress has been made in establishing best practices for curating smaller datasets in a more ethical fashion, the unprecedented scale of training data in the era foundation models presents unique hurdles for which AI researchers and practitioners must now face. This workshop aims to unite interdisciplinary researchers and practitioners in an effort to identify the challenges unique to curating datasets for large scale foundation models—and then begin to ideate best practices for tackling those challenges. Drawing from CSCW’s tradition of interdisciplinary exchange, our aim is to cultivate a diverse community of researchers and practitioners interested in defining the future of ethical responsibility in the composition, process, and release of large-scale datasets for foundation model training. We will disseminate the outcomes of this workshop to the HCI community and beyond by developing a conceptual framework of both the
challenges and potential solutions associated specifically with curating datasets for foundation models.
Related Publications
Data workers play a key role in the big data industry. Clients hire data workers to collect and annotate data with human identity concepts, like demographic categories or clothing items. Often, such workers are treated as computational—they are expected to quickly and object…
Large language models (LLMs) are the new hot trend being rapidly integrated into products and services—often, in chatbots. LLM-powered chatbots are expected to respond to any number of topics, including topics central to gender identity. In light of rising anti-trans discour…
Despite extensive efforts to create fairer machine learning (ML) datasets, there remains a limited understanding of the practical aspects of dataset curation. Drawing from interviews with 30 ML dataset curators, we present a comprehensive taxonomy of the challenges and trade…
JOIN US
Shape the Future of AI with Sony AI
We want to hear from those of you who have a strong desire
to shape the future of AI.