Responsibly Training Foundation Models: Actualizing Ethical Principles for Curating Large-Scale Training Datasets in the Era of Massive AI Models

VIEW PUBLICATION

Morgan Klaus Scheuerman

Dora Zhao*

Jerone T. A. Andrews

Abeba Birhane

Q. Vera Liao*

Georgia Panagiotidou*

Pooja Chitre*

Kathleen Pine

Shawn Walker*

Jieyu Zhao*

Alice Xiang

* External authors

ACM SIGCHI

2025

Abstract

AI technologies have become ubiquitous, influencing domains from healthcare to finance and permeating our daily lives. Concerns about the values underlying the creation and use of datasets to develop AI technologies are growing. Current dataset practices often disregard critical ethical issues, despite the fact that data represents and impacts real people. While progress has been made in establishing best practices for curating smaller datasets in a more ethical fashion, the unprecedented scale of training data in the era foundation models presents unique hurdles for which AI researchers and practitioners must now face. This workshop aims to unite interdisciplinary researchers and practitioners in an effort to identify the challenges unique to curating datasets for large scale foundation models—and then begin to ideate best practices for tackling those challenges. Drawing from CSCW’s tradition of interdisciplinary exchange, our aim is to cultivate a diverse community of researchers and practitioners interested in defining the future of ethical responsibility in the composition, process, and release of large-scale datasets for foundation model training. We will disseminate the outcomes of this workshop to the HCI community and beyond by developing a conceptual framework of both the
challenges and potential solutions associated specifically with curating datasets for foundation models.

Related Publications

GenDataAgent: On-the-fly Dataset Augmentation with Synthetic Data

ICLR, 2026
Zhiteng Li, Lele Chen, Jerone Andrews, Yunhao Ba, Yulun Zhang, Alice Xiang

We propose a generative agent that augments training datasets with synthetic datafor model fine-tuning. Unlike prior work, which uniformly samples synthetic data,our agent iteratively generates relevant samples on-the-fly, aligning with the targetdistribution. It prioritizes…

How Data Workers Shape Datasets: The Role of Positionality in Data Collection and Annotation for Computer Vision

ACM SIGCHI, 2025
Morgan Klaus Scheuerman, Allison Woodruff, Jed R. Brubaker

Data workers play a key role in the big data industry. Clients hire data workers to collect and annotate data with human identity concepts, like demographic categories or clothing items. Often, such workers are treated as computational—they are expected to quickly and object…

Transphobia is in the Eye of the Prompter: Trans-Centered Perspectives on Large Language Models

CVPR, 2025
Morgan Klaus Scheuerman, Katy Weathington, Adrian Petterson, Dylan Thomas Doyle, Dipto Das, Michael Ann DeVito, Jed R. Brubaker

Large language models (LLMs) are the new hot trend being rapidly integrated into products and services—often, in chatbots. LLM-powered chatbots are expected to respond to any number of topics, including topics central to gender identity. In light of rising anti-trans discour…

SEE ALL

HOME
Publications
Responsibly Training Foundation Models: Actualizing Ethical Principles for Curating Large-Scale Training Datasets in the Era of Massive AI Models

JOIN US

Shape the Future of AI with Sony AI

We want to hear from those of you who have a strong desire
to shape the future of AI.

LEARN MORE